based Named Entity Recognition system for Turkish Information Extraction 10707 Project Reyyan Yeniterzi Introduction Named Entity Recognition NER aims to locate and classify the named entities ID: 462397
Download Presentation The PPT/PDF document "A CRf" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
A CRf-based Named Entity Recognition system for Turkish
Information Extraction 10-707 Project
Reyyan
YeniterziSlide2Introduction
Named Entity Recognition (NER) aims to locate and classify the named entitiesstate-of-the-art NER systems are available for several languagesa limited amount of study has been conducted for Turkish. we present the first CRF-based NER system for TurkishSlide3Turkish
Turkish is a morphologically complex language with very productive inflectional and derivational processes.Many local and non-local syntactic structures in English translate to Turkish words with complex morphological structures.
we
to make
flavor
to be able
acquire
if
are going
+
lan
tat
+
abil
+d
ı
r
+se
+ecek
+k
if we are going to be able to make [something] acquire flavor
tatland
ı
rabileceksekSlide4Related Work
Cucerzan and Yarowsky, 1999 a language independent EM-style bootstrapping algorithm use word internal and contextual information of entities
Tur et all, 2003
a statistical approach (HMM)
data sparseness issues due to the agglutinative structure of the Turkish
use the morphological form of the word instead of the surface form
Kucuk
and
Yazici
, 2009
the first rule-based NER system for Turkish
information sources such as dictionaries, list of well known entities and context patters Slide5Approach
Conditional Random Fields (CRF) CRF++ , an open source CRF sequence labeling toolkit Lexical modelusing only the word tokens in their surface formmay encounter data sparseness problemsMorphological forms of the words
Contextual evidences around the named entitiesSlide6Data Set - I
the newspaper articles data set train set used in (Tür et all, 2003) test set not availablesplit the data in two for the evaluation purposes
90% for training10% for testing Slide7Data Set - II
Three types of named entitiesOrganizationPersonLocation
# words
# person
#
organization
# location
Train
445,498
21,701
14,510
12,138
Test
47,344
2,400
1,595
1,402Slide8Data Set - III
named entities are marked with ENAMEX tag a type of SGML tag
TYPE attributeSlide9Experiments
Lexical ModelPrecision
RecallF-Measure
Person
0.96
0.73
0.83
Organization
0.95
0.73
0.83
Location
0.96
0.81
0.88Slide10Ongoing and Future Work
building the morphological featuresthe morphological analyses of the words is donecurrently working on disambiguating thesewill use the POS tags and lemmas of the wordsbuilding the contextual featuresperforming error analyses