/
A  CRf A  CRf

A CRf - PowerPoint Presentation

natalia-silvester
natalia-silvester . @natalia-silvester
Follow
366 views
Uploaded On 2016-09-07

A CRf - PPT Presentation

based Named Entity Recognition system for Turkish Information Extraction 10707 Project Reyyan Yeniterzi Introduction Named Entity Recognition NER aims to locate and classify the named entities ID: 462397

data turkish set named turkish data named set words crf entities morphological ner contextual form word system based information

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "A CRf" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

A CRf-based Named Entity Recognition system for Turkish

Information Extraction 10-707 Project

Reyyan

YeniterziSlide2
Introduction

Named Entity Recognition (NER) aims to locate and classify the named entitiesstate-of-the-art NER systems are available for several languagesa limited amount of study has been conducted for Turkish. we present the first CRF-based NER system for TurkishSlide3
Turkish

Turkish is a morphologically complex language with very productive inflectional and derivational processes.Many local and non-local syntactic structures in English translate to Turkish words with complex morphological structures.

we

to make

flavor

to be able

acquire

if

are going

+

lan

tat

+

abil

+d

ı

r

+se

+ecek

+k

if we are going to be able to make [something] acquire flavor

tatland

ı

rabileceksekSlide4
Related Work

Cucerzan and Yarowsky, 1999 a language independent EM-style bootstrapping algorithm use word internal and contextual information of entities

Tur et all, 2003

a statistical approach (HMM)

data sparseness issues due to the agglutinative structure of the Turkish

use the morphological form of the word instead of the surface form

Kucuk

and

Yazici

, 2009

the first rule-based NER system for Turkish

information sources such as dictionaries, list of well known entities and context patters Slide5
Approach

Conditional Random Fields (CRF) CRF++ , an open source CRF sequence labeling toolkit Lexical modelusing only the word tokens in their surface formmay encounter data sparseness problemsMorphological forms of the words

Contextual evidences around the named entitiesSlide6
Data Set - I

the newspaper articles data set train set used in (Tür et all, 2003) test set not availablesplit the data in two for the evaluation purposes

90% for training10% for testing Slide7
Data Set - II

Three types of named entitiesOrganizationPersonLocation

# words

# person

#

organization

# location

Train

445,498

21,701

14,510

12,138

Test

47,344

2,400

1,595

1,402Slide8
Data Set - III

named entities are marked with ENAMEX tag a type of SGML tag

TYPE attributeSlide9
Experiments

Lexical ModelPrecision

RecallF-Measure

Person

0.96

0.73

0.83

Organization

0.95

0.73

0.83

Location

0.96

0.81

0.88Slide10
Ongoing and Future Work

building the morphological featuresthe morphological analyses of the words is donecurrently working on disambiguating thesewill use the POS tags and lemmas of the wordsbuilding the contextual featuresperforming error analyses

Related Contents


Next Show more