/
1 Electronic dictionaries 1 Electronic dictionaries

1 Electronic dictionaries - PowerPoint Presentation

cheryl-pisano
cheryl-pisano . @cheryl-pisano
Follow
446 views
Uploaded On 2016-09-13

1 Electronic dictionaries - PPT Presentation

Duško Vitas University of Belgrade Faculty of Mathematics 2 One definition of a tradi tion al dictionary A dictionary is a book in which the words and phrases of a language are listed alphabetically together with their meanings or their translations in another language Co ID: 465771

forms dictionaries canonic dictionary dictionaries forms dictionary canonic form iteljica

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "1 Electronic dictionaries" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

1

Electronic dictionaries

Duško Vitas

University of Belgrade,

Faculty of MathematicsSlide2

2

One definition of a

(tradi

tion

al) dictionary

A

dictionary

is a

book

in which the words and phrases of a language are listed alphabetically, together with their meanings or their translations in another language. (Collins Cobuild,

English Dictionary for Advanced Users

)Slide3

3

From

Dictionary.com

... a

book

, optical disc, mobile device, or online lexical resource containing a selection of the words of a language, giving information about their meanings, pronunciations, etymologies, inflected forms, derived forms, etc., expressed in either the same or another language...

Print dictionaries

of various sizes, ranging from small pocket dictionaries to multivolume books, usually sort entries alphabetically...

All electronic dictionaries

, whether online or installed on a device, can provide immediate, direct access to a search term, its meanings...Slide4

4

Some definitions of e-dictionaries

According to one definition

:

(Schryver, 2003)

any dictionary that can be used in an

automated

environment

The other definition says (Jacquet-Pfau, 2002)

that

ele

ctronic

dictionary

intended for

automated processing of texts (corpora)

differ from machine-readable dictionaries that are intended to human users

”Slide5

5

Characteristics of e-dictionaries

E-dictionaries have to fulfill two basic criteria:

They have to be

formally established

so that computer programs can process them; besides that, e-dictionaries

complement grammars as all exceptions are listed in them.

E-dictionaries have to be

exhaustive

since they have to cover 100% of lexica of a language in question; a parser that processes a text should not be impeded by unknown words. This aim is difficult to achieve.

As opposed to grammars, an e-dictionary tends to desribe extensively lexical properties of lemmas. Slide6

6

Development of e-dictionaries

Can development of an e-dictionary rely on some

excellent

traditional dictionary?

Traditional dictionaries are often limited in size (e.g. for commercial reasons);

Information in them is often implicit – they rely on the belief that a human will easily supply all missing data, for instance, a human will correctly deduce a whole paradigm if offered one or two inflective endings.

Information is often partial

(

e.g

.

in Serbian a noun

otac

has two possible plural forms

očevi

,

oci

; for automatic processing it is necessary to explicitly know whether it is possible to say

: ?

očevi nacije

‘national founding fathers’, ?

oci dece

‘fathers of the childrens’

, even

Očevi i oci

(title of a novel)Slide7

7

From a list of words to an e-dictionary

Many computer scientists in the past thought that a list of words taken from a traditional dictionary is good starting point for the development of an e-dictionary.

This attitude was influenced by work done for English which is not a typical example of an European language

(

from the point of view of the automatic processing, because of its modest inflection

)

.

Before one should start to develop an e-dictionary it should be clear what is going to be its basic unit (

lemma

), and then how its other forms can be generated from it.Slide8

8

Defining a basic unit of an e-dictionary

Automatic text processing usually begins with

simple words

as basic units of texts. This is a natural starting point because they

are formalized for most of European languages. However, simple words are not always a natural unit of processing, because they are:

ambiguous

(

dictionaries offer for them several meanings

)

;

pointless

(

many terms have several constituents, and each of them does not contribute directly to the meaning of a term

)

Because of that dictionaries of simple words have to be complemented with other types of dictionaries and grammars that will provide a natural units of processing.Slide9

9

Types of e-dictionaries

E-dictionaries of simple words

(

dictionaries of simple graphemic units

these are usually entries in traditional dictionaries

)

;

E-dictionaries of multi-word units

(

multi-word units that contain non-letter characters, terminology, collocations, phrases,

...)

;

Phonological e-dictionaries

(

pronunciation of simple words

,

with rules of how to pronounce inflected forms, words in contact, etc.);

Semantic e-dictionaries

(

simple words and multi-word units with encoded senses

network of senses

?)Slide10

10

A prerequisite for the development of an e-dictionary of simple words

A selection of lexical categories and a way to represent them

. Tradi

t

ional

categories

:

Part-of-Speech

:

noun

,

verb

,

adjective

,...

subordinated

categories

:

possessive

,

indefinite

,

definite

,...

inflectional categories

:

masculine

,

feminine

,

neuter

, nominativ

e

, genitiv

e

,...

syntactic categories

:

transitive

,

intransitive

,...

semanti

c

categories

:

human

,

abstraction

,

concrete object

,...Slide11

11

The selection of categories is not a straightforward task

A sat of tags used to annotate the

Brown

corpus

Brown

A sat of tags used to annotate the

Penn tree bank

Penn

A sat of tags used for the

Multext-East

project

Multext-EastSlide12

12

LADL format

of

ele

ctron

ic

dictionaries

Unitex

works with dictionaries that were developed by members of the

Rel

e

x

network.

Rel

e

x

is an international informal network of laboratories that work on computational linguistics. It was established by

M

au

ri

ce

Gros

s

and his

LADL

t

ea

m. (

LADL

is shortened for

Laboratoire d'Automatique Documentaire et Linguistique

)

Members of the

Rel

e

x

network developed exhaustive e-dictionaries of simple words and compounds

(

http://infolingu.univ-mlv.fr/Relex/Relex.html

)Slide13

13

A selection of canonic forms

In the case that a word has several surface forms, one of them is chosen as a canonic representative for other, subordinate forms.

What are canonic forms in e-dictionaries of French?

For nouns, as a rule that is the singular masculine form;

For verbs, that is the infinitive form

...Slide14

14

Is the selection of a canonic form unique

?

It is neither simple nor unique

.

For instance,

In French, the gender of nouns is an inflectional category, that is

lecteur

,

lecteurs

,

lectrice

,

lectrice

s

are four forms of the same word – its canonic form is

lecteur

In Serbian, the gender of nouns is not an inflectional category; so,

učitelj

and

učiteljica

(tradi

t

ional

ly

, a

s

well as in the Serbian

e-dic

tionary

)

are two canonic forms, each with its own subordinated forms.

Similarly in Bulgarian:

учител

and

учителкаSlide15

15

Why is the adequate selection of an canonic form so important

?

A lot of information about a word is attached to its canonic form

all subordinate forms share that information:

učiteljica

has semantic features

+Hum+Prof

and the same features have all its inflected forms:

učiteljice, učiteljici, učiteljicu,...

Is this a rule that release us from further from making other decisions

?

No, in Serbian the gender and the animacy are features of subordinate forms, not canonic forms

.

Why

?

Nouns can change gender in plural forms

,

vladika

(m)

vladike

(f)

In order to treat the same category always in the same way (for nouns, adjectives, pronouns, numerals, etc.)Slide16

16

More about categories attached to canonic and subordinate forms

First, there was a mouse

This mouse is alive

Its canonic form is

miš,N+Zool

Then came a mouse

This mouse is not alive

Its canonic form is

miš,N+Conc

What is the value of the grammatical category “animacy” for this new mouse

?

Da biste se prebacili na sledeće poglavlje DVD-a,

pomerite miša

Da biste kontrolisali reprodukciju televizije uživo,

pomerite miš

kako

bi se prikazale kontrole za reprodukciju

Google: 28,300

Google:19,000Slide17

17

More on the selection of a canonic form

Passive past participles

are not separate entries in Serbian traditional dictionaries – these forms belong to the verb paradigm.

What about passive past participles that are used as adjectives

?

A program for automatic text processing has to recognize them somehow and to tag them appropriately

.

For instance, a sample of

Politik

a

having

582,000

simple word tokens

contains only in the feminine gender accusative

228

adjectives derived from the past participle

(

they are not all correct

)Slide18

18

And what about

...

Present past participle

(

functioning as an adjective

) –

Politika

occurrences

in the feminine accusative singular forms;

Present gerund

(

functioning as an adjective

) –

Politika

occurrences

in the feminine accusative singular forms

Derivational forms

Possessive adjectives

dečakov, partizanov,...

Diminutives

tkaninica

,

futrolica

,

telefonče

,...

Gender motion

druidica

,

gutačica, gudačica, guvernerka,...

Šezedestogodišnjakinja, četvoroipomesečni, dvestopedestogodišnjica,...

They all have in Serbian e-dictionary separate canonic forms, each with its own subordinate forms.Slide19

19

In order to obtain (close to) 100% coverage of a text, it is necessary to include:

colors

skerletnocrven

,

bledoplav

,

mlečnožut,...

Proper names

personal names

,

geo

political names

organizations

(

Ozna

,

Gestapo

,

Metropoliten

,...)

objects

trademarks

(

lajka

,

spitfajer

,

mercedes

,...)

Titles and characters of novels, films, operas…

(

Dezdemona, Asteriks

,

Plavobradi...

)

events

(

Anšlus

,...)

And then also

donžuanstvo

,

arsenlupenovski

,

neotitoizam

,

nedićevština

,...Slide20

20

Details of the LADL format

There are two dictionaries

(

or lists

)

of simple words in the

LADL format

:

First dictionary

DELAS

is a dictionary of canonic forms (lemmas)

.

This dictionary is used to generate the second dictionary

.

Second dictionary

DELAF

is a dictionary of subordinate (or inflected) forms

.

Only this dictionary is used in the automatic text processing

.Slide21

21

An entry in a DELAS dictionary

lemma,Kn+Prop

K:

A Part-of-Speech code;

Usually that is a code consisting of one or more upper-case letters.

n:

A relation with subordinate forms, if they exist;

Usually that is an alphanumeric code that together with a PoS code enables the generation of all subordinate forms for a

DELAF

dictionary.

Prop:

Syntactic, semantic, dialect, usage, domain,… markers

Markers that can be freely attached to any canonic form – they are in a form of alphanumeric codes.Slide22

22

An example of a

DELAS

entry from the Serbian e-dictionary

u

č

itel

j

ica,N651+Hum+GM

u

č

itel

j

ica

canonic form

(lem

m

a)

N

Part-of-Speech

(

noun

)

(N)651

Inflection class code used to

generate all inflected forms

+Hum

human

+GM

feminine gender noun derived

from the corresponding

masculine gender noun

učiteljSlide23

23

Examples of Serbian

DELAS

entries for various PoS

u

č

itel

j

ica,N651+Hum+GM

zagasitocrven,A6+Col

smejati,V516

+Imperf+It+Ref+Ek

ć

utke,ADV

deset,NUM+v5

poneko,PRO+ProN+Indef+Sr

ali,CONJ

od,PREP+p2

jaoj,INT

naime,PARSlide24

24

An example of a

DELAS

entry from the Bulgarian e-dictionary

глава,

C

600

+

Ж

глава

canonic form

(lem

m

a)

N

Part-of-Speech

(

noun

)

(N)6

00

Inflection class code used to generate all inflected forms

+

Ж

feminineSlide25

25

Examples of Bulgarian

DELAS

entries for various PoS

глава,

С

(

600

)

+

Ж

червен,

ПРИ

(

3

)

абе,абе.МЕЖ

ако,ако.СЮ+П

вместо,вместо.ПРЕД

вредно,вредно.НАР

даже,даже.ЧА

дам.Г+С+Т

…Slide26

26

An entry in a DELAF dictionary:

word form,

lemma.K+Prop

(:gc)*

Canonic form (or lemma);

K:

A Part-of-Speech code

(

inherited from its lemma

)

Prop:

Syntactic, semantic, dialect, usage, domain,… markers

(

inherited from its lemma

)

gc:

A set of codes that represent values of grammatical categories describing a form:

Grammatical categories depend on the PoS;

These are one character alphanumeric codes.Slide27

27

An example of a

DELA

F

entry from the Serbian e-dictionary

učiteljicu,učiteljica.N

+Hum+GM

:fs4v

u

čiteljicu

subordinate form

(realiz

ation

)

u

čiteljic

a

canonic form

(lem

m

a)

N

PoS

(

inherited from the canonic form

)

+Hum+GM

marker

s

(

inherited from the canonic form

)

fs4v

values of grammatical categories:

f

category gender

(

value feminine

)

s

category number

(

value singular

)

4

category case

(

value accusative)

v

category animacy

(

value animate

)Slide28

28

The whole paradigm of the lemma

u

č

itel

j

ica

u

č

iteljica,u

č

iteljica.N:fp2v

u

č

iteljica,u

č

iteljica.N:fs1v

u

č

iteljice,u

č

iteljica.N:fp5v

u

č

iteljice,u

č

iteljica.N:fp4v

u

č

iteljice,u

č

iteljica.N:fp1v

u

č

iteljice,u

č

iteljica.N:fs5v

u

č

iteljice,u

č

iteljica.N:fw4v

u

č

iteljice,u

č

iteljica.N:fw2v

u

č

iteljice,u

č

iteljica.N:fs2v

u

č

iteljici,uč

iteljica.N:fs7v

u

č

iteljici,u

č

iteljica.N:fs3v

u

č

iteljicu,u

č

iteljica.N:fs4v

u

č

iteljicom,u

č

iteljica.N:fs6v

u

č

iteljicama,u

č

iteljica.N:fp7v

u

č

iteljicama,u

č

iteljica.N:fp6v

u

č

iteljicama,u

č

iteljica.N:fp3v

The numeric code

651

that connects a canonic form with all of its

subordinate forms is deleted because it is of no use anymore.Slide29

29

The whole paradigm of the lemma

глава

глава,глава.С+Ж:s0

главата,глава.С+Ж:sd

глави,глава.С+Ж:p0

главите,глава.С+Ж:pdSlide30

30

The whole paradigm of the lemma

дам

дадете,дам.Г+С+Т:R2p

дадеш,дам.Г+С+Т:R2s

дадеше,дам.Г+С+Т:D2s:D3s

дадох,дам.Г+С+Т:E1s

дадоха,дам.Г+С+Т:E3p

дадохме,дам.Г+С+Т:E1p

дадохте,дам.Г+С+Т:E2p

дадял,дам.Г+С+Т:Wsm

дадяла,дам.Г+С+Т:Wsf

дадяло,дам.Г+С+Т:Wsn

дадях,дам.Г+С+Т:D1s

дадяха,дам.Г+С+Т:D3p

дадяхме,дам.Г+С+Т:D1p

дадяхте,дам.Г+С+Т:D2p

дай,дам.Г+С+Т:I2s

дайте,дам.Г+С+Т:I2p

дал,дам.Г+С+Т:Xsm0

дала,дам.Г+С+Т:Xsf0

далата,дам.Г+С+Т:Xsfd

дадат,дам.Г+С+Т:R3p

даде,дам.Г+С+Т:E2s:E3s

даде,дам.Г+С+Т:R3s

дадели,дам.Г+С+Т:Wp

дадем,дам.Г+С+Т:R1p

даден,дам.Г+С+Т:Qsm0

дадена,дам.Г+С+Т:Qsf0

дадената,дам.Г+С+Т:Qsfd

дадени,дам.Г+С+Т:Qp0

дадените,дам.Г+С+Т:Qpd

дадения,дам.Г+С+Т:Qsmh

даденият,дам.Г+С+Т:Qsml

дадено,дам.Г+С+Т:Qsn0

даденото,дам.Г+С+Т:Qsnd

дали,дам.Г+С+Т:Xp0

далите,дам.Г+С+Т:Xpd

далия,дам.Г+С+Т:Xsmh

далият,дам.Г+С+Т:Xsml

дало,дам.Г+С+Т:Xsn0

далото,дам.Г+С+Т:Xsnd

дам,дам.Г+С+Т:R1sSlide31

31

How is relation between canonic form and its subordinate form (inflected forms) established?

In Unitex system Finite State Transducers – FST – are used for this.

Inflection class code used that follows PoS code in DELAS (dictionary of lemmas) is used to generate all inflected forms.

One transducer is usually used to generate forms for many lemmas. For instance,

transdu

cer

N2

generates inflected forms for

:

emir

,

evrofil

,

dijetetičar

,

forenzičar

,

leptir

,

šegrt

,

and many other lemmas

.Slide32

32

FST defines classes

(BG)

In most of the languages the relation between a lemma and its forms is an intuitive relation of equivalence that is formalized, in the case of the LADL format by FSTs

,

син

/

sg,indef

синове

/

pl,indef

сине

/

sg,voc

сина

/

pl,count

син

(

<E>

/sg,indef+

ове

/pl,indef+

е

/sg,voc+

а

/pl,count)

(...?...)

(

<E>

/sg,indef+

ове

/pl,indef+

е

/sg,voc+

а

/pl,count)

N01:

(

<E>

/sg,indef+

ове

/pl,indef+

е

/sg,voc+

а

/pl,count)Slide33

33

Dictionaries for other languages

Russian - developed at CIS, Munich (CISLEX-RU) derived mostly from Zaliznyak, A.

Grammaticheskij slovar' russkogo jazyka

) and contains approximately 44,000 lemmas (930.000 forms)Slide34

34

разговориться,.V+intr+sv:AI

при,пря.N+anim(j)+gen(F):geF:nm:ajm

при,.PRAEP+gov(q)

при,переть.V+nsv+tr:A2eb

первой,первый.A+Ord:geF:deF:teF:qeF

встрече,встреча.N+anim(j)+gen(F):deF:qeF

объявил,объявить.V+sv+tr:AeMVi

нынешним,нынешний.A:teM:teN:dm

летом,.ADV

летом,лето.N+anim(j)+gen(N):teN

летом,лет.N+anim(j)+gen(M):teM

Капе,Капа.N+PN+VORN+anim(o)+gen(M)+style(colloq):deM:qeM

Капе,Капа.N+PN+VORN+anim(o)+gen(F)+style(colloq):deF:qeFSlide35

35

Капе,Капа.N+PN+VORN+anim(o)+gen(M)+style(colloq):deM:qeM

Капе,Капа.N+PN+VORN+anim(o)+gen(M)

+style(colloq):deM:qeM

Капе – word form

Капа - lemma

N – noun

PN – proper noun

VORN – given name

anim(o) - animate

gen(M) – masculin gender

style(colloq)

deM:qeM – dative or prepositional case, singular (e), masculinSlide36

36

Dictionaries for other languages

Polish (Z. Vetulany, Adam Mickiewicz University, 1996)

marcu,

marzec

.N+Gi+Ns+Cl

marcu,

marzec

.N+Gi+Ns+Cv

marcu,

marzec

.N+

month

:L

marynarka,.N+Gf+Ns+Cn

masową,masowy.ADJ+Dp+Ns+Cai+Gf

masowe,masowy.ADJ+Dp+Np+Cnav+GaifnSlide37

37

Dictionaries for other languages

Latin dictionary derived from Perseus project, based on the Lewis&Short dictionary (1879)

abaculus,.N:Nms

abacum,abax.N:Gmp

abacum,

abacus

.N:Ams

abacum,abacus.N+poet:Gmp

abaddon,

ab-addo

.V:1siPC

abagmentum,.N:Vns

abagmentum,.N:Nns

abagmentum,.N:AnsSlide38

38

Comparison between L&S, Georges and Whiteker dictionaries

All three dictionaries are available in e-from.

L&S

supports processing on the site

Perseus

Georges

is available on-line

Whitaker’s

Words

is an application that performs morhological anaylsis

But their content is different.

E.g.

abacinus

exists in

Georges

and

Whitaker

, but not in

L&SSlide39

39

What else should be known

?

The use of upper-case and lower-case letters in a dictionary:

Canonic forms written with lower-case letters can match in a text both lower–case and upper-case occurrences.

Canonic forms written with (some) upper-case letters can match in a text only occurrences that use upper-case letters on that position(s).

For instance

,

vlada,N600

Vlada,N1741+NProp+Hum+First

some results from corpus “

Politika

vlada

and

VladaSlide40

40

What else should be known, or not

?

A user that

will not

produce a new dictionary

(

e.g. for a new language, or a dictionary for some sub-domain

)

need not

know the format of

DELAS

dictionaries

,

neither he/she has to know what are inflectional transducers and how some of them look like.

A user that

wants to use

dictionaries for text processing needs to know what is the content of

DELAF

dictionaries he plans to use and what does different codes and markers mean

.

Dictionary that he/she is using are compiled dictionaries

(

two files with the extensions

.bin

and

.inf

)

and their usage by Unitex is very effective

.

These dictionaries cannot be “seen”

.Slide41

41

E-dictionary as statistical tagger

Filtering the results of word form tagging by Tnt, TreeTagger, etc. with e-dictionaries transform the results to „real“ lemmas (a part of ambiguity is lost, but the result is >95% correct :-) Slide42

42

Numbers that illustrate the content of Serbian e-dictionaries

The number of inflection transducers (April 2010)

for nouns

369

for verbs

371

for adjectives

66