Duško Vitas University of Belgrade Faculty of Mathematics 2 One definition of a tradi tion al dictionary A dictionary is a book in which the words and phrases of a language are listed alphabetically together with their meanings or their translations in another language Co ID: 465771
Download Presentation The PPT/PDF document "1 Electronic dictionaries" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
1
Electronic dictionaries
Duško Vitas
University of Belgrade,
Faculty of MathematicsSlide2
2
One definition of a
(tradi
tion
al) dictionary
A
dictionary
is a
book
in which the words and phrases of a language are listed alphabetically, together with their meanings or their translations in another language. (Collins Cobuild,
English Dictionary for Advanced Users
)Slide3
3
From
Dictionary.com
... a
book
, optical disc, mobile device, or online lexical resource containing a selection of the words of a language, giving information about their meanings, pronunciations, etymologies, inflected forms, derived forms, etc., expressed in either the same or another language...
Print dictionaries
of various sizes, ranging from small pocket dictionaries to multivolume books, usually sort entries alphabetically...
All electronic dictionaries
, whether online or installed on a device, can provide immediate, direct access to a search term, its meanings...Slide4
4
Some definitions of e-dictionaries
According to one definition
:
(Schryver, 2003)
“
any dictionary that can be used in an
automated
environment
”
The other definition says (Jacquet-Pfau, 2002)
that
“
ele
ctronic
dictionary
intended for
automated processing of texts (corpora)
differ from machine-readable dictionaries that are intended to human users
”Slide5
5
Characteristics of e-dictionaries
E-dictionaries have to fulfill two basic criteria:
They have to be
formally established
so that computer programs can process them; besides that, e-dictionaries
complement grammars as all exceptions are listed in them.
E-dictionaries have to be
exhaustive
since they have to cover 100% of lexica of a language in question; a parser that processes a text should not be impeded by unknown words. This aim is difficult to achieve.
As opposed to grammars, an e-dictionary tends to desribe extensively lexical properties of lemmas. Slide6
6
Development of e-dictionaries
Can development of an e-dictionary rely on some
excellent
traditional dictionary?
Traditional dictionaries are often limited in size (e.g. for commercial reasons);
Information in them is often implicit – they rely on the belief that a human will easily supply all missing data, for instance, a human will correctly deduce a whole paradigm if offered one or two inflective endings.
Information is often partial
(
e.g
.
in Serbian a noun
otac
has two possible plural forms
očevi
,
oci
; for automatic processing it is necessary to explicitly know whether it is possible to say
: ?
očevi nacije
‘national founding fathers’, ?
oci dece
‘fathers of the childrens’
, even
Očevi i oci
(title of a novel)Slide7
7
From a list of words to an e-dictionary
Many computer scientists in the past thought that a list of words taken from a traditional dictionary is good starting point for the development of an e-dictionary.
This attitude was influenced by work done for English which is not a typical example of an European language
(
from the point of view of the automatic processing, because of its modest inflection
)
.
Before one should start to develop an e-dictionary it should be clear what is going to be its basic unit (
lemma
), and then how its other forms can be generated from it.Slide8
8
Defining a basic unit of an e-dictionary
Automatic text processing usually begins with
simple words
as basic units of texts. This is a natural starting point because they
are formalized for most of European languages. However, simple words are not always a natural unit of processing, because they are:
ambiguous
(
dictionaries offer for them several meanings
)
;
pointless
(
many terms have several constituents, and each of them does not contribute directly to the meaning of a term
)
Because of that dictionaries of simple words have to be complemented with other types of dictionaries and grammars that will provide a natural units of processing.Slide9
9
Types of e-dictionaries
E-dictionaries of simple words
(
dictionaries of simple graphemic units
–
these are usually entries in traditional dictionaries
)
;
E-dictionaries of multi-word units
(
multi-word units that contain non-letter characters, terminology, collocations, phrases,
...)
;
Phonological e-dictionaries
(
pronunciation of simple words
,
with rules of how to pronounce inflected forms, words in contact, etc.);
Semantic e-dictionaries
(
simple words and multi-word units with encoded senses
–
network of senses
?)Slide10
10
A prerequisite for the development of an e-dictionary of simple words
A selection of lexical categories and a way to represent them
. Tradi
t
ional
categories
:
Part-of-Speech
:
noun
,
verb
,
adjective
,...
subordinated
categories
:
possessive
,
indefinite
,
definite
,...
inflectional categories
:
masculine
,
feminine
,
neuter
, nominativ
e
, genitiv
e
,...
syntactic categories
:
transitive
,
intransitive
,...
semanti
c
categories
:
human
,
abstraction
,
concrete object
,...Slide11
11
The selection of categories is not a straightforward task
A sat of tags used to annotate the
Brown
corpus
Brown
A sat of tags used to annotate the
Penn tree bank
Penn
A sat of tags used for the
Multext-East
project
Multext-EastSlide12
12
LADL format
of
ele
ctron
ic
dictionaries
Unitex
works with dictionaries that were developed by members of the
Rel
e
x
network.
Rel
e
x
is an international informal network of laboratories that work on computational linguistics. It was established by
M
au
ri
ce
Gros
s
and his
LADL
t
ea
m. (
LADL
is shortened for
Laboratoire d'Automatique Documentaire et Linguistique
)
Members of the
Rel
e
x
network developed exhaustive e-dictionaries of simple words and compounds
(
http://infolingu.univ-mlv.fr/Relex/Relex.html
)Slide13
13
A selection of canonic forms
In the case that a word has several surface forms, one of them is chosen as a canonic representative for other, subordinate forms.
What are canonic forms in e-dictionaries of French?
For nouns, as a rule that is the singular masculine form;
For verbs, that is the infinitive form
...Slide14
14
Is the selection of a canonic form unique
?
It is neither simple nor unique
.
For instance,
In French, the gender of nouns is an inflectional category, that is
lecteur
,
lecteurs
,
lectrice
,
lectrice
s
are four forms of the same word – its canonic form is
lecteur
In Serbian, the gender of nouns is not an inflectional category; so,
učitelj
and
učiteljica
(tradi
t
ional
ly
, a
s
well as in the Serbian
e-dic
tionary
)
are two canonic forms, each with its own subordinated forms.
Similarly in Bulgarian:
учител
and
учителкаSlide15
15
Why is the adequate selection of an canonic form so important
?
A lot of information about a word is attached to its canonic form
–
all subordinate forms share that information:
učiteljica
has semantic features
+Hum+Prof
and the same features have all its inflected forms:
učiteljice, učiteljici, učiteljicu,...
Is this a rule that release us from further from making other decisions
?
No, in Serbian the gender and the animacy are features of subordinate forms, not canonic forms
.
Why
?
Nouns can change gender in plural forms
,
vladika
(m)
vladike
(f)
In order to treat the same category always in the same way (for nouns, adjectives, pronouns, numerals, etc.)Slide16
16
More about categories attached to canonic and subordinate forms
First, there was a mouse
This mouse is alive
Its canonic form is
miš,N+Zool
Then came a mouse
This mouse is not alive
Its canonic form is
miš,N+Conc
What is the value of the grammatical category “animacy” for this new mouse
?
Da biste se prebacili na sledeće poglavlje DVD-a,
pomerite miša
Da biste kontrolisali reprodukciju televizije uživo,
pomerite miš
kako
bi se prikazale kontrole za reprodukciju
Google: 28,300
Google:19,000Slide17
17
More on the selection of a canonic form
Passive past participles
are not separate entries in Serbian traditional dictionaries – these forms belong to the verb paradigm.
What about passive past participles that are used as adjectives
?
A program for automatic text processing has to recognize them somehow and to tag them appropriately
.
For instance, a sample of
“
Politik
a
”
having
582,000
simple word tokens
contains only in the feminine gender accusative
228
adjectives derived from the past participle
(
they are not all correct
)Slide18
18
And what about
...
Present past participle
(
functioning as an adjective
) –
“
Politika
”
–
occurrences
in the feminine accusative singular forms;
Present gerund
(
functioning as an adjective
) –
Politika
–
occurrences
in the feminine accusative singular forms
Derivational forms
Possessive adjectives
–
dečakov, partizanov,...
Diminutives
–
tkaninica
,
futrolica
,
telefonče
,...
Gender motion
–
druidica
,
gutačica, gudačica, guvernerka,...
Šezedestogodišnjakinja, četvoroipomesečni, dvestopedestogodišnjica,...
They all have in Serbian e-dictionary separate canonic forms, each with its own subordinate forms.Slide19
19
In order to obtain (close to) 100% coverage of a text, it is necessary to include:
colors
–
skerletnocrven
,
bledoplav
,
mlečnožut,...
Proper names
–
personal names
,
geo
political names
organizations
(
Ozna
,
Gestapo
,
Metropoliten
,...)
objects
–
trademarks
(
lajka
,
spitfajer
,
mercedes
,...)
Titles and characters of novels, films, operas…
(
Dezdemona, Asteriks
,
Plavobradi...
)
events
(
Anšlus
,...)
And then also
–
donžuanstvo
,
arsenlupenovski
,
neotitoizam
,
nedićevština
,...Slide20
20
Details of the LADL format
There are two dictionaries
(
or lists
)
of simple words in the
LADL format
:
First dictionary
–
DELAS
–
is a dictionary of canonic forms (lemmas)
.
This dictionary is used to generate the second dictionary
.
Second dictionary
–
DELAF
–
is a dictionary of subordinate (or inflected) forms
.
Only this dictionary is used in the automatic text processing
.Slide21
21
An entry in a DELAS dictionary
lemma,Kn+Prop
K:
A Part-of-Speech code;
Usually that is a code consisting of one or more upper-case letters.
n:
A relation with subordinate forms, if they exist;
Usually that is an alphanumeric code that together with a PoS code enables the generation of all subordinate forms for a
DELAF
dictionary.
Prop:
Syntactic, semantic, dialect, usage, domain,… markers
Markers that can be freely attached to any canonic form – they are in a form of alphanumeric codes.Slide22
22
An example of a
DELAS
entry from the Serbian e-dictionary
u
č
itel
j
ica,N651+Hum+GM
u
č
itel
j
ica
canonic form
(lem
m
a)
N
Part-of-Speech
(
noun
)
(N)651
Inflection class code used to
generate all inflected forms
+Hum
human
+GM
feminine gender noun derived
from the corresponding
masculine gender noun
učiteljSlide23
23
Examples of Serbian
DELAS
entries for various PoS
u
č
itel
j
ica,N651+Hum+GM
zagasitocrven,A6+Col
smejati,V516
+Imperf+It+Ref+Ek
ć
utke,ADV
deset,NUM+v5
poneko,PRO+ProN+Indef+Sr
ali,CONJ
od,PREP+p2
jaoj,INT
naime,PARSlide24
24
An example of a
DELAS
entry from the Bulgarian e-dictionary
глава,
C
600
+
Ж
глава
canonic form
(lem
m
a)
N
Part-of-Speech
(
noun
)
(N)6
00
Inflection class code used to generate all inflected forms
+
Ж
feminineSlide25
25
Examples of Bulgarian
DELAS
entries for various PoS
глава,
С
(
600
)
+
Ж
червен,
ПРИ
(
3
)
абе,абе.МЕЖ
ако,ако.СЮ+П
вместо,вместо.ПРЕД
вредно,вредно.НАР
даже,даже.ЧА
дам.Г+С+Т
…Slide26
26
An entry in a DELAF dictionary:
word form,
lemma.K+Prop
(:gc)*
Canonic form (or lemma);
K:
A Part-of-Speech code
(
inherited from its lemma
)
Prop:
Syntactic, semantic, dialect, usage, domain,… markers
(
inherited from its lemma
)
gc:
A set of codes that represent values of grammatical categories describing a form:
Grammatical categories depend on the PoS;
These are one character alphanumeric codes.Slide27
27
An example of a
DELA
F
entry from the Serbian e-dictionary
učiteljicu,učiteljica.N
+Hum+GM
:fs4v
u
čiteljicu
subordinate form
(realiz
ation
)
u
čiteljic
a
canonic form
(lem
m
a)
N
PoS
(
inherited from the canonic form
)
+Hum+GM
marker
s
(
inherited from the canonic form
)
fs4v
values of grammatical categories:
f
category gender
(
value feminine
)
s
category number
(
value singular
)
4
category case
(
value accusative)
v
category animacy
(
value animate
)Slide28
28
The whole paradigm of the lemma
u
č
itel
j
ica
u
č
iteljica,u
č
iteljica.N:fp2v
u
č
iteljica,u
č
iteljica.N:fs1v
u
č
iteljice,u
č
iteljica.N:fp5v
u
č
iteljice,u
č
iteljica.N:fp4v
u
č
iteljice,u
č
iteljica.N:fp1v
u
č
iteljice,u
č
iteljica.N:fs5v
u
č
iteljice,u
č
iteljica.N:fw4v
u
č
iteljice,u
č
iteljica.N:fw2v
u
č
iteljice,u
č
iteljica.N:fs2v
u
č
iteljici,uč
iteljica.N:fs7v
u
č
iteljici,u
č
iteljica.N:fs3v
u
č
iteljicu,u
č
iteljica.N:fs4v
u
č
iteljicom,u
č
iteljica.N:fs6v
u
č
iteljicama,u
č
iteljica.N:fp7v
u
č
iteljicama,u
č
iteljica.N:fp6v
u
č
iteljicama,u
č
iteljica.N:fp3v
The numeric code
651
that connects a canonic form with all of its
subordinate forms is deleted because it is of no use anymore.Slide29
29
The whole paradigm of the lemma
глава
глава,глава.С+Ж:s0
главата,глава.С+Ж:sd
глави,глава.С+Ж:p0
главите,глава.С+Ж:pdSlide30
30
The whole paradigm of the lemma
дам
дадете,дам.Г+С+Т:R2p
дадеш,дам.Г+С+Т:R2s
дадеше,дам.Г+С+Т:D2s:D3s
дадох,дам.Г+С+Т:E1s
дадоха,дам.Г+С+Т:E3p
дадохме,дам.Г+С+Т:E1p
дадохте,дам.Г+С+Т:E2p
дадял,дам.Г+С+Т:Wsm
дадяла,дам.Г+С+Т:Wsf
дадяло,дам.Г+С+Т:Wsn
дадях,дам.Г+С+Т:D1s
дадяха,дам.Г+С+Т:D3p
дадяхме,дам.Г+С+Т:D1p
дадяхте,дам.Г+С+Т:D2p
дай,дам.Г+С+Т:I2s
дайте,дам.Г+С+Т:I2p
дал,дам.Г+С+Т:Xsm0
дала,дам.Г+С+Т:Xsf0
далата,дам.Г+С+Т:Xsfd
дадат,дам.Г+С+Т:R3p
даде,дам.Г+С+Т:E2s:E3s
даде,дам.Г+С+Т:R3s
дадели,дам.Г+С+Т:Wp
дадем,дам.Г+С+Т:R1p
даден,дам.Г+С+Т:Qsm0
дадена,дам.Г+С+Т:Qsf0
дадената,дам.Г+С+Т:Qsfd
дадени,дам.Г+С+Т:Qp0
дадените,дам.Г+С+Т:Qpd
дадения,дам.Г+С+Т:Qsmh
даденият,дам.Г+С+Т:Qsml
дадено,дам.Г+С+Т:Qsn0
даденото,дам.Г+С+Т:Qsnd
дали,дам.Г+С+Т:Xp0
далите,дам.Г+С+Т:Xpd
далия,дам.Г+С+Т:Xsmh
далият,дам.Г+С+Т:Xsml
дало,дам.Г+С+Т:Xsn0
далото,дам.Г+С+Т:Xsnd
дам,дам.Г+С+Т:R1sSlide31
31
How is relation between canonic form and its subordinate form (inflected forms) established?
In Unitex system Finite State Transducers – FST – are used for this.
Inflection class code used that follows PoS code in DELAS (dictionary of lemmas) is used to generate all inflected forms.
One transducer is usually used to generate forms for many lemmas. For instance,
transdu
cer
N2
generates inflected forms for
:
emir
,
evrofil
,
dijetetičar
,
forenzičar
,
leptir
,
šegrt
,
and many other lemmas
.Slide32
32
FST defines classes
(BG)
In most of the languages the relation between a lemma and its forms is an intuitive relation of equivalence that is formalized, in the case of the LADL format by FSTs
,
син
/
sg,indef
синове
/
pl,indef
сине
/
sg,voc
сина
/
pl,count
син
(
<E>
/sg,indef+
ове
/pl,indef+
е
/sg,voc+
а
/pl,count)
(...?...)
(
<E>
/sg,indef+
ове
/pl,indef+
е
/sg,voc+
а
/pl,count)
N01:
(
<E>
/sg,indef+
ове
/pl,indef+
е
/sg,voc+
а
/pl,count)Slide33
33
Dictionaries for other languages
Russian - developed at CIS, Munich (CISLEX-RU) derived mostly from Zaliznyak, A.
Grammaticheskij slovar' russkogo jazyka
) and contains approximately 44,000 lemmas (930.000 forms)Slide34
34
разговориться,.V+intr+sv:AI
при,пря.N+anim(j)+gen(F):geF:nm:ajm
при,.PRAEP+gov(q)
при,переть.V+nsv+tr:A2eb
первой,первый.A+Ord:geF:deF:teF:qeF
встрече,встреча.N+anim(j)+gen(F):deF:qeF
объявил,объявить.V+sv+tr:AeMVi
нынешним,нынешний.A:teM:teN:dm
летом,.ADV
летом,лето.N+anim(j)+gen(N):teN
летом,лет.N+anim(j)+gen(M):teM
Капе,Капа.N+PN+VORN+anim(o)+gen(M)+style(colloq):deM:qeM
Капе,Капа.N+PN+VORN+anim(o)+gen(F)+style(colloq):deF:qeFSlide35
35
Капе,Капа.N+PN+VORN+anim(o)+gen(M)+style(colloq):deM:qeM
Капе,Капа.N+PN+VORN+anim(o)+gen(M)
+style(colloq):deM:qeM
Капе – word form
Капа - lemma
N – noun
PN – proper noun
VORN – given name
anim(o) - animate
gen(M) – masculin gender
style(colloq)
deM:qeM – dative or prepositional case, singular (e), masculinSlide36
36
Dictionaries for other languages
Polish (Z. Vetulany, Adam Mickiewicz University, 1996)
marcu,
marzec
.N+Gi+Ns+Cl
marcu,
marzec
.N+Gi+Ns+Cv
marcu,
marzec
.N+
month
:L
marynarka,.N+Gf+Ns+Cn
masową,masowy.ADJ+Dp+Ns+Cai+Gf
masowe,masowy.ADJ+Dp+Np+Cnav+GaifnSlide37
37
Dictionaries for other languages
Latin dictionary derived from Perseus project, based on the Lewis&Short dictionary (1879)
abaculus,.N:Nms
abacum,abax.N:Gmp
abacum,
abacus
.N:Ams
abacum,abacus.N+poet:Gmp
abaddon,
ab-addo
.V:1siPC
abagmentum,.N:Vns
abagmentum,.N:Nns
abagmentum,.N:AnsSlide38
38
Comparison between L&S, Georges and Whiteker dictionaries
All three dictionaries are available in e-from.
L&S
supports processing on the site
Perseus
Georges
is available on-line
Whitaker’s
Words
is an application that performs morhological anaylsis
But their content is different.
E.g.
abacinus
exists in
Georges
and
Whitaker
, but not in
L&SSlide39
39
What else should be known
?
The use of upper-case and lower-case letters in a dictionary:
Canonic forms written with lower-case letters can match in a text both lower–case and upper-case occurrences.
Canonic forms written with (some) upper-case letters can match in a text only occurrences that use upper-case letters on that position(s).
For instance
,
vlada,N600
Vlada,N1741+NProp+Hum+First
some results from corpus “
Politika
”
vlada
and
VladaSlide40
40
What else should be known, or not
?
A user that
will not
produce a new dictionary
(
e.g. for a new language, or a dictionary for some sub-domain
)
need not
know the format of
DELAS
dictionaries
,
neither he/she has to know what are inflectional transducers and how some of them look like.
A user that
wants to use
dictionaries for text processing needs to know what is the content of
DELAF
dictionaries he plans to use and what does different codes and markers mean
.
Dictionary that he/she is using are compiled dictionaries
(
two files with the extensions
.bin
and
.inf
)
and their usage by Unitex is very effective
.
These dictionaries cannot be “seen”
.Slide41
41
E-dictionary as statistical tagger
Filtering the results of word form tagging by Tnt, TreeTagger, etc. with e-dictionaries transform the results to „real“ lemmas (a part of ambiguity is lost, but the result is >95% correct :-) Slide42
42
Numbers that illustrate the content of Serbian e-dictionaries
The number of inflection transducers (April 2010)
for nouns
369
for verbs
371
for adjectives
66