Ladislav Gallay Supervisor Ing Marián Šimko PhD Slovak University of Technology Faculty of Informatics and Information Technologies Lemmatization basic form of a word houses gt ID: 816533
Download The PPT/PDF document "Utilizing vector models for automatic te..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Utilizing vector models
for automatic text lemmatization
Ladislav GallaySupervisor: Ing. Marián Šimko, PhD.
Slovak University of Technology
Faculty of Informatics and Information Technologies
Slide2Lemmatization
basic form of a word: houses > house, best >
goodthe basic task in natural language processingunderstanding of text > sentence > wordunify different forms of the same wordsimilar to stemming
highly inflected languages are the most problematicmultiple lemmas based on contextaxes -> axe or axis ?1
15
Slide3Related workdictionary approach
list of pairs: waiting – wait | waited – wait | houses - housefor Slovak: 2,5 mi
llion tokens, 77 000 unique words (Garabík, 2004)r
ule-based approachbased on given set of rules: houses – house | waited - waitSlovak language has many conditions and exceptions
cti – česť (honor)combination – Tvaroslovník
(Krajči, 2006)
based on the existing pairs new ones are created
find similar word based on the longest suffix
pon
úk
ponuka rúk ruka (offer, hand)
215
Slide4Vector models of wordscaptures relationships between words
semantic and grammatical relationships (Mikolov
, 2013)faster and more accurate training by utilizing neural networksarithmetic operationsevery word is represented by N-dimensional latent vector(N between 100-300)word2vec
tool by Google315
Slide5Vector models of words
415
vector(“king”) – vector(“man”) + vector(“woman”
) = vector(“queen”)man to woman
is similar as king to queenhttps://code.google.com/p/word2vec/
KING
QUEEN
KING
MAN
WOMAN
Slide6Our approach - idea
515
vodník – vodyanoy
vodníkom – with vodynaoy
rybník – pond / lake
rybníkom – with pond
Slide7Lemmatization utilizing vector models
we need to know correct vector shiftwe can observe similar known pairs and their vector shifts
auto-select several similar pairsfrom reference dictionaryfor each pair and the input word calculatelemma candidates
choose the best candidate based on givenweights
616
Reference pairs
Input word
Vector model
Algorithm
Lemma
Slide8Relevant reference pairs selection
R1 – Suffix length
autom: putom, plotom, vlakomR2 – Cosine similarity
semantically closest wordsobtained from the vector modelautom: autobusom, šoférom, koleseR3 –
Grammatical categoriesreference pairs are grouped into categoriesfor the input word we know which category to select frome.g. Singular,
L
ok
al case, gender “
mesto
”
autom: mestom, hniezdom, cestom715
Slide9Lemma candidate weight computationevery candidate is given a weight based on the similarity with the input word
lemma is similar to the input wordf
or the input word autom (with car) it is obvious that slon (elephant) is not a correct lemma
DM0: Ignored (weight = 1)DM1: Levenshtein distanceDM2: Jaro-Winkler distance
DM3: Relative prefix length815
Slide10Input / Output example9
15
Input word: stromoch (about trees)Expected lemma: strom (tree)
Reference pairs
SSip6 duboch dubSSfs2 ženy žene
SSip6
zuboch
zub
SSip6 koláčoch koláčSSfp1 steny stena…
Slide11Evaluationlatent vector model trained on Slovak National Corpus
655 572 511 wordsreference pairs and input words extracted from the annotated lexicon by Ľudovít Štúr Institute of Linguistics
(dictionary based approach)nouns onlyoutput is ordered list of candidatesthe first one is expected lemmaresults are evaluated as true or false
1015
Slide12Evaluation – results (1)11
15
comparison of reference pairs selection variantsrandom input words from corpus@
k represents whether the correct lemma occurs in the top k positionsR1 – suffix length
R2 – cosine similarityR3 – grammatical categories
Slide13Evaluation – results (1)
1215
comparison of reference pairs selection variantsthe most frequent words from corpus
@k represents whether the correct lemma occurs in the top k positions
R1 – suffix lengthR2 – cosine similarityR3 – grammatical categories
Slide14Evaluation – results (2)comparison of weight computation methods
13
15
DM0 - Ignored (weight = 1)DM1 - Levenshtein
distanceDM2 - Jaro-Winkler distanceDM3 -
Relative prefix length
Slide15Evaluation – results (3)14
15
correlation between correctness and coveragesearching for the threshold D
Slide16Conclusionvector models are promising for automatic lemmatization
minimal human inputlanguage independentviable for languages with small knowledge base
strong dependency on corpus used for trainingfurther workevaluation on other parts of speech (beside nouns)other variants for reference pair selectionlemma candidates weighting utilizing morphological or language-specific regularitieslemmatization including context
1515