/
Utilizing vector models for automatic text lemmatization Utilizing vector models for automatic text lemmatization

Utilizing vector models for automatic text lemmatization - PowerPoint Presentation

jiggyhuman
jiggyhuman . @jiggyhuman
Follow
344 views
Uploaded On 2020-11-06

Utilizing vector models for automatic text lemmatization - PPT Presentation

Ladislav Gallay Supervisor Ing Marián Šimko PhD Slovak University of Technology Faculty of Informatics and Information Technologies Lemmatization basic form of a word houses gt ID: 816533

input vector word pairs vector input pairs word reference lemma based words models weight similar grammatical correct lemmatization utilizing

Share:

Link:

Embed:

Download Presentation from below link

Download The PPT/PDF document "Utilizing vector models for automatic te..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Utilizing vector models

for automatic text lemmatization

Ladislav GallaySupervisor: Ing. Marián Šimko, PhD.

Slovak University of Technology

Faculty of Informatics and Information Technologies

Slide2

Lemmatization

basic form of a word: houses > house, best >

goodthe basic task in natural language processingunderstanding of text > sentence > wordunify different forms of the same wordsimilar to stemming

highly inflected languages are the most problematicmultiple lemmas based on contextaxes -> axe or axis ?1

15

Slide3

Related workdictionary approach

list of pairs: waiting – wait | waited – wait | houses - housefor Slovak: 2,5 mi

llion tokens, 77 000 unique words (Garabík, 2004)r

ule-based approachbased on given set of rules: houses – house | waited - waitSlovak language has many conditions and exceptions

cti – česť (honor)combination – Tvaroslovník

(Krajči, 2006)

based on the existing pairs new ones are created

find similar word based on the longest suffix

pon

úk

ponuka rúk ruka (offer, hand)

215

Slide4

Vector models of wordscaptures relationships between words

semantic and grammatical relationships (Mikolov

, 2013)faster and more accurate training by utilizing neural networksarithmetic operationsevery word is represented by N-dimensional latent vector(N between 100-300)word2vec

tool by Google315

Slide5

Vector models of words

415

vector(“king”) – vector(“man”) + vector(“woman”

) = vector(“queen”)man to woman

is similar as king to queenhttps://code.google.com/p/word2vec/

KING

QUEEN

KING

MAN

WOMAN

Slide6

Our approach - idea

515

vodník – vodyanoy

vodníkom – with vodynaoy

rybník – pond / lake

rybníkom – with pond

Slide7

Lemmatization utilizing vector models

we need to know correct vector shiftwe can observe similar known pairs and their vector shifts

auto-select several similar pairsfrom reference dictionaryfor each pair and the input word calculatelemma candidates

choose the best candidate based on givenweights

616

Reference pairs

Input word

Vector model

Algorithm

Lemma

Slide8

Relevant reference pairs selection

R1 – Suffix length

autom: putom, plotom, vlakomR2 – Cosine similarity

semantically closest wordsobtained from the vector modelautom: autobusom, šoférom, koleseR3 –

Grammatical categoriesreference pairs are grouped into categoriesfor the input word we know which category to select frome.g. Singular,

L

ok

al case, gender “

mesto

autom: mestom, hniezdom, cestom715

Slide9

Lemma candidate weight computationevery candidate is given a weight based on the similarity with the input word

lemma is similar to the input wordf

or the input word autom (with car) it is obvious that slon (elephant) is not a correct lemma

DM0: Ignored (weight = 1)DM1: Levenshtein distanceDM2: Jaro-Winkler distance

DM3: Relative prefix length815

Slide10

Input / Output example9

15

Input word: stromoch (about trees)Expected lemma: strom (tree)

Reference pairs

SSip6 duboch dubSSfs2 ženy žene

SSip6

zuboch

zub

SSip6 koláčoch koláčSSfp1 steny stena…

Slide11

Evaluationlatent vector model trained on Slovak National Corpus

655 572 511 wordsreference pairs and input words extracted from the annotated lexicon by Ľudovít Štúr Institute of Linguistics

(dictionary based approach)nouns onlyoutput is ordered list of candidatesthe first one is expected lemmaresults are evaluated as true or false

1015

Slide12

Evaluation – results (1)11

15

comparison of reference pairs selection variantsrandom input words from corpus@

k represents whether the correct lemma occurs in the top k positionsR1 – suffix length

R2 – cosine similarityR3 – grammatical categories

Slide13

Evaluation – results (1)

1215

comparison of reference pairs selection variantsthe most frequent words from corpus

@k represents whether the correct lemma occurs in the top k positions

R1 – suffix lengthR2 – cosine similarityR3 – grammatical categories

Slide14

Evaluation – results (2)comparison of weight computation methods

13

15

DM0 - Ignored (weight = 1)DM1 - Levenshtein

distanceDM2 - Jaro-Winkler distanceDM3 -

Relative prefix length

Slide15

Evaluation – results (3)14

15

correlation between correctness and coveragesearching for the threshold D

Slide16

Conclusionvector models are promising for automatic lemmatization

minimal human inputlanguage independentviable for languages with small knowledge base

strong dependency on corpus used for trainingfurther workevaluation on other parts of speech (beside nouns)other variants for reference pair selectionlemma candidates weighting utilizing morphological or language-specific regularitieslemmatization including context

1515