Adhoc Retrieval Task amp Morpheme Extraction Task Avinash Yadav Robins Yadav Sukomal Pal Department of Computer Science amp Engineering Indian School of Mines Dhanbad India Contents ID: 411663
Download Presentation The PPT/PDF document "ISM@FIRE 2012:" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
ISM@FIRE 2012: Adhoc Retrieval Task & Morpheme Extraction Task
Avinash
Yadav
Robins
Yadav
Sukomal
Pal
Department of Computer Science & Engineering
Indian School of Mines
Dhanbad
, IndiaSlide2
Contents Introduction
Adhoc
retrieval task participation
Morpheme Extraction Task participation
ConclusionSlide3
Introduction Stemmer
ISMstemmer
EvaluationSlide4
StemmerAttempts to reduce word variants to its stem or root form
Example –
education, educating, educative
will all reduce to
educat
Approaches for Stemming
Language based approach
Statistical approachSlide5
ISMstemmerstatistical stemmerbased on suffix extraction
suffix frequency
algorithmSlide6
Data Preprocessing
Convert the corpus into single file
File 1
File 2
File n
…
Single File
Cleaning of data
John asked a girl with an apple of Kashmir
, “
do you have the time
”.
She said
,
“
yes
”.
John asked a girl with an apple of Kashmir do you have the time she said yes
Removing Stop Words
John asked a girl with an apple of Kashmir do you have the time she said yes
John asked girl with apple Kashmir you time she said yes
John asked girl with apple Kashmir you time she said yes
John
asked
girl
with
apple
Kashmir
youtimeshesaidyes
Convert file into Single ColumnSlide7
Data preprocessing (contd….)
unique words extracted
Hindi- 4,90,391
English-7,95,144Slide8
Find valid suffixes
Reverse the words of single column file
aborning
absolution
absorption
abuilding
acquisition
activation
added
addition
admiration
admitted
admitting agreed agreeing allotted allotting ambling angling
gninroba
noitulosba noitprosba gnidliuba noitisiuqca
noitavitca dedda noitidda noitarimda
dettimda gnittimda
deerga gnieerga dettolla gnittolla
gnilbma gnilgna
Sort the reversed list
gninroba
noitulosba noitprosba gnidliuba
noitisiuqca noitavitca dedda noitidda
noitarimda dettimda gnittimda
deerga gnieerga dettolla
gnittolla gnilbma gnilgna
dedda
deerga dettimda dettolla
gnidliuba gnieerga gnilbma gnilgna
gninroba gnittimda gnittolla noitarimda noitavitca noitidda noitisiuqca
noitprosba noitulosba
Find suffix according to threshold
dedda
deerga
dettimda
dettolla
gnidliuba
gnieerga
gnilbma
gnilgna gninroba gnittimda gnittolla noitarimda noitavitca noitidda noitisiuqca noitprosba noitulosba
degniniot
gni
17%
40%Slide9
Threshold usedEnglish: 0.01 - 0.1%
Hindi: 0.1 – 1.0%Slide10
Stemming of corpus
Stem the reversed words with reversed valid suffixes
dedda
deerga
dettimda
dettolla
gnidliuba
gnieerga
gnilbma gnilgna
gninroba gnittimda gnittolla
noitarimda noitavitca noitidda noitisiuqca
noitprosba noitulosba
dda
erga ttimda ttolla dliuba
eerga lbma
lgna nroba ttimda ttolla arimda avitca idda
isiuqca prosba
ulosba
Reverse stemmed words to get the original words
dda
erga ttimda ttolla
dliuba eerga lbma lgna
nroba ttimda ttolla
arimda avitca idda
isiuqca prosba ulosba
add
agreadmittallottabuildagreeamblangl
aborn
admitt
allott
admira
activa
addi
acquisi
absorp
absoluSlide11
Note:
If the length of a word after stemming is less than ’3’ alphabets, then that word will not be stemmed
aging
king
ag
kSlide12
Evaluation of ISMstemmer
For evaluation of
ISMstemmer
we have participated in:
Monolingual
Adhoc
retrieval task in English and Hindi Languages
Morpheme Extraction Task (MET) of FIRE-2012Slide13
Adhoc Retrieval Task(ART) Participation
Monolingual task
Languages chosen:
English
Approach
Results
Hindi
Approach
ResultsSlide14
ART: English Approach:
Indexing:
Search Engine used: Indri(
IndriBuildIndex
)
Retrieval:
Search engine used: Lemur (
RetEval
)
Data Provided:
Corpus from The Telegraph and BD
News
50 query setSlide15
ART: English (contd….)
Results:
Run id
No. of queries
No. of results
No. of relevant docs.
No. of rel. docs ret.
MAP value
EE.ism.unstemmed
50
50000
3539
2503
0.2264
EE.ism.krovetzstemmer
50
500003539
25040.2255EE.ism.ismstemmer
505000035392415
0.2096Slide16
ART: HindiApproach:
Indexing:
Search Engine used: Indri (
IndriBuildIndex
)
Retrieval:
Search Engine used: Indri (
IndriRunQuery
)
Data Provided:
Corpus from
Navbharat
Times and
Amar
Ujala 50 query setSlide17
ART: Hindi (contd….)
Results:
Run id
No.
of queries
No.
o
f results
No. of relevant docs
No. of rel. docs ret.
MAP value
HH.ism.unstemmed.indri
50
50000
2309
2220.0173
HH.stemmmedcorpus.unstemmedquery5050000
2309980.0026
HH.stemmmedcorpus.stemmedquery5050000
23092090.0137Slide18
Morpheme Extraction Task Participation
Tool submitted
ResultsSlide19
MET Tool Submission.ISMstemmer submitted
evaluated at IR Labs: DAIICT, Gujarat
tested on 6 languages of South Asian origin
has given efficient results with 3 languagesSlide20
MET Results:
BENGALI
Institute Language MAP Obtained
Baseline Bengali 0.2740
JU Bengali 0.3307
DCU Bengali 0.3300
IIT-KGP Bengali 0.3225
CVPR-Team1 Bengali 0.3159
ISM Bengali 0.3103
CVPR-Team2
+
Bengali NASlide21
MET Results (contd….)
2.
GUJARATI
Institute
Language
MAP
Obtained
Baseline Gujarati 0.2677
ISM Gujarati 0.2824
3
.
MARATHI
Institute Language
MAP Obtained
Baseline Marathi
0.2320
ISM Marathi
0.2797 IIT-B Marathi 0.2684Slide22
MET Results (contd….)
4.
ODIA
Institute Language
MAP Obtained
Baseline
Odia
0.1537
IIIT-
Bh
Odia
0.1537
ISM
Odia 0.1537
5. HINDI
Institute Language MAP Obtained Baseline Hindi 0.2821
DCU Hindi 0.2963
ISM Hindi 0.2793Slide23
MET Results (contd….)
6.
TAMIL
Institute Language
MAP Obtained
Baseline Tamil NA
AUCEG Tamil NA
ISM Tamil NA
NA : results are not available, due non-availability of
qrelsSlide24
Reasons for Underperformance with Hindi
overstemming
undesired stemming of proper nounsSlide25
Overstemming
This refers to words that shouldn’t be grouped together by stemming, but are.
Example –
accent, accentual, accentuate
Stem word – accent
accept, acceptant, acceptor
Stem word – accept
access, accessible, accession
Stem word – access
due to
overstemming
it may be possible that these all group into wrong stem -
acce
Slide26
Undesired stemming of proper nouns
proper nouns should not be stemmed as they are not inflected
Example –
Beijing
It will get stemmed to
BeijSlide27
Conclusion
ART
:
English
:
not satisfactory
Hindi:
poor
Reasons:
overstemming
undesired stemming of proper nouns
MET:
performed efficiently with Bengali, Gujarati and Marathi languages
performed up to the mark with
Odia
underperformed with HindiSlide28
References
1.
Banerjee
R. and Pal S. 2011. ISM@FIRE-2011 Bengali Monolingual Task: A frequency based stemmer.
Forum for Information Retrieval Evaluation 2011, ISI
kolkata
.
2. www.isical.ac.in/~fire/ (as on 06.12.2012)
3. Christopher D. Manning,
Hinrich
Schütze
: Foundations of Statistical Natural Language Processing, MIT Press (1999), ISBN 978-0-262-13360-9.
4. http://en.wikipedia.org/wiki/Information_retrieval (as on 06.12.2012)
5.http://sourceforge.net/p/lemur/wiki/Indri%20query%20Language%20Reference/ (as on 06.12.2012)6. www.lemurproject.org (as on 06.12.2012)
7. Paik, J. H., Mitra, M., Parui, S. K., and J¨ arvelin, K. 2011. GRAS: An effective and efficient stemming algorithm for information retrieval.
ACM Trans. Inf. Syst. 29, 4, Article 19 (November 2011)Slide29
References (contd…)
8. Paik, J. H. and
Parui
, S. K. 2011. A fast corpus-based stemmer. ACM Trans. Asian Lang.
N
form
.
Process
. 10, 2, Article 8 (
June
2011).
9. Paik J. H., Pal
Dipasree
, Parui S. K. A Novel Corpus-Based Stemming Algorithm using Co-occurrence Statistics.
SIGIR’11, July 24–28, 2011, Beijing, China.10. Xu, J. and Croft, W. B. 1998. Corpus-based stemming using co-occurrence of word variants. ACM Trans. Inf. Syst. 16, 1, 61–81.
11. http://en.wikipedia.org/wiki/Stemming (as on 06.12.2012)12. How Effective Is Suffixing? Donna Harman. lister Hill Center
for Biomedical Communications, National Library of Medicine, Bethesda, MD 20209Slide30
THANK YOU!!