ISM@FIRE 2012: - PowerPoint Presentation

405 views
Uploaded On 2016-07-19

ISM@FIRE 2012: - PPT Presentation

Adhoc Retrieval Task amp Morpheme Extraction Task Avinash Yadav Robins Yadav Sukomal Pal Department of Computer Science amp Engineering Indian School of Mines Dhanbad India Contents ID: 411663

hindi results stemming ism results hindi ism stemming bengali task retrieval language 2011 map english met contd

Link:

Copy

Embed:

<iframe width="560" height="315" src="https://www.docslides.com/embed/411663" frameborder="0" allowfullscreen></iframe>

Download Presentation from below link

Download Presentation The PPT/PDF document "ISM@FIRE 2012:" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.

Presentation Transcript

Slide1

ISM@FIRE 2012: Adhoc Retrieval Task & Morpheme Extraction Task

Avinash

Yadav

Robins

Yadav

Sukomal

Pal

Department of Computer Science & Engineering

Indian School of Mines

Dhanbad

, IndiaSlide2

Contents Introduction

Adhoc

retrieval task participation

Morpheme Extraction Task participation

ConclusionSlide3

Introduction Stemmer

ISMstemmer

EvaluationSlide4

StemmerAttempts to reduce word variants to its stem or root form

Example –

education, educating, educative

will all reduce to

educat

Approaches for Stemming

Language based approach

Statistical approachSlide5

ISMstemmerstatistical stemmerbased on suffix extraction

suffix frequency

algorithmSlide6

Data Preprocessing

Convert the corpus into single file

File 1

File 2

File n

…

Single File

Cleaning of data

John asked a girl with an apple of Kashmir

, “

do you have the time

”.

She said

“

yes

”.

John asked a girl with an apple of Kashmir do you have the time she said yes

Removing Stop Words

John asked a girl with an apple of Kashmir do you have the time she said yes

John asked girl with apple Kashmir you time she said yes

John

asked

girl

with

apple

Kashmir

youtimeshesaidyes

Convert file into Single ColumnSlide7

Data preprocessing (contd….)

unique words extracted

Hindi- 4,90,391

English-7,95,144Slide8

Find valid suffixes

Reverse the words of single column file

aborning

absolution

absorption

abuilding

acquisition

activation

added

addition

admiration

admitted

admitting agreed agreeing allotted allotting ambling angling

gninroba

noitulosba noitprosba gnidliuba noitisiuqca

noitavitca dedda noitidda noitarimda

dettimda gnittimda

deerga gnieerga dettolla gnittolla

gnilbma gnilgna

Sort the reversed list

gninroba

noitulosba noitprosba gnidliuba

noitisiuqca noitavitca dedda noitidda

noitarimda dettimda gnittimda

deerga gnieerga dettolla

gnittolla gnilbma gnilgna

dedda

deerga dettimda dettolla

gnidliuba gnieerga gnilbma gnilgna

gninroba gnittimda gnittolla noitarimda noitavitca noitidda noitisiuqca

noitprosba noitulosba

Find suffix according to threshold

dedda

deerga

dettimda

dettolla

gnidliuba

gnieerga

gnilbma

gnilgna gninroba gnittimda gnittolla noitarimda noitavitca noitidda noitisiuqca noitprosba noitulosba

degniniot

gni

17%

40%Slide9

Threshold usedEnglish: 0.01 - 0.1%

Hindi: 0.1 – 1.0%Slide10

Stemming of corpus

Stem the reversed words with reversed valid suffixes

dedda

deerga

dettimda

dettolla

gnidliuba

gnieerga

gnilbma gnilgna

gninroba gnittimda gnittolla

noitarimda noitavitca noitidda noitisiuqca

noitprosba noitulosba

dda

erga ttimda ttolla dliuba

eerga lbma

lgna nroba ttimda ttolla arimda avitca idda

isiuqca prosba

ulosba

Reverse stemmed words to get the original words

dda

erga ttimda ttolla

dliuba eerga lbma lgna

nroba ttimda ttolla

arimda avitca idda

isiuqca prosba ulosba

add

agreadmittallottabuildagreeamblangl

aborn

admitt

allott

admira

activa

addi

acquisi

absorp

absoluSlide11

Note:

If the length of a word after stemming is less than ’3’ alphabets, then that word will not be stemmed

aging

king

kSlide12

Evaluation of ISMstemmer

For evaluation of

ISMstemmer

we have participated in:

Monolingual

Adhoc

retrieval task in English and Hindi Languages

Morpheme Extraction Task (MET) of FIRE-2012Slide13

Adhoc Retrieval Task(ART) Participation

Monolingual task

Languages chosen:

English

Approach

Results

Hindi

Approach

ResultsSlide14

ART: English Approach:

Indexing:

Search Engine used: Indri(

IndriBuildIndex

)

Retrieval:

Search engine used: Lemur (

RetEval

)

Data Provided:

Corpus from The Telegraph and BD

News

50 query setSlide15

ART: English (contd….)

Results:

Run id

No. of queries

No. of results

No. of relevant docs.

No. of rel. docs ret.

MAP value

EE.ism.unstemmed

50000

3539

2503

0.2264

EE.ism.krovetzstemmer

500003539

25040.2255EE.ism.ismstemmer

505000035392415

0.2096Slide16

ART: HindiApproach:

Indexing:

Search Engine used: Indri (

IndriBuildIndex

)

Retrieval:

Search Engine used: Indri (

IndriRunQuery

)

Data Provided:

Corpus from

Navbharat

Times and

Amar

Ujala 50 query setSlide17

ART: Hindi (contd….)

Results:

Run id

No.

of queries

No.

f results

No. of relevant docs

No. of rel. docs ret.

MAP value

HH.ism.unstemmed.indri

50000

2309

2220.0173

HH.stemmmedcorpus.unstemmedquery5050000

2309980.0026

HH.stemmmedcorpus.stemmedquery5050000

23092090.0137Slide18

Morpheme Extraction Task Participation

Tool submitted

ResultsSlide19

MET Tool Submission.ISMstemmer submitted

evaluated at IR Labs: DAIICT, Gujarat

tested on 6 languages of South Asian origin

has given efficient results with 3 languagesSlide20

MET Results:

BENGALI

Institute Language MAP Obtained

Baseline Bengali 0.2740

JU Bengali 0.3307

DCU Bengali 0.3300

IIT-KGP Bengali 0.3225

CVPR-Team1 Bengali 0.3159

ISM Bengali 0.3103

CVPR-Team2

Bengali NASlide21

MET Results (contd….)

GUJARATI

Institute

Language

MAP

Obtained

Baseline Gujarati 0.2677

ISM Gujarati 0.2824

MARATHI

Institute Language

MAP Obtained

Baseline Marathi

0.2320

ISM Marathi

0.2797 IIT-B Marathi 0.2684Slide22

MET Results (contd….)

ODIA

Institute Language

MAP Obtained

Baseline

Odia

0.1537

IIIT-

Odia

0.1537

ISM

Odia 0.1537

5. HINDI

Institute Language MAP Obtained Baseline Hindi 0.2821

DCU Hindi 0.2963

ISM Hindi 0.2793Slide23

MET Results (contd….)

TAMIL

Institute Language

MAP Obtained

Baseline Tamil NA

AUCEG Tamil NA

ISM Tamil NA

NA : results are not available, due non-availability of

qrelsSlide24

Reasons for Underperformance with Hindi

overstemming

undesired stemming of proper nounsSlide25

Overstemming

This refers to words that shouldn’t be grouped together by stemming, but are.

Example –

accent, accentual, accentuate

Stem word – accent

accept, acceptant, acceptor

Stem word – accept

access, accessible, accession

Stem word – access

due to

overstemming

it may be possible that these all group into wrong stem -

acce

Slide26

Undesired stemming of proper nouns

proper nouns should not be stemmed as they are not inflected

Example –

Beijing

It will get stemmed to

BeijSlide27

Conclusion

ART

English

not satisfactory

Hindi:

poor

Reasons:

overstemming

undesired stemming of proper nouns

MET:

performed efficiently with Bengali, Gujarati and Marathi languages

performed up to the mark with

Odia

underperformed with HindiSlide28

References

Banerjee

R. and Pal S. 2011. ISM@FIRE-2011 Bengali Monolingual Task: A frequency based stemmer.

Forum for Information Retrieval Evaluation 2011, ISI

kolkata

2. www.isical.ac.in/~fire/ (as on 06.12.2012)

3. Christopher D. Manning,

Hinrich

Schütze

: Foundations of Statistical Natural Language Processing, MIT Press (1999), ISBN 978-0-262-13360-9.

4. http://en.wikipedia.org/wiki/Information_retrieval (as on 06.12.2012)

5.http://sourceforge.net/p/lemur/wiki/Indri%20query%20Language%20Reference/ (as on 06.12.2012)6. www.lemurproject.org (as on 06.12.2012)

7. Paik, J. H., Mitra, M., Parui, S. K., and J¨ arvelin, K. 2011. GRAS: An effective and efficient stemming algorithm for information retrieval.

ACM Trans. Inf. Syst. 29, 4, Article 19 (November 2011)Slide29

References (contd…)

8. Paik, J. H. and

Parui

, S. K. 2011. A fast corpus-based stemmer. ACM Trans. Asian Lang.

form

Process

. 10, 2, Article 8 (

June

2011).

9. Paik J. H., Pal

Dipasree

, Parui S. K. A Novel Corpus-Based Stemming Algorithm using Co-occurrence Statistics.

SIGIR’11, July 24–28, 2011, Beijing, China.10. Xu, J. and Croft, W. B. 1998. Corpus-based stemming using co-occurrence of word variants. ACM Trans. Inf. Syst. 16, 1, 61–81.

11. http://en.wikipedia.org/wiki/Stemming (as on 06.12.2012)12. How Effective Is Suffixing? Donna Harman. lister Hill Center

for Biomedical Communications, National Library of Medicine, Bethesda, MD 20209Slide30

THANK YOU!!