Download - Marathi – Marathi Monolingual Information Retrieval Mr. Ashish Almeida Prof. Pushpak Bhattacharyya

Marathi – Marathi Monolingual Information Retrieval

Mr. Ashish AlmeidaProf. Pushpak Bhattacharyya

Overview

• Morphological analyzer• Suffix processing• Stop-words• Future work

Present work Search “भा�रत” – bhaarat – Bharat Will not match pages which has terms such as

भा�रत�चा� – bharataachaa - Of Bharat भा�रत�त – bharataat - In Bharat

Lack of large size corpus Unavailability of tools

Corpus Statistics- Marathi

• 99,275 Documents (510 MB)– Maharashtra times– Sakal News

• April 2004 to September 2007 • UTF-8 encoding• XML tags

– DOC - document– DOCNO – document identifier– TEXT - article

Document: example<DOC> <DOCNO>MaharashtraC06E811C6B.htm.txt</DOCNO> <TEXT> मोहफू ल वे�चाण्या�स गे�ल�ल्या� तरुणा�वेर बि��ट्या�चा� हल्ल� (attack of a leapord on a young man who has gone to collect flowers of Moha) इस्ल�पू र, त�. २२ - चा�रळी आणिणा मोहफू ल वे�चाण्या�स�ठी$ जं&गेल�त गे�ल�ल्या� एका� आदि*वे�स+ तरुणा�वेर

बि��ट्या�ने� अचा�नेका हल्ल� का� ल्या�ने� त तरुणा गे&भा+र जंखमो+ झा�ल� आह�. ह+ घटने� शु3क्रवे�र+ (त�. २०) मो3ळीझार� (त�. बिकानेवेट) या� गे�वे�च्या� जं&गेल�त घडल+. .......

इस्ल�पू र वेने पूरिरक्षे�त्र का�या�;लया�अ&तगे;त या�णा�ऱ्या� मो3ळीझार� या�थी+ल आदि*वे�स+ तरुणा मोनेहर . . .. . .</TEXT> </DOC>

Topics

• 100 topics• Aligned with English topics• XML tags

– num : query identifier – title: title of the query– desc: description– narr: Additional information about the query

• Cover all issues –local, international

Topic example

<top><num>1<title>ट>वे?ट - २० बिवेश्वचाषका�त+ल भा�रत�चा� क्रBड�पूट3त्वे

(India’s championship in tewnty-20 Worldcup)<desc> पूबिहल्या� आयास+स+ बिवेश्व ट>वे?ट - २० सवेDत्काE ष्ट-बिवेजं�त�- स्पर्धेIत+ल

भा�रत�च्या� बिवेजंया�चा� वेEत्त *�णा�र� ल�ख शुर्धे�.</desc><narr>ट>वे?ट - २० बिवेश्चचाषका स्पर्धेIमोर्धे+ल पू�बिकास्त�ने बिवेरूद्ध भा�रत�चा� बिवेजंया,

ह्या� ऐबितह�सिसका बिवेजंया� बिनेमिमोत्त ख�ळी�ड &ने+ का� ल�ल� बिवेक्रमो त्या�&ने+ मिमोळीबिवेल�ल+ �णिक्षेस� वे पू3रस्का�र�चा+ रक्कामो स�मोने�वे+र�चा� तस�चा मो�सिलका�वे+र�चा� ने�वे, मो�जं+

ख�ळी�ड &ने+ आणिणा जंगेभार�त+ल लका�&ने+ का� ल�ल+ प्रशु&स� या�स&*भा�;त आम्ह+ उसिचात मो�बिहत+ मिमोळीवेत आहत.

</top>

Tools

• Terrier– Open source IR system– Models

• TF-IDF (Vector space model)• DFR-BM25 (Probabilistic)

– Both models available in Terrier

• Evaluation against relevance judged document for 25 queries

Lemmatizer Vs stemmer

– भा�रत�ला� bhaarataalaa – for Bharat– भा�रत�चा� bhaarataachaa - of Bharat– भा�रत�त bhaarataat – in Bharat– भा�रत�वेर bhaarataavar – on Bharat

• Lemmatizer finds Lemma– भा�रत

• Stemmer finds stem: Longest unchangeable word prefix– भा�रत�

Marathi suffixes

• Suffixes include case markers, postposition markers etc.

• Suffixes may get attached after another suffix• Example:

– घर�समोरचा�*�खिखल– घर�-समोर-चा�-*�खिखल– gharaa-samor-chaa-dekhil – house-front- of-also – Root word: घर (ghar) (house)

Morphological analyzer

• Use of Marathi morphology analyzer– Better matching words

• र�मो versus र�मो�

• Gives all possible roots– Selects first root – most frequent

• Used at indexing and query processing end

Lemmatizer Results

MAP

R-precision Precision at

5Precision at 10

Recall

TF-IDF without lemmatizer

0.3366 0.2944 0.3167 0.2583 0.8724

TF-IDF + lemmatizer 0.4003 0.3551 0.3417 0.2917 0.9686

DFR+ without lemmatizer

0.3455 0.3209 0.3500 0.2667 0.8744

DFR-BM25 + lemmatizer

0.4140 0.3686 0.3833 0.3083 0.9619

DFR-BM25 + lemmatizer

(Fire submission)

0.3625 0.3797 0.4600 0.3960 0.9178

Suffixes

• Usually ignored• Indexing suffixes - not studied• Index selected suffixes

– Suffixes of space and time• वेर – var - on• समोर – samor - in front of• मोध्या� – madhye - in• ने&तर -nanter – after

• Created manually– 66 words list

Stop-words

• Most frequently occurring words• Little discriminatory value• Occur in 80 % or more documents• Selected stop-words

– त+, त�, या�, ू ने, अस, आह, या�, ह, कार, त

Results suffix indexing and stop-words

MAP R-precision Precision at

5

Precision at

10

Recall

DFR-BM25

+ lemmatization

+ suffix Indexing

0.4381 0.3846 0.3917 0.3167 0.97085

DFR-BM25

+ lemmatization

+ suffix Indexing

+ stop-words

0.4433 0.3798 0.4000 0.3208 0.9731

P-R graph

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0 10 20 30 40 50 60 70 80 90 100

Recall %

Pre

cis

ion

lemmatization, indexing suffixes and stopwords

lemmatization and indexing suffixes

lemmatization

base-line

• Precision-recall graph for all four cases is show below

Future work

• Morphological analyzer– Accuracy 94.5 %

• Needs to be improved

• Heuristic suffix stripping: unknown words• Handle derivational morphology• Spelling variations, common spelling mistakes

Acknowledgement

• “Cross Lingual Information Access” Project• Maharashtra times: Times Media Group,

– http://in.indiatimes.com/aboutus.cms

• Sakal: Sakal Media Group– http://www.sakaal.in/

References

• http://ir.dcs.gla.ac.uk/terrier/ • Ricardo Baeza Yates and Berthier Ribeiro

Neto, Modern Information Retrieval• Jacques Savoy, Searching strategies for the

Bulgarian language• Morphological Analyzer, CFILT

Thank you

Download - Marathi – Marathi Monolingual Information Retrieval Mr. Ashish Almeida Prof. Pushpak Bhattacharyya

Top Related