Download - Marathi – Marathi Monolingual Information Retrieval Mr. Ashish Almeida Prof. Pushpak Bhattacharyya
Marathi – Marathi Monolingual Information Retrieval
Mr. Ashish AlmeidaProf. Pushpak Bhattacharyya
Overview
• Morphological analyzer• Suffix processing• Stop-words• Future work
Present work Search “भा�रत” – bhaarat – Bharat Will not match pages which has terms such as
भा�रत�चा� – bharataachaa - Of Bharat भा�रत�त – bharataat - In Bharat
Lack of large size corpus Unavailability of tools
Corpus Statistics- Marathi
• 99,275 Documents (510 MB)– Maharashtra times– Sakal News
• April 2004 to September 2007 • UTF-8 encoding• XML tags
– DOC - document– DOCNO – document identifier– TEXT - article
Document: example<DOC> <DOCNO>MaharashtraC06E811C6B.htm.txt</DOCNO> <TEXT> मोहफू ल वे�चाण्या�स गे�ल�ल्या� तरुणा�वेर बि��ट्या�चा� हल्ल� (attack of a leapord on a young man who has gone to collect flowers of Moha) इस्ल�पू र, त�. २२ - चा�रळी आणिणा मोहफू ल वे�चाण्या�स�ठी$ जं&गेल�त गे�ल�ल्या� एका� आदि*वे�स+ तरुणा�वेर
बि��ट्या�ने� अचा�नेका हल्ल� का� ल्या�ने� त तरुणा गे&भा+र जंखमो+ झा�ल� आह�. ह+ घटने� शु3क्रवे�र+ (त�. २०) मो3ळीझार� (त�. बिकानेवेट) या� गे�वे�च्या� जं&गेल�त घडल+. .......
इस्ल�पू र वेने पूरिरक्षे�त्र का�या�;लया�अ&तगे;त या�णा�ऱ्या� मो3ळीझार� या�थी+ल आदि*वे�स+ तरुणा मोनेहर . . .. . .</TEXT> </DOC>
Topics
• 100 topics• Aligned with English topics• XML tags
– num : query identifier – title: title of the query– desc: description– narr: Additional information about the query
• Cover all issues –local, international
Topic example
<top><num>1<title>ट>वे?ट - २० बिवेश्वचाषका�त+ल भा�रत�चा� क्रBड�पूट3त्वे
(India’s championship in tewnty-20 Worldcup)<desc> पूबिहल्या� आयास+स+ बिवेश्व ट>वे?ट - २० सवेDत्काE ष्ट-बिवेजं�त�- स्पर्धेIत+ल
भा�रत�च्या� बिवेजंया�चा� वेEत्त *�णा�र� ल�ख शुर्धे�.</desc><narr>ट>वे?ट - २० बिवेश्चचाषका स्पर्धेIमोर्धे+ल पू�बिकास्त�ने बिवेरूद्ध भा�रत�चा� बिवेजंया,
ह्या� ऐबितह�सिसका बिवेजंया� बिनेमिमोत्त ख�ळी�ड &ने+ का� ल�ल� बिवेक्रमो त्या�&ने+ मिमोळीबिवेल�ल+ �णिक्षेस� वे पू3रस्का�र�चा+ रक्कामो स�मोने�वे+र�चा� तस�चा मो�सिलका�वे+र�चा� ने�वे, मो�जं+
ख�ळी�ड &ने+ आणिणा जंगेभार�त+ल लका�&ने+ का� ल�ल+ प्रशु&स� या�स&*भा�;त आम्ह+ उसिचात मो�बिहत+ मिमोळीवेत आहत.
</top>
Tools
• Terrier– Open source IR system– Models
• TF-IDF (Vector space model)• DFR-BM25 (Probabilistic)
– Both models available in Terrier
• Evaluation against relevance judged document for 25 queries
Lemmatizer Vs stemmer
– भा�रत�ला� bhaarataalaa – for Bharat– भा�रत�चा� bhaarataachaa - of Bharat– भा�रत�त bhaarataat – in Bharat– भा�रत�वेर bhaarataavar – on Bharat
• Lemmatizer finds Lemma– भा�रत
• Stemmer finds stem: Longest unchangeable word prefix– भा�रत�
Marathi suffixes
• Suffixes include case markers, postposition markers etc.
• Suffixes may get attached after another suffix• Example:
– घर�समोरचा�*�खिखल– घर�-समोर-चा�-*�खिखल– gharaa-samor-chaa-dekhil – house-front- of-also – Root word: घर (ghar) (house)
Morphological analyzer
• Use of Marathi morphology analyzer– Better matching words
• र�मो versus र�मो�
• Gives all possible roots– Selects first root – most frequent
• Used at indexing and query processing end
Lemmatizer Results
MAP
R-precision Precision at
5Precision at 10
Recall
TF-IDF without lemmatizer
0.3366 0.2944 0.3167 0.2583 0.8724
TF-IDF + lemmatizer 0.4003 0.3551 0.3417 0.2917 0.9686
DFR+ without lemmatizer
0.3455 0.3209 0.3500 0.2667 0.8744
DFR-BM25 + lemmatizer
0.4140 0.3686 0.3833 0.3083 0.9619
DFR-BM25 + lemmatizer
(Fire submission)
0.3625 0.3797 0.4600 0.3960 0.9178
Suffixes
• Usually ignored• Indexing suffixes - not studied• Index selected suffixes
– Suffixes of space and time• वेर – var - on• समोर – samor - in front of• मोध्या� – madhye - in• ने&तर -nanter – after
• Created manually– 66 words list
Stop-words
• Most frequently occurring words• Little discriminatory value• Occur in 80 % or more documents• Selected stop-words
– त+, त�, या�, ू ने, अस, आह, या�, ह, कार, त
Results suffix indexing and stop-words
MAP R-precision Precision at
5
Precision at
10
Recall
DFR-BM25
+ lemmatization
+ suffix Indexing
0.4381 0.3846 0.3917 0.3167 0.97085
DFR-BM25
+ lemmatization
+ suffix Indexing
+ stop-words
0.4433 0.3798 0.4000 0.3208 0.9731
P-R graph
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0 10 20 30 40 50 60 70 80 90 100
Recall %
Pre
cis
ion
lemmatization, indexing suffixes and stopwords
lemmatization and indexing suffixes
lemmatization
base-line
• Precision-recall graph for all four cases is show below
Future work
• Morphological analyzer– Accuracy 94.5 %
• Needs to be improved
• Heuristic suffix stripping: unknown words• Handle derivational morphology• Spelling variations, common spelling mistakes
Acknowledgement
• “Cross Lingual Information Access” Project• Maharashtra times: Times Media Group,
– http://in.indiatimes.com/aboutus.cms
• Sakal: Sakal Media Group– http://www.sakaal.in/
References
• http://ir.dcs.gla.ac.uk/terrier/ • Ricardo Baeza Yates and Berthier Ribeiro
Neto, Modern Information Retrieval• Jacques Savoy, Searching strategies for the
Bulgarian language• Morphological Analyzer, CFILT
Thank you