study of indexing techniques to improve the performance of ... · pdf filekannada , malayalam...
TRANSCRIPT
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 3, Issue 1, January 2013)
482
Study of Indexing Techniques to Improve the Performance of
Information Retrieval in Telugu Language Kolikipogu Ramakrishna
1, Dr.B.Padmaja Rani
2
1Department of Information Technology, Sridevi Women’s Engineering College, Hyderabad, India 2Department of Computer Science and Engineering, JNTUH, Hyderabad, India
Abstract—Information Retrieval Systems (IRS) are so popular
through World Wide Web. Availability of Text Information
related to all types of objects like Documents, Web Pages,
Images, Videos and Audio files on web are increasing day by
day in an exponential manner. When the text repository
grows to the maximum extent of the memory size in the
server, the methods used to find a particular text unit either word or document is tedious task. Representation of these
objects, using text information gives summarized features to
decide whether to access the identified unit or not in the first
look. Instead of exact query match in the document a set of
keywords will be used to find the relevance of the document.
If a set of keywords represents a document, then it is easy to
match the couple of keywords from the query against
keywords of the document and decide the relevance. Finding
keywords to represent a complete unit is called index.
Keyword are your own designated units which can be used for
easy location of the document using any search engines. A
keyword maps all the documents containing this indexed
word. This problem is addressed by identifying indexed words
or phrases of a document. Indexing terms together represents
whole document and act as ambassadors of the unit. In this
paper we studied the effect of various indexing techniques ,
namely , manual , automatic and semi-automatic on 10,000
Telugu text documents. Statistical Indexing is taken as base
line approach and compared the results with other techniques.
We observed that, the results are better plotted while moving from statistical representations to semantic representations.
Keywords—Keywords, Indexing Terms, Manual Indexing,
Automatic Indexing, Statistical Indexing, Semantic based
Indexing, Telugu Text Corpus, N-gram, Inverted File
Structure.
I. INTRODUCTION
Information Retrieval is a process Retrieving and
Presenting various content object to the user relevant to
his/her query from a standardised collection of objects from
different sources or repositories. Web is the best resource
of Information Retrieval Processes, where different techniques are used to give exact information needed by the
users.
Naive users are not much familiar with structural
queries. Users submit short queries that do not consider the
variety of terms used to describe a topic, resulting in poor
recall power [5]. Searching on non standardised store of
bulk document is highly difficult, where in indexing
reduces the complexity of search process. Information
Retrieval is process of Indexing is a process of identifying
keywords to represent a document based on their contents.
Indexing is very important phase of Information Retrieval
System to create a search-able unit for the given query.
Basically, indexing is performed by assigning each document with keywords or descriptive terms representing
the document[1]. The assigned terms must reflect the
content of the document to allow effective keyword
searching. In automatic indexing , couple of trained people
who are well with concept of the document participates in
indexing process. Manual indexing is a time taking process
and it requires huge manual hours to index a repository
which grows day by day. Automatic text indexing which is
much faster and less error-prone has become a common
practice on big corpus. Research on English texts has
shown that the retrieval effectiveness of automatic indexing is comparable to that of manual indexing [2][3]. A natural
language query specifies the user's information need in a
natural language sentence or sentences. A phrase query
contains phrases representing concepts of interest to the
user[4], then it requires to mind the language features
before selecting indexing terms. Where in the language
processing tools helps to identify the better indexing terms
to represent whole object. In this case study the effect of
various indexing techniques are observed on fixed length
Telugu corpus. Majority of related work has been examined
on various language corpus through literature survey in
next chapter. In this paper we concentrated on how the indexing improves the retrieval performance. Various
methods are adopted to index the items in this research.
Indexing of items using semantic concept gives better
representations.
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 3, Issue 1, January 2013)
483
The performance of this task is measured with standard
Information Retrieval Performance measures , namely,
Precision, Recall and F-Measure along with other similarity measures used while indexing the items of the corpus.
II. INDIAN LANGUAGES
A. Dravidian Languages
The languages spoken by Indian are from different
families of languages like Indo-aryan a subset of Indo-
Eropean languages , Dravidian Languages, Austro-Asiatic
and Tibeto-Burmese etc. There are many languages in
Indian, but 22 – languages are given official status by
Govt. of India in 8th
schedule , out of which 15 are Indo-
Aryan Languages, 4-are Dravidian languages, one Austro-
Asiatic and two Tibeto-Burmese Languages. Kannada ,
Malayalam ,Tamil are most spoken Dravidian Languages
in South India.
B. Telugu Language
Telugu Language is the third most spoken language in
India and one of the fifteen most spoken languages in the
world. Due to high complexity of Telugu language it is
difficult to search and retrieve the required documents from
the repository.
C. Language Representation and Encoding Standards
There is no UPPER and lower case representation for
Indian Languages. The Vowels are free to occur at the
beginning unlike English to be occurred with a word.
Building Rules for recognizing language features are
different from language to language, hence it is required to
build language processing tools for each language with
unique features. It is very important to know the internal
representation of any language before processing. Telugu
letters are not single alphabets like English. Entire Study
and Implementation Results are shown in WX-UTF
Notations. Whereas internal encoding is always
UNICODE. The Unicode Standard, Version 6.2 assigned a
hexadecimal code point for Telugu Scripts in the Range of
0C00-0C7F. Table-1 shows WX-Notations for Telugu
Alphabets. Unicode Transformation Format (UTF) is the
universal character code standard to represent character
sets. UTF-8 is an alternative coded representation form for
all the characters in Unicode while maintaining
compatibility with ASCII [15]. WX-Notations are used to
represent transliteration scheme of Roman Script. These
Scripts are used to denote Dravidian and Devanagari
scripts of Indian Languages. These standards aim at
providing a unique representation of Indian Languages in
Roman alphabet [16]. The Example given throughout this
paper are represented in WX-Notations and Formats are
converted from WX to UTF for displaying in Telugu Script
forms.
TABLE I
UTF[WX]-NOTATION FOR TELUGU SCRIPTS
ఄ [a] అ[A] ఆ[i] ఇ[I] ఈ[u]
ఉ[U] ఊ[q] ఎ[e] ఏ[eV] ఐ[E]
[o] ఔ[oV] ఄం[aM] ఄః[aH] ర[rY]
క[ka] ఖ[Ka] గ[ga] ఘ[G] ఙ[fa]
చ[ca] ఛ[Ca] జ[ja] ఝ[Ja] ఞ[Fa]
ట[ta] ఠ[Ta] డ[da] ఢ[Da] ణ[Na]
త[wa] థ[Wa] ద[xa] ధ[ Xa] న[na]
[pa] ప[Pa] ఫ[ba] బ[Ba] భ[ma]
మ[ya] య[ra]] ఱ[la] ళ[va] శ[sa]
ఴ[Sa] వ[Ra] ష[ha] ల[lYa] క్ష[kRa]
III. RELATED STUDY ON INDEXING
Indexing of an item is a process of creating search-able
Data structure from the received items. This transformation
requires finding the important keywords to represent a
complete item in meaning wise and statistical matching
ratio. Indexing is similar to that of Cataloguing books in a library. Cataloguing is used to create access points on an
item collection that are expected and most useful to the
users of the information retrieval [9]. Behind the user
interface, the search engine collects all the data and build
an index to store that data so that a user can access them
quickly by posing simple query. Different search- engines
use different methods to index the data.
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 3, Issue 1, January 2013)
484
This is the main reason to get different outcome while
searching on different search engines, Google, Bing, Yahoo
etc for the same query. Google ranks a page higher if more number of pages vote (having links) to that particular page.
Basically indexing can be done in two ways: Manual
Indexing and Automatic Indexing. Sometimes the semi-
automatic indexing also used to represent complex
documents having more content and less keywords.
A. Manual Indexing
Automatic indexing follows set processes of analysing
frequencies of word patterns and comparing results to other
documents in order to assign to subject categories. This
requires no understanding of the material being indexed
therefore leads to more uniform indexing but this is at the
expense of the true meaning being interpreted. A computer
program will not understand the meaning of statements and
may therefore fail to assign some relevant terms or assign
incorrectly. Human indexers focus their attention on certain
parts of the document such as the title, abstract, summary and conclusions, as analysing the full text in depth is costly
and time consuming [6]. According to Jakob [8] manual
indexing is called tagging with index terms referred to as
tags. There is a renaissance of manual subject indexing and
analysis: Structured meta-data is published with techniques
like RDF,Dewey Decimal System, MARK (MAchine
Readable Cataloging ), RSS, OAI-PMH and OpenSearch1
and browser search plug-ins allow it to agree a specialised
search engines[8]. The full text searchable data structure
for items in the Document File provides a new class of
indexing called total document indexing [9]. Most of the search websites follow controlled vocabulary tagging for
indexing the web pages. There has been a debate for a
number of years about which method is better, human
indexing or automatic indexing [8]. Manual indexing is
always better in concept representation. The manual
indexing is still having more importance in expressing the
meaning of whole document with limited number of
indexing terms. But the manual indexing is a critical task to
carry-out. When all terms of an item are used to represent a
completed item, then what is the use of selecting terms as
indexing units to represents an item?. If one can search on any of the words in a document why does one need to add
additional index terms?[9]. Use of Controlled vocabulary
helps the indexer to limit the number of indexing terms to
represent a document with major concepts. A controlled
vocabulary is a finite set of index terms from which all
index terms must be selected.
This kind of indexing with controlled vocabulary takes
more time when the repository is big in size. The extra
processing time comes from the indexer trying to determine the appropriate index terms for concepts that are not
specifically in the controlled vocabulary set. The Indexing
gets slow, but the search process would be easier and
faster. Controlled vocabularies give understanding to the
indexer to select proper items, which describes information
needs. Uncontrolled vocabularies have the opposite effect,
making indexing faster but the search process much more
difficult [9]. In case TELUGU Language vocabulary is not
completely controlled and most of the available vocabulary
is not in use, then building index terms for Telugu item set
is bit difficult task either in manual process or in automatic
process. Most of the Text-book (గసర ంధికభు-grAMXikamu )
terminology is not in practice (ళయళహారికభు-vyavahArikamu), in this scenario manually deciding all the
domain concepts in an item badly required alternate
vocabulary resources like Dictionaries. World Wide Web is
drastically increasing in size; where in manual indexing is
not possible on dynamically growing repository. In manual
indexing considering entire item for selecting the terms is
tedious job, then manual indexer limits to few sections of
an item like Title, Sub headlines, Abstract, Summary and
Conclusion. When indexing any item, one should mind
Specificity and Exhaustivity.
i) Specificity:
Selection of Indexing terms must represent the concept of
an item in terms of closeness that means every item
selected as an index term should relate to a topic of item in
search while matching from the query terms. This is what is
known as specificity of topic from index term. The
specificity describes how closely the index terms match the
topics they represent [10]. An index is said to be specific if the indexer uses parallel descriptors to the concept of the
document and reflects the concepts precisely. Specificity
tends to increase with exhaustivity as the more terms you
include, the narrower those terms will be[11]. Specificity as
characterised as a semantic property of index terms (i.e. a
term is more or less specific as its meaning is more or less
detailed and precise). It is suggested that specificity should
be interpreted statistically, as a function of term use rather
than of term meaning [12]. Simply to say specificity is right
selection of keywords which represents a topic of an item
set (i.e. Number of documents represented by an indexed
term is specificity).
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 3, Issue 1, January 2013)
485
ii) Exhaustivity:
Exhaustivity of a document description is the coverage of
its various topics given by the terms assigned to it .
Number of indexed terms used to represent a document is
known as exhaustivity. Exhaustivity of indexing is the
extent to which these concepts are covered by the
descriptors assigned to the entity. Exhaustivity has two
components: viewpoint exhaustivity and importance
exhaustivity [13]. The exhaustivity must be limited; otherwise search process would be time taking process. If
the concepts are not represented using indexing terms, it is
difficult to locate items space in the search. Viewpoint
exhaustivity specifies how much features are covered by
indexing to find the item in retrieval process? The degree to
which this question can be answered with "yes" is
viewpoint exhaustivity. Importance exhaustivity addresses
the question: What is the importance threshold for the
assignment of descriptors as prescribed in the indexing
rules? [13]. Incorrect descriptors increase the count without
contributing to exhaustivity.
B. Automatic Indexing
Automatic Indexing is a process whereby computer
processes a text in natural language which is in machine
readable form [26]. Automatic Indexing is done by a
machine according to the rules framed in the program. In
fact automatic indexing is claimed as better indexing
approach as it takes away the time , cost, exhaustivity,
specificity, vocabulary, searching, browsing limit and
allows the entire document to be analysed, but also has the
option to be directed to particular parts of the document.
Major limitations of this approach are not in a position to
decide the ambiguous selection of terms. To overcome this,
one has to write complex rules in selecting indexed terms
to disambiguate this concept by different algorithms.
Linking many tools and applying complex rules to select a
index term which describes the correct concept gives
slower response. Research into ways to apply automatic
techniques to image, sound, and other types of text is still
in its infancy, compared to the half century of work on
automatic indexing of language text [7]. Query Expansion
and Document Expansion Techniques are used to select
better indexing terms to describe the correct concept of the
units [5]. Examples of Automatic Indexing are KWIC –
Key Word In Context, KWAC-Key Word Alongside
Context, KWOC-Key Word Out of Context.
Automatic Indexing is more preferable method by real time
search engine developer.
C. Relatedness of Indexing terms
Provided relationship among the set of indexing terms
comes in two places. One at the time of indexing process
call Pre-coordination, Second at the time of search process
called Post-coordination. The establishment of relationship
between multiple indexing terms is also called as linkage.
Linkages are used to correlate related attributes associated
with concepts discussed in an item. This process of creating
term linkages at index creation time is called pre-
coordination. Post-coordination can be is implemented by
connecting terms using Boolean ―AND‖ing operator, which
only finds indexes that have the entire search terms [9].
During linkage, the indexer should mind the number of
indexing terms to related and order of linkage and any
additional descriptors are associated with the index terms
[14]. Examples show how indexing terms are related during
indexing and or during search process on a TELUGU Text
item.
D. Stop list
Stop words are common words, which have no
information alone and frequently occur in most of the
items of the corpus. These stop-words need to be filtered
out before or after processing text. There is no predefined stop list for Telugu like English. Even in English, if the
search is a phrase search, then no functional word can be
treated as stop-word in the search. Coming to Telugu stop-
words depends on the context of search. While indexing
removing stop-words is much more important to limit the
index size and speed up the match. Hence deciding stop-
word list is key task during indexing phase. For this study
we prepared list of possible stop list based on available text
corpus and those are eliminated before indexing the items.
E. Web Crawler and Text Corpus
Before discussing the Data structures used to store the
indexing terms of an item, it is necessary to know the flow
of steps how documents are collected and cleaned through
standardization process. Figure-1 gives complete
understanding of the systems how indexing is being carried
in Information Retrieval Process.
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 3, Issue 1, January 2013)
486
i) Crawlers :
Crawler is Computer program starts visiting with a seed
URL and iterates by selecting new URLs for collecting Web pages. Once crawler fetches a Web page, then Server
begins indexing process and store indexed terms into a
specified Data Structures.
Figure I. Crawling of Web pages based on URLs to create a
standardized corpus for search engine.
Web Crawlers visit or crawl web pages and download
them by starting from one page and determines which page
to go next. The Crawling efficiency depends on various policies like Selection of page by considering page rank,
Revisit by checking freshness and Age, Politeness for not
burden the web servers and download rate. Figure-2 gives
the understanding of how Nutch – Open Source Web
Crawler works based on Lucene [17].
Figure II. Open Source web crawler: Nutch [17]
We have not used any WebCrawler to collect Telugu
Copus, but we downloaded from Wikipedia Corpus
collection [18].
F. Indexing Data Structures
In an Indexing, after collecting items, they must be
stored in a normalized form. These Data Structures store
the terms and associated information to support the search
process.
i)Stemming :
A series of rules are applied to remove a part (i.e. suffix,
affix or prefix) of the word to generate root word. Received items are preprocessed by stemming. Suffix removal is
more suggested method of stemming proposed by Porter
[20], called as porter stemmer. In stemming suffixes or
prefixes or infixes will be blindly removed or replaced with
possible rule based substrings. Stemming reduces the
diversity of representations of a concept to a canonical
morphological representation called morphological variants
or word forms. Stemming improves recall [19], but some
time it may cause loss of information, due to blind stripping
of suffix or prefix of a term. It is highly difficult to frame
the rules for suffix stripping for Telugu language. Suffix stripping is not suggested for highly complex languages
with more number of variant words forms like Telugu
Language.
ii) N-gram Model:
A variant of the searchable data structure is the N-gram
structure that breaks processing tokens into smaller string
units [21]. N-Grams Model is an alternate technique to
generate possible n-length words starting from 1 to n to
generate conflational forms of indexing terms. N-Grams
are a fixed length consecutive series of ―n‖ characters to
determine the stem of a word that represents the concept of the word, but n-grams do not care about semantics [9].
Trigrams with 3-character length words were found to be
optimal for English Language [22]. In English the
Alphabets are syllable, and then it is easy to find the
syllables with N-gram technique, where as in Telugu
Language syllables representation combination of one or
more alphabets. A syllable is a unit of pronunciation having
one vowel sound, with or without surrounding consonants,
forming the whole or a part of a word. Syllable N-Gram
Model gave good performance [21], Maximum word length
n is considered and generated all syllable to minimum length 1.
Fetching
Indexing
Searching Crawling
World Wide
Web
Server Indexed
File Store
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 3, Issue 1, January 2013)
487
Example 1: A word తియుతి with Syllable n-gram Model
with n=4.( WX-Notation).
TABLE III
EXAMPLE -1 WITH N-GRAM AND SYLLABLE N-GRAM MODELS
N-gram-Model,
n=1,2,3,4,5,6,7,8
Syllable N-gram Model
n=1,2,3,4
w
wi
wir
wiru
wirup
wirupa
wirupaw
wirupawi
తి-wi
తియు-wiru
తియు-wirupa
తియుతి-wirupawi
iii) Tries ( Re”trie”ve) – Simple Try :
Trie is taken from ―Retrieve‖ , which represents try for matching syllables from an internal node point. Trie
represents all the words as a single data structure. The time
cost of this Data Structure is optimal when compared to
other Data structures. This structure has the advantage that
the search space of the word depends on the number of
syllables in the word .The difficulty of n-gram model in the
context of Telugu Language can be solved by identifying
syllables of the word using linguistic rules of the language
[23].
Example 2: a word wirupawi – తియుతి can be represented
as syllables according to language rule [23]. After dividing
―wirupawi‖ ―తియుతి‖ into syllables, the syllables are
wi-ru-pa-wi ―తి-యు--తి‖. This pre-conversion saves
search time in tries Data Structure.
iv) Successor Stemmer :
The Successor Stemmer finds the word and morpheme
boundaries based on the distribution of phonemes that
distinguishes one word from other. The process determines
the successor variety for a word, uses this information to
divide a word into segments and selects one of the segment
as stem. The successor variety of a segment of a word in a
set of words is the number of distinct syllable that occupy the segment length plus one syllable.
Example – 3: తియుతి, తియుభఱ, తియువీధ,ి తియుతిరసళు – To
verify the word తియుతిరసళు among the 4-word collection,
the Successor Stemmer works as shown in TABLE -III:
TABLE IIIII
SUCCESSOR STEMMER FOR EXAMPLE -3
Suffix Successor
Variety
Syllable
తి 1 యు తియు 3 , భ,వీ
తియు 1 తి
తియుతి 1 రస తియుతిరస 1 ళు తియుతిరసళు 1 Blank Space
The successor variety of any prefix of a word is the no. of
children associated with the node in the symbol tree
representing that prefix [24]. In less formal terms, the
successor variety of a string is the number of different characters that follow it in words in some body of text [25].
v) Inverted File(IF) Structure :
In general, the importance of a word is decided by no.
of occurrence of a word called term frequency (tf) in an
item. Each term maintains the list of items, in which the
term occurs in an item in a list called inverted list. The way
to locate the inversion list for a particular word is
dictionary lookup. Inverted File Structure maintains list of items, inverted list and Dictionary. Dictionary stores all
possible unique words of entire corpus. Most of the
Database and Information Retrieval Applications use
inverted file structure to represent indexing of an item.
Unfortunately, it is not possible to achieve an efficient
adaptation of an inverted file to deal with the matching of
more elaborate document and query descriptions such as
weighted keywords [9]. Inversion list maintains Document
ID, term frequency, along with the term position in the
Document to support proximity search, continuous word
phrase search. Inverted File Structure is an efficient Data structure to store big collection data items. For Telugu Text
Corpus, IFS is one of the best approaches for indexing.
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 3, Issue 1, January 2013)
488
Steps to prepare IFS:
1. Tokenize the Item.
2. Eliminate the Stop words.
3. Apply Syllable N-gram Model to identify the
stems or use morphological analyzer to get root
words.
4. Maintain Doc-ID and it‘s indexing terms of an
item.
5. Prepare the Dictionary with all unique words with
total occurrence in all the Documents.
6. Create Inverted list with <Doc-IDs, term-
position, term-frequency in the same
document> against list of word occurrence in
various documents.
Example – 4 : Document set
Doc-1:. తియుతి ఱడడూ గురించి తెలిమని వసయుంటారస?.
Doc-2:. రంచవసయతంగస ఎంతో పేయు రఖయయతుఱు ప ందిన. తియుతి
ఱడడూ క ితవయఱ ోపేట ంట్ ళచ్చే ఄళకసఴం ఈంది.
Doc-3:. దచఴ, విదచశసఱకు చె్ందిన బకుత ఱు బకిత ఴరదధఱతో ఎంతో ఆవటంగస స్వవకరించ్చ ఇ ఱడడూ క ిభౌగోళిక రతచయకత కింద పేట ంట్ షకుుఱు ప ందాఱని తియుభఱ తియుతి దచళస్సథ నం (టిటిడి) నియణయంచింది.
Before Preprocessing, the documents are converted into
WX-Notation and tokenized into processable units. Stop
words are manually identified and separated from terms
list of each item.
Doc-1: wirupawi laddU guriMci weVliyani?-
vAruMtArA?
Doc-2: prapaMcavyApwaMgA eVMwo peru
praKyAwulu poVMxina wirupawi laddUki wvaralo
peteVMt vacce avakASaM uMxi.
Doc-3: xeSa, vixeSAlaku ceVMxina Bakwulu Bakwi SraxXalawo eVMwo iRtaMgA svIkariMce I laddUki
BOgolYika prawyekawa kiMxa peteVMt hakkulu
poVMxAlani wirumala wirupawi xevasWAnaM (titidi)
nirNayiMciMxi.
Doc-1 wirupawi,laddU
Doc-2 prapaMcavyApwaM,eVMwo, peru,
praKyAwulu, poVMxina, wirupawi, laddU,
wvaralo, peteVMt, vacce, avakASaM
Doc-3 xeSamu, vixeSamu, Bakwulu, Bakwi , SraxXa,
eVMwo, iRtaM, svIkariMcu, I laddU,
BOgolYikaM, prawyekaM, peteVMt, hakkulu,
poVMxuta, wirumala, wirupawi,
xevasWAnaM, titidi, nirNayiMciMxi
After Converting from WX-UTF :
Doc-1 తియుతి, ఱడడూ
Doc-2 రంచవసయత ం, ఎంతో , పేయు , రఖయయతుఱు , ప ందుట, తియుతి ,
ఱడడూ , పేట ంట్, ఄళకసఴం
Doc-3 దచఴభు, విదచఴభు, బకుత ఱు, బకిత, ఴరదధ , ఎంతో, ఆవటం, స్వవకరించు,
ఱడడూ , భౌగోళికం, రతచయకం, పేట ంట్, షకుుఱు, ప ందుట, తియుభఱ,
తియుతి, దచళస్సథ నం,టిటిడి,నియణయంచింది
Stop words : గురించి, తెలిమని, వసయుంటారస, తవయఱో, ఈంది, చె్ందిన, ఇ, కింద, ళచ్చే,
Dictionary Creation in sorted order: F- Frequency of
Words in all the Documents called Total Frequency.
TABLE IVV
WORD DICTIONARY WITH FREQUENCY OF TERMS FOR EXMAPLE-4.
Unique Word F Unique Word F
avakASaM –
ఄళకసఴం
1 poVMxuta –ప ందుట 2
iRtaM –ఆవటం 1 prawyekaM –రతచయకం 1
eVMwo – ఎంతో 2 praKyAwulu –రఖయయతుఱు 1
titidi-టిటిడి 1 prapaMcavyApwaM –
రంచవసయత ం 1
wirupawi- తియుతి 3 Bakwi –బకిత 1
wirumala- తియుభఱ 1 Bakwulu- బకుత ఱు 1
xevasWAnaM –
దచళస్సథ నం 1 laddU-ఱడడూ 3
xeSamu-దచఴభు 1 svIkariMcu -స్వవకరించు 1
నియణయంచింది 1 hakkulu -షకుుఱు 1
peteVMt -పేట ంట్ 2 SraxXa -ఴరదధ 1
peru-పేయు 1
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 3, Issue 1, January 2013)
489
Indexing terms are manually identified by language expert
to represent the concept of the whole document as well as
each sentence.
TABLE V
INVERTED LIST FOR EXAMPLE -4
wirupawi- తియుతి Doc-1 Doc-2 Doc-3
laddU- ఱడడూ Doc-1 Doc-2 Doc-3
peteVMt -పేట ంట్ Doc-2 Doc-3
poVMxuta -ప ందుట Doc-2 Doc-3
G. Index Term Weighting
The effectiveness of the Indexing are depends on
Specificity and Exhaustivity [27]. It has been recognized
that a high level of exhaustivity of indexing leads to high recall and low precision. Conversely, a low level
of exhaustivity leads to low recall and high precision
[28].
i) Precision : Precision is the ratio of the number of relevant
documents retrieved to the total number of documents retrieved. To find the rate of error and rate of success
among the retrieved documents the precision is used.
ii) Recall :
Recall is the ratio of the number of relevant
documents retrieved to the total number of relevant
documents.
iii) Fallout:
Sometimes a negative ratio of relevance is useful to judge
the inefficiency of the retrieval. Fallout is the ratio of
number of non-relevant document retrieved to the total
number of non-relevant documents.
iv) F-Measure:
F-Measure is also called as F-Score Measure, which
measure the accuracy of retrieval outcome:
Where β varies from 0 to 1. When β=1, it weighted harmonic mean of precision and recall known as traditional
F-score or balanced F-Measure [29].
H. Latent Semantic Indexing
In statistical indexing the indexing terms are selected
based on term frequency and co-occurrence in an item.
Vector Space Model is used to represent the terms in rows,
items in columns in vector space [32]. Different
measurements are used to find the similarity of indexed terms in an item with query terms. In Cosine similarity
large angle, small cosine represents dissimilar items and
small angle, large cosine represents similarity of items.
When concepts are used in vector space model, it has invert
effect on recall and precision. The basis for concept
indexing is that there are many ways to express the same
idea and increased retrieval performance comes from using
a single representation [9]. Concept can be extracted from
language resources like Dictionaries, Thesaurus, WordNet
and Ontologies etc. Synonyms (రసయమదాఱు-paryAyapaxAlu) of a term give many ways to refer the
same indexing term (i.e. Different terms with same
meaning). For example mother in Telugu can be
represented in many ways without losing meaning: ఄభమ-amma: జనని,తలిి,భయత,జనని). Instead of single term all
the synonyms any one of them can be used to represent
same concept. This will increase precision of the search. Similarly when a term has more than one meaning, it
misleads to concept representation, but it will increase
recall and show adverse effect on precision. For Example
―Apple‖ may be ‗Computer‘ or a ‗Fruit’.
# Items retrieved relevant
Precision =
# Items total retrieved
# Items retrieved relevant
Recall =
# Items total Relevant
# Items retrieved non-relevant
Fallout =
# Items total Non-Relevant
(1+β2).(Precision . Recall)
F-Score =
β2.(Precision+Recall)
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 3, Issue 1, January 2013)
490
To handle this kind of problem statistical indexing is not
adequate, there is an alternate method to solve these two
problems using Vector Space Model called Latent Semantic Indexing[31]. Document expansion and Query
expansion techniques helps to extract the concepts of an
item and query [30]. Concept indexing determines a
canonical set of concepts based upon a test set of terms and
uses them as a basis for indexing all items [9]. The
determined set of concepts does not have a label associated
with each concept (i.e., a word or set of words that can be
used to describe it), but is a mathematical representation as
a Vector. LSI works based on the principle that words that
are used in the same contexts tend to have similar
meanings.
IV. CONCLUSION
In this paper we studied the effect of Indexing in
Information Retrieval process on Telugu Documents. We
observed manual indexing works quite good in terms of
accuracy, but it is a time taking process and it is highly
difficult to index all the subject areas by few indexers.
Most of the cases human limits to a particular part of the
item to extract the concept, it may give inverse effect.
While corpus is growing manual indexing is not advisable.
Automatic indexing has many advantages when compared
to manual indexing. In manual Indexing Statistical
indexing using Boolean approach and Vector Space Model
blind methods in finding descriptors of items through
frequency count of terms in the corresponding items. The
Concept of an Item can be best represented by few
indexing term with related meanings using Latent Semantic
Indexing Techniques. Use of Language Processing tools
like Morphological Analyzer, POS Tagger to find the base
terms may simplify Indexing process. While Indexing,
Extracting concept of items using Language Resources like
Synset, WordNet or Ontology will improve the Indexing
accuracy in terms of precision and recall. Too much of
using these resources like WordNet may leads to poor
results.
REFERENCES
[1]. W. Bruce Croft , Mirna Adriani ,1997. Retrieval
Effectiveness Of Various Indexing Techniques On
Indonesian News Articles, 1-7.
[2]. Mirna Adriani , W. Bruce Croft , 1997. Retrieval Effectiveness Of Various Indexing Techniques On Indonesian News Articles.
[3]. Salton, Gerard.,1986 . Another Look at Automatic Text-Retrieval Systems. Communications of the ACM 29(7), 648-656.
[4]. Jones Sparck , Karen, 1974. Automatic Indexing. Journal of Documentation: 30(4), 393-432.
[5]. Ramakrishna Kolikipogu, Padmaja Rani B, 2011. WordNet based
Term selection for Pseudo Relevance Feedback Query Expansion Model, ICCMS, IEEE, Vol 2.,
[6]. F. W. Lancaster, 2003. Indexing and abstracting in theory and practice. Third edition. London, Facet ISBN 1-85604-482-3. pp 24.
[7]. James D. Anderson, Jos Prez-Carballo , 2001. The nature of indexing: how humans and machines analyze messages and texts for retrieval. Part II: Machine indexing, and the allocation of human versus machine effort, Information Processing and Management 37 : 255-277 .
[8]. Ginger Shields , 2005. What are the main differences between human indexing and automatic Indexing? , LI-842 Automatic Indexing Assignment , pp:1-4.
[9]. Kowalski and Maybury, 2001. Information Storage and Retrieval System : Theory and Implementations.
[10]. J.D. Anderson, 1997. Guidelines for indexes and related information retrieval devices. Bethesda, Maryland, Niso Press. 10 December 2008.
[11]. D.B. Cleveland and A.D. Cleveland,2001. Introduction to indexing and abstracting. 3rd Ed. Englewood, libraries Unlimited, Inc. Page 106.
[12]. Karen Spärck Jones, 2004. A statistical interpretation of term specificity and its application in retrieval , Journal of Documentation Volume 60 Number 5 pp. 493-502 .
[13]. Dagobert Soergel , 1994. Indexing and retrieval performance: The logical evidence , Soergel, Indexing and
retrieval performance , Journal of the American Society for Information Science , 1-22.
[14]. Vickery, B. C., 1970. ―Techniques of Information Retrieval‖, Archon Books, Hamden, Conn.
[15]. Unicode Standard Version 6.2, 2012. http://www.unicode.org
[16]. Akshar Bharati, Vineet Chaitanya, and Rajeev Sangal.1995. Natural Language Processing : A Paninian Perspective. PHI, 2010.
[17]. http://lucene.apache.org/nutch/
[18]. http://lucene.apache.org/nutch/
[19]. Ricardo Baeza-Yates, 2011. Modern Information Retrieval , Pearson 5th Edition, Pp:168-169.
[20]. M.F.Porter, 1980, An algorithm for suffix stripping, pp 130-137.
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 3, Issue 1, January 2013)
491
[21]. Dr.B.Padmaja Rani and Dr.A.Vinay Babu, 2010. Novel Implementation of Search Engine for Telugu Documents with Syllable N-Gram Model, International Journal of Engineering Science and Technology, Vol. 2(8), 2010, 3712-3720.
[22]. Yochum and Yochum, J., 1985. A High-Speed Text Scanning Algorithm Utilizing Least Frequent Trigraphs, IEEE Proceedings New Directions in Computing Symposium, Trondheim, Norway, 1985, pages 114-121.
[23]. K.V.N.Sunitha and N.Kalyani, 2012. Isolated Word Recognition using Morph – Knowledge for Telugu Language, International Journal of Computer Applications,Vol-38,No-12,Pp:47-54.
[24]. M. Hafer and S. Weiss, 1974. Word Segmentation by Letter Successor Varieties, Information Storage and Retrieval, 10, 371-85.
[25]. Deepika Sharma, 2012. Stemming Algorithms: A Comparative Study and their Analysis, International Journal
of Applied Information Systems, Foundation of Computer Science FCS, Volume 4– No.3,Pp:7-12.
[26]. Martin Tulic, 2005. Automatic Indexing (http://anindexer.com/about/auto/autoindex.html ).
[27]. KEEN, E.M. and DIGGER, J.A.,1972. Report of an
Information Science Index Languages Test,
Aberystwyth College of Librarianship, Wales.
[28]. Lancaster F.W.,1968 Information Retrieval
Systems: Characteristics, Testing and Evaluation,
Wiley, New York.
[29]. Beitzel., Steven M., 2006. On Understanding and Classifying Web Queries –(Ph.D. thesis).
[30]. Ramakrishna Kolikipogu, Padmaja Rani.B,2012, Reformulation of Web Query terms using Semantic Relationships, ICACCI-2012,ACM-Proceedings.
[31]. Dumais, S. T., Furnas, G. W., Landauer, T. K. and Deerwester, S., 1988. Using latent semantic analysis to improve information retrieval. In Proceedings of CHI'88:
Conference on Human Factors in Computing, New York: ACM, 281-285.
[32]. Christopher D. Manning,2006. An Introduction to Information Retrieval,Priliminary Draft,Cambridge UP.
AUTHOR – 1:
Mr.Kolikipogu Ramakrishna holds B.Tech in Computer Science and Information Technology from JNTU-
Hyderabad, M.Tech in Computer Science and Engineering with Specialization of Software Engineering and Pursuing Ph.D in Computer Science and Engineering from JNTU Hyderabad. To the Credit he published 20-research papers in various International/National Conferences and Journals. He act as Reviewer for Couple of Journals including IJECCE, IJEIT, IJCL, IJCCT, IJCSI and few more. He is a member of various professional bodies ACM-SIGIR, ISTE, IACSIT,
IAENG, ACM-CSTA etc. At present he is working as Associate Professor and Head , Department of Information Technology, Sridevi Women‘s Engineering College, Hyderabad.
AUTHOR - 2 :
Dr.B.Padmaja Rani holds B.E in Electronics and Communication Engineering from Osmania University-Hyderabad, M.Tech in Computer Science from JNTU Hyderabad and She received a Doctoral Degree(Ph.D) in
Computer Science from JNTU Hyderabad. At present she is working as Professor & Head, Department of Computer Science and Engineering, JNTUH College of Engineering, JNTUH University, Hyderabad. She is guiding couple of Ph.D Scholars in the area of Information Retrieval, Natural Language Processing and Information Security. Her area of research interest includes Information Retrieval, Natural Language Processing, Information Security, Data Mining
and Embedded Systems. To the Credit she published 40 + research papers in various International/National Conferences and Journals. She is meber of various professional bodies including CSI, IEEE, ISTE ect.