study of indexing techniques to improve the performance of ... · pdf filekannada , malayalam...

10
International Journal of Emerging Technology and Advanced Engineering Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 3, Issue 1, January 2013) 482 Study of Indexing Techniques to Improve the Performance of Information Retrieval in Telugu Language Kolikipogu Ramakrishna 1 , Dr.B.Padmaja Rani 2 1 Department of Information Technology, Sridevi Women’s Engineering College, Hyderabad, India 2 Department of Computer Science and Engineering, JNTUH, Hyderabad, India AbstractInformation Retrieval Systems (IRS) are so popular through World Wide Web. Availability of Text Information related to all types of objects like Documents, Web Pages, Images, Videos and Audio files on web are increasing day by day in an exponential manner. When the text repository grows to the maximum extent of the memory size in the server, the methods used to find a particular text unit either word or document is tedious task. Representation of these objects, using text information gives summarized features to decide whether to access the identified unit or not in the first look. Instead of exact query match in the document a set of keywords will be used to find the relevance of the document. If a set of keywords represents a document, then it is easy to match the couple of keywords from the query against keywords of the document and decide the relevance. Finding keywords to represent a complete unit is called index. Keyword are your own designated units which can be used for easy location of the document using any search engines. A keyword maps all the documents containing this indexed word. This problem is addressed by identifying indexed words or phrases of a document. Indexing terms together represents whole document and act as ambassadors of the unit. In this paper we studied the effect of various indexing techniques , namely , manual , automatic and semi-automatic on 10,000 Telugu text documents. Statistical Indexing is taken as base line approach and compared the results with other techniques. We observed that, the results are better plotted while moving from statistical representations to semantic representations. KeywordsKeywords, Indexing Terms, Manual Indexing, Automatic Indexing, Statistical Indexing, Semantic based Indexing, Telugu Text Corpus, N-gram, Inverted File Structure. I. INTRODUCTION Information Retrieval is a process Retrieving and Presenting various content object to the user relevant to his/her query from a standardised collection of objects from different sources or repositories. Web is the best resource of Information Retrieval Processes, where different techniques are used to give exact information needed by the users. Naive users are not much familiar with structural queries. Users submit short queries that do not consider the variety of terms used to describe a topic, resulting in poor recall power [5]. Searching on non standardised store of bulk document is highly difficult, where in indexing reduces the complexity of search process. Information Retrieval is process of Indexing is a process of identifying keywords to represent a document based on their contents. Indexing is very important phase of Information Retrieval System to create a search-able unit for the given query. Basically, indexing is performed by assigning each document with keywords or descriptive terms representing the document[1]. The assigned terms must reflect the content of the document to allow effective keyword searching. In automatic indexing , couple of trained people who are well with concept of the document participates in indexing process. Manual indexing is a time taking process and it requires huge manual hours to index a repository which grows day by day. Automatic text indexing which is much faster and less error-prone has become a common practice on big corpus. Research on English texts has shown that the retrieval effectiveness of automatic indexing is comparable to that of manual indexing [2][3]. A natural language query specifies the user's information need in a natural language sentence or sentences. A phrase query contains phrases representing concepts of interest to the user[4], then it requires to mind the language features before selecting indexing terms. Where in the language processing tools helps to identify the better indexing terms to represent whole object. In this case study the effect of various indexing techniques are observed on fixed length Telugu corpus. Majority of related work has been examined on various language corpus through literature survey in next chapter. In this paper we concentrated on how the indexing improves the retrieval performance. Various methods are adopted to index the items in this research. Indexing of items using semantic concept gives better representations.

Upload: trandung

Post on 10-Mar-2018

228 views

Category:

Documents


6 download

TRANSCRIPT

Page 1: Study of Indexing Techniques to Improve the Performance of ... · PDF fileKannada , Malayalam ,Tamil are ... letters are not single alphabets like English. ... ఔ[oV] ఄం[aM] ఄః[aH]

International Journal of Emerging Technology and Advanced Engineering

Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 3, Issue 1, January 2013)

482

Study of Indexing Techniques to Improve the Performance of

Information Retrieval in Telugu Language Kolikipogu Ramakrishna

1, Dr.B.Padmaja Rani

2

1Department of Information Technology, Sridevi Women’s Engineering College, Hyderabad, India 2Department of Computer Science and Engineering, JNTUH, Hyderabad, India

Abstract—Information Retrieval Systems (IRS) are so popular

through World Wide Web. Availability of Text Information

related to all types of objects like Documents, Web Pages,

Images, Videos and Audio files on web are increasing day by

day in an exponential manner. When the text repository

grows to the maximum extent of the memory size in the

server, the methods used to find a particular text unit either word or document is tedious task. Representation of these

objects, using text information gives summarized features to

decide whether to access the identified unit or not in the first

look. Instead of exact query match in the document a set of

keywords will be used to find the relevance of the document.

If a set of keywords represents a document, then it is easy to

match the couple of keywords from the query against

keywords of the document and decide the relevance. Finding

keywords to represent a complete unit is called index.

Keyword are your own designated units which can be used for

easy location of the document using any search engines. A

keyword maps all the documents containing this indexed

word. This problem is addressed by identifying indexed words

or phrases of a document. Indexing terms together represents

whole document and act as ambassadors of the unit. In this

paper we studied the effect of various indexing techniques ,

namely , manual , automatic and semi-automatic on 10,000

Telugu text documents. Statistical Indexing is taken as base

line approach and compared the results with other techniques.

We observed that, the results are better plotted while moving from statistical representations to semantic representations.

Keywords—Keywords, Indexing Terms, Manual Indexing,

Automatic Indexing, Statistical Indexing, Semantic based

Indexing, Telugu Text Corpus, N-gram, Inverted File

Structure.

I. INTRODUCTION

Information Retrieval is a process Retrieving and

Presenting various content object to the user relevant to

his/her query from a standardised collection of objects from

different sources or repositories. Web is the best resource

of Information Retrieval Processes, where different techniques are used to give exact information needed by the

users.

Naive users are not much familiar with structural

queries. Users submit short queries that do not consider the

variety of terms used to describe a topic, resulting in poor

recall power [5]. Searching on non standardised store of

bulk document is highly difficult, where in indexing

reduces the complexity of search process. Information

Retrieval is process of Indexing is a process of identifying

keywords to represent a document based on their contents.

Indexing is very important phase of Information Retrieval

System to create a search-able unit for the given query.

Basically, indexing is performed by assigning each document with keywords or descriptive terms representing

the document[1]. The assigned terms must reflect the

content of the document to allow effective keyword

searching. In automatic indexing , couple of trained people

who are well with concept of the document participates in

indexing process. Manual indexing is a time taking process

and it requires huge manual hours to index a repository

which grows day by day. Automatic text indexing which is

much faster and less error-prone has become a common

practice on big corpus. Research on English texts has

shown that the retrieval effectiveness of automatic indexing is comparable to that of manual indexing [2][3]. A natural

language query specifies the user's information need in a

natural language sentence or sentences. A phrase query

contains phrases representing concepts of interest to the

user[4], then it requires to mind the language features

before selecting indexing terms. Where in the language

processing tools helps to identify the better indexing terms

to represent whole object. In this case study the effect of

various indexing techniques are observed on fixed length

Telugu corpus. Majority of related work has been examined

on various language corpus through literature survey in

next chapter. In this paper we concentrated on how the indexing improves the retrieval performance. Various

methods are adopted to index the items in this research.

Indexing of items using semantic concept gives better

representations.

Page 2: Study of Indexing Techniques to Improve the Performance of ... · PDF fileKannada , Malayalam ,Tamil are ... letters are not single alphabets like English. ... ఔ[oV] ఄం[aM] ఄః[aH]

International Journal of Emerging Technology and Advanced Engineering

Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 3, Issue 1, January 2013)

483

The performance of this task is measured with standard

Information Retrieval Performance measures , namely,

Precision, Recall and F-Measure along with other similarity measures used while indexing the items of the corpus.

II. INDIAN LANGUAGES

A. Dravidian Languages

The languages spoken by Indian are from different

families of languages like Indo-aryan a subset of Indo-

Eropean languages , Dravidian Languages, Austro-Asiatic

and Tibeto-Burmese etc. There are many languages in

Indian, but 22 – languages are given official status by

Govt. of India in 8th

schedule , out of which 15 are Indo-

Aryan Languages, 4-are Dravidian languages, one Austro-

Asiatic and two Tibeto-Burmese Languages. Kannada ,

Malayalam ,Tamil are most spoken Dravidian Languages

in South India.

B. Telugu Language

Telugu Language is the third most spoken language in

India and one of the fifteen most spoken languages in the

world. Due to high complexity of Telugu language it is

difficult to search and retrieve the required documents from

the repository.

C. Language Representation and Encoding Standards

There is no UPPER and lower case representation for

Indian Languages. The Vowels are free to occur at the

beginning unlike English to be occurred with a word.

Building Rules for recognizing language features are

different from language to language, hence it is required to

build language processing tools for each language with

unique features. It is very important to know the internal

representation of any language before processing. Telugu

letters are not single alphabets like English. Entire Study

and Implementation Results are shown in WX-UTF

Notations. Whereas internal encoding is always

UNICODE. The Unicode Standard, Version 6.2 assigned a

hexadecimal code point for Telugu Scripts in the Range of

0C00-0C7F. Table-1 shows WX-Notations for Telugu

Alphabets. Unicode Transformation Format (UTF) is the

universal character code standard to represent character

sets. UTF-8 is an alternative coded representation form for

all the characters in Unicode while maintaining

compatibility with ASCII [15]. WX-Notations are used to

represent transliteration scheme of Roman Script. These

Scripts are used to denote Dravidian and Devanagari

scripts of Indian Languages. These standards aim at

providing a unique representation of Indian Languages in

Roman alphabet [16]. The Example given throughout this

paper are represented in WX-Notations and Formats are

converted from WX to UTF for displaying in Telugu Script

forms.

TABLE I

UTF[WX]-NOTATION FOR TELUGU SCRIPTS

ఄ [a] అ[A] ఆ[i] ఇ[I] ఈ[u]

ఉ[U] ఊ[q] ఎ[e] ఏ[eV] ఐ[E]

[o] ఔ[oV] ఄం[aM] ఄః[aH] ర[rY]

క[ka] ఖ[Ka] గ[ga] ఘ[G] ఙ[fa]

చ[ca] ఛ[Ca] జ[ja] ఝ[Ja] ఞ[Fa]

ట[ta] ఠ[Ta] డ[da] ఢ[Da] ణ[Na]

త[wa] థ[Wa] ద[xa] ధ[ Xa] న[na]

[pa] ప[Pa] ఫ[ba] బ[Ba] భ[ma]

మ[ya] య[ra]] ఱ[la] ళ[va] శ[sa]

ఴ[Sa] వ[Ra] ష[ha] ల[lYa] క్ష[kRa]

III. RELATED STUDY ON INDEXING

Indexing of an item is a process of creating search-able

Data structure from the received items. This transformation

requires finding the important keywords to represent a

complete item in meaning wise and statistical matching

ratio. Indexing is similar to that of Cataloguing books in a library. Cataloguing is used to create access points on an

item collection that are expected and most useful to the

users of the information retrieval [9]. Behind the user

interface, the search engine collects all the data and build

an index to store that data so that a user can access them

quickly by posing simple query. Different search- engines

use different methods to index the data.

Page 3: Study of Indexing Techniques to Improve the Performance of ... · PDF fileKannada , Malayalam ,Tamil are ... letters are not single alphabets like English. ... ఔ[oV] ఄం[aM] ఄః[aH]

International Journal of Emerging Technology and Advanced Engineering

Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 3, Issue 1, January 2013)

484

This is the main reason to get different outcome while

searching on different search engines, Google, Bing, Yahoo

etc for the same query. Google ranks a page higher if more number of pages vote (having links) to that particular page.

Basically indexing can be done in two ways: Manual

Indexing and Automatic Indexing. Sometimes the semi-

automatic indexing also used to represent complex

documents having more content and less keywords.

A. Manual Indexing

Automatic indexing follows set processes of analysing

frequencies of word patterns and comparing results to other

documents in order to assign to subject categories. This

requires no understanding of the material being indexed

therefore leads to more uniform indexing but this is at the

expense of the true meaning being interpreted. A computer

program will not understand the meaning of statements and

may therefore fail to assign some relevant terms or assign

incorrectly. Human indexers focus their attention on certain

parts of the document such as the title, abstract, summary and conclusions, as analysing the full text in depth is costly

and time consuming [6]. According to Jakob [8] manual

indexing is called tagging with index terms referred to as

tags. There is a renaissance of manual subject indexing and

analysis: Structured meta-data is published with techniques

like RDF,Dewey Decimal System, MARK (MAchine

Readable Cataloging ), RSS, OAI-PMH and OpenSearch1

and browser search plug-ins allow it to agree a specialised

search engines[8]. The full text searchable data structure

for items in the Document File provides a new class of

indexing called total document indexing [9]. Most of the search websites follow controlled vocabulary tagging for

indexing the web pages. There has been a debate for a

number of years about which method is better, human

indexing or automatic indexing [8]. Manual indexing is

always better in concept representation. The manual

indexing is still having more importance in expressing the

meaning of whole document with limited number of

indexing terms. But the manual indexing is a critical task to

carry-out. When all terms of an item are used to represent a

completed item, then what is the use of selecting terms as

indexing units to represents an item?. If one can search on any of the words in a document why does one need to add

additional index terms?[9]. Use of Controlled vocabulary

helps the indexer to limit the number of indexing terms to

represent a document with major concepts. A controlled

vocabulary is a finite set of index terms from which all

index terms must be selected.

This kind of indexing with controlled vocabulary takes

more time when the repository is big in size. The extra

processing time comes from the indexer trying to determine the appropriate index terms for concepts that are not

specifically in the controlled vocabulary set. The Indexing

gets slow, but the search process would be easier and

faster. Controlled vocabularies give understanding to the

indexer to select proper items, which describes information

needs. Uncontrolled vocabularies have the opposite effect,

making indexing faster but the search process much more

difficult [9]. In case TELUGU Language vocabulary is not

completely controlled and most of the available vocabulary

is not in use, then building index terms for Telugu item set

is bit difficult task either in manual process or in automatic

process. Most of the Text-book (గసర ంధికభు-grAMXikamu )

terminology is not in practice (ళయళహారికభు-vyavahArikamu), in this scenario manually deciding all the

domain concepts in an item badly required alternate

vocabulary resources like Dictionaries. World Wide Web is

drastically increasing in size; where in manual indexing is

not possible on dynamically growing repository. In manual

indexing considering entire item for selecting the terms is

tedious job, then manual indexer limits to few sections of

an item like Title, Sub headlines, Abstract, Summary and

Conclusion. When indexing any item, one should mind

Specificity and Exhaustivity.

i) Specificity:

Selection of Indexing terms must represent the concept of

an item in terms of closeness that means every item

selected as an index term should relate to a topic of item in

search while matching from the query terms. This is what is

known as specificity of topic from index term. The

specificity describes how closely the index terms match the

topics they represent [10]. An index is said to be specific if the indexer uses parallel descriptors to the concept of the

document and reflects the concepts precisely. Specificity

tends to increase with exhaustivity as the more terms you

include, the narrower those terms will be[11]. Specificity as

characterised as a semantic property of index terms (i.e. a

term is more or less specific as its meaning is more or less

detailed and precise). It is suggested that specificity should

be interpreted statistically, as a function of term use rather

than of term meaning [12]. Simply to say specificity is right

selection of keywords which represents a topic of an item

set (i.e. Number of documents represented by an indexed

term is specificity).

Page 4: Study of Indexing Techniques to Improve the Performance of ... · PDF fileKannada , Malayalam ,Tamil are ... letters are not single alphabets like English. ... ఔ[oV] ఄం[aM] ఄః[aH]

International Journal of Emerging Technology and Advanced Engineering

Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 3, Issue 1, January 2013)

485

ii) Exhaustivity:

Exhaustivity of a document description is the coverage of

its various topics given by the terms assigned to it .

Number of indexed terms used to represent a document is

known as exhaustivity. Exhaustivity of indexing is the

extent to which these concepts are covered by the

descriptors assigned to the entity. Exhaustivity has two

components: viewpoint exhaustivity and importance

exhaustivity [13]. The exhaustivity must be limited; otherwise search process would be time taking process. If

the concepts are not represented using indexing terms, it is

difficult to locate items space in the search. Viewpoint

exhaustivity specifies how much features are covered by

indexing to find the item in retrieval process? The degree to

which this question can be answered with "yes" is

viewpoint exhaustivity. Importance exhaustivity addresses

the question: What is the importance threshold for the

assignment of descriptors as prescribed in the indexing

rules? [13]. Incorrect descriptors increase the count without

contributing to exhaustivity.

B. Automatic Indexing

Automatic Indexing is a process whereby computer

processes a text in natural language which is in machine

readable form [26]. Automatic Indexing is done by a

machine according to the rules framed in the program. In

fact automatic indexing is claimed as better indexing

approach as it takes away the time , cost, exhaustivity,

specificity, vocabulary, searching, browsing limit and

allows the entire document to be analysed, but also has the

option to be directed to particular parts of the document.

Major limitations of this approach are not in a position to

decide the ambiguous selection of terms. To overcome this,

one has to write complex rules in selecting indexed terms

to disambiguate this concept by different algorithms.

Linking many tools and applying complex rules to select a

index term which describes the correct concept gives

slower response. Research into ways to apply automatic

techniques to image, sound, and other types of text is still

in its infancy, compared to the half century of work on

automatic indexing of language text [7]. Query Expansion

and Document Expansion Techniques are used to select

better indexing terms to describe the correct concept of the

units [5]. Examples of Automatic Indexing are KWIC –

Key Word In Context, KWAC-Key Word Alongside

Context, KWOC-Key Word Out of Context.

Automatic Indexing is more preferable method by real time

search engine developer.

C. Relatedness of Indexing terms

Provided relationship among the set of indexing terms

comes in two places. One at the time of indexing process

call Pre-coordination, Second at the time of search process

called Post-coordination. The establishment of relationship

between multiple indexing terms is also called as linkage.

Linkages are used to correlate related attributes associated

with concepts discussed in an item. This process of creating

term linkages at index creation time is called pre-

coordination. Post-coordination can be is implemented by

connecting terms using Boolean ―AND‖ing operator, which

only finds indexes that have the entire search terms [9].

During linkage, the indexer should mind the number of

indexing terms to related and order of linkage and any

additional descriptors are associated with the index terms

[14]. Examples show how indexing terms are related during

indexing and or during search process on a TELUGU Text

item.

D. Stop list

Stop words are common words, which have no

information alone and frequently occur in most of the

items of the corpus. These stop-words need to be filtered

out before or after processing text. There is no predefined stop list for Telugu like English. Even in English, if the

search is a phrase search, then no functional word can be

treated as stop-word in the search. Coming to Telugu stop-

words depends on the context of search. While indexing

removing stop-words is much more important to limit the

index size and speed up the match. Hence deciding stop-

word list is key task during indexing phase. For this study

we prepared list of possible stop list based on available text

corpus and those are eliminated before indexing the items.

E. Web Crawler and Text Corpus

Before discussing the Data structures used to store the

indexing terms of an item, it is necessary to know the flow

of steps how documents are collected and cleaned through

standardization process. Figure-1 gives complete

understanding of the systems how indexing is being carried

in Information Retrieval Process.

Page 5: Study of Indexing Techniques to Improve the Performance of ... · PDF fileKannada , Malayalam ,Tamil are ... letters are not single alphabets like English. ... ఔ[oV] ఄం[aM] ఄః[aH]

International Journal of Emerging Technology and Advanced Engineering

Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 3, Issue 1, January 2013)

486

i) Crawlers :

Crawler is Computer program starts visiting with a seed

URL and iterates by selecting new URLs for collecting Web pages. Once crawler fetches a Web page, then Server

begins indexing process and store indexed terms into a

specified Data Structures.

Figure I. Crawling of Web pages based on URLs to create a

standardized corpus for search engine.

Web Crawlers visit or crawl web pages and download

them by starting from one page and determines which page

to go next. The Crawling efficiency depends on various policies like Selection of page by considering page rank,

Revisit by checking freshness and Age, Politeness for not

burden the web servers and download rate. Figure-2 gives

the understanding of how Nutch – Open Source Web

Crawler works based on Lucene [17].

Figure II. Open Source web crawler: Nutch [17]

We have not used any WebCrawler to collect Telugu

Copus, but we downloaded from Wikipedia Corpus

collection [18].

F. Indexing Data Structures

In an Indexing, after collecting items, they must be

stored in a normalized form. These Data Structures store

the terms and associated information to support the search

process.

i)Stemming :

A series of rules are applied to remove a part (i.e. suffix,

affix or prefix) of the word to generate root word. Received items are preprocessed by stemming. Suffix removal is

more suggested method of stemming proposed by Porter

[20], called as porter stemmer. In stemming suffixes or

prefixes or infixes will be blindly removed or replaced with

possible rule based substrings. Stemming reduces the

diversity of representations of a concept to a canonical

morphological representation called morphological variants

or word forms. Stemming improves recall [19], but some

time it may cause loss of information, due to blind stripping

of suffix or prefix of a term. It is highly difficult to frame

the rules for suffix stripping for Telugu language. Suffix stripping is not suggested for highly complex languages

with more number of variant words forms like Telugu

Language.

ii) N-gram Model:

A variant of the searchable data structure is the N-gram

structure that breaks processing tokens into smaller string

units [21]. N-Grams Model is an alternate technique to

generate possible n-length words starting from 1 to n to

generate conflational forms of indexing terms. N-Grams

are a fixed length consecutive series of ―n‖ characters to

determine the stem of a word that represents the concept of the word, but n-grams do not care about semantics [9].

Trigrams with 3-character length words were found to be

optimal for English Language [22]. In English the

Alphabets are syllable, and then it is easy to find the

syllables with N-gram technique, where as in Telugu

Language syllables representation combination of one or

more alphabets. A syllable is a unit of pronunciation having

one vowel sound, with or without surrounding consonants,

forming the whole or a part of a word. Syllable N-Gram

Model gave good performance [21], Maximum word length

n is considered and generated all syllable to minimum length 1.

Fetching

Indexing

Searching Crawling

World Wide

Web

Server Indexed

File Store

Page 6: Study of Indexing Techniques to Improve the Performance of ... · PDF fileKannada , Malayalam ,Tamil are ... letters are not single alphabets like English. ... ఔ[oV] ఄం[aM] ఄః[aH]

International Journal of Emerging Technology and Advanced Engineering

Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 3, Issue 1, January 2013)

487

Example 1: A word తియుతి with Syllable n-gram Model

with n=4.( WX-Notation).

TABLE III

EXAMPLE -1 WITH N-GRAM AND SYLLABLE N-GRAM MODELS

N-gram-Model,

n=1,2,3,4,5,6,7,8

Syllable N-gram Model

n=1,2,3,4

w

wi

wir

wiru

wirup

wirupa

wirupaw

wirupawi

తి-wi

తియు-wiru

తియు-wirupa

తియుతి-wirupawi

iii) Tries ( Re”trie”ve) – Simple Try :

Trie is taken from ―Retrieve‖ , which represents try for matching syllables from an internal node point. Trie

represents all the words as a single data structure. The time

cost of this Data Structure is optimal when compared to

other Data structures. This structure has the advantage that

the search space of the word depends on the number of

syllables in the word .The difficulty of n-gram model in the

context of Telugu Language can be solved by identifying

syllables of the word using linguistic rules of the language

[23].

Example 2: a word wirupawi – తియుతి can be represented

as syllables according to language rule [23]. After dividing

―wirupawi‖ ―తియుతి‖ into syllables, the syllables are

wi-ru-pa-wi ―తి-యు--తి‖. This pre-conversion saves

search time in tries Data Structure.

iv) Successor Stemmer :

The Successor Stemmer finds the word and morpheme

boundaries based on the distribution of phonemes that

distinguishes one word from other. The process determines

the successor variety for a word, uses this information to

divide a word into segments and selects one of the segment

as stem. The successor variety of a segment of a word in a

set of words is the number of distinct syllable that occupy the segment length plus one syllable.

Example – 3: తియుతి, తియుభఱ, తియువీధ,ి తియుతిరసళు – To

verify the word తియుతిరసళు among the 4-word collection,

the Successor Stemmer works as shown in TABLE -III:

TABLE IIIII

SUCCESSOR STEMMER FOR EXAMPLE -3

Suffix Successor

Variety

Syllable

తి 1 యు తియు 3 , భ,వీ

తియు 1 తి

తియుతి 1 రస తియుతిరస 1 ళు తియుతిరసళు 1 Blank Space

The successor variety of any prefix of a word is the no. of

children associated with the node in the symbol tree

representing that prefix [24]. In less formal terms, the

successor variety of a string is the number of different characters that follow it in words in some body of text [25].

v) Inverted File(IF) Structure :

In general, the importance of a word is decided by no.

of occurrence of a word called term frequency (tf) in an

item. Each term maintains the list of items, in which the

term occurs in an item in a list called inverted list. The way

to locate the inversion list for a particular word is

dictionary lookup. Inverted File Structure maintains list of items, inverted list and Dictionary. Dictionary stores all

possible unique words of entire corpus. Most of the

Database and Information Retrieval Applications use

inverted file structure to represent indexing of an item.

Unfortunately, it is not possible to achieve an efficient

adaptation of an inverted file to deal with the matching of

more elaborate document and query descriptions such as

weighted keywords [9]. Inversion list maintains Document

ID, term frequency, along with the term position in the

Document to support proximity search, continuous word

phrase search. Inverted File Structure is an efficient Data structure to store big collection data items. For Telugu Text

Corpus, IFS is one of the best approaches for indexing.

Page 7: Study of Indexing Techniques to Improve the Performance of ... · PDF fileKannada , Malayalam ,Tamil are ... letters are not single alphabets like English. ... ఔ[oV] ఄం[aM] ఄః[aH]

International Journal of Emerging Technology and Advanced Engineering

Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 3, Issue 1, January 2013)

488

Steps to prepare IFS:

1. Tokenize the Item.

2. Eliminate the Stop words.

3. Apply Syllable N-gram Model to identify the

stems or use morphological analyzer to get root

words.

4. Maintain Doc-ID and it‘s indexing terms of an

item.

5. Prepare the Dictionary with all unique words with

total occurrence in all the Documents.

6. Create Inverted list with <Doc-IDs, term-

position, term-frequency in the same

document> against list of word occurrence in

various documents.

Example – 4 : Document set

Doc-1:. తియుతి ఱడడూ గురించి తెలిమని వసయుంటారస?.

Doc-2:. రంచవసయతంగస ఎంతో పేయు రఖయయతుఱు ప ందిన. తియుతి

ఱడడూ క ితవయఱ ోపేట ంట్ ళచ్చే ఄళకసఴం ఈంది.

Doc-3:. దచఴ, విదచశసఱకు చె్ందిన బకుత ఱు బకిత ఴరదధఱతో ఎంతో ఆవటంగస స్వవకరించ్చ ఇ ఱడడూ క ిభౌగోళిక రతచయకత కింద పేట ంట్ షకుుఱు ప ందాఱని తియుభఱ తియుతి దచళస్సథ నం (టిటిడి) నియణయంచింది.

Before Preprocessing, the documents are converted into

WX-Notation and tokenized into processable units. Stop

words are manually identified and separated from terms

list of each item.

Doc-1: wirupawi laddU guriMci weVliyani?-

vAruMtArA?

Doc-2: prapaMcavyApwaMgA eVMwo peru

praKyAwulu poVMxina wirupawi laddUki wvaralo

peteVMt vacce avakASaM uMxi.

Doc-3: xeSa, vixeSAlaku ceVMxina Bakwulu Bakwi SraxXalawo eVMwo iRtaMgA svIkariMce I laddUki

BOgolYika prawyekawa kiMxa peteVMt hakkulu

poVMxAlani wirumala wirupawi xevasWAnaM (titidi)

nirNayiMciMxi.

Doc-1 wirupawi,laddU

Doc-2 prapaMcavyApwaM,eVMwo, peru,

praKyAwulu, poVMxina, wirupawi, laddU,

wvaralo, peteVMt, vacce, avakASaM

Doc-3 xeSamu, vixeSamu, Bakwulu, Bakwi , SraxXa,

eVMwo, iRtaM, svIkariMcu, I laddU,

BOgolYikaM, prawyekaM, peteVMt, hakkulu,

poVMxuta, wirumala, wirupawi,

xevasWAnaM, titidi, nirNayiMciMxi

After Converting from WX-UTF :

Doc-1 తియుతి, ఱడడూ

Doc-2 రంచవసయత ం, ఎంతో , పేయు , రఖయయతుఱు , ప ందుట, తియుతి ,

ఱడడూ , పేట ంట్, ఄళకసఴం

Doc-3 దచఴభు, విదచఴభు, బకుత ఱు, బకిత, ఴరదధ , ఎంతో, ఆవటం, స్వవకరించు,

ఱడడూ , భౌగోళికం, రతచయకం, పేట ంట్, షకుుఱు, ప ందుట, తియుభఱ,

తియుతి, దచళస్సథ నం,టిటిడి,నియణయంచింది

Stop words : గురించి, తెలిమని, వసయుంటారస, తవయఱో, ఈంది, చె్ందిన, ఇ, కింద, ళచ్చే,

Dictionary Creation in sorted order: F- Frequency of

Words in all the Documents called Total Frequency.

TABLE IVV

WORD DICTIONARY WITH FREQUENCY OF TERMS FOR EXMAPLE-4.

Unique Word F Unique Word F

avakASaM –

ఄళకసఴం

1 poVMxuta –ప ందుట 2

iRtaM –ఆవటం 1 prawyekaM –రతచయకం 1

eVMwo – ఎంతో 2 praKyAwulu –రఖయయతుఱు 1

titidi-టిటిడి 1 prapaMcavyApwaM –

రంచవసయత ం 1

wirupawi- తియుతి 3 Bakwi –బకిత 1

wirumala- తియుభఱ 1 Bakwulu- బకుత ఱు 1

xevasWAnaM –

దచళస్సథ నం 1 laddU-ఱడడూ 3

xeSamu-దచఴభు 1 svIkariMcu -స్వవకరించు 1

నియణయంచింది 1 hakkulu -షకుుఱు 1

peteVMt -పేట ంట్ 2 SraxXa -ఴరదధ 1

peru-పేయు 1

Page 8: Study of Indexing Techniques to Improve the Performance of ... · PDF fileKannada , Malayalam ,Tamil are ... letters are not single alphabets like English. ... ఔ[oV] ఄం[aM] ఄః[aH]

International Journal of Emerging Technology and Advanced Engineering

Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 3, Issue 1, January 2013)

489

Indexing terms are manually identified by language expert

to represent the concept of the whole document as well as

each sentence.

TABLE V

INVERTED LIST FOR EXAMPLE -4

wirupawi- తియుతి Doc-1 Doc-2 Doc-3

laddU- ఱడడూ Doc-1 Doc-2 Doc-3

peteVMt -పేట ంట్ Doc-2 Doc-3

poVMxuta -ప ందుట Doc-2 Doc-3

G. Index Term Weighting

The effectiveness of the Indexing are depends on

Specificity and Exhaustivity [27]. It has been recognized

that a high level of exhaustivity of indexing leads to high recall and low precision. Conversely, a low level

of exhaustivity leads to low recall and high precision

[28].

i) Precision : Precision is the ratio of the number of relevant

documents retrieved to the total number of documents retrieved. To find the rate of error and rate of success

among the retrieved documents the precision is used.

ii) Recall :

Recall is the ratio of the number of relevant

documents retrieved to the total number of relevant

documents.

iii) Fallout:

Sometimes a negative ratio of relevance is useful to judge

the inefficiency of the retrieval. Fallout is the ratio of

number of non-relevant document retrieved to the total

number of non-relevant documents.

iv) F-Measure:

F-Measure is also called as F-Score Measure, which

measure the accuracy of retrieval outcome:

Where β varies from 0 to 1. When β=1, it weighted harmonic mean of precision and recall known as traditional

F-score or balanced F-Measure [29].

H. Latent Semantic Indexing

In statistical indexing the indexing terms are selected

based on term frequency and co-occurrence in an item.

Vector Space Model is used to represent the terms in rows,

items in columns in vector space [32]. Different

measurements are used to find the similarity of indexed terms in an item with query terms. In Cosine similarity

large angle, small cosine represents dissimilar items and

small angle, large cosine represents similarity of items.

When concepts are used in vector space model, it has invert

effect on recall and precision. The basis for concept

indexing is that there are many ways to express the same

idea and increased retrieval performance comes from using

a single representation [9]. Concept can be extracted from

language resources like Dictionaries, Thesaurus, WordNet

and Ontologies etc. Synonyms (రసయమదాఱు-paryAyapaxAlu) of a term give many ways to refer the

same indexing term (i.e. Different terms with same

meaning). For example mother in Telugu can be

represented in many ways without losing meaning: ఄభమ-amma: జనని,తలిి,భయత,జనని). Instead of single term all

the synonyms any one of them can be used to represent

same concept. This will increase precision of the search. Similarly when a term has more than one meaning, it

misleads to concept representation, but it will increase

recall and show adverse effect on precision. For Example

―Apple‖ may be ‗Computer‘ or a ‗Fruit’.

# Items retrieved relevant

Precision =

# Items total retrieved

# Items retrieved relevant

Recall =

# Items total Relevant

# Items retrieved non-relevant

Fallout =

# Items total Non-Relevant

(1+β2).(Precision . Recall)

F-Score =

β2.(Precision+Recall)

Page 9: Study of Indexing Techniques to Improve the Performance of ... · PDF fileKannada , Malayalam ,Tamil are ... letters are not single alphabets like English. ... ఔ[oV] ఄం[aM] ఄః[aH]

International Journal of Emerging Technology and Advanced Engineering

Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 3, Issue 1, January 2013)

490

To handle this kind of problem statistical indexing is not

adequate, there is an alternate method to solve these two

problems using Vector Space Model called Latent Semantic Indexing[31]. Document expansion and Query

expansion techniques helps to extract the concepts of an

item and query [30]. Concept indexing determines a

canonical set of concepts based upon a test set of terms and

uses them as a basis for indexing all items [9]. The

determined set of concepts does not have a label associated

with each concept (i.e., a word or set of words that can be

used to describe it), but is a mathematical representation as

a Vector. LSI works based on the principle that words that

are used in the same contexts tend to have similar

meanings.

IV. CONCLUSION

In this paper we studied the effect of Indexing in

Information Retrieval process on Telugu Documents. We

observed manual indexing works quite good in terms of

accuracy, but it is a time taking process and it is highly

difficult to index all the subject areas by few indexers.

Most of the cases human limits to a particular part of the

item to extract the concept, it may give inverse effect.

While corpus is growing manual indexing is not advisable.

Automatic indexing has many advantages when compared

to manual indexing. In manual Indexing Statistical

indexing using Boolean approach and Vector Space Model

blind methods in finding descriptors of items through

frequency count of terms in the corresponding items. The

Concept of an Item can be best represented by few

indexing term with related meanings using Latent Semantic

Indexing Techniques. Use of Language Processing tools

like Morphological Analyzer, POS Tagger to find the base

terms may simplify Indexing process. While Indexing,

Extracting concept of items using Language Resources like

Synset, WordNet or Ontology will improve the Indexing

accuracy in terms of precision and recall. Too much of

using these resources like WordNet may leads to poor

results.

REFERENCES

[1]. W. Bruce Croft , Mirna Adriani ,1997. Retrieval

Effectiveness Of Various Indexing Techniques On

Indonesian News Articles, 1-7.

[2]. Mirna Adriani , W. Bruce Croft , 1997. Retrieval Effectiveness Of Various Indexing Techniques On Indonesian News Articles.

[3]. Salton, Gerard.,1986 . Another Look at Automatic Text-Retrieval Systems. Communications of the ACM 29(7), 648-656.

[4]. Jones Sparck , Karen, 1974. Automatic Indexing. Journal of Documentation: 30(4), 393-432.

[5]. Ramakrishna Kolikipogu, Padmaja Rani B, 2011. WordNet based

Term selection for Pseudo Relevance Feedback Query Expansion Model, ICCMS, IEEE, Vol 2.,

[6]. F. W. Lancaster, 2003. Indexing and abstracting in theory and practice. Third edition. London, Facet ISBN 1-85604-482-3. pp 24.

[7]. James D. Anderson, Jos Prez-Carballo , 2001. The nature of indexing: how humans and machines analyze messages and texts for retrieval. Part II: Machine indexing, and the allocation of human versus machine effort, Information Processing and Management 37 : 255-277 .

[8]. Ginger Shields , 2005. What are the main differences between human indexing and automatic Indexing? , LI-842 Automatic Indexing Assignment , pp:1-4.

[9]. Kowalski and Maybury, 2001. Information Storage and Retrieval System : Theory and Implementations.

[10]. J.D. Anderson, 1997. Guidelines for indexes and related information retrieval devices. Bethesda, Maryland, Niso Press. 10 December 2008.

[11]. D.B. Cleveland and A.D. Cleveland,2001. Introduction to indexing and abstracting. 3rd Ed. Englewood, libraries Unlimited, Inc. Page 106.

[12]. Karen Spärck Jones, 2004. A statistical interpretation of term specificity and its application in retrieval , Journal of Documentation Volume 60 Number 5 pp. 493-502 .

[13]. Dagobert Soergel , 1994. Indexing and retrieval performance: The logical evidence , Soergel, Indexing and

retrieval performance , Journal of the American Society for Information Science , 1-22.

[14]. Vickery, B. C., 1970. ―Techniques of Information Retrieval‖, Archon Books, Hamden, Conn.

[15]. Unicode Standard Version 6.2, 2012. http://www.unicode.org

[16]. Akshar Bharati, Vineet Chaitanya, and Rajeev Sangal.1995. Natural Language Processing : A Paninian Perspective. PHI, 2010.

[17]. http://lucene.apache.org/nutch/

[18]. http://lucene.apache.org/nutch/

[19]. Ricardo Baeza-Yates, 2011. Modern Information Retrieval , Pearson 5th Edition, Pp:168-169.

[20]. M.F.Porter, 1980, An algorithm for suffix stripping, pp 130-137.

Page 10: Study of Indexing Techniques to Improve the Performance of ... · PDF fileKannada , Malayalam ,Tamil are ... letters are not single alphabets like English. ... ఔ[oV] ఄం[aM] ఄః[aH]

International Journal of Emerging Technology and Advanced Engineering

Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 3, Issue 1, January 2013)

491

[21]. Dr.B.Padmaja Rani and Dr.A.Vinay Babu, 2010. Novel Implementation of Search Engine for Telugu Documents with Syllable N-Gram Model, International Journal of Engineering Science and Technology, Vol. 2(8), 2010, 3712-3720.

[22]. Yochum and Yochum, J., 1985. A High-Speed Text Scanning Algorithm Utilizing Least Frequent Trigraphs, IEEE Proceedings New Directions in Computing Symposium, Trondheim, Norway, 1985, pages 114-121.

[23]. K.V.N.Sunitha and N.Kalyani, 2012. Isolated Word Recognition using Morph – Knowledge for Telugu Language, International Journal of Computer Applications,Vol-38,No-12,Pp:47-54.

[24]. M. Hafer and S. Weiss, 1974. Word Segmentation by Letter Successor Varieties, Information Storage and Retrieval, 10, 371-85.

[25]. Deepika Sharma, 2012. Stemming Algorithms: A Comparative Study and their Analysis, International Journal

of Applied Information Systems, Foundation of Computer Science FCS, Volume 4– No.3,Pp:7-12.

[26]. Martin Tulic, 2005. Automatic Indexing (http://anindexer.com/about/auto/autoindex.html ).

[27]. KEEN, E.M. and DIGGER, J.A.,1972. Report of an

Information Science Index Languages Test,

Aberystwyth College of Librarianship, Wales.

[28]. Lancaster F.W.,1968 Information Retrieval

Systems: Characteristics, Testing and Evaluation,

Wiley, New York.

[29]. Beitzel., Steven M., 2006. On Understanding and Classifying Web Queries –(Ph.D. thesis).

[30]. Ramakrishna Kolikipogu, Padmaja Rani.B,2012, Reformulation of Web Query terms using Semantic Relationships, ICACCI-2012,ACM-Proceedings.

[31]. Dumais, S. T., Furnas, G. W., Landauer, T. K. and Deerwester, S., 1988. Using latent semantic analysis to improve information retrieval. In Proceedings of CHI'88:

Conference on Human Factors in Computing, New York: ACM, 281-285.

[32]. Christopher D. Manning,2006. An Introduction to Information Retrieval,Priliminary Draft,Cambridge UP.

AUTHOR – 1:

Mr.Kolikipogu Ramakrishna holds B.Tech in Computer Science and Information Technology from JNTU-

Hyderabad, M.Tech in Computer Science and Engineering with Specialization of Software Engineering and Pursuing Ph.D in Computer Science and Engineering from JNTU Hyderabad. To the Credit he published 20-research papers in various International/National Conferences and Journals. He act as Reviewer for Couple of Journals including IJECCE, IJEIT, IJCL, IJCCT, IJCSI and few more. He is a member of various professional bodies ACM-SIGIR, ISTE, IACSIT,

IAENG, ACM-CSTA etc. At present he is working as Associate Professor and Head , Department of Information Technology, Sridevi Women‘s Engineering College, Hyderabad.

AUTHOR - 2 :

Dr.B.Padmaja Rani holds B.E in Electronics and Communication Engineering from Osmania University-Hyderabad, M.Tech in Computer Science from JNTU Hyderabad and She received a Doctoral Degree(Ph.D) in

Computer Science from JNTU Hyderabad. At present she is working as Professor & Head, Department of Computer Science and Engineering, JNTUH College of Engineering, JNTUH University, Hyderabad. She is guiding couple of Ph.D Scholars in the area of Information Retrieval, Natural Language Processing and Information Security. Her area of research interest includes Information Retrieval, Natural Language Processing, Information Security, Data Mining

and Embedded Systems. To the Credit she published 40 + research papers in various International/National Conferences and Journals. She is meber of various professional bodies including CSI, IEEE, ISTE ect.