[ijet-v1i6p17] authors : mrs.r.kalpana, mrs.p.padmapriya

International Journal of Engineering and Techniques - Volume 1 Issue 6, Nov – Dec 2015

ISSN: 2395-1303 http://www.ijetjournal.org Page 82

A Capable Text Data Mining Using in Artificial Neural Network

Mrs.R.Kalpana, Mrs.P.Padmapriya 1,2

(HEAD, Computer Science Department, Annai Vailankanni Arts and Science College, Thanjavur-7.)

I. INTRODUCTION

ANNs are processing devices such as algorithms or hardware that are freely modeled after the neuronal

structure of the mammalian with smaller scales. A large

ANN might have lot of processor units whereas a

mammalian brain has huge of neurons to increase their

overall interaction and emergent behavior. In Neural

Network that address classification problems, training

set, testing set, learning rate are considered as key tasks. That is collection of input/output patterns that

are used to train the network and used to assess

the network performance, set the rate of adjustments.

This paper describes a proposed back propagation

neural net classifier that performs cross validation

for original Neural Network. In order to reduce the

optimization of classification accuracy, training time.

This algorithm is independent of specify data sets so that

many ideas and solutions can be transferred to other

classifier paradigm. We have to propose text data

mining with this Artificial Neural Network.

Clustering or Cluster Analysis is one of the data

mining concepts is an unsupervised pattern where this

pattern try to identify intrinsic sets of a text document.

So that a group of clusters is created in which clusters demonstrate intra cluster similarity and inter cluster

similarity [1]. Commonly text clustering patterns

attempt to separate the documents into sets where each

set represents various themes that are different than

those areas represented by other groups.

Most of the current text clustering methods based on

Vector Space Model (VSM). VSM is a broadly used

data representation for text classification on clustering.

Methods used for text mining includes decision

trees[2],conceptual clustering[3], statistical analysis[4]

and clustering based on data summarization[5].

Usually, in text data mining techniques, the term

frequency of a phrase or a word is computed to discover

the importance of the phrase in the file. However, two

phrases can have the same frequency in their papers, but

one phrase adds more to the meaning of its sentences

than another phrase.

Abstract:

Text mining efforts to innovate new, previous unknown or hidden data by automatically extracting

collection of information from various written resources. Applying knowledge detection method to

formless text is known as Knowledge Discovery in Text or Text data mining and also called Text Mining.

Most of the techniques used in Text Mining are found on the statistical study of a term either word or

phrase. There are different algorithms in Text mining are used in the previous method. For example

Single-Link Algorithm and Self-Organizing Mapping(SOM) is introduces an approach for visualizing

high-dimensional data and a very useful tool for processing textual data based on Projection method.

Genetic and Sequential algorithms are provide the capability for multiscale representation of datasets and

fast to compute with less CPU time based on the Isolet Reduces subsets in Unsupervised Feature

Selection. We are going to propose the Vector Space Model and Concept based analysis algorithm it will

improve the text clustering quality and a better text clustering result may achieve. We think it is a good

behavior of the proposed algorithm is in terms of toughness and constancy with respect to the formation of

Neural Network.

Keywords — Concept analysis, document clustering, k-Nearest Neighbor (k-NN), data visualization,

Self-Organizing Map (SOM).

RESEARCH ARTICLE OPEN ACCESS



II. Concept-based mining model The proposed concept-based mining

model consists of sentence-based concept

analysis, document-based concept analysis,

corpus-based concept-analysis, and concept-based

similarity measure. A raw text document is the

input to the proposed model. Each document has

well-defined sentence boundaries. Each sentence in

the document is labeled automatically based on

parser. After running the semantic role labeler, each

sentence in the document might have one or more

labeled verb argument structures. In this model,

both the verb and the argument are considered as

terms. One semantic role in the same sentence. In

such cases, this term plays important semantic roles

that contribute to the meaning of the sentence. In

the concept-based mining model, a labeled terms

either word or phrase is considered as concept. The

System architecture consists of the following main

modules:

o Text preprocessing

o Concept Analysis and

o Concept based similarity measure

Fig.1 is an Architecture of Concept Based model

and it consists of sentence-based concept analysis,

document-based concept analysis and concept-

based similarity measure.

Fig.1 Architecture of Concept Based Model

A. Text Preprocessing

1) Label Terms A raw text document is the input to the proposed

model. Each document has well defined sentence

boundaries. Each sentence in the document is

labeled automatically based on the parser. After

running the semantic role labeler, each sentence in

the document might have one or more labeled verb

argument structures. The labeled verb argument

structures, the output of the role labeling task, are

captured and analyzed by the concept-based mining

model on sentence, document levels. In this model,

both the verb and the argument are considered as

terms. One term can be an argument to more than

one verb in the same sentence. This means that this

term can have more than one semantic role in the

same sentence. In such cases, this term plays

important semantic roles that contribute to the

meaning of the sentence. In the concept-based

mining model, a labeled terms either word or phrase

is considered as concept.

2) Removing stop words In computing stop words are words which are

filtered out prior to, or after, processing of natural

language data (text). It is controlled by human input

and not automated. There is not one definite list of

stop words which all tools use, if even used. Some

tools specifically avoid using them to support

phrase search.

3) Stem words In linguistic morphology, stemming is the

process for reducing inflected (or sometimes

derived) words to their stem, base or root form –

generally a written word form. The stem need not

be identical to the morphological root of the word;

it is usually sufficient that related words map to the

same stem, even if this stem is not in itself a valid

root. Algorithms for stemming have been studied in

computer science since 1968. Many search engines

treat words with the same stem as synonyms as a

kind of query broadening, a process called

conflation. Stemming programs are commonly

referred to as stemming algorithms or stemmers.

B. Concept Analysis To analyze each concept at the sentence level is

called as

Sentence based Concept Analysis.

Consider the following sentence:

“Texas and Australia researchers have created

industry-ready sheets of materials made from

nanotubes that could lead to the development of

artificial muscles”.

Text Preprocess:

Separate Sentences, Label

Terms, removing stop words.

Concept Analysis

• Sentence based

• Document based

• Corpus based

Concept

based

similarity



In this example, stop words are removed and

concepts are shown without stemming for better

readability as follows:

1. Concepts in the first verb-argument structure of

the verb created:

• Texas Australia researchers

• created

• industry-ready sheets of material nanotubes lead

development of artificial muscles

2. Concepts in the second verb-argument structure

of the verb made:

• materials

• nanotubes lead development artificial muscles

3. Concepts in the third verb-argument structure of

the verb lead:

• nanotubes

• lead

• development artificial muscles.

It is imperative to note that these concepts are

extracted from the same sentence. Thus, the

concepts mentioned in this example sentence are:

• Texas

• Australia

• researchers

• created

• industry

• ready

• sheets

• materials

• nanotubes

• lead

• development

• artificial

• muscles

After finding the concepts at sentence level,

concepts are

also found at document level.

III. Performances of Neural Network

Systems One concern in machine learning community is

that a system trained on small samples may not

perform well on test data. On the other hand, if

training data sets are too large, our concern is how

well and efficiently a system can learn. The

objective of this study [6] is what neural network

systems are better suited for applications that have

small or large training data. For studying neural

learning from small training data we chose five

data sets like contact-lenses, cpu, weather

symbolic, Weather, labor-nega-data. All five

collections have rather balanced distribution among

all classes, and the number of pattern classes is not

too large. First, we utilized our developed text

mining algorithms, including text mining

techniques based on classification of data in

several data collections. After that, we employ

exiting neural network to deal with measure the

training time for five data sets.

Experimental results show that the accuracy was

the same for all datasets but Contact-lences, which

is the only one with absent attributes. For Contact-

lences the exactness with Proposed Neural

Network was in average around 0.3 % less than

with the original Neural Network. The larger the

dataset, the better the improvement in speed. Other

informal experiments with larger datasets

showed that Proposed Neural Network can be

more than ten times quicker when the dataset is

bigger than CPU or the network has many unknown

elements.

IV. Advantages and Disadvantages of

Neural Networks

The calculated output [7] is compared to the

identified output. If the calculated output is correct,

then nothing more is necessary. If the computed

output is incorrect, then the weights are adjusted

so as to make the computed output closer to the

known output. This process is continued for a

large number of cases, or time-series, until the net

gives the correct output for a given input. The entire

collection of cases learned is called a “training

sample” (Connor, Martin and Atlas, 1994). In most

real world problems, the neural network is never

100% correct. Neural networks are programmed to

learn up to a given threshold of error. After the

neural network learns up to the error threshold,

the weight adaptation mechanism is turned off and

the net is tested on known cases it has not seen

before. The application of the neural network to

unseen cases gives the true error rate (Baets, 1994).

Artificial neural networks present a number of

advantages over conventional methods of analysis.



First, artificial neural networks make no

assumptions about the nature of the distribution of

the data and are not therefore, biased in their

analysis. Instead of making assumptions about the

underlying population, neural networks with at least

one middle layer use the data to develop an

internal representation of the relationship

between the variables (White, 1992). Second,

since time-series data are dynamic in nature, it is

necessary to have non-linear tools in order to

discern relationships among time-series data.

Neural networks are best at discovering non-linear

relationships (Wasserman, 1989; Hoptroff, 1993;

Moshiri, Cameron, and Scuse, 1999; Shtub and

Versano, 1999; Garcia and Gencay, 2000; and

Hamm and Brorsen, 2000). Third, neural

networks perform well with missing or incomplete

data. Whereas traditional regression analysis is not

adaptive, typically processing all older data together

with new data, neural networks adapt their

weights as new input data becomes available

(Kuo and Reitch, 1994). Fourth, it is relatively

easy to obtain a forecast in a short period of time as

compared with an econometric model. However,

there are some problem connected with the use

of artificial neural networks. No estimation or

prediction errors are calculated with an artificial

neural network (Caporaletti, Dorsey, Johnson,

and Powell, 1994). Also, artificial neural

networks are “black boxes,” for it is impractical

to form out how relations in unseen layers are

estimated (Li, 1994). In addition, a network may

become a bit overzealous and try to fit a curve to

some data even when there is no relationship.

Another problem is that neural networks have long

guidance times. Reducing guidance time is crucial

because building a neural network forecasting

system is a process of trial and error. Therefore, the

more research a researcher can run in a finite period

of time, the more confident he can be of the result.

V. CONCLUSION This effort links the gap between Artificial Neural

network processing and text data mining

disciplines. A new concept based mining model

composed of four components i.e sentence based

concept analysis, documents based concept

analysis, corpus based concept analysis and concept

based similarity measure is future to develop the

text clustering quality. By utilizing the semantic

formation of the sentences in documents, a

enhanced text clustering result is achieved. By

merging the factors disturbing the weights of

thoughts on the sentence, document, and corpus

levels, a concept-based match determine that is able

of the exact result of pair wise documents is

invented. This allows performing model matching

and concept-based similarity calculations among

documents in a very robust and accurate way. The

quality of text clustering achieved by his model

considerably better the traditional solo term based

approaches. There are a number of chances for

extending this effort. One direction is to connection

this effort to Web document clustering. Another

direction is to apply the same model to text data

classification.

REFERENCES [1] Shady Shehata, Fakhri Karray and Mohamed

S. Kamel, “An Efficient Concept-Based Mining

Model for Enhancing Text Clustering”, IEEE

Transactions on Knowledge and Data Engineering,

Vol. 22, No.10, pp. 1360 – 1371, October 2010.

[2] U.Y. Nahm and R.J. Mooney, “A Mutually

Beneficial Integration of Data Mining and

Information Extraction”, Proc.17th

Nat’l Conf.

Artificial Intelligence (AAAI ’00), pp. 627-632,

2000.

[3] L.Talavera and J. Bejar, “Generality-Based

Conceptual Clustering with Probabilistic

Concepts”, IEEE Trans, Pattern Analysis and

Machine Intelligence, Vol.23, no.2, pp. 196-206,

Feb. 2001.

[4] T.Hofmann, “The Cluster-Abstraction Model:

Unsupervised Learning of Topic Hierarchies from

Text Data”, Proc. 16 th Int’l Joint Conf. Artificial

Intelligence (IJCAI ’99), pp.682-687, 1999.

[5] T.Honkela, S.Kaski, k.Lagus, and T.

Kohonen, “WEBSOM – Self Organizing Maps of

Document Collections,” Proc. Workshop Self

Organizing Maps (WSOM ’97),1997.

[6] Guobin Ou,Yi Lu Murphey, “Multi-class

pattern classification using neural networks”,

Pattern Recognition 40 (2007).



[7] Yochanan Shachmurove, Department of

Economics, The City College of the City,

University of New York and The University of

Pennsylvania, Dorota Witkowska, Department of

Management,Technical University of Lodz

“CARESS Working Paper #00-11Utilizing

Artificial Neural Network Model to Predict Stock

Markets” September 2000.

[8] M. Steinbach, G. Karypis, and V. Kumar, “A

Comparison of Document Clustering Techniques,”

Proc. Knowledge Discovery and Data Mining

(KDD) Workshop Text Mining, Aug. 2000.

[9] C. Fillmore, “The Case for Case,” Universals

in Linguistic Theory, Holt, Rinehart and Winston,

1968.

[10] S.Y. Lu and K.S. Fu, “A Sentence-to-

Sentence Clustering Procedure for Pattern

Analysis,” IEEE Trans. Systems, Man, and

Cybernetics, vol. 8, no. 5, pp. 381-389, May 1978.

[11] S. Pradhan, W. Ward, K. Hacioglu, J.

Martin, and D. Jurafsky, “Shallow Semantic

Parsing Using Support Vector Machines,” Proc.

Human Language Technology/North Am. Assoc.

for Computational Linguistics (HLT/NAACL),

2004.