final report(suddhasatwasatpathy)

62
1 A REPORT ON TEXT ANALYTICS: SMS SPAM FILTERING CLASSIFICATION MODEL By (SUDDHASATWA SATPATHY) Enrolment No.14BSP1513 (SKYBITS TECHNOLOGY PVT. LTD)

Upload: skybits-technologies-pvt-ltd

Post on 25-Jan-2017

149 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Final Report(SuddhasatwaSatpathy)

1

A REPORT

ON

TEXT ANALYTICS: SMS SPAM FILTERING CLASSIFICATION MODEL

By

(SUDDHASATWA SATPATHY)

Enrolment No.14BSP1513

(SKYBITS TECHNOLOGY PVT. LTD)

Page 2: Final Report(SuddhasatwaSatpathy)

2

INTERIM REPORT

ON

Text analytics: SMS SPAM Filtering of Classification model

By

SUDDHASATWA SATPATHY

Enrolment no.

14BSP1513

SKYBITS TECHNOLOGY PVT. LTDA report submitted in partial fulfillment of the requirements of PGPM Program of IBS KOLKATA

2014 – 2016 BATCHFaculty Guide:Prof. Parthana Banerjee Company Guide:Mr.Chandramouli Banerjee

Date of submission:12/05/2015

Page 3: Final Report(SuddhasatwaSatpathy)

3

DECLARATION

I am grateful to Professor Parthana Banerjee , Professor. Nirendu Konar, Mr.Chandramouli Banerjee, Mr. Arup Banerjee , and Mr. Debabrata Dutta of Sky Bits Technology for their active guidance in the preparation of the project. Authentic and genuine information were collected for the preparation of the project. This is interim report is being submitted towards the partial fulfillment PGPM programmed of IBS KOLKATA.

Date: 05/06/2015 Name:Suddhasatwa Satpathy

Page 4: Final Report(SuddhasatwaSatpathy)

4

Acknowledgements:

I Suddhasatwa Satpathy, the student of I.B.S (Kolkata), is extremely grateful to “Sky Bits Technology Pvt. Ltd” for the confidence which bestowed me and entrusting my project.

 At this juncture I feel deeply honored in expressing my sincere thanks to Professor. Parthana Banerjee , Professor. Nirendu Konar & Professor. Samprit Chakrabarti for making the resources available at right time and providing valuable insights leading to the successful completion of my project.

 I express my gratitude to College Director Dr. AJAY PATHAK for arranging the summer training in good schedule. I also extend my gratitude to my Project Guide Mr.Chandramouli Banerjee, who assisted me in compiling the project.

I would also like to thank all the faculty members of college I.B.S for their critical advice and guidance without which this project would not have been possible.

 Last but not the least I place a deep sense of gratitude to my family members and my friends who have been constant source of inspiration during the preparation of this project work.

Page 5: Final Report(SuddhasatwaSatpathy)

5

Table of ContentsSerial No.

CONTENTS Page No.

Internship Objective Architecture 9Executive Summary 10Abstract 11

1 Introduction: Basic Concepts 131.1 Classification of SPAMS/HAM SMS 131.2 About the company: SkyBits Technology Private Ltd 131.3 Michael Porter Analysis for Analytics Industry 161.3.1 Supplier Power 161.3.2 Buyer Power 161.3.3 Threat of Substitutes 161.3.4 Threat of new Entrants 161.3.5 Competitive rivalry 171.4 SWOT Analysis 171.4.1 Strength 171.4.2 Weakness 171.4.3 Opportunities 181.4.4 Threat 181.5 Analytics 201.6 Text Mining 201.6.1 Introduction 201.6.2 Steps Involved in Text Mining 201.6.2.1 Text pre-processing 211.6.2.2 Text transformation 211.6.2.3 Feature selection 211.6.2.4 Text mining methods 221.6.2.5 Interpretation/Evolution 221.6.3 Areas of Text Mining 221.6.3.1 Information Extraction(IE) 221.6.3.2 Information Retrieval(IR) 221.6.3.3 Natural Language Processing (NLP) 221.6.3.4 Data Mining 221.6.1 Applications of Text Mining 221.7 Text Classification 241.8 Phase of the text classification 241.8.1 Training Phase 241.8.2 Validation/Test Phase 24

Page 6: Final Report(SuddhasatwaSatpathy)

6

1.8.3 Application Phase 251.8.4 Learning Method Classifiers 251.8.5 Support Vector Machine(SVM) 251.8.6 Naive Bayes Method 261.8.7 Document Term Matrix/Term-document matrix 262 Scope & Objective of Study 292.1 Objectives of Study 292.2.1 Text Mining Objectives 292.3 Limitation of the Study 303 Methodology 323.1 PDCA Rule 324 Model Development 344.2 R Studio Process 344.2.1 R Studio Operators 354.2.2 Creating a classification model with Naive Bayes 364.2.2.1 Creating training and or the testing datasets 365 Results 396 Conclusion 417 Recommendation 428 Syntax for the classification model for SPAM/HAM SMS 449 Reference 51

Page 7: Final Report(SuddhasatwaSatpathy)

7

LIST OF FIGURESSerial No. Figures Page No.1 1. The life cycle of analytics 232 2. Model to represent text mining 263 3-A. Diagram explaining SVM(NON Linear) 284 3-B. Diagram explaining SVM(NON Linear) 295 R studio Process View 39

LIST OF TABLESSerial No. Tables Page No.1 Document term Matrix 302 Confusion Matrix 44

Page 8: Final Report(SuddhasatwaSatpathy)

8

Internship Objective Architecture: The light blue color representing the primary focus of the internship project.

Text Mining SMS WebsiteR Pubs

Page 9: Final Report(SuddhasatwaSatpathy)

9

Executive Summary:

Student Name: Suddhastwa Satpathy Enrolment number: 14BSP1513Organization: SKY BITS TECHONOLOGY Pvt.

Industry Type: SOFTWARE SERVICES & SOLUTIONS(Analytics, IT Services)Report Title: TEXT ANALYTICS: SMS SPAM FILTERING CLASSIFICATION MODEL

Objective of the Internship:

The main objective of the project is to develop a Text Classification Model through R Studio Analytics Software package. The primary focus of the indigenously developed text classification model is to automatically analyze and categorize SMS containing SPAM/HAM which will in turn help the telecom service provider industry. With a lot of SMS coming into the system, manual checking and classifying SMS is an impossible task. This model is very much helpful.

Background: Now days the customer is the king in the business world. For retaining exciting customer is more troublesome to acquire new customer.

Methodology Used: The SMS SPAM filtering classification model is done based on the primary customer SMS data provided by the company. PDCA (Plan, Do, Check & Act) rule has been followed to achieve successful completion of the tasks.

Findings: After analyzing it is observed that the accuracy of the SPAM data 95%.

Recommendations: It can be extensively used for the telecom service provider. This model SMS SPAM filtering classification can be used extensively for the betterment of the

telecom service. This model can also restrict the fraudulent use of SMS.

Page 10: Final Report(SuddhasatwaSatpathy)

10

Abstract

Automatic Text Classification is a machine learning technique. Document can be set to predefined categories best on textual content and extraction features. It has important applications in spam filtering and text mining. In the recent years the automatic categorization of texts into predefined categories is booming interest. Due to the increased availability of data & documents in the day to day basics in digital form and ensuing the need to organize them. In the research community the commanding, approach to this problem is based on machine learning techniques. The advantages of this approach efficiency, considerable savings in terms of expert manpower and straightforward portability to different domains. This report emphasizes on the text categorization that fall within the machine learning technique. How Automatic Text Classification can be used to classify SMS SPAM filtering classification model. An analysis of SMS SPAM filtering classification model has also been done using Automatic Text Classification. Using data mining software package R Studio, the models are being designed to classify texts.

Page 11: Final Report(SuddhasatwaSatpathy)

11

TOPICS -1

INTRODUCTION

BASIC CONCEPTS

Page 12: Final Report(SuddhasatwaSatpathy)

12

1. Introduction: Basic Concepts

1.1Classification of SPAM/HAM SMS:

SPAM is the virus infected SMS which results malfunctioning of mobile. HAM is basics a virus free SMS. SPAM SMS can corrupt the operating system of the mobile. Mobile phone SPAM is originated from the text message and other communication services by mobile phones. Due to the extensive use of the mobile phones now a days advertisement through SMS has rapidly increased. For this reason the user cannot identify SPAM or HAM resulting the fall under the trap of fraudulent companies. Unlike in email, some recipients may be charged a fee for every message received, including spam. Mobile phone spam is generally less pervasive than email spam, where in 2010 around 90% of email is spam. The amount of mobile spam varies widely from region to region. In North America, mobile spam has steadily increased from 2008 through 2012, but remains below 1% as of December 2012. In parts of Asia up to 30% of messages were spam’s in 2012. SMS spam is illegal under common law in many jurisdictions. In India Section 66A of the IT Act it is punishable.

1.2 About the company: Sky Bits Technology Private Limited: Sky-Bits Technology Private Limited was established during November 2013. Sky-Bits is a solutions and service provider of various analytics products and also mobile application development. With the expert in domain knowledge Sky-Bits will definitely grow in the upcoming years.

Sky-Bits Marketing Analytics:

Sky-Bits Customer Segment Solution

Segment customer based on based on behaviour, demography, values

Sky-Bits Segmentation Analytics Solution

The reviews of the customers are taken care

Sky-Bits Target Marketing Solution

Effectively plan targeted marketing

Sky-Bits Recommendation Engine

Page 13: Final Report(SuddhasatwaSatpathy)

13

Enhance customer shopping experience in e-Tail and Retail, e-service.

Sky-Bits Analytics Service

Descriptive Analytics

Machine Learning – based Descriptive Analytics helps customers understand hidden and counter-intuitive patterns within their multi-dimensional data.

Predictive Analytics

While Descriptive Analytics helps customers understand the law of the land for their business as it stands, Predictive Analytics provides them with actionable insights.

Sky-Bits BIG DATA Infrastructure Services

Sky Bits –Hadoop

Remote and On Premise Hadoop Installation and setup.

SkyBits –Hadoop Enterprise

Quick and easy Hadoop Integration with existing infrastructure.

Hadoop Infrastructure Maintenance

Dedicated expert team available to be deployed.

Big Data Consulting Service

Helping business to unearth the business potential hidden in Data.

Sky-Bits BIG DATA Edge Services

Enterprise and Consumer Web

Seamless integration of Sky-Bits Hadoop and Sky-Bits Predictive Analysis Suite with existing J2EE, LAMP system.

Visualization and Reporting

Intuitive Web Visualization and Reporting tool.

Enterprise Backend Service

Page 14: Final Report(SuddhasatwaSatpathy)

14

Adapted/Connected development/integration for ERP/CRM/SCM suites likes SAP, Oracle, PeopleSoft, SalesForce , JDA etc.

Social Media Integration

Various social media integration APIs like FaceSMS SPAM collections , Twitter, Youtube etc.

eMobility and Android Platform Service

eMobility : Run the Business

Sky-Bits provides a wide range of consulting and implementation services aimed at enabling enterprises better manage and improve their operations through and deployment of relevant mobile technologies.

eMobility : Change The Business

Sky-Bits helps organization ride the crest of the mobility wave in business through focused solutions that mobile-enable workflows, processes, information access and customer outreach.

Android OS and Application Services

Sky-Bits provides OEMs the competitive advantage through a combination of Android OS customizations at both framework and kernel levels, as well as applications and customer-facing service frameworks that ensures unique identity for OEM devices.

Embedded Systems Engineering

Sky-Bits leverages its extensive experience in the embedded engineering domain to partner with ODMs and OEMs to deliver solutions around bleeding-edge technologies like Linux-on-ARM, Raspberry Pi, Arduino and internet-of-Things.

Page 15: Final Report(SuddhasatwaSatpathy)

15

1.3 Michael Porter Analysis for the Analytics Industry:

1.3.1 Supplier Power: On IT and analytics industry today human resource suppliers and turnkey solution suppliers have

a great impact. Without good networking suppliers and professionals it is hard to sustain in the competitive market.

Analytics industries are growing industries. Due to open source software tools the industries is growing at a faster space. The company needs to upgrade the analytical software tools now and then due to change in the demand of the suppliers. Extra manpower and expenditure is required for the up gradation of analytical tools to meet the current changes in the suppliers.

Training is requiring to switch from one analytical software tools to the other analytical software tools which involves cost to the organization.

Implementation, continuous execution and maintenance of any analytics related product model needs skilled manpower. This requires training the stuff by the supplier/provider.

There are few companies like SAS, IBM, SAP, RAPID in which offers analytics software packages. Hence supplier power is high. But with many open source analytics software solutions like R, Rapid Miner, ELKI, ITLASSI, Weka coming into the fray, buyers will have more options in coming future.

1.3.2 Buyer Power: Due to the lots of analytics firm’s user have the options to switch from one firm to the others.

Unless the organization is really big and provides unmatched product experience, it is hard for companies to prevent customer churn rate. To retain the customers sometimes product rates are reduced and a service is extended.

The market matters in the identity of Brand. Big players do not bother about the buyers. The new firm has to sustain the pressure in the market.

From the provider perspective product differentiation is also advantageous.

Backward integration by the buyers is a treat to the suppliers. Analytics software products and recruit resources buyers can directly buy.

As the market very competitive sellers offers incentives to buyers.

1.3.3 Threat of Substitutes: As analytics is a rapidly growing field, innovation creates the steps of sustainability in this

industry. With technological advancement of software tools there is a real threat & can make the present product obsolete in the global market.

Page 16: Final Report(SuddhasatwaSatpathy)

16

1.3.4 Threat of new Entrants: Analytics or IT professionals with many years of experiences often explore the possibility of

becoming entrepreneurs. New entrants in the key role.

To sustain in the analytics industry innovation plays a key role. If the firm does not have any innovative idea to sustain in the market they shift of operation.

Entry barrier is low as because the cost of opening the firm is low start up capital. However it is difficult to sustain if the firm does not innovate.

Knowledge based professional is the way to success analytics venture.

1.3.5 Competitive rivalry: Sustainable competitive advantage through innovation: Analytics products bring in competition

and make it stronger by new innovations technique.

Brand value of the organization: Companies having more brand value gain a competitive advantage over the start-ups as they share greater percentage of customers’ minds.

Pricing policy: To attract new customers and retain old customers the firm uses penetration pricing policy.

Bench Strength: Start up firm have low benchmark strength compared to big once.

1.4 SWOT Analysis of the company:

1.4.1 Strength: The top management of the team has the knowledge of the real world experience.

They are expertise in different IT, Telecommunication domain. Most of the professional has well master degree in their education from different renounced university of India and abroad.

The organizations has tied up with different sectors like education, idea analytics, information technology etc.

The top management who has worked on different CMM level 5 companies and Govt. Companies like wipro, defence R & D, Lucent Inc. are the assets.

1.4.2 Weakness: Just being a start-up company they have financial problems. For this reason of they could not

target to the various marketing activities.

As they are new entered in the business their resources are limited compared to the CMM level 5 company.

Page 17: Final Report(SuddhasatwaSatpathy)

17

The bigger companies have better human resources so they can easily get more projects from the different clients globally which is absent in Sky-Bits.

1.4.3 Opportunities: As the analytics industries is growing day by day in the corporate world. Sky-bits have advanced

tools like R, Hadoop which help to analyze different problems of the big data infrastructure service. Sky-bits uses this tools to analyze different problem solving technique of the big data infrastructure.

With the experience of the founder members of the organisation they can manage the different links which will helpful to the growth of the organisation.

1.4.4 Threat: Due to the use of the open source software there is huge treats of competition among the per

company which are existing in the market.

The initial start-up capital is low. For this reason they require lots of funds to grow and sustain in the industry.

Page 18: Final Report(SuddhasatwaSatpathy)

18

SWOT Analysis of the Company

StrengthsKnowledgeable expert Better business senseBetter domain knowledgeWeaknessLow financial capabilitiesLimited resourcesLow expanses target OpportunityExpantion of busisiness with advance toolsBetter business contactsThreatsOpen marketFund management

Page 19: Final Report(SuddhasatwaSatpathy)

19

1.5 Analytics:Analytics is defined as the scientific process of transforming data into insight for making better decisions. For example the members of the organisations like yours every day to use analytics to improve processes, save costs, and enhance revenues. What are the business need in analytics – from wanting to know what it is and how you might use it to improve your organizational goals to applying it to a specific business problem you already have – INFORMS has the professionals and programs to help you. In the broader way analytics is the Analytics is the discovery and communication of meaningful patterns in data. The valuable in areas rich with recorded information, analytics relies on the simultaneous application of statistics, computer programming and operations research to quantify performance. The company may commonly apply analytics to business data, to describe, predict, and improve business performance. Specifically, areas within analytics include enterprise decision management, retail analytics, store assortment and stock-keeping unit optimization, marketing optimization and marketing mix analytics, text analytics in contact centres, web analytics, sales force sizing and optimization, price and promotion modelling, predictive science, credit risk analysis, and fraud analytics. The analytics require extensive use of the software, algorithms in the most updated methods of computer science, statistics and mathematics.

Figure 1: The life cycle of analytics [Source: https://www.informs.org/About-INFORMS/What-is-Analytics]

Page 20: Final Report(SuddhasatwaSatpathy)

20

1.6 Text Mining:

1.6.1 Introduction: In the contemporary world the text is the most common means for exchanging information .Due to

there is an increase trend in the usage of computers for storing of various types of documents. The volume of data stored in computers is increasing day by day in the form of documents. The documents can be further divided into three types structured documents, semi-structure documents and unstructured documents. The data stored in the database is an example of unstructured datasets.

The examples of semi structure and unstructured sets like emails, full text documents and HTML files etc. Text Mining is defined as the process of discovering hidden, useful and interesting pattern from unstructured text documents. The data present in the corporate world approximately 80 % is in unstructured format.

1.6.2 Steps Involved In Text Mining: The steps for the processes of the text mining is discussed below

1.6.2.1 Text preprocessing: The text pre-processing step can further divided into three parts a) tokenisation and b) stopword removal

a) Tokenisation: - Text documents contain a collection of statements. This step segments the whole text into words by removing blank spaces, commas etc.

b) Stop word removal: - This step involves removing of HTML, XML tags from web pages. Then the process of removal of stop words such as ‘a’, ‘is’, ‘of’ etc is performed.

1.6.2.2 Text transformation:Text document is represented by the words it contains and their occurrences. There are two ways to approach the for the documents presentation are i) a.bag of words and b. vector spaces.

1.6.2.3 Feature selection:It is also called as variable selection. For the use in the model creation this process is used for selecting the sub-set of important features. This phase include the performance removing features which are irrelevant.

Page 21: Final Report(SuddhasatwaSatpathy)

21

1.6.2.4 Text mining methods:The data mining methods is used for the clustering, classification information retrieval which are general used in the field of the text mining.

1.6.2.5 Interpretation/Evaluation: The process of analyzing the result is the evolution phase of text mining process.

1.6.3 Areas of the Text Mining:The areas of text mining are broad divided into four subcategory as i) information retrieval, ii) information extraction, iii) Natural language processing(NLP) & iv) Data Mining.

1.6.3.1 Information Extraction (IE):It is the process of automatically extracting structured information from unstructured and/or semi structured text documents. Attributes and relationship between entities, companies and location, names of people are involving in identifying the IE system.

1.6.3.2 Information Retrieval (IR):It is the process by which collecting the information in the form of the text. The textual documents type of information can be collected by various ways like newspaper, product reviews collected from different websites. This is basically primary and secondary medium of business research.

1.6.3.3 Natural Language Processing (NLP): Natural Language Processing is the most challenging problem in the field of artificial intelligence. The present goal of natural language processing (NLP) which the computer can understand the language used humans. It is a process of machine learning technique.

1.6.3.4 Data Mining:It is the process of discovering knowledge of large amount of data. Data mining attempts to discover statistical rules and patterns automatically.

1.6.1 Applications of Text Mining:The text mining application is used in the areas of the following sectors:-

(i) Telecommunications, energy and other service industries.

(ii) Information Technology sector and Internet.

(iii) Publishing and media.

(iv) Banks, insurance and financial markets.

(v) Political institutions, political analysis, public administration and legal documents.

Page 22: Final Report(SuddhasatwaSatpathy)

22

(vi) Pharmaceutical and research companies and health care.

(vii) Bio-Informatics, Business Intelligence and national security.

Figure 1.2: Model to represent text mining [Source: International Journal of Computer application website:www.ijcaonline.org]

1.7 Text Classification (Text Categorization): The text classification is the task of atomically sorting a set of documents into categories (or classes, or topics) from a predefined set. It the process of the conceptual view of document collections and has

Page 23: Final Report(SuddhasatwaSatpathy)

23

important applications in the real world. For example the sentiment analysis is the process collecting reviews of the product which given by different peoples in the different websites can be classified as positive, neutral & negative. The example of the spam filtration of SMS text can be classified as spam or not spam. In the field of the marketing when products has declined sales from the previous year what are parameters which causes the product in the declining sales can be easily classified into competitor price, ingredients used and so on. Text classification problems are determined by the number of classifications. If there are exactly two classes (example: spam/non spam in case of Emails), it is called a ‘binary’ text classification problem. If there are more than two classes (example: positive/negative/neutral in case of sentiment analysis of documents), and each document exactly falls into one class, this is a ‘multi – class’ problem. In many cases, however, a document may have more than one associated category in a classification scheme. This type of text classification task is called a ‘multi – label’ categorization problem.

1.8 Phases of text classification:The machine learning technique can build to text classification model consist the following phases:

1.8.1 Training phase:Here from SMS SPAM collections the of the entire dataset. Then using the proportion from the SMS SPAM collections of the entire data set. From the entire data set of the spam filter 75% is training set. The entire 75% of the training set is divided into the training of the two predictive models.

1.8.2 Validation/Test phase:The rest 25% is used for the testing phase from the SMS SPAM collections of the spam filter dataset. Then use these two models to predict the appropriate classification. In each case we will estimate how good the prediction is. For the purpose of validation function is used to predict .

1.8.3 Application phase:The application phase involves the model to different SMS Spam dataset .For the Naive Bayes model is good enough to test . The model can be applied on Independent and Personal SMS Spam filtering .The data set of the large is applied as training phase where the data set of small is known as the validation phase. For the process of validation of dataset 75% is the training set while the rest 25%is the test set.

Page 24: Final Report(SuddhasatwaSatpathy)

24

1.8.4 Learning Method Classifiers:The basis learning method classifiers are machine learning algorithms which include a number of advanced statistical methods for handling regression and classification tasks with multiple dependent and independent variables. These methods are divided into various sectors Support Vector Machines (SVM) for regression and classification, Naïve Bayes for classification, and document term matrix(Dtm) for classification.

1.8.5 Support Vector Machine (SVM): Support Vector Machines are based on the concept of decision planes that define decision boundaries. A set of objects having different class memberships is separated by decision plane. An example is illustrated below:-

In this example the object belongs to RED or GREEN. RED object at the left and GREEN object at the right is separated by a separating line which is known as boundary. Any new object (white circle) falling to the right is labelled, i.e., classified, as GREEN (or classified as RED should it fall to the left of the separating line).

Figure 3-A. Diagram explaining SVM (NON Linear) [Source: http://www.statsoft.com/TextSMS SPAM collections /Support-Vector-Machines

The above is a classic example of a linear classifier, i.e., a classifier that separates a set of objects into their respective groups (GREEN and RED in this case) with a line. For more complex structures are needed for the optimal separation. This situation is depicted in the illustration below. Compared to the previous schematic, it is clear that a full separation of the GREEN and RED objects would require a curve (which is more complex than a line) Classification tasks based on drawing separating lines to distinguish between objects of different class memberships are known as hyper plane classifiers. Support Vector Machines are particularly suited to handle such tasks.

Page 25: Final Report(SuddhasatwaSatpathy)

25

Figure 3-B. Diagram explaining SVM (NON Linear) [Source: http://www.statsoft.com/TextSMS SPAM collections /Support-Vector-Machines

1.8.6 Naive Bayes method:The Naive Bayes Classifier technique is based on the so-called Bayesian theorem and is particularly suited when the dimensionality of the inputs is high. Despite its simplicity, Naive Bayes can often outperform more sophisticated classification methods.

1.8.7 Document Term Matrix\Term-document matrix (Dtm):A document-term matrix or term-document matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms. In the document matrix there are various schemes for determining the value that each entry in the matrix should take. It is used in the field of natural language processing. For example schemes for determining the value that each entry in the matrix should take.

D1 = "I like databases"

D2 = "I hate databases"

Then from the following example the document term matrix would be like as given below in the table.

I like hate databases

D1 1 1 0 1

D2 1 0 1 1

Page 26: Final Report(SuddhasatwaSatpathy)

26

This shows the general structure which documents contain which terms and how many times it appears.

TOPICS -2

SCOPE & OBJECTIVES OF STUDY

Page 27: Final Report(SuddhasatwaSatpathy)

27

2. Scope & Objectives of Study:

2.1 Scope of Study: The scope of the project to understand the basic concepts of the areas of the text mining. It also

involves the design of a Text Classification Model to perform predictive analysis. In modern day the experiment is a key. The study involves the modern analytics tools. This tools help understand the working knowledge of the different technique which are used in analytics. The different modern analytics applications can which can be applied in the real-time scenario. The scope of every study is based on data collection methods. Company provided the data for the development of a classification model for SPAM/HAM SMS classification using caret and Naive Bayes.

This model will help the company to apply for model for predicting different SPAM/HAM SMS in the text analytics market. This study will generate the different SPAM/HAM SMS which are coming to the people now and then in the real time scenario. Now days the people is get SMS from the various sources which are really unknown to the them.

For the betterment of the society this model can be extensively used. The company can be benefited by applying the model through the various telecom service providers in the world wide market.

2.2 Objectives of the study:

2.2.1 Text Mining objectives:

Text is basically the process of structuring the input text deriving patterns within the structured or unstructured data, and finally evaluation and interpretation of the output. The text mining not only includes text categorization, predictive analysis but also document summarization.

The basic objective of the project Text Classification Model through R Analytics Software package. The model is used for the classification of SPAM & HAM from the entire set of SMS SPAM collections .

Page 28: Final Report(SuddhasatwaSatpathy)

28

After the development of the classification model for the implementation purpose it is further divided into two parts is the one part training phase and the another part is test phase.

2.3 Limitations of the study: The storage of time is a big issue in analyzing huge data.

Data Availability: The identification of the text of spam messages in the claims is a very hard and time-consuming task .It also involved carefully scanning hundreds of web pages.

Page 29: Final Report(SuddhasatwaSatpathy)

29

3. Methodology

TOPICS -3

METHDOOLOGY

Page 30: Final Report(SuddhasatwaSatpathy)

30

3.1 PDCA Rule:

The whole internship will follow a common PDCA rule i.e. Plan, Do, Check, Act. It is the process of the operation management.

PLAN: In accordance with the expected output the objectives and process can be established to deliver the result. For the targeted improvement establishment of output expectations is the completeness and accuracy of the specification.

DO: First implement the plan then execute the process and at last developing the product.

CHECK: Study the actual result and compare with the expected result.

ACT: Correction of the difference between the actual data with the outcome results.

3.2 The PDCA rule for the project: Plan: Getting familiar with R Studio software, extracting the SMS for the text mining purpose. With the help of R Studio, sentiment analysis is done using csv(coma separated value) of mobile galaxy S4 reviews.

Do: Using the use of real time data. The data provided by the company for the model development.

Check: Checking the model for the accuracy, sensitivity and specificity.

Act: Dependent of the result then act.

Page 31: Final Report(SuddhasatwaSatpathy)

31

TOPICS - 4

MODEL DEVELOPMENT

Page 32: Final Report(SuddhasatwaSatpathy)

32

4. Model Development:

4.1 Data Source: For mobile phones SPAM research public set of SMS labelled messages has been collected from the SMS SPAM collection. It has one collection composed by 5,574 English, real and non-encoded messages, tagged according being legitimate (ham) or spam. First a collection of 425 SMS spam was extracted from the Grumbletext Web site. Then the second subset of 3,375 SMS randomly chosen ham messages of the NUS SMS Corpus (NSC), which is a dataset of about 10,000 legitimate messages collected for research at the Department of Computer Science at the National University of Singapore. Thirdly a list of 450 SMS ham messages collected from Caroline Tag's PhD. Finally, we have incorporated the SMS Spam Corpus v.0.1 Big. It has 1,002 SMS ham messages and 322 spam messages and it is public available.

4.2 R studio Process: The R studio has a script writing part where the code can be written with comment line above. Then the console part which is basically used for writing code and also for viewing the output. The other part consists of the R studio environment like the objects and variables can be seen. On the other hand the R studio history displays the history of all the codes which has been written. Just below the R studio history there are list of files, plots, packages & help . In the file there are options to create a folder, rename folder & to change the directory of the folder. In the plot various graphs can be seen. The default packages that there are R studio provide can used and also the updated packages can be seen. The help is very much useful of the different R code helps which are elastrator.

Page 33: Final Report(SuddhasatwaSatpathy)

33

Figure 4.1:R studio Process View

4.2.1 R Studio Operators: The R studio is divided upon various parts which are given below.

i) Bottom left : Console window also known as the command window. Here R command can written.

ii) Top left: Editor widow which also known as the script window. In this window collections of command and scripted can be saved and edited.

iii) Top right: In the workspace and history you can see the data and values in the R studio. R studio has it own memory.

Page 34: Final Report(SuddhasatwaSatpathy)

34

iv) Bottom right: In this workspace have multidimensional activities like open files viewing the graphs to be plotted.

4.2.2 Creating a classification model with Naive Bayes:

4.2.2.1Generating the training and testing datasets:

The Create Data Partition function is used to split the original dataset into training and a testing set, using the proportions from the SMS SPAM collections (75% training, 25% testing). This generates the corresponding corpora and document term matrices.

According to the documentation that accompanies the data file, 86.6% of the entries correspond to legitimate messages (“ham”), and 13.4% to spam messages. The main objective is to see if the partition procedure has been preserved for those proportions in the testing and training sets.

It would seem that the procedure keeps the proportions perfectly.

The strategy used in the SMS SPAM collections, first we would pick terms that appear at least 5 times in the training document term matrix. To do the task, we first have to create a dictionary of terms (using the function find Freq Terms) that we use to filter the cleaned up training and testing corpora.

As a final step before using these sets, first we have to convert the numeric entries in the term matrices into factors that indicate whether the term is present or not. For this we have a slightly modified version of the convert counts function that appears in the SMS SPAM collections, and we would apply it to each column in the matrices.

Training the two prediction models

First use Naive Bayes to train a couple of prediction models. With the default parameters of both the models will be generated using 10-fold cross validation.

It must be kept in mind that the first model doesn’t use the Laplace correction. Lets the training procedure figure out whether to user or not a kernel density estimate, while the second one fixes Laplace parameter to one (fL=1) and explicitly forbids the use of a kernel density estimate (useKernel=FALSE).

Testing the predictions

We will also use our sumpred function to extract the true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN), the prediction accuracy, the sensitivity(also known as recall or true positive rate), and the specificity (also known as true negative rate).

Page 35: Final Report(SuddhasatwaSatpathy)

35

Also, we will use the information from the similar models described in the SMS SPAM collections, in terms of TP, TN, TP, and FN, to estimate the rest of the parameters, and compare them with the caret derived models.

Accuracy gives us an overall sense of how good the models are, and using those criteria, the ones in the SMS SPAM collections and those calculated here are very similar in how well they classify an SMS. All of them do surprisingly well taking into account the simplicity of the method.

The discussion in the SMS SPAM collections cantered around the number of FP(False Positive) predicted by the model, but I’d rather look at the sensitivity (related to type II errors) and specificity (related to Type I errors) of the predictions (and the corresponding PPV and NPV).

In this example, the sensitivity gives us the probability of an SMS text being classified as SPAM, when it really is SPAM. Looking at this parameter, we see that even though the SMS SPAM collections’s models do not differ much from the caret models in terms of accuracy, they do worse in terms of sensitivity. The text of the SMS SPAM collections argues that using the Laplace correction improves prediction, but with the cross-validated models generated using caret package the opposite is true.

Of course, we gain in sensitivity, but we lose slightly in specificity, which in this example is the probability of a HAM message being classified as HAM. In other words, we increase (a bit) the misclassification of the regular SMS texts as SPAM. But the difference between the worst and the best specificity is of the order of 0.01 or ~1%.

Page 36: Final Report(SuddhasatwaSatpathy)

36

TOPICS – 5

RESULTS

CONCLUSIONS

RECOMENDATION

Page 37: Final Report(SuddhasatwaSatpathy)

37

5. Results:$ V1: Factor w/ 2 levels "ham","spam": 1 1 2 1 1 2 1 1 2 2 ...

$ V2: Factor w/ 5171 levels "'An Amazing Quote'' - \"Sometimes in life its difficult to decide whats wrong!! a lie that brings a smile or the truth that bri"| __truncated__,..: 1147 3249 1046 4278 2895 1071 975 407 4765 1282 ...

> colnames(sms_raw)

[1] "V1" "V2”

Naive Bayes

4182 samples

1281 predictors

2 classes: 'ham', 'spam'

No pre-processing

Resampling: Cross-Validated (10 fold)

Summary of sample sizes: 3764, 3764, 3764, 3763, 3764, 3764, ...

Resampling results across tuning parameters:

usekernel Accuracy Kappa Accuracy SD Kappa SD

FALSE 0.9803379 0.9110432 0.008921711 0.0413257

TRUE 0.9803379 0.9110432 0.008921711 0.0413257

Tuning parameter 'fL' was held constant at a value of 0

Accuracy was used to select the optimal model using the largest value.

Page 38: Final Report(SuddhasatwaSatpathy)

38

The final values used for the model were fL = 0 and usekernel = FALSE.

Confusion Matrix and Statistics

ReferencePrediction ham spamHam 1204 25Spam 2 161

Accuracy : 0.9806

95% CI : (0.9719, 0.9872)

No Information Rate : 0.8664

Sensitivity : 0.8656

Specificity : 0.9983

'Positive' Class : spam

Page 39: Final Report(SuddhasatwaSatpathy)

39

6. Conclusion:From the a classification model for SMS SPAM filtering is the accuracy level is 95% and positive level is spam.

Page 40: Final Report(SuddhasatwaSatpathy)

40

7. Recommendation:

It can be extensively used for the telecom service provider. This model SMS SPAM filtering classification can be used extensively for the betterment of the

telecom service. This model can also restrict the fraudulent use of SMS.

Page 41: Final Report(SuddhasatwaSatpathy)

41

TOPICS – 6

ATTCHEMENTS

Page 42: Final Report(SuddhasatwaSatpathy)

42

8. Syntax for the classification model for SPAM/HAM SMS:# libraries needed by caret

library(klaR)

library(MASS)

# for the Naive Bayes modelling

library(caret)

# to process the text into a corpus

library(tm)

# to get nice looking tables

library(pander)

# to simplify selections

library(dplyr)

##library(doMC)

##registerDoMC(cores=4)

# a utility function for % freq tables

frqtab <- function(x, caption) {

round(100*prop.table(table(x)), 1)

}

Page 43: Final Report(SuddhasatwaSatpathy)

43

# utility function to summarize model comparison results

# sumpred <- function(cm) {

# summ <- list(TN=cm$table[1,1], # true negatives

# TP=cm$table[2,2], # true positives

# FN=cm$table[1,2], # false negatives

# FP=cm$table[2,1], # false positives

# acc=cm$overall["Accuracy"], # accuracy

# sens=cm$byClass["Sensitivity"], # sensitivity

# spec=cm$byClass["Specificity"]) # specificity

# lapply(summ, FUN=round, 2)

# }

###########################################################################

## Reading and preparing the data

# if (!file.exists("smsspamcollection.zip")) {

# download.file(url="http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/smsspamcollection.zip",

# destfile="smsspamcollection.zip", method="curl")

# }

#

# sms_raw <- read.table(unz("smsspamcollection.zip","SMSSpamCollection"),

# header=FALSE, sep="\t", quote="", stringsAsFactors=FALSE)

Page 44: Final Report(SuddhasatwaSatpathy)

44

sms_raw <- read.csv("D:\\BI&A-Collections\\TextMining\\Sms-Spam\\spamdata.csv",header = F, sep=",")

str(sms_raw)

colnames(sms_raw) <- c("type", "text")

sms_raw$type <- factor(sms_raw$type)

sms_raw$text <- as.character(sms_raw$text)

# randomize it a bit

set.seed(12358)

sms_raw <- sms_raw[sample(nrow(sms_raw)),]

dim(sms_raw)

str(sms_raw)

table(sms_raw$type)

### Preparing the data ####################################

# First transform the SMS text into a corpus that can later

# be used in the analysis, then convert all text to lowercase,

# remove numbers, remove some common stop words in english, remove

# punctuation and extra whitespace, and finally, generate the document

Page 45: Final Report(SuddhasatwaSatpathy)

45

# term that will be the basis for the classification task.

smsCorpus <- Corpus(VectorSource(sms_raw$text))

sms_corpus_clean <- sms_corpus %>%

#tm_map(content_transformer(tolower)) %>%

#tm_map(removeNumbers) %>%

#tm_map(removeWords, stopwords(kind="en")) %>%

#tm_map(removePunctuation) %>%

#tm_map(stripWhitespace)

smsCorpus = tm_map(smsCorpus, tolower)

smsCorpus = tm_map(smsCorpus, removePunctuation)

smsCorpus = tm_map(smsCorpus, removeNumbers)

smsCorpus <-tm_map(smsCorpus,stripWhitespace)

smsCorpus = tm_map(smsCorpus, removeWords, stopwords("english"))

##myDTM = TermDocumentMatrix(smsCorpus, control = list(minWordLength = 1))

sms_corpus_clean = tm_map(smsCorpus, PlainTextDocument)

sms_dtm <- DocumentTermMatrix(sms_corpus_clean)

# m = as.matrix(smsDtm)

# v = sort(rowSums(m), decreasing = TRUE)

Page 46: Final Report(SuddhasatwaSatpathy)

46

# library(wordcloud)

# set.seed(4363)

# wordcloud(names(v), v, min.freq = 5)

# Creating a classification model witn Naive Bayes

# Generating the training and testing datasets using caret

train_index <- createDataPartition(sms_raw$type, p=0.75, list=FALSE)

sms_raw_train <- sms_raw[train_index,]

sms_raw_test <- sms_raw[-train_index,]

sms_corpus_clean_train <- sms_corpus_clean[train_index]

sms_corpus_clean_test <- sms_corpus_clean[-train_index]

sms_dtm_train <- sms_dtm[train_index,]

sms_dtm_test <- sms_dtm[-train_index,]

# ft_orig <- frqtab(sms_raw$type)

# ft_train <- frqtab(sms_raw_train$type)

# ft_test <- frqtab(sms_raw_test$type)

# ft_df <- as.data.frame(cbind(ft_orig, ft_train, ft_test))

# colnames(ft_df) <- c("Original", "Training set", "Test set")

# pander(ft_df, style="rmarkdown",

#caption=paste0("Comparison of SMS type frequencies among datasets"))

Page 47: Final Report(SuddhasatwaSatpathy)

47

## pick terms that appear at least 5 times in the training document term matrix.

## To do this, we first create a dictionary of terms

## (using the function findFreqTerms) that we will use to filter the

## cleaned up training and testing corpora.

sms_dict <- findFreqTerms(sms_dtm_train, lowfreq=5)

sms_train <- DocumentTermMatrix(sms_corpus_clean_train, list(dictionary=sms_dict))

sms_test <- DocumentTermMatrix(sms_corpus_clean_test, list(dictionary=sms_dict))

# As a final step before using these sets, we will convert the numeric entries

## in the term matrices into factors that indicate whether the term is present

## or not.

convert_counts <- function(x) {

x <- ifelse(x > 0, 1, 0)

x <- factor(x, levels = c(0, 1), labels = c("Absent", "Present"))

}

sms_train <- sms_train %>% apply(MARGIN=2, FUN=convert_counts)

sms_test <- sms_test %>% apply(MARGIN=2, FUN=convert_counts)

## use Naive Bayes to train prediction models.

## using 10-fold cross validation

Page 48: Final Report(SuddhasatwaSatpathy)

48

ctrl <- trainControl(method="cv", 10)

set.seed(12358)

sms_model1 <- train(sms_train, sms_raw_train$type, method="nb",

trControl=ctrl)

sms_model1

## Model 2 with tuning parameter

# set.seed(12358)

# sms_model2 <- train(sms_train, sms_raw_train$type, method="nb",

# tuneGrid=data.frame(.fL=1, .usekernel=FALSE),

# trControl=ctrl)

# sms_model2

## Confusion matrix

sms_predict1 <- predict(sms_model1, sms_test)

cm1 <- confusionMatrix(sms_predict1, sms_raw_test$type, positive="spam")

cm1

# sms_predict2 <- predict(sms_model2, sms_test)

# cm2 <- confusionMatrix(sms_predict2, sms_raw_test$type, positive="spam")

# cm2

Page 49: Final Report(SuddhasatwaSatpathy)

49

9. References

[1] http://en.wikipedia.org/wiki/Spamming

[2]http://cis-india.org/internet-governance/blog/breaking-down-section-66-a-of-the-it-act

[3] https://www.informs.org/About-INFORMS/What-is-Analytics

[4] http://sky-bits.com/wpsite/

[5] http://en.wikipedia.org/wiki/Analytics

[4] http://www.ijcaonline.org/

[5] http://www.ijcsi.org/

[6] https://rpubs.com/jesuscastagnetto/caret-naive-bayes-spam-ham-sms

[7] http://www.statsoft.com/TextSMS SPAM collections /Support-Vector-Machines

[8] http://www.statsoft.com/TextSMS SPAM collections /Naive-Bayes-Classifier

[9] http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/

[10] https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection

[11] http://www.grumbletext.co.uk/

[12] http://www.comp.nus.edu.sg/~rpnlpir/downloads/corpora/smsCorpus/

[13] http://etheses.bham.ac.uk/253/1/Tagg09PhD.pdf

[14] http://www.esp.uem.es/jmgomez/smsspamcorpus/

Page 50: Final Report(SuddhasatwaSatpathy)

50