Vous êtes sur la page 1sur 18

Articles Desktop Development Miscellaneous General

Text Mining and its Business Applications


Niladri_Biswas, 23 Sep 2014 CPOL

4.73 (7 votes)

Text Mining and its Business Applications

Introduction
Organizations today encounter textual data (both semi-structured and unstructured) while running their day to day
business. The source of the data could be electronic text, call center logs, social media, corporate documents, research
papers, application forms, service notes, emails, etc. This data may be accessible but remains untapped due to the lack of
awareness of the information wealth an organization possesses or the lack of methodology or technology to analyze this
data and get the useful insight.

Any form of information that an organization possesses or can posses is an asset and can get insight about its business by
exploiting this information for decision making. This data could hold information about their customer, partners and
competitors. The data about customers could give them insight about how to provide better services to its customers and
increase their customer base. The data about its partners can provide insights about how to maintain better relationships
with its partners and forge new and valuable relationships. The data about its competitors can help them stay ahead of its
competitors. However, not all the data that an organization possesses is tapped to get these insights. The reason being that
major portion of this data is not in the structured form and it is difficult to process this data the way structured data is
processed(using traditional methods) to get the useful and desired insight. Further the sea of this data, having potential
commercial, economic and societal value, is expected to grow at a faster pace in near future. Therefore it becomes
extremely important to use techniques that can exploit this potential by uncovering hidden value from this data. This is
where text mining/analytics techniques find its value and can be helpful in discovering useful and interesting knowledge
from this data. Businesses use such techniques to analyze customer and competitor data to improve competitiveness.

Abstract
Text mining is gaining importance due to problem of discovering useful information from the data deluge that the
organizations are facing today. This white paper intends to present a broad overview of text mining and its components
and techniques and their use in various business applications. This paper gives a description about text mining and the
reasons for its increased importance over the years. This is followed with presenting a generic process framework for text
mining and describes its different components and sub-components, business applications, and brief description of text
mining tools available in the market.

Text mining is often considered to have originated from data mining; however a few of the techniques have come from
various other disciplines like information science and information visualization. Text mining strives to solve the information
overload problem by using techniques from data mining, machine learning, natural language processing (NLP), information
retrieval (IR), Information extraction (IE) and knowledge management (KM). Text mining involves the preprocessing of
document collections (text categorization, feature/term extraction, etc.), the storage of the intermediate representations,
the techniques to analyze these intermediate representations (such as distribution, analysis, clustering, trend analysis, and
association rules), and visualization of the results.

What is Text Mining?


Simply put text mining is the knowledge discovery from textual data or textual data exploration to uncover useful but
hidden information. However, many people have defined text mining slightly differently. The following are a few definitions:

The objective of Text Mining is to exploit information contained in textual documents in various ways, including discovery of
patterns and trends in data, associations among entities, predictive rules, etc. (Grobelnik et al., 2001).

Another way to view text data mining is as a process of exploratory data analysis that leads to heretofore unknown
information, or to answers for questions for which the answer is not currently known. (Hearst, 1999).

Text mining also known as text data mining or text analytics is the process of discovering high quality information from the
textual data sources. The application of text mining techniques to solve specific business problems is called business text
analytics or simply text analytics. Text mining techniques can facilitate organizations derive valuable business insight from
the wealth of textual information they possess.

Text mining transforms textual data into structured format through the use of several techniques. It involves identification
and collection of the textual data sources, NLP techniques like part of speech tagging and syntactic parsing, entity/concept
extraction which identifies named features like people, places, organizations, etc., disambiguation, establishing relationship
between different entities/concepts, pattern and trend analysis and visualization techniques.

Text Mining Framework


Figure 1 below depicts a generic text mining framework. The textual data is obtained from the various textual data sources.
Preprocessing techniques, centered on identification and extraction of the features from the textual data, are then used to
transform the unstructured data from the textual data sources into a more explicitly structured intermediate format. Text
mining also uses techniques and methodologies from other computer science disciplines concerned with managing natural
language text like information retrieval and information extraction. Knowledge discovery component generally includes the
application pattern discovery and trend analysis algorithms to discover valuable information from the intermediate format
textual data. Presentation layer component includes GUI for pattern browsing facility and also includes tools for creating
and viewing trends and patterns.
Text Mining Framework Components
The different stages in the text mining framework are described below:

1. Textual Data Sources

The textual data is available in numerous internal and external data source like electronic text, call center logs, social media,
corporate documents, research papers, application forms, service notes, emails, etc.

2. Preprocessing

Preprocessing tasks include methods to collect data from the disparate data sources. This is the preliminary step of
identifying the textual information for mining and analysis. Preprocessing tasks apply various feature extraction methods
against the data. Preprocessing tasks include different types of techniques to transform the raw, unstructured, original
format data into structured, intermediate data format. Knowledge discovery operations are conducted against the
structured intermediate data.

For the preparation of unstructured data into a structured data format, different techniques are needed than those of
traditional data mining systems where the knowledge discovery is done against the structured data sources. Various
preprocessing techniques exist and can be used in combination to create structured data representation from raw textual
data. Therefore different combinations of techniques can be used based on the type of the raw textual data.

a. Text Cleansing

Text cleansing is the process of cleansing noisy text from the textual sources. Noisy textual data can be found in SMSes,
email, online chat, news articles, blogs and web pages. Such text may have spelling errors, abbreviations, non-standard
terminology, missing punctuation, misleading case information, as well as false starts, repetitions, and special characters.

Noise can be defined as any kind of difference in the surface form of an electronic text from the original, intended or actual
text. The text used in the in short message service (SMS) and on-line forums like twitter, chat and discussion boards and
social networking sites is often distorted mainly because the recipients can very well understand the shorter form of the
longer words and also reduces the time and effort of the sender. Most of the text is created and stored so that humans can
understand it, and it is not always easy for a computer to process that text. With the increase in noisy text data generated in
various social communication media, cleansing of such text has become necessary and also because the of-the-shelf NLP
techniques generally fail to work because of several reasons like sparsity, out-of-vocabulary words and irregular syntactic
structures in such texts.
A few of the cleaning techniques are:

Removing stop words (deleting very common words like "a", "the", "and", etc.).

Stemming (ways of combining words that have the same linguistic root or stem).

i. Removing stop words

Stop words are words which are filtered before or after processing of textual data. There is not one definite list of stop
words which all tools use, if even used. Some tools specifically avoid removing them to support phrase search. The most
common stop words found in the text are the, is, at, which and on. These kinds of stop words can sometimes cause
problems when looking for the phrases that include them. Some search engines remove some of the most common words
from the query on order to improve performance.

ii. Stemming

Stemming is the process for reducing inflected (or sometimes derived) words to their stem, base or root form, generally a
written word form. The stem need not be identical to the morphological root of the word; it is usually sufficient that related
words map to the same stem, even if this stem is not in itself a valid root. Algorithms for stemming have been studied in
computer science since 1968. Many search engines treat words with the same stem as synonyms as a kind of query
broadening, a process called conflation.

A stemmer for English, for example, should identify the string "cats" (and possibly "catlike", "catty" etc.) as based on the
root "cat", and "stemmer", "stemming", "stemmed" as based on "stem". A stemming algorithm reduces the words "fishing",
"fished", "fish", and "fisher" to the root word, "fish". On the other hand, "argue", "argued", "argues", "arguing", and "argus"
reduce to the stem "argu" (illustrating the case where the stem is not itself a word or root) but "argument" and "arguments"
reduce to the stem "argument".

Stemming programs are commonly referred to as stemming algorithms or stemmers. There are several types of stemming
algorithms which differ in respect to performance and accuracy and how certain stemming obstacles are overcome.

b. Tokenization

Tokenization is the process of breaking piece of text into smaller pieces like words, phrases, symbols and other elements
which are called tokens. Even a whole sentence can be considered as a token. During the tokenization process some
characters like punctuation marks can be removed. The tokens then become an input for other processes in text mining like
parsing.

Tokenization relies mostly on simple heuristics in order to separate tokens by following a few steps:

1. Tokens or words are separated by whitespace, punctuation marks or line breaks


2. White space or punctuation marks may or may not be included depending on the need
3. All characters within contiguous strings are part of the token. Tokens can be made up of all alpha characters,
alphanumeric characters or numeric characters only.

Tokens themselves can also be separators. For example, in most programming languages, identifiers can be placed
together with arithmetic operators without white spaces. Although it seems that this would appear as a single word or
token, the grammar of the language actually considers the mathematical operator (a token) as a separator, so even when
multiple tokens are bunched up together, they can still be separated via the mathematical operator.

Tokenization is the first step in processing the text. It is very difficult to extract useful high level information from the text
without identifying the tokens. Each token is an instance of a type, so the number of tokens is much higher than the
number of types. As an example, in the previous sentence there are two tokens spelled the. These are both instances of a
type the, which occurs twice in the sentence. Properly speaking, one should always refer to the frequency of occurrence of
a type, but loose usage also talks about the frequency of a token. It would be easier for a person, familiar with the
language, to identify the tokens in the stream of characters. But on the other hand it would be difficult for a computer to
do so due to lack of understanding of the language. This is because some characters are sometimes considered as token
delimiters and sometimes not based on the application. The characters space, tab, and newline we assume are always
delimiters and are not counted as tokens. They are often collectively called white space. The characters ( ) < > ! ? " are
always delimiters and may also be tokens. The characters . , : - may or may not be delimiters, depending on their
environment. A period, comma, or colon between numbers would not normally be considered a delimiter but rather part of
the number. Any other comma or colon is a delimiter and may be a token. A period can be part of an abbreviation (e.g., if it
has a capital letter on both sides). It can also be part of an abbreviation when followed by a space (e.g., Dr.). However, some
of these are really ends of sentences. The problem of detecting when a period is an end of sentence and when it is not will
be discussed later. For the purposes of tokenization, it is probably best to treat any ambiguous period as a word delimiter
and also as a token. The apostrophe also has a number of uses. When preceded and followed by non-delimiters, it should
be treated as part of the current token (e.g., isnt or Dangelo). When followed by an unambiguous terminator, it might be a
closing internal quote or might indicate a possessive (e.g., Tess). An apostrophe preceded by a terminator is
unambiguously the beginning of an internal quote, so it is possible to distinguish the two cases by keeping track of
opening and closing internal quotes. A dash is a terminator and a token if preceded or followed by another dash. A dash
between two numbers might be a subtraction symbol or a separator (e.g., 555-1212 as a telephone number). It is probably
best to treat a dash not adjacent to another dash as a terminator and a token, but in some applications it might be better
to treat the dash, except in the double dash case, as simply a character.

c. POS tagging

Part-of-speech tagging also known as grammatical tagging or word-category disambiguation is the process of assigning a
word in the text corresponding to a particular part of speech like noun, verb, pronoun, preposition, adverb, adjective or
other lexical class marker to each word in a sentence. The input to a tagging algorithm is a string of words of a natural
language sentence and a specified tagset (a finite list of Part-of-speech tags). The output is a single best POS tag for each
word.

Tags play an important role in Natural language applications like speech recognition, natural language parsing and
information retrieval.

Part-of-speech tagging is harder than just having a list of words and their parts of speech, because some words can
represent more than one part of speech at different times, and because some parts of speech are complex or unspoken.
This is very common in natural languages as compared to the artificial languages where a large portion of word forms are
ambiguous. For example, dogs, thought to be plural nouns can also be a verb: The sailor dogs the barmaid.

Performing grammatical tagging will indicate that "dogs" is a verb, and not the more common plural noun, since one of the
words must be the main verb, and the noun reading is less likely following "sailor". "Dogged", on the other hand, can be
either an adjective or a past-tense verb. Just which parts of speech a word can represent varies greatly.

Hidden Markov Models (HMMs) is one of the earliest models used to disambiguate part of speech.

d. Syntactical Parsing

Syntactical parsing is the process of performing syntactical analysis on a string of words, phrase or a sentence according to
certain rules of grammar. Syntactical parsing discovers structure in the text and is used to determine if a text conforms to
an expected format. It involves breaking of text into different elements and identifying syntactical relationship between
different elements. The basic idea behind syntactical analysis is to create a syntactic structure or a parse tree from a
sentence in a given natural language text to determine how a sentence is broken down into phrases, how the phrases are
broken down into sub-phrases, and all the way down to the actual structure of the words used. In order to parse natural
language text two basic grammars are used:- the constituency and dependency grammars.

Constituency grammars help create the syntactical structure by breaking the sentences into repetitive phrases or sequence
of syntactically grouped elements. Many constituency grammars make a distinction between noun phrases, verb phrases,
prepositional phrases, adjective phrases, and clauses. Each phrase may consist of zero or smaller phrases or words
according to the rules of the grammar. Each phrase plays a different role in a syntactical structure of a sentence, for
example, a noun phrase may be labeled as the subject of the sentence.

Dependency grammars, on the other hand, help create the syntactical structure of a sentence based on the direct one-to-
one relation between different elements or words. Dependency relation views the verb as the center of the syntactical
structure and all other words elements or words dependent on the verb directly or indirectly.

e. Information Extraction
Information extraction identifies the key phrases and relationships within the textual data. This is done by a process called
pattern matching which looks for predefined sequences in the text. Information extraction infers the relationships between
all the identified people, places and time from the text to extract the meaningful information. For handling huge volumes of
textual data Information extraction can be very useful. The meaningful information is collected and stores in the data
repositories for Knowledge discovery, mining and analysis. A few of the information extraction techniques are described
below:

i. Topic tracking

Topic tracking system keeps track of the users and their profiles and the documents a particular user views and thereby
finds out the similar documents which may be of interest to the user. This system can be helpful in letting the users identify
particular categories they may be interested in and can also identify users interest based on their reading history.

Topic tracking finds application in many business areas in the industry. With topic tracking system in place organization can
find out the news related to their competitors and their products, which helps them keep track of competitive products and
the market conditions as well as keep track their own business and products. In the medical industry topic tracking can help
medical professionals find out new treatments for illnesses and advances in the medicine.

iii. Summarization

Text summarization, as one can make out, is to create a summary of the detailed text. The most important part of
summarization is to reduce the size of the text without distorting the overall meaning and without eliminating the essential
points in the text. This helps in getting the useful information from only summarized portion of text.

In summarization one of the most commonly used techniques is sentence extraction, which extracts the essential sentences
from the text by adding a weight to the sentence and also finds out the position of a particular sentence by identifying the
key phrases.

Text summarization is very helpful in trying to figure out whether or not a lengthy document meets the users needs and is
worth reading for further information. Generally, when humans summarize text, we read the entire selection to develop a
full understanding, and then write a summary highlighting its main points. With large texts, text summarization software
processes and summarizes the document in the time it would take the user to read the first paragraph.

iii. Categorization

Text Categorization also known as text classification is the task of grouping a set of free-text documents into predefined
categories. This is done by identifying the main topics in the text documents. The text documents can be classified based
on the subject and other attributes like document type, author, genre etc.

Categorization does not process the actual information that is contained in the text documents. However it counts the
words that appear on the text and from the counts it identifies the main topics that the text document covers. Domain
specific dictionaries are used in categorization to identify the relationships by looking for synonyms and related terms.
Categorization also ranks the text documents based on the documents having most content on a particular topic.

Categorization can be applied in many business areas. For example companies having customer support units, which is
meant to answer the customer queries on different topics, can use categorization to classify the text documents by topics
and thereby would access the relevant information much more quickly and answer the user queries quickly.

iv. Feature/Term Selection

A major difficulty of text categorization is high dimensionality of the feature space. The feature selection methods can be
used to reduce the dimensionality of the datasets by removing the features that are not required for text categorization or
classification. Feature selection is an essential part of text categorization or classification. The feature space consists of the
unique terms (words or phrases) that occur in text documents. Text document collections have a lot of such unique terms,
which can be tens or hundreds of thousands of terms for even moderate sized text collection. Having a lot of such terms
are not considered useful for text classification. Reducing the set of terms can make classification more effective and can
improve generalization error.

Thus feature selection methods can be advantageous in reducing the size of the feature space and producing smaller
datasets and thereby letting text classification algorithms work on lesser computational requirements.
v. Entity Extraction

Entity Extraction also know as Named Entity Recognition or Named Entity Extraction is a subtask of Information extraction
that is used to identify and classify atomic elements in text into predefined categories like people, places, organizations and
products. These are generally proper nouns and constitute who and where. However there may be other named entities
which can be interesting like dates, addresses, phone numbers and website urls. Ability to extract these kind of named
entities can essential based on what you are trying to achieve.

You can use a system which will have a statistical model to find out the entities you are looking for, like people, places or
organizations. For example organization name and an individuals name are proper nouns and systems can make good
guesses to find out the type a particular name is, whether it is a place (Hilton Head), a person (Paris Hilton), or an
organization (Hilton Hotels).

vi. Concept extraction

Concepts answer the question: What are the important concepts that are being used? Concept is a word or a phrase
contained in the text by which you can identify the context of the text collection. Identification of the concepts in the text is
one of the ways of classification/ categorization. Social media, technology, business are examples of concepts which can be
identified in the text. For example, you can identify a conversation in text talking about technology or a collection of text
discussing 'politics. To find out whether a piece of text is actually about a particular concept or it just describes something
related to that concept, concept classifiers have scores associated with them.

There is a parent child relationship between categories and concepts. A category can have many concepts associated with
it. For example, if the Chemistry is a category then atomic structure, chemical bonding, gases, etc., would be the concepts
associated with the category Chemistry. So, by identifying the concepts you can carry out an analysis of your company
and find out the broader context in which your company is being talked about, for example, technology.

vii. Theme extraction

Themes are the main ideas in a document. Themes can be concrete concepts such as Oracle Corporation, jazz music,
football, England, or Nelson Mandela; themes can be abstract concepts such as success, happiness, motivation, or
unification. Themes can also be groupings commonly defined in the world, such as chemistry, botany, or fruit.

Themes are the noun phrases or words in the text with contextual relevance scores. Theme extraction tells you the
important words or phrases being used in the text. Themes once extracted are then scored for contextual relevance.
Themes differ from the classifiers in the sense that themes tell you exact phrases or words being used while as classifiers
identify the broad topics.

Themes are useful for discovery purposes. Themes will allow you to actually see that there is a new aspect to the
conversation that may be important to consider, which your classifiers wont be able to catch.

Themes do a very good job in uncovering the actual context in the text. With the addition of contextual scoring information
themes are even more useful in finding out important context from the text and also comparing across similar pieces of
text over a period of time.

viii. Clustering

Clustering is defined as the process of organizing objects together into groups and the objects in each group have
similarity with the other objects in some way or the other. Therefore a cluster is a collection of objects which are similar
between them and dissimilar to the objects in the other clusters. Clustering help identify a structure in a collection of
unlabeled text.

Clustering technique is used to group similar documents in a collection but is different from categorization in the way that
it clusters documents on the fly rather than using predefined topics.

The clustering tools help users to narrow down the documents rapidly by identifying which documents are relevant and
which are not.

Clustering can be done by using various algorithms that differ significantly in their notion of what constitutes a cluster and
how efficiently to find them.
3. Knowledge Discovery (Mining and Analysis)
Preprocessing (Information retrieval and Information extraction) is an essential component in text mining for discovering
knowledge, as can be understood from the previous section on Preprocessing (Information retrieval and Information
extraction). With information extraction we can uncover knowledge from the identified entities and the relationships
between different entities from the text collection with considerable accuracy. However, the information extracted can be
further analyzed by using traditional mining techniques/algorithms to discover more useful information. If the knowledge
to be discovered is expressed directly from the text collection to be mined, then information extraction alone can serve as
an effective approach to discover knowledge from the text collection. However, if the text collection contains data
pertaining to reality rather than conceptual knowledge, then it may be useful to use information extraction to transform the
data into structured form and store in a database and then use traditional mining tools to identify the trends and patterns
in the extracted data.

Preprocessing tasks play an important part in transforming the raw unstructured textual data from document collection into
a more manageable concept-level representation, the core functionality of a text mining system resides in the analysis of
concept co-occurrence patterns across documents in a collection. Text mining systems rely on algorithmic and heuristic
approaches to consider distributions, frequent sets, and various associations of concepts at an inter-document level in an
effort to enable a user to discover the nature and relationships of concepts as reflected in the document collection as a
whole. For example, from various news articles, you can find many articles on politician X and scandal. This obviously
indicates a negative image of the politician X and therefore alerts his managers who then can go for a new public relation
campaign. As another example, you might encounter many articles on company Y and their product Z which may indicate a
shift of focus in company Ys interests. This shift in focus might be worth noting by its competitors. In another example, a
potential relationship can be identified between two proteins P1 and P2 by the understanding the pattern of

a) several articles mentioning the protein P1 in relation to the enzyme E1,

b) a few articles describing functional similarities between enzymes E1 and E2 without referring to any protein names, and

c) several articles linking enzyme E2 to protein P2.

In all three of these examples, the information is not provided by any single document but rather from the totality of the
collection. Text mining methods of pattern analysis seek to discover co-occurrence relationships between concepts as
reflected by the totality of the corpus at hand.

In text mining trend analysis relies on date-and-time stamping of text documents within a collection so that comparisons
can be made between a subset of documents relating to one period and a subset of documents relating to another. Trend
analysis across document subsets attempts to answer certain types of questions.

For instance,

1. What is the general trend of the news topics between two periods (as represented by two different document
subsets)?
2. Are the news topics nearly the same or are they widely divergent across the two periods?
3. Can emerging and disappearing topics be identified?
4. Did any topics maintain the same level of occurrence during the two periods?

As can be seen in the questions above, individual news topics are specific concepts in the document collection. Different
types of trend analytics attempt to compare the frequencies of such concepts (i.e., number of occurrences) in the
documents from different time period document sub collections. Several other types of analysis, derived from data mining
that can be used to support trend analysis are ephemeral association discovery and deviation detection.

Mining process in text mining systems is built around algorithms that facilitate the creation of queries for discovering
patterns in text document collections. Mining component includes many ways of discovering patterns of concept
occurrence within a given text document collection or subset of a document collection. The three most common types of
patterns encountered in text mining are distributions (and proportions), frequent and near frequent sets, and associations.
Text mining systems also provide the capability of discovering more than one type of pattern so that the users are able to
toggle between displays of the different types of patterns for a given concept or set of concepts and there providing the
richest possible exploratory access to the underlying textual data collection.

4. Presentation/Visualization
Browsing is one of the key functionalities supported by a text mining system. Many text mining systems support both
dynamic and content-based browsing due to the reason that browsing is guided by the actual textual content in a
particular document collection and not by anticipated or pre-specified structures. Browsing facilitates a user by providing a
graphical presentation of the concept patterns in the form of a hierarchy to help organizing concepts for investigation and
analysis.

Browsing should also be navigational. Text mining systems provide a user with extremely large sets of concepts extracted
from large collections of text documents. Therefore, text mining systems must provide a user the facility to move across
these concepts so that a user is able to choose either a big picture view of the collection or to drill down on specific and
possibly very sparsely identified concept relationships.

Text mining systems use visualization tools to facilitate navigation, exploration of concept patterns and graphical
representations to express complex data relationships. Text mining systems toady heavily rely on highly interactive graphic
representations of data that allow a user to drag, pull, click, or otherwise directly interact with the graphical representation
of concept patterns.

The presentation layer in a text mining system serves as a front end for executing knowledge discovery algorithms and
therefore significant attention is given in providing friendlier presentation user interface to the user with more powerful
methods for executing these algorithms. Such methods can necessitate developing dedicated query languages to support
the efficient parameterization and execution of specific types of pattern discovery queries.

Furthermore, text mining systems now-a-days are designed to provide users the direct access to their query language
interfaces. Text mining front ends may also provide a user with the facility to cluster concepts by using clustering tools
where a user can create customized profiles for concepts or concept relationships in order to create a richer knowledge
environment for interactive exploration.

Finally, some text mining systems provide users with the facility to create and manipulate refinements constraints which will
aid in generating more manageable and useful result sets for browsing and also in creation, shaping, and parameterization
of queries. The use of such refinement constraints can be made much more user-friendly by incorporating graphical
elements such as pull-downs, radio boxes, or context or query-sensitive pick lists.

5. Domains and Background Knowledge


Concepts in a text mining systems belong not only to the descriptive attributes of a particular document but generally also
to domains. Domain, in relation to text mining, can be defined as a specialized area of interest where dedicated ontologies,
lexicons, and taxonomies of information may be developed. Domains can include very broad areas of subject matter (e.g.,
biology) or more narrowly defined specialisms (e.g., genomics or proteomics). In addition to this, there are other
application areas of domains for text mining which include financial services (with significant subdomains like corporate
finance, securities trading, and commodities.), world affairs, international law, counterterrorism studies, patent research, and
materials science. Many text mining systems can use information from formal external knowledge sources for these
domains to improve upon the elements of their preprocessing, knowledge discovery, and presentation layer operations to a
great extent. In the text mining preprocessing tasks, domain knowledge can be used to enhance concept extraction and
validation activities. Access to background knowledge can play an important role in the development of more meaningful,
consistent, and normalized concept hierarchies. Advanced text mining applications, by relating features by way of lexicons
and ontologies, can create fuller representations of document collections in preprocessing operations and support
enhanced query and refinement functionalities. In fact, in a text mining system different components can make use of the
information contained in the background knowledge. Background knowledge is an important add-on to classification and
concept-extraction methodologies and can also be leveraged to enhance core mining algorithms and browsing operations.
In addition, domain-oriented information serves as one of the main bases for search refinement techniques. Furthermore
background knowledge may be used to construct meaningful constraints in knowledge discovery operations. Likewise,
background knowledge may also be used to formulate constraints that allow users greater flexibility when browsing large
result sets.

Business Applications
Text mining can be used in the following business sectors:

1. Publishing and media.


2. Telecommunications, energy and other services industries.
3. Information technology sector and Internet.
4. Banks, insurance and financial markets.
5. Political institutions, political analysts, public administration and legal documents.
6. Pharmaceutical and research companies and healthcare.

We will describe a few of the business application widely used in specific business areas.

a. Knowledge and Human Resource Management

The following are a few applications in this area:

i. Competitive Intelligence

Organizations today are very keen to know about their performance in the market with respect to the products and services
they offer to its customers. They want to collect information about themselves in order to find out if there is any need to
reorganize and restructure their strategies according to market demands and also to the opportunities that the market
presents. In addition to this they are also interested in collecting the information about the market and their competitors.
They also have to manage huge collection of data, process and analyze this data to get useful insights and make new plans.
The goal of Competitive Intelligence is to extract only relevant information from various relevant data sources. Once the
material is collected, it is classified into categories to develop a database, and analyzing the database to get answers to
specific and crucial information for company strategies.

The typical queries concern the products, the sectors of investment of the competitors, the partnerships existing in markets,
the relevant financial indicators, and the names of the employees of a company with a certain profile of competencies.
Organizations, prior to having a text mining system, would have a department that would dedicatedly look into the
continuous monitoring of information (financial, geopolitical, technical and economic) and answer the queries coming from
the different business areas by the use of manual operation. The process of manually compiling documents according to a
user's needs and preferences and into actionable reports is very labor intensive, and is greatly amplified when it needs to
be updated frequently. With the introduction of text mining systems the return on investment was evident when compared
to results previously achieved by manual operators.

ii. Human resource management

Text mining techniques are also used to manage human resources strategically, mainly with applications aiming at
analyzing staffs opinions, monitoring the level of employee satisfaction, as well as reading and storing CVs for the selection
of new personnel. In the context of human resources management, the TM techniques are often utilized to monitor the
state of health of a company by means of the systematic analysis of informal documents.

b. Customer Relationship Management (CRM)

Text mining in CRM domain is most widely used in the areas related to the management and analysis of the contents of
clients messages. This kind of analysis often aims at automatically rerouting specific requests to the appropriate service or
at supplying immediate answers to the most frequently asked questions. Services research has emerged as a green field
area for application of advances in computer science and IT.

CRM practices, particularly contact centers (call centers) in our context, have emerged as hotbeds for application of
innovations in the areas of knowledge management, analytics, and data mining. Unstructured text documents produced
from a variety of sources in today contact centers have exploded in terms of the sheer volume generated. Companies are
increasingly looking to understand and analyze this content to derive operational and business insights. The customer, the
end consumer of products and services, is receiving increased attention.

Business analytics applications revolving around customers have led to emergence of areas like customer experience
management, customer relationship management, and customer service quality. These are becoming critical to competitive
growth, and sometimes even, survival. Applications with such customer focus are most evident in services companies
especially CRM practices and contact centers.

c. Market Analysis

Text mining in Market Analysis is used mainly to monitor customers opinion to identify new potential customers, analyze
competitors and determine the organizations image by analyzing press reviews and other relevant sources. Most of the
organization indulge in tele-marketing and e-mail activities to acquire new customers. With the introduction of text mining
systems organizations are able to answer the queries related to more complex market scenarios.

Data mining technology have helped us in extracting useful information from various databases. Data warehouses turned
out to be successful for numerical information, but failed when it came to textual information. The 21st century has taken
us beyond the limited amount of information on the web. This is good in one way that more information would provide
greater awareness, and better knowledge. The knowledge of marketing information is available on the web by means of
industry white papers, academic publications relating to markets, trade journals, market news articles, reviews, and even
public opinions when it comes down to customer requirements.

Text mining technology can help marketing professionals to use this information to get useful insights.

Market Analysis includes the following:

Where are the data sources for analysis?

Credit card transactions, loyalty cards, discount coupons, customer complaint calls, plus (public) lifestyle studies.

Target marketing:

Find clusters of model customers who share the same characteristics: interest, income level, spending habits, etc.

Determine customer purchasing patterns over time:

Conversion of single to a joint bank account: marriage, etc.

Cross-market analysis:

1. Associations/co-relations between product sales


2. Prediction based on the association information
3. Finance planning and asset evaluation

d. Warranty or insurance claims, diagnostic medical


interviews, etc.
In certain business areas, the bulk of the information available is in an undefined textual form. For example, during warranty
or insurance claims, claimant will be interviewed by an insurance agent and he will take note of all the details related to the
claim in the form of a brief description. Similarly during patient medical interviews, the attendant will take down a brief
description of the patients health issues or when you take your vehicle for repairs to the service station, the attendant will
take down some notes about the issues you highlight and what needs to be repaired. These notes are then collected
electronically and are input into the text mining systems. This information can be exploited to identify common cluster of
problems and complaints on certain vehicles, etc. Similarly in the medical field useful information can be extracted from the
collected open-ended descriptions about patients disease symptoms, which could be helpful in actual medical diagnosis.

e. Sentiment Analysis
Sentiment analysis or opinion mining is a natural language processing or information extraction tasks that helps extract pro
or anti opinions or feelings expressed by a writer in a document collection. In general, the goal of the sentiment analysis is
to obtain the writers outlook about several topics or overall contextual polarity contained in a document. The writers
outlook may be because of the knowledge he or she possess, his or her emotional state while writing or the intended
emotional touch the writer wants to present to the reader.

The sentiments in sentiment analysis can be obtained at document level by classifying the polarity of the expressed opinion
in the text of a document, at the sentence or entity feature level to find out if the opinion expressed is positive, negative or
neutral. Further sentiment classification can also be done on the basis of the emotional state expressed by the writer like
(glad, dejected, and annoyed). Sentiment analysis can also be done on the basis of objective or subjective opinions
expresses by a writer. Sentiment Analysis identifies the phrases in a text that bears some sentiment. The author may speak
about some objective facts or subjective opinions. It is necessary to distinguish between the two. SA finds the subject
towards whom the sentiments are directed. A text may contain many entities but it is necessary to find the entity towards
which the sentiments are directed. It identifies the polarity and degree of the sentiment. Sentiments are classified as
objective (facts), positive (denotes a state of happiness, bliss or satisfaction on part of the writer) or negative (denotes a
state of sorrow, dejection or disappointment on part of the writer).

Another way of capturing sentiments is by using scoring method where sentiments are given a score based on their degree
of positivity, negativity or objectivity. In this method a piece of text is an analyzed and subsequent analysis of the concepts
contained in the text is carried out to understand the sentimental words and how these words relate to concepts. Each
concept is then given a score based on the relation between the sentimental words and the associated concepts.

Sentiment analysis also called voice of customer plays a major role in customer buying decisions. Internet usage has seen
an exponential rise in the past few years, and the fact that a large no of people share their opinions on the internet, is a
motivating factor for using sentiment analysis for commercial purposes. Consumers often share their attitudes, reactions or
opinions about businesses, products and services on the social networking sites. Consumers naturally get influenced by the
opinions expressed on the online resources like review sites, blogs, and social networking sites to make buying decisions.
Sentiment analysis can therefore be used in marketing for knowing consumer attitudes and trends, consumer markets for
product reviews and social media to find out general opinion about recent hot topics.

Algorithms/Models for business applications


This section describes various algorithms/models used for some of the business applications.

a. Clustering algorithms

Clustering models can be used for customer segmentation, analyze behavioral data, identify the customer groups and
suggest a solution based on the data paterns. Clustering algorithms include:

i. K-mean

This is an efficient and perhaps the fastest clustering algorithm that can handle both long (many records) and wide datasets
(many data dimensions and input fields). It is a distance-based clustering technique The number of clusters to be formed is
predetermined and specified by the user in advance. Usually a number of different solutions should be tried and evaluated
before approving the most appropriate. It is best for handling continuous clustering fields.

ii. TwoStep
As its name implies, this scalable and efficient clustering model, processes records in two steps. The first step of pre-
clustering makes a single pass through the data and assigns records to a limited set of initial sub-clusters. In the second
step, initial sub-clusters are further grouped, through hierarchical clustering, into the final segments. It suggests a clustering
solution by automatic clustering: the optimal number of clusters can be automatically determined by the algorithm
according to specific criteria.

iii. Kohonen network/ self organizing maps

Kohonen networks are based on neural networks and typically produce a two-dimensional grid or map of the clusters,
hence the name self-organizing maps. Kohonen networks usually take a longer time to train than the K-means and
TwoStep algorithms, but they provide a different view on clustering that is worth trying.

b. Acquisition models

Acquisition models can be used to identify profitable prospective customers who have similar characteristics to those of the
already existing valuable customers.

c. Cross-sell and up-sell models

These models can be used to identify existing customers who have the purchasing potential to buy recommended similar
or upgraded products. Attrition models can be used to identify the customers who are highly likely to leave the
relationship.

d. Classification algorithms

Classification algorithms can be used for acquisition/Cross-sell/up-sell/attrition models, which include:

i. Neural networks

Neural networks are powerful machine learning algorithms that use complex, nonlinear mapping functions for estimation
and classification.

These models estimate weights that connect predictors (input layer) to the output. Input records, with known outcomes, are
presented to the network and model prediction is evaluated with respect to the observed results. Observed errors are used
to adjust and optimize the initial weight estimates.

ii. decision trees

Decision trees operate by recursively splitting the initial population. For each split they automatically select the most
significant predictor, the predictor that yields the best separation with respect to the target field. Through successive
partitions, their goal is to produce pure sub-segments, with homogeneous behavior in terms of the output. They are
perhaps the most popular classification technique. Part of their popularity is because they produce transparent results that
are easily interpretable, offering an insight into the event under study.

iii. logistic regression

This is a powerful and well-established statistical technique that estimates the probabilities of the target categories. It is
analogous to simple linear regression but for categorical outcomes. It uses the generalized linear model and calculates
regression coefficients that represent the effect of predictors on the probabilities of the categories of the target field.
Logistic regression results are in the form of continuous functions that estimate the probability of membership in each
target outcome.

iv. Bayesian networks

Bayesian models are probability models that can be used in classification problems to estimate the likelihood of
occurrences. They are graphical models that provide a visual representation of the attribute relationships, ensuring
transparency, and an explanation of the models rationale.

e. Association models
Association models can be used to identify the related products which are typically purchased together and also identifying
the products that can be sold together. By using association analysis customers can be offered associated products if they
buy a particular product. Association alorithms include:

i. A priori

A priori is a classic algorithm for learning association rules. A priori is designed to operate on databases containing
transactions (for example, collections of items bought by customers, or details of a website frequentation). The purpose of
the A priori Algorithm is to find associations between different sets of data and is to extract useful information from large
amounts of data. For example, the information that a customer who purchases a particular product also tends to buy an
associated product at the same time is acquired from the association rule.

ii. sequence models

Sequence modeling techniques are used to identify associations of events/ purchases/attributes over time. Sequence
models take into account the order of actions/purchases and can identify sequences of events like when certain things
happen in a specific order, a specific event has an increased probability of occurring next. The techniques can also be used
as a means for predicting the next expected move of the customers.

Text mining tools


In this section we will present the features, techniques used and business applications of some the commercial and open
source text mining tools available in the market.

a. Commercial text mining tools

Text Mining
Features, Techniques and Applications
Tools
Angoss uses techniques such as entity and theme extraction, topic categorization, sentiment analysis and
document summarization. This tool merges the output of unstructured, text-based analysis with structured
data to provide additional predictive variables for improved predictive models and association analysis.
Angoss helps businesses discover valuable insight and intelligence from their data while providing clear and
Angoss
detailed recommendations on the best and most profitable opportunities to pursue to improve sales,
marketing and risk performance. Its application areas include: Customer Segmentation, Customer
Acquisition, Cross-Sell / Upsell Next-Best Offer, Channel performance, Churn / Loyalty to improve customer
retention and loyalty, Sales Productivity improvement etc.
This tool has the capability to extracts facts, relationships and sentiment from unstructured data and
provides social analytics and engagement applications for Social Customer Relationship Management. This
Attensity tool uses natural language processing technology to address collective intelligence in blogs, online forums
and social media, the voice of the customer in surveys and emails, Customer Experience Management,
e-services, research and e-discovery risk and compliance and intelligence analysis.
This tool uses clustering, categorization and pattern recognition (centered on Bayesian inference)
Autonomy
techniques. Application areas include enterprise search and knowledge management
This tool uses the techniques like words/tokens/phrases/entity search, entity extraction, entity translation
and NLP techniques for information retrieval, text mining and search engines. This tool uses artificial
Basis intelligence techniques to understand text written in different languages. Basis tools are widely used in
forensic analysis and help identify and extract clues from data storage devices like hard disks or flash cards,
as well as devices such as smart phones.
Clarabridge uses techniques like natural language (NLP), machine learning, clustering and categorization.
Clarabridge
This tool is widely used for CRM and sentiment analysis.
Cogito suite of products owned by Expert systems use techniques such as natural language search,
Cogito automatic categorization, data/metadata extraction and natural language processing. Application areas
include CRM, Product development, marketing etc.
IBM SPSS
IBM SPSS text analytics tool uses advanced NLP based techniques like multi-lingual sentiment, event and
fact extraction, categorization etc. SPSS is widely used for statistical analysis for social science. Its application
areas include market research, health research, surveys, marketing etc.
Inxight uses natural language processing, Information retrieval, categorization and summarization and
clustering techniques. This tool has the capability to indentify stems, parts of speech, and noun phrases. It
also identifies entities and grammatical patterns, such as facts, events, relations, and sentiment from text.
Inxight is used in the analysis of customer interactions in call centers and online customer chat sessions, This
Inxight(SAP)
analysis can uncover customer dissatisfaction and product and pricing issues earlier, resulting in faster,
proactive product changes and customer communications. Inxights text analytics is also being used to
uncover risk areas in email, such as private or sensitive data leaving an organization in violation of internal or
externally mandated policy.
Lexanlytics uses natural language processing techniques to extract entities (people, places, companies,
products, etc.), sentiment, quotes, opinions, and themes (generally noun phrases) from text. Lexalytics text
Lexalytics
analytics engine is used in Social Media Monitoring, Voice of Customer, Survey Analysis, pharmaceutical
research and development and other applications.
Megaputer provided techniques like linguistic and semantic information retrieval, clustering and
categorization of documents, summarization, entity extraction, visualization of patterns. Megaputers
Megaputer
application areas include: survey analysis, call center analysis, complaint analysis, competitive intelligence,
market segmentation, cross sell analysis, fraud detection, risk assessment etc.
SAS Text Miner is an add-on for the SAS Enterprise Miner environment. SAS uses information retrieval,
information extraction, categorization and summarization techniques to extract useful information from text.
SAS Text miners capabilities include: stemming; automatic recognition of multi-word terms; normalization
of various entities such as dates, currencies, percentages, and years; part-of-speech tagging; extraction of
SAS Text entities such as organizations, products, Social Security numbers, time, titles, etc.; support for synonyms;
Miner language-specific analysis. SAS text miners application areas include: filtering e-mail; grouping documents
by topic into predefined categories; routing news items; clustering analysis of research papers in a database,
survey data and customer complaints and comments; predicting stock market prices from business news
announcements; predicting customer satisfaction from customer comments; predicting costs based on call
center logs.
VantagePoint is desk top Text Mining Software for Discovering Knowledge in virtually any Structured Text
Database. It uses natural language processing techniques to extract words/phrases from the established
VantagePoint relationships between them. It uses Co-word Bibliometrics/Co-occurrence statistics to find relationships.
VantagePoint enables you to quickly find WHO, WHAT, WHEN and WHERE, enabling you to clarify
relationships and find critical patternsturning your information into knowledge.
DiscoverText is a cloud-based, collaborative text analytics solution which has the capability to generate
valuable insights about customers, products, employees, news, citizens, and more. With dozens of powerful
DiscoverText text mining features, the DiscoverText software solution provides tools to quickly and accurately make better
decisions. DiscoverTexts concept extraction and unique active-learning can handle a sea of social media,
thousands of survey responses, streams of customer service requests, e-mail, or other electronic text.
Eaagle is a software company providing leading text mining technology to CRM, Marketing and Research
professionals. Eaagle is an online service that automatically and objectively analyzes and categorizes
verbatim, without any pre requisites and creates automatic reports like charts, words cloud, and an exclusive
Eaagle
mobile browser compatible report that your clients will discover on their iPad or Smartphone. Eaagle Full
Text Mapper automatically MAPS data and enables you to analyze sets of full text data by topics, and also to
generate customized report.

b. Open source text mining tools

Text Mining
Features, Techniques and Applications
Tools
GATE (General Architecture for Text Engineering) is an open-source toolbox for natural language
processing and language engineering. Gate uses information extraction and machine learning techniques
to extract useful information from text. Gates information extraction component called ANNIE consists of
Gate
tokenizer, a gazetteer, a sentence splitter, a part of speech tagger, a named entities transducer and a
coreference tagger. A few of the application areas include drug research, cancer research, recruitment and
decision support.
RapidMiner RapidMiners Text Extension adds all operators necessary for statistical text analysis. You can load texts
from different data sources or from your data sets, transform then by a huge set of different filtering
techniques, and finally analyze your text data. The Text Extensions supports several text formats including
plain text, HTML, or PDF. It also provides standard filters for tokenization, stemming, stopword filtering, or
n-gram generation.
The Apache OpenNLP library is a machine learning based toolkit for the processing of natural language
OpenNLP text. It supports the most common NLP tasks, such as tokenization, sentence segmentation, part-of-
speech tagging, named entity extraction, chunking, parsing, and coreference resolution.
Carrot2 is an open source search results clustering engine. It can automatically cluster small collections of
documents, e.g. search results or document abstracts, into thematic categories. Apart from two
Carrot2
specialized search results clustering algorithms, Carrot2 offers ready-to-use components for fetching
search results from various sources.
NLTK (The Natural Language Toolkit) is a suite of libraries and programs for symbolic and statistical
natural language processing (NLP) for the Python programming language. NLTK includes graphical
demonstrations and sample data. NLTK is intended to support research and teaching in NLP or closely
NLTK
related areas, including empirical linguistics, cognitive science, artificial intelligence, information retrieval,
and machine learning. NLTK has been used successfully as a teaching tool, as an individual study tool, and
as a platform for prototyping and building research systems.
The
programming The programming language R provides a framework for text mining applications in the package tm.
language R

References
1. The Text Mining HandbookAdvanced Approaches in Analyzing Unstructured Data Ronen Feldman, James Sanger.

2. Tapping into the Power of Text Mining Weiguo Fan, Linda Wallace, Stephanie Rich, and Zhongju Zhang.

3. A Survey of Text Mining Techniques and Applications -- Vishal Gupta, Gurpreet S. Lehal.

4. Unsupervised cleansing of noisy text Danish Contractor, Tanveer A. Faruquie, Venkata Subramaniam.

5. Text Analytics Beginners Guide Agnoss.

6. Text Analytics Sentiment Extraction Measuring the Emotional Tone of Content Agnoss.

7. Experiments with artificially generated noise for cleansing noisy text Phani Gadde, Rahul Goutam, Rakshit Shah,
Hemanth Sagar, L. V. Subramaniam.

8. A comaparative study of Feature Selection in Text Categorization Yiming Yang, Jan O. Pedersen.

9. http://searchbusinessanalytics.techtarget.com/definition/t ext-mining.

10. http://guides.library.duke.edu/content.php?pid=383688&s id=3143978.

11. http://en.wikipedia.org/wiki/Stop_words.

12. http://en.wikipedia.org/wiki/Stemming.

13. http://consultingblogs.emc.com/manjunathasubbarya/arc hive/2011/03/04/stemming-and-lemmatization.aspx.

14. http://www.techopedia.com/definition/13698/tokenizatio n.

15. http://en.wikipedia.org/wiki/Tokenization

Conclusion
Text mining is a growing technology area that is in its early days and having its own inherent complexities, similar to any
emerging technology before the terms and concepts related to it are standardized. There is no accepted/definite depiction
of what it should cover because of the fact that it covers different techniques to handle different problems in the text under
consideration. Likewise, different text mining tools available in the market vary widely and take a slightly different path.
Some fundamental text mining techniques like entity extraction, relationship between the entities, categorization,
classification, summarization etc., have undergone a plenty of research and study and are apt at uncovering useful
information from plain text. However, with a lot of information available on the internet presents more challenges and
opportunities and more research and study needs to be done in this area. Since text mining is also considered a sibling of
data mining some of the major vendors, already having data mining capabilities, are clubbing text mining with data mining
to extend the value of knowledge discovery from the data. Automatic text mining techniques have a long way to go before
they equal the ability of people to discover the knowledge from textual data, even without using any specific domain
knowledge.

License
This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

You may also be interested in...


Pro Keeping Up With PHP Pro The Business Case for Earlier
Software Defect Detection
and Compliance

Data Mining with SQL Server Mobile Messaging with


2005 Twilio

Silverlight Animations in a Visual COBOL New Release:


Practical Business Small point. Big deal
Application
Comments and Discussions
1 message has been posted for this article Visit https://www.codeproject.com/Articles/822379/Text-Mining-and-
its-Business-Applications to post and view comments on this article, or click here to get a print view with messages.

Permalink | Advertise | Privacy | Terms of Use | Mobile


Select Language
Web02 | 2.8.161218.1 | Last Updated 23 Sep 2014
Article Copyright 2014 by Niladri_Biswas
Everything else Copyright CodeProject, 1999-2016

Vous aimerez peut-être aussi