Datmindata Minig

Data Mining
Rajagopal Sukumar Cognizant Technology Solutions
Agenda
What is Data Mining ? Data Mining Techniques Data Mining Process Our work in Data Mining Tools available in the market
What is Data Mining ?
Data mining is the search for relationships and global patterns that exist in large databases but are `hidden' among the vast amount of data These relationships represent valuable knowledge about the database and the objects in the database and, if the database is a faithful mirror, of the real world registered by the database.
What is Data Mining ?
The analogy with the mining process is described as:

Data mining refers to "using a variety of techniques to identify nuggets of information or decision-making knowledge in bodies of data, and extracting these in such a way that they can be put to use in the areas such as decision support, prediction, forecasting and estimation. The data is often voluminous, but as it stands of low value as no direct use can be made of it; it is the hidden information in the data that is useful"
Why do we need Data Mining ?
We need it because everybody needs it ! To uncover strategic competitive insight to drive market share and profits
What can we do with our data ?

Derive Quantitative Information
How many people bought our products last month ?
Explain Past Results

Why did my monthly sales for our products have declined sharply ?
Discover Hidden Patterns

Houses with a male HOH (Head of the HHLD) are more likely to have both cats and dogs than those with a female. The actual ratio is 7:3.
Predict Future Results

So those household in our customer base that have a male Head of Household are likely to have both cats and dogs. If we are a pet food supplier, think about the value of this prediction ?
Transforming Data
Data
Facts/Information Knowledge
Recommendations/Decisions
OLAP Vs. Data Mining

OLAP Focus Dimensions No. of attributes Size of datasets Analysis Technique State of technology Summary Data Limited Total in the tens Small to medium Deductive Slice and Dice Mature Data Mining Detail Data Lots Hundreds Millions Predictive Automatic Discovery Mature in Statistical Analysis/Emerging in Knowledge Discovery
Data Mining Methods
Decision Trees Case Based Reasoning Neural Networks Genetic Algorithms Linear and Non Linear Regression Analysis
Decision Tree
ToyType Car Car Doll Doll Car Buyersex Boys Boys Girls Girls Boys Sales month Jan Jan Feb Feb Mar Location FL GA FL CA NY Qty 50,000 10,000 20,000 15,000 20,000
Jan Boys Car Girls
FL GA
50,000 10,000
< Highest
... ...
< Lowest
Feb
Case based Reasoning (CBR)
Finds the closest situation that occurred in the past and adopts the same solution that was the right one Disadvantage is that CBR systems do not create rules or models summarizing the past experiences Example: Help Desk Support Systems
Neural Networks
Mimic the way learning occurs in the brain They are used extensively in the business world as predictive models Each neuron takes many inputs and generates an output that is a non-linear function of the weighted sum of inputs
Neural Networks
Toy Type
n1 Buyer Sex Good
n2
Quantity n3 Sale Month n4 Location Bad
Neural Networks
y = Good or Bad y = w1n1 + w2n2 + w3n3 + w4n4 The weights w1..w4 can be calculated using backward propagation by training the net using known values of y and the inputs Then the net can be used for predictions
Genetic Algorithms
Mimic the evolutionary process of natural selection It has a fitness function that determines those solutions that are better fits Then genetic operations mutations and mating are performed to generate more solutions Currently in research mode rather than in practical applications
Linear and Non-Linear Regression
Searching for a dependence of the target variable on other variables in the form of function of some predetermined polynomial form Quantity = A*Buyer Sex + B* Location + C* Month (This is linear !) Solving this equation for A, B, C using the available data can be a predictive model
Usage
Clustering
Grouping data into disjoint sets that are similar in some respect. It also attempts to place dissimilar data in different clusters.
For example, in the context of super market data, clustering of sale items to perform effective shelf space organization is a typical application Clustering algorithms typically use a distance function to separate data
Usage
Classification
Classifies data into distinctive groups
For example, people can be categorized into the classifications of babies, children, teenagers, adults, and elderly. The attribute age two years or younger can be mapped to babies. Once data is classified, traits of these groups can be summarized
Usage
Deviation Detection
Extracting anomalies or deviations in the data An anomaly may show a new fact of great interest
Usage
Association Rules
Extracting associations between data items. Can be used to predict the value of one object based on the value of another.
Find a model that identifies the most predictive characteristics of people buying toy pickup trucks ? Answer - During summer vacation, single parent families with certain income levels buy toy pickup trucks
Association Rules
70% of customers who order pen and pencils also order writing tablets If Writing Tablets are high margin items discover all associations that have Writing Tablets as a consequent If pencils are low margin items, discover all associations that have pencils as an antecedent to determine the impact of discontinuing pencils
Data Mining Process
Data Preparation
Most Important Phase GIGO !
Defining a Study Reading the data and building a model Understanding the model Prediction
Data Preparation
Data Cleansing
Inconsistencies
Toy types soft and plush mean the same
Stale Data
Address changes are not reflected correctly
Typographical Errors
words are misspelled or typed incorrectly
Missing Values
Tough problem to address
Data Cleansing - Missing Values
Treatment of missing numeric values is more difficult

Artificial assignment change distribution and statistics of the field Assign using average values Segment data using another variable and assign segment averages Build a model and impute the missing values (the best method)
Data Transformation
Ratio Variables Time derivatives Discretization using quantiles Discretization using other mathematical transforms
Ratio Variables
Beanie Cust Virtual Toy Gun Baby omer Pet profit Profit profit Total Profit Duration in months
Profit/Dur ation for Beanie Baby 5
50
30
20
100
10
10
Time Derivatives
Variation of data over time is very important to understand For example, toy sales time series = toy sales of current month - toy sales of previous month Cyclic Association Rules can be identified
monthly sales of goods may have different correlations based on the season
Discretization using quantiles
Discretization of numeric data using quantiles is a very good way to normalize data. Makes the data easier to interpret.
For example, the quantile break points we can use for toy sales quantity could be 10, 25, 50, 75, and 90.
Discretization using other mathematical transforms
Range transformations Logarithmic transforms

used for highly skewed distributions
Polynomial transforms
Used to linearize variable if the data is continuously distributed
Data Mining Process
Choose the study

Classification/Clustering Deviation Detection Affinity Analysis
Run the algorithm on the prepared data Analyze the outputs Make decisions
Our Approach
Demystification of Data Mining Built a Windows based Prototype to demonstrate decision trees Working on adding a module to our Adhoc Query Generator - Extempore
Sample Study
I want to understand what makes certain types of customers buy more Subject Field Is it related to their salary levels ? Or is it related to their age ? Or is it related to their sex ?
Associated Fields
Demonstration of the Prototype
What is Extempore ?
EXTract M204 and Process On REquest Generates native M204 UL code Reports generated on multiple M204 files without any M204 coding Complex report formatting with the help of reporting tools like info-maker Provides user friendly GUI Dynamically generates customized reports
What is Extempore ?
Structured user interface Point & click methodology Limited M204 knowledge required to use Quick access to M204 data Reports can be copied/saved and reused Data retrieved can be saved in formats like excel, CSV or HTML tables to be used by other systems Online & batch modes of execution
Extempore Architecture
Sybase routes client RPC to M204 RPC to Sybase & results from RPC to client Hidden connection from M204 to Sybase to read report specification
CT LIB
JANUS
Tools in the market
IBM Intelligent Miner Data Mind Corps Data Mind Professional Edition Angoss Softwares Knowledge Seeker Neuralwares Neuralworks Predict Pilot Softwares Discovery Server Redbrick Systems Data Mine Thinking Machines Corps Darwin
Web sites
Excellent reference sites

http://www.thearling.com http://www.kdnuggets.com
Source code sites

C4.5 Decision Tree Algorithm
htttp://ftp.cs.su.oz.au/pub/ml/
OC1 Decision Tree Algorithm

http:/www.cs.jhu.edu/
Thank You !

Datmindata Minig

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Datmindata Minig

Transféré par

Droits d'auteur :

Formats disponibles

Data Mining

Rajagopal Sukumar Cognizant Technology Solutions

What is Data Mining ?

What is Data Mining ?

The analogy with the mining process is described as:

Why do we need Data Mining ?

What can we do with our data ?

Explain Past Results

Discover Hidden Patterns

Predict Future Results

OLAP Vs. Data Mining

Data Mining Methods

Jan Boys Car Girls

Case based Reasoning (CBR)

Linear and Non-Linear Regression

Data Mining Process

Data Cleansing - Missing Values

Treatment of missing numeric values is more difficult

Profit/Dur ation for Beanie Baby 5

Discretization using quantiles

Discretization using other mathematical transforms

Range transformations Logarithmic transforms

Data Mining Process

Choose the study

Demonstration of the Prototype

Tools in the market

Excellent reference sites

Source code sites

OC1 Decision Tree Algorithm

Vous aimerez peut-être aussi