Data Mining

Data Mining- SESAP ZC425 - CS1
BITS Pilani Mrs. Preeti NG

Pilani Campus
Module 1
Mrs. Preeti NG
BITS Pilani Dept of Computer Science and Engg
BITS Pilani Bangalore
Pilani Campus
Source Courtesy: Some of the contents of this PPT are sourced from materials provided by publishers of prescribed books
Course Description – Module Structure
M1 Introduction to Data Mining
M2 Data Preprocessing:
To understand the need for data preprocessing and various techniques used
in the context of Data Mining
M3 Data Exploration:
A preliminary exploration of the data to better understand its characteristics
M4 Classification and prediction:

To learn different techniques and algorithms for classification, a major
predictive and supervised Data Mining task
M5 Association Analysis:
To understand the descriptive relation between the entities by identifying
associations among them and to learn various algorithms to find them
M6 Clustering:
To learn different techniques and algorithms for clustering, a major
descriptive and unsupervised Data Mining task
M7 Anomaly Detection:
Detecting outliers and noise in data sets is an important Data Mining task.
This module focuses on techniques needed for anomaly detection
M8 Data Mining on unstructured(Big) data:
Graph Mining, Social Network Analysis, Multimedia Data Mining, Text Mining,
Mining the World Wide Web
M9 Data Mining Applications: 3
Recommendation Systems, Fraud Detection, Sentiment Analysis BITS Pilani, Pilani Campus
Prerequisites
• ?
4
BITS Pilani, Pilani Campus
Motivation
• ?
5
Books
Prescribed Text Book
T1- Tan P. N., Steinbach M & Kumar V. “Introduction to Data Mining”

Pearson Education, 2006
T2- Data Mining: Concepts and Techniques, Second Edition by
Jiawei Han and Micheline Kamber Morgan Kaufmann Publishers,
2006
Reference Books
1. Predictive Analytics and Data Mining: Concepts and Practice

with RapidMiner by Vijay Kotu and Bala Deshpande Morgan
Kaufmann Publishers © 2015
2. Practical Text Mining and Statistical Analysis for Non-structured
Text Data Applications by Gary Miner et al. Academic Press © 2012
3. Recommender Systems for Learning by Nikos Manouselis,
Hendrik Drachsler, Katrien Verbert and Erik Duval Springer © 2013
4. Data Mining: Introductory and Advanced Topics. By Margaret H.Dunham
6
Lecture 1 Topics
1. Review of Data Mining basics.

2. Examples of patterns that can be mined
3. Examples of technologies used in DM
4. Approaches to overcome challenges.
7
Quiz
1) ...................... is an essential process where

intelligent methods are applied to extract
data patterns.
Data mining is the process of discovering patterns in large data

sets involving methods at the intersection of machine learning,
statistics, and database systems.(source- Wikipedia)
8
2) Data mining can also applied to other forms such as ................
i) Data streams
ii) Sequence data
iii) Networked data
iv) Text data
v) Spatial data
A) i, ii, iii and v only

B) ii, iii, iv and v only
C) i, iii, iv and v only
D) All i, ii, iii, iv and v
9
3) Which of the following is not a data mining
functionality?
A) Characterization and Discrimination
B) Classification and regression
C) Selection and Evaluation
D) Clustering and Analysis

10
4) The various aspects of data mining methodologies is/are
...................
i) Mining various and new kinds of knowledge

ii) Mining knowledge in multidimensional space
iii) Pattern evaluation and pattern or constraint-guided
mining.
iv) Handling uncertainty, noise, or incompleteness of data
A) i, ii and iv only
B) ii, iii and iv only
C) i, ii and iii only
D) All i, ii, iii and iv 11
5) _____________ is the application of data
mining techniques to discover patterns from the
Web.
A. Text Mining.
B. Multimedia Mining.
C. Web Mining.
D. Link Mining
12
6)__________________refers to the process of
deriving high-quality information from text.
A. Text Mining.
B. Image Mining.
C. Database Mining.
D. Multimedia Mining
13
Introduction to Data Mining

Companies Using Big Data the Right Way
How Domino’s Used Data to Heat Up Their
Marketing- Case Study
Domino’s Pizza is the hottest pizza chain in the
country, but back in 2008, things were starting to
look stale for the company.
They had lost a significant amount of market share
and sales were slipping. Rather than fall back on
worn-out strategies, Domino’s embraced the
promise of the information age.
They made a conscious decision not to make
marketing decisions on a hunch anymore and
embraced Big Data, big time
Source : http://crimsonmarketing.com/trusting-gut-worst-marketing-move-make/ 15
https://www.bernardmarr.com/default.asp?contentID=1264 BITS Pilani, Pilani Campus
the company has used Splunk Enterprise marketing
technology to monitor real-time sales around the
world, allowing them to target customers in real
time with promotions and unique campaigns.
For example, if a local sports team was about to

play a game, they could provide a unique offer to
their fans, or if the weather got warmer, they could
offer discount coupons on drinks rather than pizza.
16
The Amazing Ways Coca Cola Uses Artificial
Intelligence And Big Data To Drive Success
• Coca Cola is known to have ploughed extensive research and

development resources into artificial intelligence (AI) to ensure it is
squeezing every drop of insight it can from the data it collects.
• Fruits of this research were unveiled when it was announced that the
decision to launch Cherry Sprite as a new flavour was based on
monitoring data collected from the latest generation of self-service soft
drinks fountains, which allow customers to mix their own drinks.
• As the machines allow customers to add their own choice from a range
of flavour “shots” to their drinks while they are mixed, this meant
they were able to pick the most popular combinations and launch it as
a ready-made, canned drink.
• As sales of sugary, fizzy drink products have declined in recent years
Coca Cola has also hooked into data to help produce and market some
of its healthier options, such as orange juice, which the company sells
under a number of brands around the world (including Minute Maid
and Simply Orange).
17
https://www.bernardmarr.com/default.asp?contentID=1171 BITS Pilani, Pilani Campus
How Barbie’s manufacturers are using Big Data
in practice
• Hello Barbie responds realistically to a child by using
natural language processing, machine learning and
advanced analytics to parse what the child says and
respond accordingly.
• And unlike Siri or Cortana, who use automated voices,
every line Barbie speaks is recorded by an actor, giving her
a more lifelike sound. Updates will be recorded periodically
so that Barbie is always up on the latest pop culture.
• Hello Barbie is just the first in a long line of AI-enabled toys
that will take advantage of deep learning algorithms to
interact with us in more and more realistic ways
• Privacy Issues? /Ethics of Data Science?

https://www.bernardmarr.com/default.asp?contentID=730 18
Why Mine Data? Commercial Viewpoint
• The Explosive Growth of Data: from terabytes to petabytes
• Lots of data is being collected

and warehoused
– Web data, e-commerce
– purchases at department/
grocery stores
– Bank/Credit Card
transactions
• Computers have become cheaper and more powerful
• Competitive Pressure is Strong
– Provide better, customized services for an edge (e.g. in
Customer Relationship Management)
Why Mine Data? Scientific Viewpoint
• Data collected and stored at

enormous speeds (GB/hour)
– remote sensors on a satellite
– telescopes scanning the skies
– microarrays generating gene
expression data
– scientific simulations
generating terabytes of data
• Traditional techniques infeasible for raw data
• Data mining may help scientists
– in classifying and segmenting data
– in Hypothesis Formation
• We are drowning in data, but starving for
knowledge!
• “Necessity is the mother of invention”—Data
mining—Automated analysis of massive data
sets
21
Why Data Mining
A search engine (e.g., Google) receives hundreds of millions of queries every day. Each
query can be viewed as a transaction where the user describes her or his information
need.
What novel and useful knowledge can a search engine learn from such a huge
collection of queries collected from users over time?
Some patterns found in user search queries can disclose invaluable knowledge
For example, Google's Flu Trends uses specific search terms as indicators of flu activity.
Using aggregated Google search data, Flu Trends can estimate flu activity up to two
weeks faster than traditional systems can.
This example shows how data mining can turn a large collection of data into
knowledge that can help meet a current global challenge.

What is Data Mining?
Many Definitions
– Non-trivial extraction of implicit, previously
unknown and potentially useful information from
data.
Data mining is the process of automatically

discovering useful information in large data
repositories
BITS BITS
Pilani,Pilani,
Hyderabad
Pilani Campus
What is (not) Data Mining?
 What is not Data  What is Data Mining?

Mining?
– Look up phone – Certain names are more

number in phone prevalent in certain US
directory locations (O’Brien, O’Rurke,
O’Reilly… in Boston area)
– Query a Web – Group together similar
search engine for documents returned by search
information about engine according to their
“Amazon” context (e.g. Amazon
rainforest, Amazon.com,)

Origins of Data Mining
• Draws ideas from machine learning/AI, pattern

recognition, statistics, and database systems
• Traditional Techniques
may be unsuitable due to Statistics/ Machine Learning/
AI
– Enormity of data Pattern
Recognition
– High dimensionality
Data Mining
of data
– Heterogeneous,
distributed nature Database
systems
of data

Data Mining: On What Kinds of Data?
(What Kinds of Data Can Be Mined? )
• Database-oriented data sets and applications
– Relational database, data warehouse, transactional database
• Advanced data sets and advanced applications
– Data streams and sensor data
– Time-series data, temporal data, sequence data (incl. bio-sequences)
– Structure data, graphs, social networks and multi-linked data
– Object-relational databases
– Heterogeneous databases and legacy databases
– Spatial data and spatiotemporal data
– Multimedia database
– Text databases
– The World-Wide Web
26
• Example : A relational database for
AllElectronics. The company is described by
the following relation tables: customer, item,
employee, and branch.
27
Data Mining Tasks
• Prediction Methods
– Use some variables to predict unknown or future
values of other variables.
• Description Methods
– Find human-interpretable patterns that describe
the data.
From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996

29
Data Mining Models and Tasks...
• Classification [Predictive]
• Clustering [Descriptive]
• Association Rule Discovery [Descriptive]
• Sequential Pattern Discovery [Descriptive]
• Regression [Predictive]
• Deviation Detection [Predictive]

Classification: Definition
• Given a collection of records (training set )

– Each record contains a set of attributes, one of the
attributes is the class.
• Find a model for class attribute as a function of the
values of other attributes.
• Goal: previously unseen records should be assigned a
class as accurately as possible.
– A test set is used to determine the accuracy of the model.
Usually, the given data set is divided into training and test
sets, with training set used to build the model and test set
used to validate it.
• Supervised Learning

Classification Example
Tid Refund Marital Taxable
Status Income Cheat
1 Yes Single 125K No Refund Marital Taxable

Status Income Cheat
2 No Married 100K No
No Single 75K ?
3 No Single 70K No
Yes Married 50K ?
4 Yes Married 120K No No Married 150K ?
5 No Divorced 95K Yes Yes Divorced 90K ?
6 No Married 60K No No Single 40K ?
7 Yes Divorced 220K No No Married 80K ? Test

Set
10
8 No Single 85K Yes

9 No Married 75K No
Training
Learn
10 No Single 90K Yes Model
10
Set Classifier

Classification: Application 1
• Direct Marketing
– Goal: Reduce cost of mailing by targeting a set of
consumers likely to buy a new cell-phone product.
– Approach:
• Use the data for a similar product introduced before.
• We know which customers decided to buy and which decided
otherwise. This {buy, don’t buy} decision forms the class attribute.
• Collect various demographic, lifestyle, and company-interaction
related information about all such customers.
– Type of business, where they stay, how much they earn, etc.
• Use this information as input attributes to learn a classifier model.
From [Berry & Linoff] Data Mining Techniques, 1997

CAP 4770 33
• Fraud Detection
–Goal: Predict fraudulent cases in credit card
transactions.
–Approach:
• Use credit card transactions and the information on its
account-holder as attributes.
– When does a customer buy, what does he buy, how often he pays
on time, etc
• Label past transactions as fraud or fair transactions. This
forms the class attribute.
• Learn a model for the class of the transactions.
• Use this model to detect fraud by observing credit card
transactions on an account.

• Customer Attrition/Churn:
–Goal: To predict whether a customer is likely to be
lost to a competitor.
–Approach:
• Use detailed record of transactions with each of the
past and present customers, to find attributes.
– How often the customer calls, where he calls, what time-of-
the day he calls most, his financial status, marital status, etc.
• Label the customers as loyal or disloyal.
From [Berry & Linoff] Data Mining Techniques, 1997
• Find a model for loyalty.

Clustering Definition
• Given a set of data points, each having a set of

attributes, and a similarity measure among them, find
clusters such that
– Data points in one cluster are more similar to one another.
– Data points in separate clusters are less similar to one
another.
• Similarity Measures:
– Euclidean Distance if attributes are continuous.
– Other Problem-specific Measures.
• Unsupervised Learning

Illustrating Clustering
Euclidean Distance Based Clustering in 3-D space.
Intracluster distances Intercluster distances

are minimized are maximized

Clustering: Application 1
• Document Clustering:
–Goal: To find groups of documents that are similar to
each other based on the important terms appearing in
them.
–Approach: To identify frequently occurring terms in
each document. Form a similarity measure based on
the frequencies of different terms. Use it to cluster.
–Gain: Information Retrieval can utilize the clusters to
relate a new document or search term to clustered
documents.

Association Rule Discovery: Definition
Given a set of records each of which contain some number of items

from a given collection;
– Produce dependency rules which will predict
occurrence of an item based on occurrences of other
items.
TID Items Rules Discovered:
1 Bread, Coke, Milk {Milk} --> {Coke}
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk

Association Rule Discovery: Application 1
• Marketing and Sales Promotion:
–Let the rule discovered be

{Bagels, … } --> {Potato Chips}
–Potato Chips as consequent => Can be used to
determine what should be done to boost its sales.
–Bagels in the antecedent => Can be used to see which
products would be affected if the store discontinues
selling bagels.
–Bagels in antecedent and Potato chips in consequent
=> Can be used to see what products should be sold
with Bagels to promote sale of Potato chips!

Sequential Pattern Discovery: Definition
• Given is a set of objects, with each object associated with its own timeline of
events, find rules that predict strong sequential dependencies among different
events.
(A B) (C) (D E)

42
Regression
• Predict a value of a given continuous valued variable based on

the values of other variables, assuming a linear or nonlinear
model of dependency.
• Greatly studied in statistics, neural network fields.
• Examples:
– Predicting sales amounts of new product based on
advertising expenditure.
– Predicting wind velocities as a function of temperature,
humidity, air pressure, etc.
– Time series prediction of stock market indices.

Deviation/Anomaly Detection
(Outlier Analysis)
• Many data mining methods discard outliers as noise

or exceptions.
• However, in some applications (e.g., fraud
detection) the rare events can be more interesting
than the more regularly occurring ones
• Detect significant deviations from normal behavior
• Applications:
– Credit Card Fraud Detection
– Network Intrusion Detection
Which Technologies Are Used?
45
Major Issues in Data Mining (1)
• Mining Methodology
– Mining various and new kinds of knowledge
– Mining knowledge in multi-dimensional space
– Data mining: An interdisciplinary effort
– Handling noise, uncertainty, and incompleteness of data
– Pattern evaluation and pattern- or constraint-guided mining
• User Interaction
– Interactive mining
– Incorporation of background knowledge
– Presentation and visualization of data mining results
47
Major Issues in Data Mining (2)
• Efficiency and Scalability

– Efficiency and scalability of data mining algorithms
– Parallel, distributed, stream, and incremental mining methods
• Diversity of data types
– Handling complex types of data
– Mining dynamic, networked, and global data repositories
• Data mining and society
– Social impacts of data mining
– Privacy-preserving data mining
– Invisible data mining
48
Homeplay 
1. What is Bibliomining??Give examples

2. What is Opinion Mining ?Elaborate.
3. Bug Mining?? How does it work?
49
Summary
• Data mining: Discovering interesting patterns and knowledge from massive
amount of data
• A natural evolution of database technology, in great demand, with wide
applications
• A KDD process includes data cleaning, data integration, data selection,
transformation, data mining, pattern evaluation, and knowledge
presentation
• Mining can be performed in a variety of data
• Data mining functionalities: characterization, discrimination, association,
classification, clustering, outlier and trend analysis, etc.
• Data mining technologies and applications
• Major issues in data mining
50
Virtual Lab - EDA
• Python
• Weka
51
Contact Session 2 :Data Preprocessing
• RL’s on Module 2
– Types of Data (Nominal,Categorical)
– Data Quality
– Data Preprocessing Tasks
• Data cleaning
– Fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies
• Data integration
– Integration of multiple databases, data cubes, or files
• Data reduction
– Dimensionality reduction
– Data compression
• Data transformation and data discretization
– Normalization
52
Thank You
53
54

Data Mining

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Data Mining

Transféré par

Droits d'auteur :

Formats disponibles

Data Mining- SESAP ZC425 - CS1

BITS Pilani Mrs. Preeti NG

M4 Classification and prediction:

T1- Tan P. N., Steinbach M & Kumar V. “Introduction to Data Mining”

1. Predictive Analytics and Data Mining: Concepts and Practice

1. Review of Data Mining basics.

1) ...................... is an essential process where

Data mining is the process of discovering patterns in large data

A) i, ii, iii and v only

A) Characterization and Discrimination

B) Classification and regression

C) Selection and Evaluation

D) Clustering and Analysis

i) Mining various and new kinds of knowledge

BITS Pilani, Pilani Campus

For example, if a local sports team was about to

• Coca Cola is known to have ploughed extensive research and

• Privacy Issues? /Ethics of Data Science?

• The Explosive Growth of Data: from terabytes to petabytes

• Lots of data is being collected

• Data collected and stored at

BITS Pilani, Pilani Campus

Data mining is the process of automatically

 What is not Data  What is Data Mining?

– Look up phone – Certain names are more

BITS Pilani, Pilani Campus

• Draws ideas from machine learning/AI, pattern

BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus

• Given a collection of records (training set )

BITS Pilani, Pilani Campus

1 Yes Single 125K No Refund Marital Taxable

7 Yes Divorced 220K No No Married 80K ? Test

8 No Single 85K Yes

BITS Pilani, Pilani Campus

From [Berry & Linoff] Data Mining Techniques, 1997

BITS Pilani, Pilani Campus

• Find a model for loyalty.

BITS Pilani, Pilani Campus

• Given a set of data points, each having a set of

BITS Pilani, Pilani Campus

Intracluster distances Intercluster distances

BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus

Given a set of records each of which contain some number of items

BITS Pilani, Pilani Campus

–Let the rule discovered be

with Bagels to promote sale of Potato chips!

BITS Pilani, Pilani Campus

• Predict a value of a given continuous valued variable based on

BITS Pilani, Pilani Campus

• Many data mining methods discard outliers as noise

• Efficiency and Scalability

1. What is Bibliomining??Give examples

Vous aimerez peut-être aussi