Vous êtes sur la page 1sur 53

Data Mining- SESAP ZC425 - CS1

BITS Pilani Mrs. Preeti NG


Pilani Campus
Module 1
Mrs. Preeti NG
BITS Pilani Dept of Computer Science and Engg
BITS Pilani Bangalore
Pilani Campus

Source Courtesy: Some of the contents of this PPT are sourced from materials provided by publishers of prescribed books
Course Description – Module Structure
M1 Introduction to Data Mining
M2 Data Preprocessing:
To understand the need for data preprocessing and various techniques used
in the context of Data Mining
M3 Data Exploration:
A preliminary exploration of the data to better understand its characteristics

M4 Classification and prediction:


To learn different techniques and algorithms for classification, a major
predictive and supervised Data Mining task
M5 Association Analysis:
To understand the descriptive relation between the entities by identifying
associations among them and to learn various algorithms to find them
M6 Clustering:
To learn different techniques and algorithms for clustering, a major
descriptive and unsupervised Data Mining task
M7 Anomaly Detection:
Detecting outliers and noise in data sets is an important Data Mining task.
This module focuses on techniques needed for anomaly detection
M8 Data Mining on unstructured(Big) data:
Graph Mining, Social Network Analysis, Multimedia Data Mining, Text Mining,
Mining the World Wide Web
M9 Data Mining Applications: 3
Recommendation Systems, Fraud Detection, Sentiment Analysis BITS Pilani, Pilani Campus
Prerequisites

• ?

4
BITS Pilani, Pilani Campus
Motivation

• ?

5
BITS Pilani, Pilani Campus
Books
Prescribed Text Book

T1- Tan P. N., Steinbach M & Kumar V. “Introduction to Data Mining”


Pearson Education, 2006
T2- Data Mining: Concepts and Techniques, Second Edition by
Jiawei Han and Micheline Kamber Morgan Kaufmann Publishers,
2006
Reference Books

1. Predictive Analytics and Data Mining: Concepts and Practice


with RapidMiner by Vijay Kotu and Bala Deshpande Morgan
Kaufmann Publishers © 2015
2. Practical Text Mining and Statistical Analysis for Non-structured
Text Data Applications by Gary Miner et al. Academic Press © 2012
3. Recommender Systems for Learning by Nikos Manouselis,
Hendrik Drachsler, Katrien Verbert and Erik Duval Springer © 2013
4. Data Mining: Introductory and Advanced Topics. By Margaret H.Dunham
6
BITS Pilani, Pilani Campus
Lecture 1 Topics

1. Review of Data Mining basics.


2. Examples of patterns that can be mined
3. Examples of technologies used in DM
4. Approaches to overcome challenges.

7
BITS Pilani, Pilani Campus
Quiz

1) ...................... is an essential process where


intelligent methods are applied to extract
data patterns.

Data mining is the process of discovering patterns in large data


sets involving methods at the intersection of machine learning,
statistics, and database systems.(source- Wikipedia)

8
BITS Pilani, Pilani Campus
2) Data mining can also applied to other forms such as ................

i) Data streams
ii) Sequence data
iii) Networked data
iv) Text data
v) Spatial data

A) i, ii, iii and v only


B) ii, iii, iv and v only
C) i, iii, iv and v only
D) All i, ii, iii, iv and v

9
BITS Pilani, Pilani Campus
3) Which of the following is not a data mining
functionality?

A) Characterization and Discrimination

B) Classification and regression

C) Selection and Evaluation

D) Clustering and Analysis


10
BITS Pilani, Pilani Campus
4) The various aspects of data mining methodologies is/are
...................

i) Mining various and new kinds of knowledge


ii) Mining knowledge in multidimensional space
iii) Pattern evaluation and pattern or constraint-guided
mining.
iv) Handling uncertainty, noise, or incompleteness of data

A) i, ii and iv only
B) ii, iii and iv only
C) i, ii and iii only
D) All i, ii, iii and iv 11
BITS Pilani, Pilani Campus
5) _____________ is the application of data
mining techniques to discover patterns from the
Web.
A. Text Mining.
B. Multimedia Mining.
C. Web Mining.
D. Link Mining

12
BITS Pilani, Pilani Campus
6)__________________refers to the process of
deriving high-quality information from text.
A. Text Mining.
B. Image Mining.
C. Database Mining.
D. Multimedia Mining

13
BITS Pilani, Pilani Campus
Introduction to Data Mining

BITS Pilani, Pilani Campus


Companies Using Big Data the Right Way
How Domino’s Used Data to Heat Up Their
Marketing- Case Study
Domino’s Pizza is the hottest pizza chain in the
country, but back in 2008, things were starting to
look stale for the company.
They had lost a significant amount of market share
and sales were slipping. Rather than fall back on
worn-out strategies, Domino’s embraced the
promise of the information age.
They made a conscious decision not to make
marketing decisions on a hunch anymore and
embraced Big Data, big time
Source : http://crimsonmarketing.com/trusting-gut-worst-marketing-move-make/ 15
https://www.bernardmarr.com/default.asp?contentID=1264 BITS Pilani, Pilani Campus
the company has used Splunk Enterprise marketing
technology to monitor real-time sales around the
world, allowing them to target customers in real
time with promotions and unique campaigns.

For example, if a local sports team was about to


play a game, they could provide a unique offer to
their fans, or if the weather got warmer, they could
offer discount coupons on drinks rather than pizza.
16
BITS Pilani, Pilani Campus
The Amazing Ways Coca Cola Uses Artificial
Intelligence And Big Data To Drive Success

• Coca Cola is known to have ploughed extensive research and


development resources into artificial intelligence (AI) to ensure it is
squeezing every drop of insight it can from the data it collects.
• Fruits of this research were unveiled when it was announced that the
decision to launch Cherry Sprite as a new flavour was based on
monitoring data collected from the latest generation of self-service soft
drinks fountains, which allow customers to mix their own drinks.
• As the machines allow customers to add their own choice from a range
of flavour “shots” to their drinks while they are mixed, this meant
they were able to pick the most popular combinations and launch it as
a ready-made, canned drink.
• As sales of sugary, fizzy drink products have declined in recent years
Coca Cola has also hooked into data to help produce and market some
of its healthier options, such as orange juice, which the company sells
under a number of brands around the world (including Minute Maid
and Simply Orange).
17
https://www.bernardmarr.com/default.asp?contentID=1171 BITS Pilani, Pilani Campus
How Barbie’s manufacturers are using Big Data
in practice
• Hello Barbie responds realistically to a child by using
natural language processing, machine learning and
advanced analytics to parse what the child says and
respond accordingly.
• And unlike Siri or Cortana, who use automated voices,
every line Barbie speaks is recorded by an actor, giving her
a more lifelike sound. Updates will be recorded periodically
so that Barbie is always up on the latest pop culture.
• Hello Barbie is just the first in a long line of AI-enabled toys
that will take advantage of deep learning algorithms to
interact with us in more and more realistic ways

• Privacy Issues? /Ethics of Data Science?


https://www.bernardmarr.com/default.asp?contentID=730 18
BITS Pilani, Pilani Campus
Why Mine Data? Commercial Viewpoint

• The Explosive Growth of Data: from terabytes to petabytes

• Lots of data is being collected


and warehoused
– Web data, e-commerce
– purchases at department/
grocery stores
– Bank/Credit Card
transactions
• Computers have become cheaper and more powerful
• Competitive Pressure is Strong
– Provide better, customized services for an edge (e.g. in
Customer Relationship Management)
BITS Pilani, Pilani Campus
Why Mine Data? Scientific Viewpoint

• Data collected and stored at


enormous speeds (GB/hour)
– remote sensors on a satellite
– telescopes scanning the skies
– microarrays generating gene
expression data
– scientific simulations
generating terabytes of data
• Traditional techniques infeasible for raw data
• Data mining may help scientists
– in classifying and segmenting data
– in Hypothesis Formation
BITS Pilani, Pilani Campus
• We are drowning in data, but starving for
knowledge!
• “Necessity is the mother of invention”—Data
mining—Automated analysis of massive data
sets

21
BITS Pilani, Pilani Campus
Why Data Mining
A search engine (e.g., Google) receives hundreds of millions of queries every day. Each
query can be viewed as a transaction where the user describes her or his information
need.

What novel and useful knowledge can a search engine learn from such a huge
collection of queries collected from users over time?

Some patterns found in user search queries can disclose invaluable knowledge

For example, Google's Flu Trends uses specific search terms as indicators of flu activity.

Using aggregated Google search data, Flu Trends can estimate flu activity up to two
weeks faster than traditional systems can.

This example shows how data mining can turn a large collection of data into
knowledge that can help meet a current global challenge.

BITS Pilani, Pilani Campus


What is Data Mining?

Many Definitions
– Non-trivial extraction of implicit, previously
unknown and potentially useful information from
data.

Data mining is the process of automatically


discovering useful information in large data
repositories
BITS BITS
Pilani,Pilani,
Hyderabad
Pilani Campus
What is (not) Data Mining?

 What is not Data  What is Data Mining?


Mining?

– Look up phone – Certain names are more


number in phone prevalent in certain US
directory locations (O’Brien, O’Rurke,
O’Reilly… in Boston area)
– Query a Web – Group together similar
search engine for documents returned by search
information about engine according to their
“Amazon” context (e.g. Amazon
rainforest, Amazon.com,)

BITS Pilani, Pilani Campus


Origins of Data Mining

• Draws ideas from machine learning/AI, pattern


recognition, statistics, and database systems
• Traditional Techniques
may be unsuitable due to Statistics/ Machine Learning/
AI
– Enormity of data Pattern
Recognition
– High dimensionality
Data Mining
of data
– Heterogeneous,
distributed nature Database
systems
of data

BITS Pilani, Pilani Campus


Data Mining: On What Kinds of Data?
(What Kinds of Data Can Be Mined? )
• Database-oriented data sets and applications
– Relational database, data warehouse, transactional database
• Advanced data sets and advanced applications
– Data streams and sensor data
– Time-series data, temporal data, sequence data (incl. bio-sequences)
– Structure data, graphs, social networks and multi-linked data
– Object-relational databases
– Heterogeneous databases and legacy databases
– Spatial data and spatiotemporal data
– Multimedia database
– Text databases
– The World-Wide Web
26
BITS Pilani, Pilani Campus
• Example : A relational database for
AllElectronics. The company is described by
the following relation tables: customer, item,
employee, and branch.

27
BITS Pilani, Pilani Campus
Data Mining Tasks

• Prediction Methods
– Use some variables to predict unknown or future
values of other variables.

• Description Methods
– Find human-interpretable patterns that describe
the data.

From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996

BITS Pilani, Pilani Campus


29
BITS Pilani, Pilani Campus
Data Mining Models and Tasks...

• Classification [Predictive]
• Clustering [Descriptive]
• Association Rule Discovery [Descriptive]
• Sequential Pattern Discovery [Descriptive]
• Regression [Predictive]
• Deviation Detection [Predictive]

BITS Pilani, Pilani Campus


Classification: Definition

• Given a collection of records (training set )


– Each record contains a set of attributes, one of the
attributes is the class.
• Find a model for class attribute as a function of the
values of other attributes.
• Goal: previously unseen records should be assigned a
class as accurately as possible.
– A test set is used to determine the accuracy of the model.
Usually, the given data set is divided into training and test
sets, with training set used to build the model and test set
used to validate it.
• Supervised Learning

BITS Pilani, Pilani Campus


Classification Example
Tid Refund Marital Taxable
Status Income Cheat

1 Yes Single 125K No Refund Marital Taxable


Status Income Cheat
2 No Married 100K No
No Single 75K ?
3 No Single 70K No
Yes Married 50K ?
4 Yes Married 120K No No Married 150K ?
5 No Divorced 95K Yes Yes Divorced 90K ?
6 No Married 60K No No Single 40K ?

7 Yes Divorced 220K No No Married 80K ? Test


Set
10

8 No Single 85K Yes


9 No Married 75K No
Training
Learn
10 No Single 90K Yes Model
10

Set Classifier

BITS Pilani, Pilani Campus


Classification: Application 1

• Direct Marketing
– Goal: Reduce cost of mailing by targeting a set of
consumers likely to buy a new cell-phone product.
– Approach:
• Use the data for a similar product introduced before.
• We know which customers decided to buy and which decided
otherwise. This {buy, don’t buy} decision forms the class attribute.
• Collect various demographic, lifestyle, and company-interaction
related information about all such customers.
– Type of business, where they stay, how much they earn, etc.
• Use this information as input attributes to learn a classifier model.

From [Berry & Linoff] Data Mining Techniques, 1997


CAP 4770 33
BITS Pilani, Pilani Campus
Classification: Application 1

• Fraud Detection
–Goal: Predict fraudulent cases in credit card
transactions.
–Approach:
• Use credit card transactions and the information on its
account-holder as attributes.
– When does a customer buy, what does he buy, how often he pays
on time, etc
• Label past transactions as fraud or fair transactions. This
forms the class attribute.
• Learn a model for the class of the transactions.
• Use this model to detect fraud by observing credit card
transactions on an account.

BITS Pilani, Pilani Campus


Classification: Application 2

• Customer Attrition/Churn:
–Goal: To predict whether a customer is likely to be
lost to a competitor.
–Approach:
• Use detailed record of transactions with each of the
past and present customers, to find attributes.
– How often the customer calls, where he calls, what time-of-
the day he calls most, his financial status, marital status, etc.
• Label the customers as loyal or disloyal.
From [Berry & Linoff] Data Mining Techniques, 1997

• Find a model for loyalty.

BITS Pilani, Pilani Campus


Clustering Definition

• Given a set of data points, each having a set of


attributes, and a similarity measure among them, find
clusters such that
– Data points in one cluster are more similar to one another.
– Data points in separate clusters are less similar to one
another.
• Similarity Measures:
– Euclidean Distance if attributes are continuous.
– Other Problem-specific Measures.

• Unsupervised Learning

BITS Pilani, Pilani Campus


Illustrating Clustering
Euclidean Distance Based Clustering in 3-D space.

Intracluster distances Intercluster distances


are minimized are maximized

BITS Pilani, Pilani Campus


Clustering: Application 1

• Document Clustering:
–Goal: To find groups of documents that are similar to
each other based on the important terms appearing in
them.
–Approach: To identify frequently occurring terms in
each document. Form a similarity measure based on
the frequencies of different terms. Use it to cluster.
–Gain: Information Retrieval can utilize the clusters to
relate a new document or search term to clustered
documents.

BITS Pilani, Pilani Campus


Association Rule Discovery: Definition

Given a set of records each of which contain some number of items


from a given collection;
– Produce dependency rules which will predict
occurrence of an item based on occurrences of other
items.
TID Items Rules Discovered:
1 Bread, Coke, Milk {Milk} --> {Coke}
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk

BITS Pilani, Pilani Campus


Association Rule Discovery: Application 1
• Marketing and Sales Promotion:

–Let the rule discovered be


{Bagels, … } --> {Potato Chips}
–Potato Chips as consequent => Can be used to
determine what should be done to boost its sales.
–Bagels in the antecedent => Can be used to see which
products would be affected if the store discontinues
selling bagels.
–Bagels in antecedent and Potato chips in consequent
=> Can be used to see what products should be sold

with Bagels to promote sale of Potato chips!


BITS Pilani, Pilani Campus
Sequential Pattern Discovery: Definition

• Given is a set of objects, with each object associated with its own timeline of
events, find rules that predict strong sequential dependencies among different
events.

(A B) (C) (D E)

BITS Pilani, Pilani Campus


42
BITS Pilani, Pilani Campus
Regression

• Predict a value of a given continuous valued variable based on


the values of other variables, assuming a linear or nonlinear
model of dependency.
• Greatly studied in statistics, neural network fields.
• Examples:
– Predicting sales amounts of new product based on
advertising expenditure.
– Predicting wind velocities as a function of temperature,
humidity, air pressure, etc.
– Time series prediction of stock market indices.

BITS Pilani, Pilani Campus


Deviation/Anomaly Detection
(Outlier Analysis)

• Many data mining methods discard outliers as noise


or exceptions.
• However, in some applications (e.g., fraud
detection) the rare events can be more interesting
than the more regularly occurring ones
• Detect significant deviations from normal behavior

• Applications:
– Credit Card Fraud Detection
– Network Intrusion Detection
BITS Pilani, Pilani Campus
Which Technologies Are Used?

45
BITS Pilani, Pilani Campus
Major Issues in Data Mining (1)

• Mining Methodology
– Mining various and new kinds of knowledge
– Mining knowledge in multi-dimensional space
– Data mining: An interdisciplinary effort
– Handling noise, uncertainty, and incompleteness of data
– Pattern evaluation and pattern- or constraint-guided mining
• User Interaction
– Interactive mining
– Incorporation of background knowledge
– Presentation and visualization of data mining results

47
BITS Pilani, Pilani Campus
Major Issues in Data Mining (2)

• Efficiency and Scalability


– Efficiency and scalability of data mining algorithms
– Parallel, distributed, stream, and incremental mining methods
• Diversity of data types
– Handling complex types of data
– Mining dynamic, networked, and global data repositories
• Data mining and society
– Social impacts of data mining
– Privacy-preserving data mining
– Invisible data mining

48
BITS Pilani, Pilani Campus
Homeplay 

1. What is Bibliomining??Give examples


2. What is Opinion Mining ?Elaborate.
3. Bug Mining?? How does it work?

49
BITS Pilani, Pilani Campus
Summary
• Data mining: Discovering interesting patterns and knowledge from massive
amount of data
• A natural evolution of database technology, in great demand, with wide
applications
• A KDD process includes data cleaning, data integration, data selection,
transformation, data mining, pattern evaluation, and knowledge
presentation
• Mining can be performed in a variety of data
• Data mining functionalities: characterization, discrimination, association,
classification, clustering, outlier and trend analysis, etc.
• Data mining technologies and applications
• Major issues in data mining
50
BITS Pilani, Pilani Campus
Virtual Lab - EDA

• Python
• Weka

51
BITS Pilani, Pilani Campus
Contact Session 2 :Data Preprocessing

• RL’s on Module 2
– Types of Data (Nominal,Categorical)
– Data Quality
– Data Preprocessing Tasks
• Data cleaning
– Fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies
• Data integration
– Integration of multiple databases, data cubes, or files
• Data reduction
– Dimensionality reduction
– Data compression
• Data transformation and data discretization
– Normalization

52
BITS Pilani, Pilani Campus
Thank You

53
BITS Pilani, Pilani Campus
54
BITS Pilani, Pilani Campus

Vous aimerez peut-être aussi