Vous êtes sur la page 1sur 41

IME 672

Data Mining & Knowledge


Discovery

Chapter 1
Course Structure
• 27 Sessions
– Lectures
– Labs

• Evaluation
– Mid Sem + End Sem: 70 %
– Project: 20%
– Quizzes / Assignments / Presentations: 10%

• Attendance Policy
– 100% attendance compulsory
Course Materials
• Books
– Data Mining: Concepts and Techniques, 3rd ed.
• By Jiawei Han, Micheline Kamber & Jian Pei
– Introduction to Data Mining
• P. N. Tan, M. Steinbach & V. Kumar
– The Art of R Programming
• Norman Matloff
– R for Everyone
• Jared P. Lander
• Handouts and Case Studies
Course Outline
• Module 1 – DM & KD Concepts and Techniques

• Module 2 – DM & KD Applications

• Module 3 – Hands on with R


Module 1

Data Mining & Knowledge Discovery


Concepts and Techniques
Why Data Mining?
 Automated data  Explosive Growth of Data:
collection tools, from terabytes to
database systems, petabytes
Web, computerized  Major sources:
society  Business: Web, e-commerce,
 Growth of many transactions, stocks, …
application areas  Science: Remote sensing,
bioinformatics, scientific
simulation, …
 Society and everyone: news,
digital cameras, YouTube

 We are drowning in a ocean of data, but starving for knowledge


 Solution DATA MINING
What is Data Mining?
An iterative and Many steps, passes
interactive process of Human Intervention
discovering
- novel, Non-trivial
- valid, Generalized to future
- useful, Action is possible
- comprehensive and
- understandable Leading to insight
patterns and models in
MASSIVE data sources
What is Data Mining?
• Data mining: a misnomer?
• Knowledge discovery in databases (KDD), knowledge
extraction, pattern analysis, data archeology,
information harvesting, business intelligence, etc.

• What is not data mining?


– Simple search and query processing
– Expert systems or small ML/statistical programs
Is DM = KD?
• Knowledge Discovery
– Overall process of extracting knowledge from data

• Data Mining
– A step in KD process, dealing with identifying
patterns in data
– Application of a specific algorithm based on the
overall goal of the KD process
Knowledge Discovery Process
Integration

Interpretation Knowledge
& Evaluation

Knowledge
Raw
Data __ __ __ Patterns

Understanding
__ __ __
__ __ __ and
Rules
Transformed
Data
DATA Target
Data
Ware
house
Steps of KD Process
1. Learning the application domain:
– relevant prior knowledge and goals of
application
2. Creating a target data set: data selection
3. Data cleaning and preprocessing: (may take
60% of effort!)
4. Data reduction and transformation:
– find useful features, dimensionality/variable
reduction, invariant representation
Steps of KD Process
5. Choosing functions of data mining
– summarization, classification, regression,
association, clustering.
6. Choosing the mining algorithm(s)
7. Data mining: search for patterns of interest
8. Pattern evaluation and knowledge presentation
– visualization, transformation, removing redundant
patterns, etc.
9. Use of discovered knowledge
Evolution of DM

1980s
•ERP

1990s
•CRM

2000s
•eCommerce

2010s
•Data Mining / Big Data Analytics
Why the new age has emerged?
• Computing Storm
– Cheaper technology
– Mobile computing
– Social networking
– Cloud computing
• Data Storm
– Volume
– Velocity
– Variety
• Convergence Storm
– Traditional software and hardware technologies
What is Big Data?
• Data becomes large enough that it cannot be
processed using conventional methods
• It isn’t just a description of raw volume
• Real issue is usability / accessibility
• Challenge is to develop cost-effective and reliable
methods for extracting value from large and complex
sets of data in real time
• Big Data analytics vs. Traditional analytics
– Speed
– Scale
– Complexity
Big Data Examples
• Europe's Very Long Baseline Interferometry (VLBI)
has 16 telescopes, each of which produces
1 Gigabit/second of astronomical data over a 25-day
observation session
– storage and analysis a big problem
• AT&T handles billions of calls per day
– so much data, it cannot be all stored --analysis has to be
done “on the fly”, on streaming data

• Knowledge Discovery is needed to make sense and


use of data
The 3 V’s
• Volume
– Quantity of transactions, events, or amount of history
– Attributes, dimensions, or predictive variables

• Variety
– Assortment of data
– Traditional data, especially operational data, is “structured”
– Recently data has become increasingly “unstructured”
– Data does not have a predefined data model and/or does
not fit well into a relational database
– Text, audio, video, image, geospatial, Internet data (click
streams and log files)
The 3 V’s (contd.)
• Variety
– Unstructured data
– Amount of data is doubling every two years
– Most new data is unstructured (~95%)
– Unstructured data is vastly underutilized

• Velocity
– Speed at which data is created, accumulated, ingested, and
processed
Is Big Data analytics worth the effort?
• Competitive advantage in ultracompetitive global economy
• Nucleus Research (2011) concluded that analytics pays back
$10.66 for every dollar spent
• Media Math Co. achieved a 212% ROI in five months with an
annual revenue lift of $2.2M
• Drive top-line and simultaneously minimize operational cost
• Big Data analytics aren’t constrained by predefined set of
questions
• “You don’t know what you don’t know”
• You don’t have to guess
• Fact based decision - use data to find answers that are more
specific and significantly more useful
Data Mining in Business Intelligence
Increasing potential
to support
business decisions End User
Decision
Making

Data Presentation Business


Analyst
Visualization Techniques
Data Mining Data
Information Discovery Analyst

Data Exploration
Statistical Summary, Querying, and Reporting

Data Preprocessing/Integration, Data Warehouses


DBA
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
DM Application Areas
• Science
– astronomy, bioinformatics, drug discovery, …
• Business
– advertising, CRM (Customer Relationship
management), investments, manufacturing,
sports/entertainment, telecom, e-Commerce,
targeted marketing, health care, …
• Web
– search engines, bots, …
• Government
– law enforcement, profiling tax cheaters, anti-terror, …
DM for Customer Modeling
• Customer Tasks:
– attrition prediction
– targeted marketing:
• cross-sell, customer acquisition
– credit-risk
– fraud detection
• Industries
– banking, telecom, retail sales, …
Customer Attrition Case
• Situation: Attrition rate of mobile phone
customers is around 25-30% a year!

• Task:
– Given customer information for the past N
months, predict who is likely to attrite next month
– Also, estimate customer value and what is the
cost-effective offer to be made to this customer
Credit Risk Assessment Case
• Situation: Person applies for a loan
• Task:
– Should a bank approve the loan?
• Note:
– People who have the best credit don’t need the loans, and people
with worst credit are not likely to repay
– Bank’s best customers are in the middle

• Banks develop credit models using variety of machine


learning methods
• Mortgage and credit card proliferation are the results of
being able to successfully predict if a person is likely to
default on a loan
E-commerce Case
• A person buys a book (product) at Amazon.com
• Task: Recommend other books (products) this
person is likely to buy
• Amazon does clustering based on books bought:
– customers who bought “Advances in Knowledge Discovery
and Data Mining”, also bought “Data Mining: Practical
Machine Learning Tools and Techniques with Java
Implementations”
• Recommendation program is quite successful
Genomic Microarrays Case
• Given microarray (medical) data for a number
of patients, can we
– Accurately diagnose the disease?
– Predict outcome for given treatment?
– Recommend best treatment?
Example: ALL/AML data
• 38 training cases, 34 test, ~ 7,000 genes
• 2 Classes: Acute Lymphoblastic Leukemia (ALL)
vs. Acute Myeloid Leukemia (AML)
• Use train data to build diagnostic model
• Results on test data:
– 33/34 correct, 1 error
Security & Fraud Detection Case
• Credit Card Fraud Detection
• Detection of Money laundering
– FAIS (US Treasury)
• Securities Fraud
– NASDAQ KDD system
• Phone fraud
– AT&T, Bell Atlantic, British Telecom/MCI
• Bio-terrorism detection at Salt Lake Olympics
2002
Disaster Management
• Optimization Analytics used to direct the correct supplies
of recovery/food items to areas where they are needed
most
• Does a village need bottled water or boats, rice or wheat,
shelter or toilets?
• Hurricane Frances was on its way to hit Florida’s Atlantic
coast (2004)
– Wal-Mart wants to predict which items will be sold most in the
path of the hurricane
– Obvious items: bottled water, flashlights
– Mined shopper history when Hurricane Charley struck several
weeks earlier
– In the past sales of strawberry Pop-Tarts and beer increased
seven times
Data Mining Functionalities
• Specify the kinds of patterns to be found in data
mining tasks

• Class/Concept description: Characterization and


discrimination
– Data characterization - summarization of the general
characteristics or features of a target class of data
– Data discrimination - comparison of the general features of
the target class data objects against objects from one or
multiple contrasting classes
– Output - pie charts, bar charts, curves, multidimensional
data cubes, and multidimensional tables
Example
• AllElectronics is a successful international company with
branches around the world
• Each branch has its own set of databases
• The database has following relation tables:
– customer – (cust_ID, name, address, age, occupation,
annual_income, credit_information, category,…)
– item – (item_ID, brand, category, type, price, place made, supplier,
cost,...)
– branch – (branch _ID, name, address,...)
– purchases – (trans_ID, cust_ID, empl_ID, date, time, method_paid,
amount)
– items_sold – (trans_ID, item_ID, qty)
Example
• Data characterization
– Summarize the characteristics of customers who spend more
than $5000 a year at AllElectronics
– Result – a general profile of these customers, such as that they
are 40 to 50 years old, employed, and have excellent credit
ratings
• Data discrimination
– Compare two groups of customers—those who shop for
computer products regularly (e.g., more than twice a month)
and those who rarely shop for such products (e.g., less than
three times a year)
– Result - 80% of the customers who frequently purchase
computer products are 20-40 years old and have a university
education, whereas 60% of the customers who infrequently buy
such products are either seniors or youths, and have no
university degree
Data Mining Functionalities
• Mining Frequent Patterns, Associations, and
Correlations
– Patterns that occur frequently in data
– Frequent itemset – a set of items that often appear
together in a transactional data set
– What items are frequently purchased together in
Walmart?
– Association analysis
• buys(X, “computer”) -> buys(X, “software”) [support = 2%,
confidence = 60%]
– Frequent sequential pattern
– Output – Association Rules
Data Mining Functionalities
• Classification and Prediction
– Finding models that describe and distinguish classes or
concepts for prediction
– Typical methods:
• Decision trees, naïve Bayesian classification, support vector
machines, neural networks, logistic regression, …
– Typical applications:
• Credit card fraud detection, direct marketing, classifying
stars, diseases, web-pages, …
– Output: Classification Rules (i.e., IF-THEN rules), Decision
Trees, Neural Networks
• age(X, “youth”) AND income(X, “high”) -> class(X, “A”)
Data Mining Functionalities
• Cluster analysis
– Class label is unknown: Group data to form new classes
– Unsupervised learning
– Principle: maximize intra-class similarity and minimize
inter-class similarity

• Outlier analysis
– Outlier: a data object that does not comply with the
general behavior of the data
– Noise or exception?
– Methods: by-product of clustering or regression analysis, …
– useful in fraud detection, rare events analysis
Data Mining Functionalities
• Trend and evolution analysis
– Trend, time series, and deviation: regression analysis
– Sequential pattern mining, periodicity analysis
• e.g., first buy digital camera, then buy large SD memory cards
Are All Patterns Interesting?
• A data mining system has the potential to generate thousands
or even millions of patterns, or rules
• Only a small fraction of the patterns potentially generated
would actually be of interest
• What makes a pattern interesting?
– easily understood
– valid on new or test data with some degree of certainty
– potentially useful, and
– novel
• An interesting pattern represents knowledge
• Measures of pattern interestingness
– Support, confidence, accuracy, coverage, unexpectedness, actionable
Data Mining: Confluence of Multiple Disciplines

Machine Pattern Statistics


Learning Recognition

Applications Data Mining Visualization

Algorithm Database High-Performance


Technology Computing
Major Issues in Data Mining
• Mining Methodology
– Mining various and new kinds of knowledge
– Mining knowledge in multi-dimensional space
– Data mining: An interdisciplinary effort
– Handling noise, uncertainty, and incompleteness of data

• User Interaction
– Interactive mining
– Incorporation of background knowledge
– Presentation and visualization of data mining results
Major Issues in Data Mining
• Efficiency and Scalability
– Parallel, distributed, stream, and incremental mining
methods

• Diversity of data types


– Mining dynamic, networked, and global data repositories

• Data mining and Society


– Privacy-preserving data mining
Conferences and Journals on Data Mining
 KDD Conferences  Other related conferences
 ACM SIGKDD Int. Conf. on Knowledge  DB conferences: ACM SIGMOD,
Discovery in Databases and Data VLDB, ICDE, EDBT, ICDT, …
Mining (KDD)
 Web and IR conferences: WWW,
 SIAM Data Mining Conf. (SDM) SIGIR, WSDM
 (IEEE) Int. Conf. on Data Mining  ML conferences: ICML, NIPS
(ICDM)
 PR conferences: CVPR
 European Conf. on Machine Learning
 Journals
and Principles and practices of
 Data Mining and Knowledge
Knowledge Discovery and Data
Discovery (DAMI or DMKD)
Mining (ECML-PKDD)
 IEEE Trans. On Knowledge and Data
 Pacific-Asia Conf. on Knowledge
Eng. (TKDE)
Discovery and Data Mining (PAKDD)
 KDD Explorations
 Int. Conf. on Web Search and Data
Mining (WSDM)  ACM Trans. on KDD

Vous aimerez peut-être aussi