Académique Documents
Professionnel Documents
Culture Documents
July 2010
July 2010
July 2010
July 2010
July 2010
July 2010
Business: Web, e-commerce, transactions, stocks, Science: Remote sensing, bioinformatics, scientific simulation, Society and everyone: news, digital cameras, YouTube
We are drowning in data, but starving for knowledge! Necessity is the mother of invention analysis of massive data sets Data mining Automated
July 2010
July 2010
10
July 2010
11
Syllabus MCOMP - 502: DATA WAREHOUSING & DATA MINING Unit-I Introduction:
Motivations Data Mining on different kinds of data Data Mining Functionalities Data Mining Task Primitives Classifications of Data Mining Systems Major issues in Data Mining.
Data Preprocessing :
Need for data Preprocessing Descriptive Data Summarization Data Cleaning, Data Integration and Transformation Data Reduction Data Discretization and Concepts Hierarchy Generation.
July 2010 Data Mining: Concepts and Techniques
12
Syllabus Conti
Unit-II
Data Warehouse and OLAP technology for data Mining:
Definition of data warehouse A Multidimensional Data Model Data warehouse architecture Data warehouse implementation From data warehousing to data Mining.
13
Syllabus Conti
Unit-III
Mining Frequent Patterns , Associations and Correlations:
Basic Concepts Efficient and Scalable Frequent Itemset Mining Methods Mining various kinds of Association rules From Association Mining to Correlation Analysis Constraint-Based Association Mining.
July 2010
14
Syllabus Conti
Unit-IV
Classification and Prediction:
Definition of Classification and Prediction Issues regarding classification and Prediction Classification by decision tree induction Bayesian Classification Rule based Classification Classification by Back propagation Classification by association rules analysis Lazy learners Other classification methods Prediction Classification accuracy and error measures
July 2010
15
Syllabus Conti
Unit-IV conti
Cluster Analysis:
Definition of Cluster Types of data in cluster analysis A categorization of major cluster Methods Partitioning methods Hierarchical methods Density-Base Methods Grid-based methods Model based Clustering Methods Outlier analysis
July 2010
16
Syllabus Conti
Unit-V
Applications and Trends in Data Mining
Mining Data Streams Mining Time-Series Data Mining Sequence Patterns in Transactional Database Mining Sequence patterns in Biological Data Graph Mining Spatial Data Mining Multimedia Data Mining Text Mining Mining the World Wide Web Data Mining Applications and Trends
17
Chapter 1. Introduction
Motivation: Why Data Mining? What Is Data Mining? Data Mining: On What Kind of data? A Multi-Dimensional View of Data Mining Data Mining Functionalities: What Kinds of Patterns Can Be Mined? Evaluation of Knowledge Data Mining Task Primitives Classification of data mining systems Major issues in data mining Applications of Data Mining Major Challenges in Data Mining A Brief History of Data Mining and Data Mining Society Summary
July 2010
18
Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data
Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc. Simple search and query processing (Deductive) expert systems
Data Mining: Concepts and Techniques
July 2010
19
Chapter 1. Introduction
Motivation: Why Data Mining? What Is Data Mining? Data Mining: On What Kind of data? A Multi-Dimensional View of Data Mining Data Mining Functionalities: What Kinds of Patterns Can Be Mined? Evaluation of Knowledge Data Mining Task Primitives Classification of data mining systems Major issues in data mining Applications of Data Mining Major Challenges in Data Mining A Brief History of Data Mining and Data Mining Society Summary
July 2010
20
Evolution of Sciences
Each discipline has grown a theoretical component. Theoretical models often motivate experiments and generalize our understanding.
Over the last 50 years, most disciplines have grown a third, computational branch (e.g. empirical, theoretical, and computational ecology, or physics, or linguistics.) Computational Science traditionally meant simulation. It grew out of our inability to find closed-form solutions for complex mathematical models.
The flood of data from new scientific instruments and simulations The ability to economically store and manage petabytes of data online The Internet and computing Grid that makes all these archives universally accessible Scientific info. management, acquisition, organization, query, and visualization tasks scale almost linearly with data volumes. Data mining is a major new challenge!
Data Mining: Concepts and Techniques
July 2010
21
Hierarchical and network data base system, Data collection, database creation, relational database system. ER Model , Indexing, Accessing, Query language, Forms reports and OLTP, Relational data model, relational DBMS implementation RDBMS, advanced data models (extended-relational, OO, deductive, etc.) Application-oriented DBMS (spatial, scientific, engineering, etc.) Data mining, data warehousing, multimedia databases, and Web databases Stream data management and mining Data mining and its applications
Data Mining: Concepts and Techniques
1980s:
1990s:
1990s:
2000s
July 2010
22
July 2010
23
July 2010
24
Chapter 1. Introduction
Why Data Mining? What Is Data Mining? Data Mining: On What Kind of data? A Multi-Dimensional View of Data Mining Data Mining Functionalities: What Kinds of Patterns Can Be Mined? Evaluation of Knowledge Data Mining Task Primitives Classification of data mining systems Major issues in data mining Applications of Data Mining Major Challenges in Data Mining A Brief History of Data Mining and Data Mining Society Summary
July 2010
25
This is a view from typical database systems and data Pattern Evaluation warehousing communities Data mining plays an essential role in the knowledge discovery Data Mining process Task-relevant Data Data Warehouse Data Cleaning Data Integration Databases Selection and Transformation
July 2010
26
July 2010
27
Input Data
Data PreProcessing
Data Mining
PostProcessing
July 2010
28
July 2010
29
Data cleaning Data integration from multiple sources Warehousing the data Data cube construction Data selection for data mining Data mining Presentation of the mining results Patterns and knowledge to be used or stored into knowledge-base
Data Mining: Concepts and Techniques
July 2010
30
End User
Decision Making Data Presentation Visualization Techniques Data Mining Information Discovery
Business Analyst Data Analyst
Data Exploration Statistical Summary, Querying, and Reporting Data Preprocessing/Integration, Data Warehouses Data Sources Paper, Files, Web documents, Scientific experiments, Database Systems
July 2010 Data Mining: Concepts and Techniques
DBA
31
Business objects vs. data mining tools Supply chain example: tools Data presentation Exploration
July 2010
32
Health care & medical data mining often adopted such a view in statistics and machine learning Preprocessing of the data (including feature extraction and dimension reduction) Classification or/and clustering processes Post-processing for presentation
July 2010
33
Statistics
Applications
Data Mining
Visualization
Algorithm
Database Technology
High-Performance Computing
July 2010
34
Algorithms must be highly scalable to handle such as tera-bytes of data Micro-array may have tens of thousands of dimensions Data streams and sensor data Time-series data, temporal data, sequence data Structure data, graphs, social networks and multi-linked data Heterogeneous databases and legacy databases Spatial, spatiotemporal, multimedia, text and Web data Software programs, scientific simulations
High-dimensionality of data
July 2010
35
Australian Health Insurance Commission identifies that in many cases blanket screening tests were requested (save Australian $1m/yr).
Telephone call model: destination of the call, duration, time of day or week. Analyze patterns that deviate from an expected norm. British Telecom identified discrete groups of callers with frequent intra-group calls, especially mobile phones, and broke a multimillion dollar fraud.
Retail
July 2010
36
Web page analysis: from web page classification, clustering to PageRank & HITS algorithms Collaborative analysis & recommender systems Basket data analysis to targeted marketing Biological and medical data analysis: classification, cluster analysis (microarray data analysis), biological sequence analysis, biological network analysis Data mining and software engineering (e.g., IEEE Computer, Aug. 2009 issue) From major dedicated data mining systems/tools (e.g., SAS, MS SQLServer Analysis Manager, Oracle Data Mining Tools) to invisible data mining
Data Mining: Concepts and Techniques
July 2010
37
Chapter 1. Introduction
Why Data Mining? What Is Data Mining? Data Mining: On What Kind of data? A Multi-Dimensional View of Data Mining Data Mining Functionalities: What Kinds of Patterns Can Be Mined? Evaluation of Knowledge Data Mining Task Primitives Classification of data mining systems Major issues in data mining Applications of Data Mining Major Challenges in Data Mining A Brief History of Data Mining and Data Mining Society Summary
July 2010
38
Data streams and sensor data Time-series data, temporal data, sequence data (incl. bio-sequences) Structure data, graphs, social networks and multi-linked data Object-relational databases Heterogeneous databases and legacy databases Spatial data and spatiotemporal data Multimedia database Text databases The World-Wide Web
Data Mining: Concepts and Techniques
July 2010
39
July 2010
40
July 2010
41
July 2010
42
July 2010
43
Chapter 1. Introduction
Why Data Mining? What Is Data Mining? Data Mining: On What Kind of data? A Multi-Dimensional View of Data Mining Data Mining Functionalities: What Kinds of Patterns Can Be Mined? Evaluation of Knowledge Data Mining Task Primitives Classification of data mining systems Major issues in data mining Applications of Data Mining Major Challenges in Data Mining A Brief History of Data Mining and Data Mining Society Summary
July 2010
44
Knowledge to be mined (or: Data mining functions) Characterization, discrimination, association, classification, clustering, trend/deviation, outlier analysis, etc. Descriptive vs. predictive data mining Multiple/integrated functions and mining at multiple levels Data to be mined Relational, data warehouse, transactional, stream, objectoriented/relational, active, spatial, time-series, text, multi-media, heterogeneous, legacy, WWW Techniques utilized Database-oriented, data warehouse (OLAP), machine learning, statistics, visualization, etc. Applications adapted Retail, telecommunication, banking, fraud analysis, bio-data mining, stock market analysis, text mining, Web mining, etc.
Data Mining: Concepts and Techniques
July 2010
45
Chapter 1. Introduction
Why Data Mining? What Is Data Mining? Data Mining: On What Kind of data? A Multi-Dimensional View of Data Mining Data Mining Functionalities: What Kinds of Patterns Can Be Mined? Evaluation of Knowledge Data Mining Task Primitives Classification of data mining systems Major issues in data mining Applications of Data Mining Major Challenges in Data Mining A Brief History of Data Mining and Data Mining Society Summary
July 2010
46
Data cleaning, transformation, integration, and multidimensional data model Scalable methods for computing (i.e., materializing) multidimensional aggregates OLAP (online analytical processing)
Generalize, summarize, and contrast data characteristics, e.g., dry vs. wet region
Data Mining: Concepts and Techniques
July 2010
47
What items are frequently purchased together in your Walmart? A typical association rule
Computer
How to use such patterns for classification, clustering, and other applications? Concepts and Techniques July 2010 Data Mining:
48
July 2010
49
Construct models (functions) based on some training examples Describe and distinguish classes or concepts for future prediction
E.g., classify countries based on (climate), or classify cars based on (gas mileage)
Predict some unknown class labels Decision trees, nave Bayesian classification, support vector machines, neural networks, rule-based classification, patternbased classification, logistic regression, Credit card fraud detection, direct marketing, classifying stars, diseases, web-pages,
Data Mining: Concepts and Techniques
Typical methods
Typical applications:
July 2010
50
Unsupervised learning (i.e., Class label is unknown) Group data to form new categories (i.e., clusters), e.g., cluster houses to find distribution patterns Principle: Maximizing intra-class similarity & minimizing interclass similarity Many methods and applications
July 2010
51
July 2010
52
Outlier analysis
Outlier: A data object that does not comply with the general behavior of the data Noise or exception? person s treasure One person s garbage could be another
Methods: by product of clustering or regression analysis, Useful in fraud detection, rare events analysis
July 2010
53
Chapter 1. Introduction
Why Data Mining? What Is Data Mining? Data Mining: On What Kind of data? A Multi-Dimensional View of Data Mining Data Mining Functionalities: What Kinds of Patterns Can Be Mined? Evaluation of Knowledge Data Mining Task Primitives Classification of data mining systems Major issues in data mining Applications of Data Mining Major Challenges in Data Mining A Brief History of Data Mining and Data Mining Society Summary
July 2010
54
Evaluation of Knowledge
One can mine tremendous amount of patterns and knowledge Some may fit only certain dimension space (time, location, Some may not be representative, may be transient, )
Interestingness measures:
A pattern is interesting if it is easily understood by humans, valid on new or test data with some degree of certainty, potentially useful, novel, or validates some hypothesis that a user seeks to confirm
July 2010
55
Evaluation of Knowledge
Objective: based on statistics and structures of patterns, e.g., support, confidence, etc. Subjective: based on user s belief in the data, e.g., unexpectedness, novelty, actionability, etc.
Descriptive vs. predictive Coverage Typicality vs. novelty Accuracy Timeliness etc
Data Mining: Concepts and Techniques
July 2010
56
Chapter 1. Introduction
Why Data Mining? What Is Data Mining? Data Mining: On What Kind of data? A Multi-Dimensional View of Data Mining Data Mining Functionalities: What Kinds of Patterns Can Be Mined? Evaluation of Knowledge Data Mining Task Primitives Classification of data mining systems Major issues in data mining Applications of Data Mining Major Challenges in Data Mining A Brief History of Data Mining and Data Mining Society Summary
July 2010
57
July 2010
58
Example:
Suppose, as a marketing manager of AllElectronics, you would like to classify customers based on their buying patterns. You are especially interested in those customers whose salary is no less than $40,000, and who have bought more than $1,000 worth of items, each of which is priced at no less than $100. In particular, you are interested in the customer s age, income, the types of items purchased, the purchase location, and where the items were made. You would like to view the resulting classification in the form of rules. This data mining query is expressed in DMQL3 as follows, where each line of the query has been enumerated to aid in our discussion.
July 2010 Data Mining: Concepts and Techniques
59
July 2010
60
Chapter 1. Introduction
Why Data Mining? What Is Data Mining? Data Mining: On What Kind of data? A Multi-Dimensional View of Data Mining Data Mining Functionalities: What Kinds of Patterns Can Be Mined? Evaluation of Knowledge Data Mining Task Primitives Classification of data mining systems Major issues in data mining Applications of Data Mining Major Challenges in Data Mining A Brief History of Data Mining and Data Mining Society Summary
July 2010
61
General functionality
Kinds of databases to be mined Kinds of knowledge to be discovered Kinds of techniques utilized Kinds of applications adapted
July 2010
62
Chapter 1. Introduction
Why Data Mining? What Is Data Mining? Data Mining: On What Kind of data? A Multi-Dimensional View of Data Mining Data Mining Functionalities: What Kinds of Patterns Can Be Mined? Evaluation of Knowledge Data Mining Task Primitives Classification of data mining systems Major issues in data mining Applications of Data Mining Major Challenges in Data Mining A Brief History of Data Mining and Data Mining Society Summary
July 2010
63
Handling relational and complex types of data Mining information from heterogeneous databases and global information systems (WWW)
Application of discovered knowledge Domain-specific data mining tools Intelligent query answering Process control and decision making
Integration of the discovered knowledge with existing knowledge: A knowledge fusion problem Protection of data security, integrity, and privacy
Data Mining: Concepts and Techniques
July 2010
64
July 2010
65
Chapter 1. Introduction
Why Data Mining? What Is Data Mining? Data Mining: On What Kind of data? A Multi-Dimensional View of Data Mining Data Mining Functionalities: What Kinds of Patterns Can Be Mined? Evaluation of Knowledge Data Mining Task Primitives Classification of data mining systems Major issues in data mining Applications of Data Mining Major Challenges in Data Mining A Brief History of Data Mining and Data Mining Society Summary
July 2010
66
Chapter 1. Introduction
Why Data Mining? What Is Data Mining? Data Mining: On What Kind of data? A Multi-Dimensional View of Data Mining Data Mining Functionalities: What Kinds of Patterns Can Be Mined? Evaluation of Knowledge Data Mining Task Primitives Classification of data mining systems Major issues in data mining Applications of Data Mining Major Challenges in Data Mining A Brief History of Data Mining and Data Mining Society Summary
July 2010
67
Efficiency and scalability of data mining algorithms Parallel, distributed, stream, and incremental mining methods Handling high-dimensionality Handling noise, uncertainty, and incompleteness of data Incorporation of constraints, expert knowledge, and background knowledge in data mining Pattern evaluation and knowledge integration Mining diverse and heterogeneous kinds of data: e.g., bioinformatics, Web, software/system engineering, information networks Application-oriented and domain-specific data mining Invisible data mining (embedded in other functional modules) Protection of security, integrity, and privacy in data mining
Data Mining: Concepts and Techniques
July 2010
68
Chapter 1. Introduction
Why Data Mining? What Is Data Mining? Data Mining: On What Kind of data? A Multi-Dimensional View of Data Mining Data Mining Functionalities: What Kinds of Patterns Can Be Mined? Evaluation of Knowledge Data Mining Task Primitives Classification of data mining systems Major issues in data mining Applications of Data Mining Major Challenges in Data Mining A Brief History of Data Mining and Data Mining Society Summary
July 2010
69
Knowledge Discovery in Databases (G. Piatetsky-Shapiro and W. Frawley, 1991) Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 1996)
1995-1998 International Conferences on Knowledge Discovery in Databases and Data Mining (KDD 95-98)
ACM SIGKDD conferences since 1998 and SIGKDD Explorations More conferences on data mining
PAKDD (1997), PKDD (1997), SIAM-Data Mining (2001), (IEEE) ICDM (2001), etc.
July 2010
70
KDD Conferences ACM SIGKDD Int. Conf. on Knowledge Discovery in Databases and Data Mining (KDD) SIAM Data Mining Conf. (SDM) (IEEE) Int. Conf. on Data Mining (ICDM) Conf. on Principles and practices of Knowledge Discovery and Data Mining (PKDD) Pacific-Asia Conf. on Knowledge Discovery and Data Mining (PAKDD)
ACM SIGMOD VLDB (IEEE) ICDE WWW, SIGIR ICML, CVPR, NIPS Data Mining and Knowledge Discovery (DAMI or DMKD) IEEE Trans. On Knowledge and Data Eng. (TKDE) KDD Explorations ACM Trans. on KDD
71
Journals
July 2010
Summary
Data mining: Discovering interesting patterns from large amounts of data A natural evolution of database technology, in great demand, with wide applications A KDD process includes data cleaning, data integration, data selection, transformation, data mining, pattern evaluation, and knowledge presentation Mining can be performed in a variety of information repositories Data mining functionalities: characterization, discrimination, association, classification, clustering, outlier and trend analysis, etc. Data mining systems and architectures Major issues in data mining
Data Mining: Concepts and Techniques
July 2010
72
What are the Motivations for Data Mining? What are the challenges in data mining Explain in detail.. Discuss and Explain the terms Discrimination, Generalization and characterization
Explain the architecture of a typical DM system with a neat diagram Explain the taxonomy of data mining tasks Define each of the following data mining functionalities: characterization, discrimination, association and correlation analysis, classification, prediction, clustering, and evolution analysis. What are the various Data Mining functionalities? What are the measures of patterns interestingness?
Explain various data Mining task primitives OR Explain the different ways of user interaction with the data mining system Discuss the issues related to Integration of a Data Mining System with a Database or Data Warehouse System
Data Mining: Concepts and Techniques
July 2010
73
What is data warehouse? How is a data warehouse different from a database? How are they similar? Briefly describe the following advanced database systems and applications: relational db, transactional db, object relational databases, spatial databases, text databases, multimedia databases, stream data, World Wide Web. Describe why concept hierarchies are important and useful in data mining. Discuss the differences between the following approaches: No coupling, Loose coupling, Semi tight coupling and tight coupling.
July 2010
74