Vous êtes sur la page 1sur 33

Penambangan Data

Program Pascasarjana Fakultas Teknik


JTETI - UGM

Indriana Hidayah
References
1. Witten, Ian H. and Eibe Frank. Data mining: practical
machine learning tools and techniques, 2nd edition.
Morgan Kaufmann publishers. 2005.
2. Han, Jiawei, Micheline Kamber, and Jian Pei. Data
mining: concept and techniques, 3rd edition. Morgan
Kaufmann Publishers. 2012.
3. Liu, Bing. Web data mining: exploring hyperlinks,
contents, and usage data. Springer. 2007.
Lecture plan
RPKPS (Rencana Program Kegiatan
Pembelajaran Semester)

 Tujuan Instruksional Umum


 To introduce students to the basic concepts and techniques of Data
Mining.
 To develop skills of using recent data mining software for solving
practical problems.
 To gain experience of doing independent study and research.
Lecture plan
RPKPS (Rencana Program Kegiatan
Pembelajaran Semester)

• Tujuan Instruksional Khusus Tiap Topik (Pokok Bahasan)


Memahami unsur-unsur yang dirinci sebagai berikut.
– What is data mining
– Input: Concepts, instances, attributes
– Output: Knowledge representation
– Algorithms: The basic method
– Credibility: Evaluating what has been learned
– Advanced Data Mining: Implementation
– Data Transformation
– WEKA Data Mining Implementation.
Today’s topic
• What is data mining:
(1) data mining and machine learning;
(2) simple examples;
(3) machine learning and statistics;
(4) generalization as search.
What is data mining:
(1) data mining and machine learning;
(2) simple examples;
(3) machine learning and statistics;
(4) generalization as search.
What Is Data Mining?
 Data mining (knowledge discovery in databases):
 Extraction of interesting (non-trivial, implicit, previously unknown
and potentially useful) information or patterns from data in large
databases
 Alternative names:
 Data mining: a misnomer?
 Knowledge discovery(mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data dredging,
information harvesting, business intelligence, etc.

Source: Jiawei Han's slide

9/13/2013 Data Mining: Concepts and Techniques 7


Why data mining?
 The motivation:
 Data explosion problem Big data in
 Automated data collection tools databases and
other repositories
 Mature database technology
 Data rich but information poor!

 Solution: Data warehousing and data mining


 Data warehousing and on-line analytical
processing (OLAP)
 Data mining: extraction of interesting
knowledge (rules, patterns, constraints) from
data in large databases

9/13/2013 Data Mining: Concepts and Techniques 8


Evolution of Database Technology
 1960s:
 Data collection, change from primitive file processing to database
system
 1970s:
 Relational data model, relational DBMS implementation
 1980s:
 RDBMS, advanced data models (extended-relational, OO,
deductive, etc.) and application-oriented DBMS (spatial, scientific,
engineering, etc.)
 1990s—2000s:
 Data mining and data warehousing, multimedia databases, and
Web databases

9/13/2013 Data Mining: Concepts and Techniques 9


How about machine learning?
• Data mining is defined as the process of
discovering useful patterns, automatically or
semi-automatically, in large quantities of data.
• Where as, machine learning is…
– Learning (noun): cognitive process of acquiring skill
or knowledge (Wordweb 6.6)
– Thus, machine learning can be thought as the
machine (i.e. computer) going on a process of
acquiring skill or knowledge.
• So…
– How is the relation between data mining and
machine learning?
Potential Applications (1)
 Data mining can be applied in multidiscipline
field, involving:
 machine learning,
 statistics,
 databases,
 artificial intelligence, and
 pattern recognition
 Web usage mining
 Text mining
9/13/2013 Data Mining: Concepts and Techniques 12
What is data mining:
(1) data mining and machine learning;
(2) simple examples;
(3) machine learning and statistics;
(4) generalization as search.
Simple example
Contact lens prescription

 The patterns can


be:
 Classification
 Presented in
decision tree

9/13/2013 Data Mining: Concepts and Techniques 14


More realistic example:
vertebral column
pelvic_inci- pelvic_tilt lumbar_lordo sacral_slope pelvic_radius degree_spondy- Class
dence sis_angle lolisthesis attribute

63.0278175 22.55258597 39.60911701 40.47523153 98.67291675 -0.254399986Hernia

39.05695098 10.06099147 25.01537822 28.99595951 114.4054254 4.564258645Hernia

68.83202098 22.21848205 50.09219357 46.61353893 105.9851355 -3.530317314Hernia

69.29700807 24.65287791 44.31123813 44.64413017 101.8684951 11.21152344Hernia

49.71285934 9.652074879 28.317406 40.06078446 108.1687249 7.918500615Hernia

40.25019968 13.92190658 25.1249496 26.32829311 130.3278713 2.230651729Hernia

53.43292815 15.86433612 37.16593387 37.56859203 120.5675233 5.988550702Hernia

45.36675362 10.75561143 29.03834896 34.61114218 117.2700675 -10.67587083Hernia

43.79019026 13.5337531 42.69081398 30.25643716 125.0028927 13.28901817Hernia

36.68635286 5.010884121 41.9487509 31.67546874 84.24141517 0.664437117Hernia

49.70660953 13.04097405 31.33450009 36.66563548 108.6482654 -7.825985755Hernia


Data mining process
 As a process, data mining encompasses three main
steps:
 Pre-processing → dealing with unsuitable raw data
 Data mining → applying data mining method
 Post-processing → interpreting mined patterns
Architecture of a Typical Data Mining
System
Graphical user interface

Pattern evaluation

Data mining engine


Knowledge-base
Database or data
warehouse server
Data cleaning & data
integration Filtering

Data
Databases Warehouse
9/13/2013 Data Mining: Concepts and Techniques 18
Another example:Directed marketing
(S. Moro, R. Laureano and P. Cortez. Using Data Mining for Bank Direct Marketing: An Application of the CRISP-
DM Methodology. )

• Problem:
– Increasing vast number of marketing campaigns
– Global competitive world
– Mass campaigns are ineffective
• Solution:
– Directed campaigns with a strict and rigorous selection of
contacts.
• Focus on targets that assumable will be keener to that specific
product/service
• More efficient, reduction in costs and time
• The dataset:
– Portuguese marketing
campaign related with bank
deposit subscription.
– Dataset collected is related to
17 campaigns that occurred
between May 2008 and
November 2010,
corresponding to a total of
79354 contacts.
– For each contact, recorded
• a large number of attributes
• the target variable (class attribute)
• there were 6499 successes (8%
success rate).
Steps
1. Goal definition
– To predict if a client will subscribe the deposit
– Classification task
2. Simple data pre-processing (Data Preparation phase)
– Non-conclusive instances were discarded, leading to a total of 55817
contacts.
– Attribute reduction, leading to 29 attributes and 1 class attribute
– Discard instances that contained missing values, leading to 45211
instances (5289 of which were successful or 11.7% success rate).
3. Data mining step (Modeling phase), using NB, DT, SVM
– dataset was randomly divided into training (2/3) and test (1/3) sets
4. Evaluation of the model
Conclusion
• Call duration is the most
relevant feature, meaning
that longer calls tend
increase successes.
• In second place comes the
month of contact.
• Success is most likely to
occur in the last month of
each trimester (March, June,
September and December).
• Such knowledge can be
used to shift campaigns to
occur in those months.
Data Mining: On What Kind of
Data?

 Relational databases
 Data warehouses
 Transactional databases
 Advanced DB and information repositories
 Object-oriented and object-relational databases
 Spatial databases
 Time-series data and temporal data
 Text databases and multimedia databases
 Heterogeneous and legacy databases
 WWW
9/13/2013 Data Mining: Concepts and Techniques 23
Functionality
Knowledge produced by data mining
 Knowledge in DM term, means useful pattern
 The pattern should be
 Useful
 Valid
 Understandable
 Pattern types can be produced by data mining
methods:
 Frequent pattern, association, correlation
 Data characterization and discrimination
 Classification and prediction
 Cluster
Frequent pattern, association,
correlation

 Patterns that occur frequently in data


 Frequent itemset
 Frequent subsequences
 Frequent substructures
 Leading to associations and correlation within
data
Classification and prediction
Cluster analysis
Are All the “Discovered” Patterns
Interesting?
 A data mining system may generate thousands of patterns, not all of them
are interesting.
 Suggested approach: Human-centered, query-based, focused mining
 Interestingness measures: A pattern is interesting if it is easily
understood by humans, valid on new or test data with some degree of
certainty, potentially useful, novel, or validates some hypothesis that a user
seeks to confirm
 Objective vs. subjective interestingness measures:
 Objective: based on statistics and structures of patterns, e.g., support,
confidence, etc.
 Subjective: based on user’s belief in the data, e.g., unexpectedness, novelty,
actionability, etc.

9/13/2013 Data Mining: Concepts and Techniques 28


Can We Find All and Only Interesting
Patterns?

 Search for only interesting patterns: Optimization


 Can a data mining system find only the interesting patterns?
 Approaches
 First general all the patterns and then filter out the uninteresting
ones.
 Generate only the interesting patterns—mining query optimization

9/13/2013 Data Mining: Concepts and Techniques 29


What is data mining:
(1) data mining and machine learning;
(2) simple examples;
(3) machine learning and statistics;
(4) generalization as search.
Machine learning and statistics
• Both are in the continuum of data analysis
techniques
– Some derive from the skills taught in standard
statistics courses,
– others are more closely associated with algorithms
that has arisen out of computer science.
What is data mining:
(1) data mining and machine learning;
(2) simple examples;
(3) machine learning and statistics;
(4) generalization as search.
• One way of visualizing the problem of learning—
and one that distinguishes it from statistical
approaches—is to imagine a search through a
space of possible concept descriptions for one
that fits the data.

Vous aimerez peut-être aussi