Vous êtes sur la page 1sur 93

B.

TECH VI Semester
COMPUTER SCIENCE AND ENGINEERING
VCE-R15 2018 – 2019
DATA WAREHOUSING AND
DATA MINING (A3522)
UNIT – I
INTRODUCTION TO DATA MINING,
PREPROCESSING
A. BHANU PRASAD
Associate Professor, Dept. of CSE
9885990509
andrajub4u@gmail.com
http://andrajub4u.blogspot.com

VARDHAMAN COLLEGE OF ENGINEERING


(AUTONOMOUS)
Shamshabad – 501218, Hyderabad, AP
UNIT – I CONTENTS
1. INTRODUCTION TO DATA MINING
1.1 Motivation, Importance, Definition of Data Mining
1.2 KDD Process, Kinds of Data
1.3 Data Mining Functionalities, Kinds of Patterns
1.4 Data Mining Technologies, Classification of Data Mining Systems
1.5 Data Mining Task Primitives
1.6 Integration of A Data Mining System with A Database or Data Warehouse
System, Major Issues In Data Mining
1.7 Data Objects and Attribute Types Basic Statistical Descriptions of Data
1.8 Data Visualization
1.9 Measuring Data Similarity and Dissimilarity
DATA PREPROCESSING
1.10 Data Quality, Major Tasks in Data Preprocessing
1.11 Data Cleaning
1.12 Data Integration
1.13 Data Reduction
1.14 Data Transformation and Data Discretization

2
Text Books / References / Websites
TEXT BOOKS:
1. Jiawei Han, Micheline Kamber, Jian Pei (2012), Data Mining: Concepts
and Techniques, 3rd edition, Elsevier, United States of America.

REFERENCE BOOKS:
1. Margaret H Dunham (2006), Data Mining Introductory and
Advanced Topics, 2nd edition, Pearson Education, New Delhi, India.
2. Amitesh Sinha(2007), Data Warehousing, Thomson Learning, India.
3. Xingdong Wu, Vipin Kumar (2009), The Top Ten Algorithms in Data
Mining, CRC Press, UK.
4. Max Barmer(2007), Principles of Data Mining, Springer, USA.

3
1. INTRODUCTION TO DATA MINING
1.1. Motivation
Evolution of Sciences
 Before 1600, Empirical science
 Empirical data is produced by experiment and observation.
 1600-1950s, Theoretical science
 Each discipline has grown a theoretical component. Theoretical models
often motivate experiments and generalize our understanding.
 1950s-1990s, Computational science
 Over the last 50 years, most disciplines have grown a third, computational
branch (e.g. empirical, theoretical, and computational ecology, or physics,
or linguistics.)
 Computational Science traditionally meant simulation. It grew out of our
inability to find closed-form solutions for complex mathematical models.
 1990-now, Data science
 The flood of data from new scientific instruments and simulations.
 The ability to economically store and manage petabytes of data online.
 The Internet and computing Grid that makes all these archives universally
accessible.
 Scientific info. management, acquisition, organization, query, and
visualization tasks scale almost linearly with data volumes. Data mining is
a major new challenge! 4
Collection
Banking & Transactions

Social media

Database systems

Search Engines

5
Data Mining as the Evolution of
Information Technology
 Data mining can be viewed as a result of the natural evolution of
information technology. The database and data management industry
evolved in the development of several critical functionalities, data
collection and database creation, data management (including data
storage and retrieval and database transaction processing), and
advanced data analysis (involving data warehousing and data mining).
 1960s and earlier:
 Data collection, database creation, Information Management
System (IMS) and network DBMS
 Primitive file processing
 1970s:
 Data Base Management Systems, Relational data model,
 Hierarchical and network database systems
 Relational DBMS implementation
 Query languages: SQL
 Transactions, concurrency control and recovery.
 On-line Transaction Processing (OLTP)
6
Contd..
 1980s:
 Advanced data models : Extended relational, object-relational
 Advanced application-oriented DBMS :spatial, scientific,
engineering, temporal, multimedia, active, stream and sensor,
knowledge-based
 Advanced Data Analysis : Data warehouse and OLAP, Data mining
and knowledge discovery, Advanced data mining applications, Data
mining and society
 1990s—2000s:
 Data mining and data warehousing, multimedia databases, and
Web databases
 XML-based database systems
 Integration with information retrieval
 Data and information integration
 Present – future:
 New generation of integrated data and information system.

7
Information Hierarchy (DIKW Pyramid)
 The DIKW pyramid / hierarchy refers to a class of models for representing
structural and/or functional relationships between Data, Information,
Knowledge, and Wisdom.
 Data-Structured or unstructured More refined and abstract
Raw facts, signal, symbols, pictures
that can be processed by a Wisdom
computer.
Knowledge
98.6ºF, 99.5ºF, 100.3ºF, 101ºF, …
 Information-Data processed and Information
presented in a particular manner.
Provides answers to "who", "what", Data
"where", and "when" questions.
Hourly body temperature: 98.6ºF, 99.5ºF, 100.3ºF, 101ºF…
 Knowledge-Information read, heard or seen and understood. Answers
"how" questions and give conclusions
If you have a temperature above 100ºF, you most likely have a fever.
 Wisdom-Evaluated understanding of "why“ and make decisions.
If you don’t feel well, go see a doctor. 8
Definition of Data Mining
Data mining (knowledge discovery in databases):
 Extraction or mining of interesting (implicit, previously unknown and
potentially useful) information / knowledge or patterns from large
amounts of databases.
 Process of analyzing data from different perspectives,
summarizing it into useful information that can be used to
increase revenue, cuts costs or both.
Alternative names:
 Knowledge Discovery (mining) in Databases (KDD)
 knowledge extraction
 data/pattern analysis
 data dredging
 information harvesting
 business intelligence, etc.
What is not data mining?
 Not (Deductive) query processing.
 Not Expert systems or statistical programs 9
Data Mining — Potential Applications

Market analysis and management


Where are the data sources for analysis?
 Credit card transactions, loyalty cards, discount coupons, customer
complaint calls, plus (public) lifestyle studies
Target marketing
 Find clusters of “model” customers who share the same
characteristics: interest, income level, spending habits, etc.
Determine customer purchasing patterns over time
 Conversion of single to a joint bank account: marriage, etc.
Cross-market analysis
 Associations/co-relations between product sales
 Prediction based on the association information
Customer profiling
 data mining can tell you what types of customers buy what
products (clustering or classification)

10
Contd..
Identifying customer requirements
 identifying the best products for different customers
 use prediction to find what factors will attract new customers
Provides summary information
 various multidimensional summary reports
 statistical summary information (data central tendency and
variation)
Corporate Analysis and Risk Management
Finance planning and asset evaluation
 cash flow analysis and prediction
 contingent claim analysis to evaluate assets
 cross-sectional and time series analysis (financial-ratio, trend
analysis, etc.)
Resource planning:
 summarize and compare the resources and spending
Competition:
 monitor competitors and market directions
 group customers into classes and a class-based pricing procedure
 set pricing strategy in a highly competitive market 11
Fraud Detection and Management
Approach
 use historical data to build models of fraudulent behavior and use data
mining to help identify similar instances
Examples
 Auto insurance: detect a group of people who stage accidents to collect
on insurance
 Money laundering: detect suspicious money transactions
(US Treasury's Financial Crimes Enforcement Network)
 Medical insurance: detect professional patients and ring of doctors
and ring of references
Detecting inappropriate medical treatment
 Australian Health Insurance Commission identifies that in many cases
blanket screening tests were requested (save Australian $1m/yr).
Detecting telephone fraud
 Telephone call model: destination of the call, duration, time
of day or week. Analyze patterns that deviate from an expected norm.
 British Telecom identified discrete groups of callers with frequent
intra-group calls and broke a multimillion dollar fraud. 12
Other Applications
Sports
 IBM Advanced Scout analyzed NBA game statistics (shots blocked,
assists, and fouls) to gain competitive advantage for New York Knicks
and Miami Heat
Astronomy
 JPL and the Palomar Observatory discovered 22 quasars with the help
of data mining
Internet Web Surf-Aid
 IBM Surf-Aid applies data mining algorithms to Web access logs for
market-related pages to discover customer preference and behavior
pages, analyzing effectiveness of Web marketing, improving Web site
organization, etc.
Text mining (news group, email, documents) and Web analysis.
Intelligent query answering

13
Database Processing
Database Vs. Vs. Data
Processing Data Mining
MiningProcessing
Processing
Query Query
– Well defined – Poorly defined
– SQL – No precise query language
Data Data
– Operational / Transactional – Not operational data / Historical
/ Current data Data
Output Output
– Precise – Fuzzy
– Subset of database – Not a subset of database
Query Examples Query Examples
– Find all credit card – Find all credit card applicants
applicants with last name who are poor credit risks.
of Ramesh. (Classification)
– Find all customers who – Find all books which are
have purchased Data frequently purchased with Data
Mining Book Mining book. (Association rules)
– Identify customers who – Identify customers with similar
have purchased more than buying habits. (Clustering)
Rs10,000 in the last month. 14
1.2 Knowledge Discovery in Databases (KDD)
Process
The KDD process is shown in Figure 1.2(a) as an iterative sequence of the
following steps:
1) Data cleaning
2) Data integration
3) Data selection
4) Data transformation
5) Data mining
6) Pattern evaluation
7) Knowledge presentation
 The ultimate goal is to extract high-level knowledge from low-level
data.
 Major KDD application areas include:
 marketing, fraud detection, telecommunication and manufacturing.

15
KDD Process

Patterns/
Model

Transformed /
Task Relevant Data

Target Data
Databases

Initial Data Fig 1.2 (a): Data mining as a step in the process of knowledge discovery.
16
17
Steps of a KDD Process
Firstly identify the goal of the KDD process from the customer’s
perspective. Understand application domains and relevant prior
knowledge. Then select a target data set or subset of data samples
on which discovery is to be performed.
1) Data cleaning to remove noise and inconsistent data.
2) Data integration where multiple data sources may be combined.
3) Data selection where data relevant to the analysis task are
retrieved from the database.
4) Data transformation where data are transformed (converted)
and consolidated into new and common format(s) appropriate
for mining.
 Steps 1 through 4 are different forms of data preprocessing,
where data are prepared for mining.

18
KDD Process Contd..
5) Data mining an essential process where intelligent methods are
applied to extract data patterns or desired results.
• The data mining step may interact with the user or a
knowledge base.
6) Pattern evaluation to identify the truly interesting patterns
representing knowledge based on interestingness measures.
7) Knowledge presentation where visualization and knowledge
representation techniques are used to present mined knowledge
(results) to users in meaningful manner.
 The interesting patterns are presented to the user and may be
stored as new knowledge in the knowledge base.

19
Data Mining and Business Intelligence (BI)

Increasing potential
to support
business decisions End User
Making
Decisions

Data Presentation Business Analyst


Visualization Techniques

Data Mining Data Analyst


Information Discovery

Data Exploration
Statistical Analysis, Querying and Reporting
[
Data Warehouses / Data Marts
OLAP, MDA Database
Data Sources Administrator (DBA)
Paper, Files, Information Providers, Database Systems, OLTP

Fig 1.2 (b): Data mining and Business Intelligence 20


Architecture of a Typical Data Mining System

Graphical User Interface

Pattern Evaluation
Knowledge-
Data Mining Engine Base

Database or Data
Warehouse Server

Data cleaning, integration and Selection

Data Data Other Info


bases Warehouse WWW
Repositories

Fig 1.2 (c): Architecture of a Typical Data Mining System 21


Contd..
 Database, Data Warehouse, World Wide Web, or Other
Information Repository: Data cleaning and data integration
techniques may be performed on this one or a set of databases.
 Database or Data Warehouse Server is responsible for fetching
the relevant data, based on the user’s data mining request.
 Knowledge base is the domain knowledge that is used to guide
the search or evaluate the interestingness of resulting patterns.
Such knowledge can include concept hierarchies, user beliefs,
additional interestingness constraints or thresholds, and
metadata.
 Data Mining Engine is essential to the data mining system and
ideally consists of a set of functional modules for tasks such as
characterization, association and correlation analysis,
classification, prediction, cluster analysis, outlier analysis, and
evolution analysis.
22
Contd..
 Pattern Evaluation Module component typically employs
interestingness measures and interacts with the data mining
modules so as to focus the search toward interesting patterns. It
may use interestingness thresholds to filter out discovered
patterns.
 Graphical User Interface module communicates between users
and the data mining system, allowing the user to interact with
the system by specifying a data mining query or task, providing
information to help focus the search, allows the user to browse
database and data warehouse schemas or data structures,
evaluate mined patterns, and visualize the patterns in different
forms.
 By performing data mining, interesting knowledge, regularities, or
high-level information can be extracted from databases and viewed or
browsed from different angles.
 The discovered knowledge can be applied to decision making, process
control, information management, and query processing.
23
Kinds of Data Can Be Mined
Data mining can be applied to any kind of data as long as the data are
meaningful for a target application. The most basic forms of data for
mining applications are:
 Database-oriented data sets and applications:
1) Relational database (Records of Customers, Items)
2) Data warehouses (multiple sources data)
3) Transactional database (transaction of customer’s purchase, web
clicks)
 Advanced sets and advanced applications:
4) Object-Relational Database Systems (complex object structures,
class and subclass hierarchies)
5) Temporal, Sequence and Time-series database systems (historical
records, stock exchange and biological sequence data)
6) Spatial and Spatiotemporal database systems (geographical maps)
7) Hypertext and Multimedia database systems (text, image, audio
and video data)
8) Heterogeneous and Legacy database systems (Multisource data)
9) Data Streams management systems (video surveillance and sensor
data, which are continuously transmitted)
10) World Wide Web -based global information systems. 24
1) Relational Database data
 A Database Management System (DBMS), consists of a collection of
interrelated data, known as a database, and a set of software programs
to manage and access the data.
1) A Relational Database is a collection of tables, each of which is
assigned a unique name. Each table consists of a set of attributes
(columns or fields) and usually stores a large set of tuples (records or
rows).
 Entity-Relationship (ER) data model, is constructed for RDBMS,
represents the database as a set of entities and their relationships.
 When mining relational databases, we can go further by
searching for trends or data patterns.

25
2) Data Warehouse data
2) A Data Warehouse is a repository of information collected from
multiple sources, stored under a unified schema, and usually residing
at a single site.
 A data warehouse is usually modeled by a multidimensional data
structure, called a data cube, in which each dimension corresponds to
an attribute or a set of attributes in the schema, and each cell stores
the value of summarized data to provide information from a historical
perspective for On-Line Analytical Processing (OLAP) to facilitate
decision making.
 Multidimensional data mining allows the exploration of multiple
combinations of dimensions at varying levels of granularity in
data mining, and thus has greater potential for discovering
interesting patterns representing knowledge.

26
3) Transactional Data
3) Each record in a Transactional Database captures a transaction,
which typically includes a unique transaction identity number
(TransID) and a list of the items making up the transaction such as a
customer’s purchase, a flight booking, or a user’s clicks on a web
page in the transaction.
 A traditional database system is not able to perform market basket
data analysis.
 Data mining on transactional data can do so by mining frequent
itemsets, that is, sets of items that are frequently sold together.

Transactions made thru POS, WWW, ATMs etc frequent itemsets 27


Advanced Data and Information Systems and
Advanced Applications
 There are many other kinds of data that have versatile forms and
structures and rather different semantic meanings.
 These applications bring about new challenges, like how to handle
data carrying special structures (e.g., sequences, trees, graphs, and
networks) and specific semantics (such as ordering, image, audio and
video contents, and connectivity), and how to mine patterns that carry
rich structures and semantics.
 Mining multiple data sources of complex data often leads to fruitful
findings due to the mutual enhancement and consolidation of such
multiple sources.
 Such data require sophisticated facilities for efficient storage, retrieval,
and updating.
 In response to these needs, advanced database systems and specific
application-oriented database systems have been developed.
 These include:

28
4) Object-Relational Database (ORD)
4) An Object-Relational Database (ORD), or Object-Relational Database
Management System (ORDBMS), is a DBMS similar to a relational
database, but with an object-oriented database model: objects, classes
and inheritance are directly supported in database schemas and in the
query language
 For data mining in object-relational systems, techniques need to be
developed for handling complex object structures, complex data
types, class and subclass hierarchies, property inheritance, and
methods and procedures.

Object Oriented data model Relational data model


29
5) Temporal database
5) Temporal database provide a uniform and systematic way of dealing with
historical, current as well as future data that include time-related attributes.
Historical data can never be deleted, but data can be inserted or updated only
to the current and future timestamp.
Ex: Financial apps, Record-keeping, Data Warehousing, Scientific apps,
Reservation Systems etc.
 A Sequence database stores sequences of ordered events, with or without a
concrete notion of time. Examples: customer shopping sequences, Web click
streams, and biological sequences.
 A Time-series database stores sequences of values or events obtained over
repeated measurements of time (e.g., hourly, daily, weekly, monthly, yearly).
 Data mining techniques can be used to find the characteristics of object
evolution, or the trend of changes for objects in the database.

30
6) Spatial databases
6) Spatial databases contain spatial-related information. Examples
include geographic (map) databases, very large-scale integration
(VLSI) or Computed-Aided Design databases, and medical and
satellite image databases.
 A Spatiotemporal database is a spatial database that stores spatial
objects that change with time.
 Data mining may uncover patterns
describing the characteristics of an object.

Spatial databases
Spatiotemporal database

Spatial databases Tools 31


7) Text databases
7) Text databases contain word descriptions for objects which are usually
not simple keywords but rather long sentences or paragraphs, such as
product specifications, error or bug reports, warning messages,
summary reports, news articles, research papers, books, digital
libraries etc.
 Text databases may be highly unstructured (Web pages), semi-
structured (e-mail messages), and well structured (library catalogue
databases).
 By mining text data, one may uncover general and concise
descriptions of the text documents, keyword or content
associations, as well as the clustering behavior of text objects.
 Multimedia databases store image, audio, and video data. They are
used in applications such as picture content-based retrieval (image
search), voice-mail systems, video-on-demand systems (youtube), the
World Wide Web, and speech-based user interfaces that recognize
spoken commands.
 For multimedia data mining, storage and search techniques need to
be integrated with standard data mining methods.
32
8) Heterogeneous database
8) A Heterogeneous database consists of a set of interconnected,
autonomous component databases. The components communicate in
order to exchange information and answer queries.
 A Legacy database is a group of heterogeneous databases that
combines different kinds of data systems, such as relational or
object-oriented databases, hierarchical databases, network
databases, spreadsheets, multimedia databases, or file systems. The
heterogeneous databases in a legacy database may be connected by
intra or inter-computer networks.
 Data mining techniques may provide an interesting solution to the
information exchange problem by performing statistical data
distribution and correlation analysis, and transforming the given
data into higher, more generalized, conceptual levels

33
9) Data Streams
10) WWW
9) Data Streams where data flow in and out of an observation platform
(or window) dynamically that are huge or possibly infinite volume,
dynamically changing, flowing in and out in a fixed order, allowing
only one or a small number of scans, and demanding fast (often real-
time) response time.
Ex: video surveillance and sensor data
 Mining data streams involves the efficient
discovery of general patterns and dynamic
changes within stream data.
10) The World Wide Web provide rich, worldwide, on-line information
services, where data objects are linked together to facilitate
interactive access.
 Web services provide keyword-based searches can only offer limited
help to users.
 Capturing user access patterns in distributed information
environments is called Web usage mining (Weblog mining).
 Web mining is the development of scalable and effective Web data
analysis and mining methods.
34
1.3. Data Mining Functionalities (Tasks)
 Data Mining Functionalities (Tasks / Algorithms) are used to specify
the kind of patterns to be found in data mining tasks. These
algorithms examine the sample data of a problem and determine a
model that fits close to solving the problem.
 There are a number of data mining functionalities like:
1) Characterization and discrimination
2) Mining of frequent patterns
3) Associations and Correlations
4) Classification and Regression
5) Clustering analysis and
6) Outlier Analysis and Evolution Analysis

35
Kinds of Patterns
 Data mining system can mine multiple kinds of patterns to
accommodate different user expectations or applications and
should be able to discover patterns at different levels of
abstraction.
 In general, data mining tasks can be classified into three
categories:
I. Descriptive Analytics: insight into the past
II. Predictive Analytics: understanding the future
III. Prescriptive Analytics: advise on possible outcomes

36
Contd..
I. Descriptive mining model (Business Intelligence) characterize the
general properties of the data in the database. Use data aggregation
and data mining to provide insight into the past and answer: “What
has happened?”
Class/Concept Description: Characterization and Discrimination
Mining Frequent Patterns Associations Correlations
Cluster Analysis Outlier Analysis Evolution Analysis
II. Predictive mining model (Forecasting) perform inference on the
current data in order to make predictions. Predictive analytics turns
data into valuable, actionable information which use statistical
models and forecasts techniques to understand the future and
answer: “What could happen?”
Classification Regression Time series Analysis Prediction
III. Prescriptive (Optimization and Simulation) analytics goes beyond
predicting future outcomes by also suggesting actions to benefit
from the predictions. Prescriptive analytics not only anticipates what
will happen and when it will happen, but also why it will happen.
which use optimization and simulation algorithms to advice on
possible outcomes and answer: “What should we do?” 37
Class/Concept Description:
1) Characterization and Discrimination
 Descriptions of a class or a concept are called Class/Concept
Descriptions.
 These descriptions can be derived via:
i. Data Characterization is a summarization of the general
characteristics or features of a target class of data.
Output forms: pie charts, bar charts, curves, multidimensional data
cubes, and multidimensional tables, including crosstabs.

ii. Data Discrimination is a comparison of the general features of


target class data objects with the general features of objects from
one or a set of contrasting classes.
Output forms: similar to characteristic descriptions.
The Discrimination descriptions expressed in rule form are
referred to as discriminant rules.

38
2) Mining Frequent Patterns
 Frequent patterns are patterns that occur frequently in data.
Kinds: itemsets, subsequences, and substructures.
 A Frequent Itemset is a set of items that frequently appear together
in a transactional data set, such as milk and bread.
 A Frequent Subsequence is a sequence of patterns that occur
frequently such as purchasing a PC, followed by a camera, and then a
memory card,.
 A substructure can refer to different structural forms, such as graphs,
trees, or lattices, which may be combined with itemsets or
subsequences.
 If a substructure occurs frequently, it is called a (frequent) structured
pattern.
 Mining frequent patterns leads to the discovery of interesting
associations and correlations within data.

39
3) Associations, and Correlations
 Association is a process of uncovering the relationship among data
and determining association rules.
 Association rules that contain a single predicate are referred to as
single-dimensional association rules.
 Buys(X, “computer”)=>buys(X, “software”) [support = 1%, confidence
= 50%]
Can be written as “computer =>software [1%, 50%]”
 Adopting the terminology used in multidimensional databases, where
each attribute is referred to as a dimension, the above rule can be
referred to as a multidimensional association rule.
 Mining of Correlations is a kind of additional analysis performed to
uncover interesting statistical correlations between associated-
attribute−value pairs or between two item sets to analyze that if they
have positive, negative or no effect on each other.

40
4) Classification and Prediction for Predictive
Analysis
 There are two forms of data analysis that can be used for extracting
models describing important classes or to predict future data trends.
These two forms are as follows −
 Classification
 Prediction
 Classification is the process of finding a model (or function) that
describes and distinguishes data classes or concepts.
• A model or classifier is constructed to predict categorical labels
whose class label is unknown.
• The derived model is based on the analysis of a set of training data
(i.e., data objects whose class label is known).
Ex: classify animals based on (spine), or classify cars based on (fuel
mileage)

41
Classification Contd..
 A classification model can be represented in various forms, such as (a)
classification (IF-THEN) rules, (b) a decision tree, or a (c) neural
network.

(a)

(c) (b)

 A Decision Tree is a flow-chart-like tree structure, where each


internal node (non-leaf node) denotes a test on an attribute value,
each branch represents an outcome of the test, and each leaf nodes
(or terminal node) represent classes or class distributions. Decision
trees can easily be converted to classification rules. 42
Classification and Prediction
 Hold the fruit and try to figure out the shape of it. Conical or has a
roughly spherical or oval shape.
 Compare the size of the fruit with the size of your palm.
 Next? texture of the surface of the fruit? If the surface is relatively
smooth, it can be an apple or a guava. If it’s rough, it can be an orange.
 Is this information sufficient: taste a small piece of
the fruit. If it tastes sweet, there is a chance the fruit
is an apple and if it’s sour, it might be an orange.
 color of the fruit is revealed,

 Would you be able to guess the fruit without all the information you
gathered using touch and taste? Each experience gave you some
additional information about the nature of the fruit. Experiences are
called “features”, types of fruits (oranges/apples) are referred as
classes
 If you didn’t see or touch an apple (or orange) previously in your life,
it is not possible to classify the fruit as an apple (or an orange).
 So the previous experiences with the fruits you have, becomes very
important before thinking about classification.
43
Contd..
 A Neural Network, when used for classification, is typically a
collection of neuron-like processing units with weighted connections
between the units.
 There are many other methods for constructing classification models,
such as Naive Bayesian classification, Support Vector Machines, and
k-Nearest Neighbor Classification.
 The resulting classification should maximally distinguish each class
from the others, presenting an organized picture of the data set.

 Prediction models predicts continuous-valued functions or ordered


value as opposed to a categorical label. and refer to both numeric
prediction and class label prediction
 Regression analysis is a statistical methodology that is most often used
for numeric prediction
 Classification and prediction may need to be preceded by relevance
analysis, which attempts to identify attributes that do not contribute to
the classification or prediction process. These attributes can then be
excluded. 44
5) Cluster Analysis
 Cluster is a group of objects that belongs to the same class. Similar
objects are grouped in one cluster and dissimilar objects are grouped
in another cluster.
 Clustering is the process of making a group of abstract objects into
classes of similar objects.
 Ex: market research, pattern recognition, data analysis, and image
processing
 The objects are clustered or grouped based on the principle of
maximizing the intraclass similarity and minimizing the interclass
similarity.
 Each formed cluster can be viewed as a class of objects, from which
rules can be derived. Clustering can also facilitate taxonomy
formation, that is, the organization of observations into a hierarchy of
classes that group similar events together.

45
6) Outlier Analysis and Evolution Analysis
 Outliers are the data objects of a database that do not comply with
the general behavior or model of the data. Most data mining methods
discard outliers as noise or exceptions.
• Ex: Useful in fraud detection, rare events analysis.
• Rather than using statistical or distance measures, deviation-based
methods identify outliers by examining differences in the main
characteristics of objects in a group.

 Data Evolution Analysis describes and models regularities or trends


for objects whose behavior changes over time.
• Trend and deviation: Ex:-regression analysis
• Sequential pattern mining: Ex-digital camera -> large SD memory
• Periodicity analysis
• Similarity-based analysis

46
1.4 Data Mining Technologies
As a highly application-driven domain, data mining is an
interdisciplinary field, the confluence of a set of disciplines and has
incorporated many techniques from other domains as shown below. The
interdisciplinary nature of data mining research and development
contributes significantly to the success of data mining and its extensive
applications. Database and
Spatial Or Data Warehouse
Temporal Technology Statistics
Data Analysis
Machine
Image &
Learning
Signal Processing
Data Mining
High-performance
Information
Computing
Retrieval

Data Pattern
Visualization Recognition
Neural Networks

47
Contd..
Statistics studies the collection, analysis, interpretation or
explanation, and presentation of data.
 A statistical model is a set of mathematical functions that describe the
behavior of the objects in a target class in terms of random variables
and their associated probability distributions. Statistical models are
widely used to model data and data classes.
 Statistics is useful for mining various patterns from data as well as for
understanding the underlying mechanisms generating and affecting
the patterns.
Machine learning investigates how computers can learn (or improve
their performance) based on data.
 Supervised learning is basically a synonym for classification. The
supervision in the learning comes from the labeled examples in the
training data set.
 Unsupervised learning is essentially a synonym for clustering. The
learning process is unsupervised since the input examples are not
class labeled.
48
Contd..
 Semi-supervised learning is a class of machine learning techniques
that make use of both labeled and unlabeled examples when learning
a model. In one approach, labeled examples are used to learn class
models and unlabeled examples are used to refine the boundaries
between classes.
 Active learning is a machine learning approach that lets users play an
active role in the learning process. An active learning approach can
ask a user (e.g., a domain expert) to label an example, which may be
from a set of unlabeled examples or synthesized by the learning
program.

49
Classification of Data Mining Systems
 Because of the diversity of disciplines contributing to data mining,
data mining systems can be categorized according to various criteria,
as follows:
I. Based on General functionality
1) Descriptive data mining
• Characterization and Discrimination, Mining Frequent Patterns,
Associations, Correlations, Cluster Analysis, Outlier Analysis,
Evolution Analysis
2) Predictive data mining
• Classification, Regression, Time series Analysis, Prediction

II. Based on Different views, different classifications


1) Classification according to the kinds of databases mined:
• Based on Data models: Relational, Transactional, Object-
relational, and Data Warehouse Mining System.
• Based on Special types of data handled: Spatial, Time-series, Text,
Multi-media, Heterogeneous, Legacy, Stream, WWW, etc. 50
Contd..
2) Classification according to the kinds of Knowledge mined:
i. Based on data mining functionalities:
• Characterization, Discrimination, Association, Classification,
Clustering, Trend, Deviation and Outlier Analysis, etc.
ii. Based on the granularity or levels of abstraction of the knowledge:
• generalized knowledge (at a high level of abstraction),
• primitive-level knowledge (at a raw data level), or
• Knowledge at multiple levels (several levels of abstraction).
iii. Based on data regularities (commonly occurring patterns) and
those that mine data irregularities (such as exceptions, or outliers).
3) Classification according to the kinds of techniques utilized:
i. According to the degree of user interaction involved (e.g.,
Autonomous Systems, Interactive Exploratory Systems, Query-
driven Systems) or
ii. Methods of data analysis employed (e.g., Database-oriented, Data
Warehouse (OLAP), Machine Learning, Statistics, Visualization,
Pattern Recognition, Neural Network, etc.) 51
Contd..
4) Classification according to the Applications adapted:
i. Data mining systems may be tailored specifically for: Finance,
Retail, Telecommunication, Banking, Fraud Analysis, DNA
mining, Stock Market Analysis, Web Mining, Weblog Analysis, etc.

52
1.5 Data Mining Task Primitives
 A data mining task can be specified in the form of a data mining
query, which is input to the data mining system.
 A data mining query is defined in terms of data mining task
primitives. These primitives allow the user to interactively
communicate with the data mining system during discovery in order
to direct the mining process, or examine the findings from different
angles or depths.
 The data mining primitives specify the following:
1) The set of task-relevant data to be mined:
• Database or data warehouse name
• Database tables or data warehouse cubes
• Conditions for data selection
• Relevant attributes or dimensions
• Data grouping criteria

53
Contd..
2) The kind of knowledge to be mined:
• Characterization, Discrimination, Clustering
• Association/Correlation
• Classification/Prediction

3) The background knowledge to be used in the discovery process:


• Concept hierarchies
 Schema hierarchy
E.g., street<city<province<state<country
 Set-grouping hierarchy
E.g., {20-39} = young, {40-59} = middle_aged
 Operation-derived hierarchy
email address: a.bhanuprasad@vardhaman.org
login-name<college<domain
• User beliefs about relationships in the data

54
Contd..
4) The interestingness measures and thresholds for pattern evaluation:
• Simplicity: (e.g., association- rule length)
• Certainty (e.g., confidence)
• Utility (e.g., support)
• Novelty (not previously known, surprising)

5) The expected representation for visualizing the discovered patterns:


• Rules, tables, reports, charts, graphs, decision trees, and cubes
• Interactive Drill-down and Roll-up, pivoting, slicing and dicing
provide different perspectives to data

55
1.6 Integration of a Data Mining System with a
Database or Data Warehouse System
 When a Data Mining (DM) system works in an environment that
requires it to communicate with other information system
components, such as Databases (DB) and Data Warehouse (DW)
systems, possible integration schemes include:
 No coupling means that a DM system will not utilize any function of a
DB or DW system. It may fetch data from a particular source (such as
a file system), process data using some data mining algorithms, and
then store the mining results in another file.
 Loose coupling means that a DM system will use some facilities of a
DB or DW system, fetching data from a data repository managed by
these systems, performing data mining, and then storing the mining
results either in a file or in a designated place in a database or data
warehouse.
 Semi tight coupling means that besides linking a DM system to a
DB/DW system, efficient implementations of a few essential data
mining primitives can be provided in the DB/DW system.
 Tight coupling means that a DM system is smoothly integrated into
the DB/DW system.
56
Major Issues in Data Mining
 Data mining is a dynamic and fast-expanding field with great
strengths. The major issues in data mining research are partitioned
them into five groups:
1) Mining Methodology
2) User Interaction
3) Efficiency and scalability
4) Diversity of Database types
5) Data mining and society
1) Issues related to Mining Methodology:
Researchers have been vigorously developing new data mining
methodologies. This involves:
 Mining various and new kinds of knowledge: Data mining covers a
wide spectrum of data analysis and knowledge discovery tasks,
from data characterization and discrimination to outlier analysis.
These tasks may use the same database in different ways and
require the development of numerous data mining techniques.
57
Major Issues in Data Mining Contd..
 Data Mining knowledge in multidimensional space: Data can be
aggregated or viewed as a multidimensional data cube, by
combinations of dimensions (attributes) at varying levels of
abstraction.
 Data mining—an interdisciplinary effort: The power of data mining can
be substantially enhanced by integrating new methods from multiple
disciplines like information retrieval, natural language processing,
software engineering knowledge into the data mining process.
 Boosting the power of discovery in a networked environment:
Knowledge derived in one set of objects can be used to boost the
discovery of knowledge in a “related” or semantically linked set of
objects resided in a linked or interconnected environment.
 Handling uncertainty, noise, or incompleteness of data: Data cleaning,
data preprocessing, outlier detection and removal, and uncertainty
reasoning are examples of techniques that need to be integrated with
the data mining process.
 Pattern evaluation and pattern- or constraint-guided mining:
Techniques are needed to assess the interestingness of discovered
patterns based on subjective measures with respect to a given user
class. 58
Major Issues in Data Mining Contd..
2) Issues related to User Interaction:
The user plays an important role in the data mining process. Interesting
areas of research include:
 Interactive mining: Interactive mining should allow users to
dynamically change the focus of a search, to refine mining requests
based on returned results, and to drill, dice, and pivot through the data
and knowledge space interactively, dynamically exploring “cube space”
while mining.
 Incorporation of background knowledge: Background knowledge,
constraints, rules, and other information regarding the domain under
study should be incorporated into the knowledge discovery process for
pattern evaluation as well as to guide the search toward interesting
patterns.
 Ad hoc data mining and data mining query languages: High-level data
mining query languages or other high-level flexible user interfaces will
give users the freedom to define ad hoc data mining tasks.
 Presentation and visualization of data mining results: Data mining
system should adopt expressive knowledge representations, user-
friendly interfaces, and visualization techniques.
59
Contd..
3) Issues related to Efficiency and scalability are always considered when
comparing data mining algorithms.
 Efficiency and scalability of data mining algorithms: Efficiency,
scalability, performance, optimization, and the ability to execute in real
time are key criteria that drive the development of many new data
mining algorithms, to effectively extract information from huge
amounts of data.
 Parallel, distributed, and incremental mining algorithms: Cloud
computing and cluster computing, which use computers in a
distributed and collaborative way to tackle very large-scale
computational tasks, are also active research themes in parallel data
mining.
4) Issues related to Diversity of Database types include:
 Handling complex types of data: Domain- or application-dedicated
data mining systems are being constructed for in depth mining of
specific kinds of data. The construction of effective and efficient data
mining tools for diverse applications remains a challenging and active
area of research.
 Mining dynamic, networked, and global data repositories: Web mining,
multisource data mining, and information network mining have
become challenging and fast-evolving data mining fields. 60
Contd..
5) Issues related to Data mining and society are
 Social impacts of data mining: How can we use data mining technology
to benefit society? How can we guard against its misuse? The improper
disclosure or use of data and the potential violation of individual
privacy and data protection rights are areas of concern that need to be
addressed.
 Privacy-preserving data mining: Data mining will help scientific
discovery, business management, economy recovery, and security
protection (e.g., the real-time discovery of intruders and cyberattacks).
 Invisible data mining: More and more systems should have data
mining functions built within so that people can perform data mining
or use data mining result simply by mouse clicking, without any
knowledge of data mining algorithms. Users may be unaware that the
Intelligent search engines and Internet-based stores is likely collecting
data on the buying patterns of its customers, which may be used to
recommend other items for purchase in the future.

61
1.7 Data Objects and Attribute Types
 Data sets are made up of data objects.
 A Data Object represents an entity. In a sales database, the objects
may be customers, items and sales.
 Data objects are typically described by attributes. Data objects can
also be referred to as samples, examples, instances, data points, or
objects.
 If the data objects are stored in a database, they are Data Tuples. That
is, the rows of a database correspond to the data objects, and the
columns correspond to the attributes.
 An Attribute is a data field, representing a characteristic or feature of
a data object. The nouns attribute (used in data mining), dimension
(used in data warehousing), feature (in Machine learning), and
variable (in statistics) are often used interchangeably.
Ex: Customer_ID, Name, and Address etc. are the attributes
describing a Customer object
 Observed values for a given attribute are known as Observations.

62
Contd..
 An Attribute Vector (or feature vector) is a set of attributes used to
describe a given object.
 A Univariate Distribution of data involves one attribute (or variable).
 A Bivariate Distribution involves two attributes, and so on.
The type of an attribute is determined by the set of possible values like:
1) Nominal Attribute
2) Binary Attribute
3) Ordinal Attribute
4) Numeric Attribute
5) Discrete & Continuous Attributes
1) The values of a Nominal Attribute (relating to names) are symbols or
names of things that represents some kind of category, code, or state,
and so referred to as Categorical and also as Enumerations.
Ex: hair_color - black, brown, white; marital_status - single,
married, divorced, and widowed of person objects.
As nominal attribute values do not have any meaningful order about
them and are not quantitative, it makes no sense to find the mean
value or median value. But attribute’s most commonly occurring
value known as the mode, is one of the measures of central tendency. 63
Contd..
2) A Binary Attribute is a nominal attribute with only two categories or
states: 0 or 1, where 0 means the attribute is absent, and 1 is present.
Binary attributes are referred to as Boolean if the two states
correspond to true and false.
Ex: The attribute medical test is binary, where a value of 1 means the
result of the test for the patient is positive, while 0 is negative.
 A binary attribute is symmetric if both of its states are equally
valuable and carry the same weight; that is, there is no preference
on which outcome should be coded as 0 or 1.
Ex: gender having the states male and female.
 A binary attribute is asymmetric if the outcomes of the states are
not equally important.
Ex: Medical test for HIV is 1 (e.g., HIV positive) and 0 (negative).
3) An Ordinal Attribute is an attribute with possible values that have a
meaningful order or ranking among them, but the magnitude
between successive values is not known.
Ex: Size - small, medium, and large.
 Nominal, binary, and ordinal attributes are qualitative. They describe
a feature of an object without giving an actual size or quantity. 64
Contd..
4) A Numeric Attribute is quantitative (measurable), represented in
integer or real values, and can be
 Interval-scaled Attributes are measured on a scale of equal-size
units whose order can be positive, 0, or negative.
Ex: outdoor temperature value for a number of different days.
 A Ratio-scaled Attribute is a numeric attribute with an inherent
zero-point. That is, a value can be a multiple (or ratio) of another
value.
Ex: weight, height, latitude and longitude
 The values are ordered, and we can also compute the difference
between values, as well as the mean, median, and mode.
5) A Discrete Attribute has a finite or countably infinite set of values,
which may or may not be represented as integers.
Ex: natural numbers, Cust_ID are countably infinite.
If an attribute is not discrete, it is Continuous Attributes and are
typically represented as floating-point variables.

65
1.8 Basic Statistical Descriptions of Data
 Basic statistical descriptions can be used to identify properties of the
data and highlight which data values should be treated as noise or
outliers. 3 areas of basic statistical descriptions are:
1) Measures of central tendency 3) Graphic Displays of Basic
which measure the location of Statistical Descriptions of
the middle or center of a data Data presentation software packages
distribution like: include:
i. Mean i. Bar charts
ii. Median ii. Pie charts
iii. Mode iii. Line graphs,
iv. Midrange. Summaries and distributions include:
2) Dispersion of the data is how are the i. Quantile plots
data spread out? The most common ii. Quantile–quantile plots,
data dispersion measures are the iii. Histograms
i. Range iv. Scatter plots.
ii. Quartiles
iii. Interquartile range
iv. Five-number summary
v. Boxplots
vi. Variance
vii. Standard deviation 66
1) Measures of Central Tendency:
i. Mean, ii. Median, iii. Mode and iv. Midrange
 Measures of central tendency include the mean, median, mode, and
midrange. i. Mean
 Let x1,x2, . . . ,xN be a set of N values or observations, such as for
some numeric attribute X, like salary.
 The Arithmetic Mean (Average) of this set of values is obtained by
dividing sum of the values of all observations in a series (ƩX) by the
number of items (N) constituting the series.
N

 xi    ... 
(Eq.1)
x x x x
x i 1
 1 2 3 N
N N
 This corresponds to the built-in aggregate function, average (avg() in
SQL), provided in relational database systems.
 Weighted arithmetic mean (weighted average) is calculated as:
N

 xi
w x w x  w x  ...  w x
x  i 1
 1 1 2 2 3 3 N N
(Eq.2)
w w  w  ...  w
N


i 1
wi 1 2 3 N

 Trimmed mean is the mean obtained after chopping off values at the
high and low extremes.
67
ii. Median
 The median is that value of the variable which divides the group into
two equal parts, one part comprising of all values greater, and the
other, all values less than median.
 Calculation of Median –Discrete series: (Eq.3)
i. Arrange observations in an ordered array.
ii. If number of terms is odd, the median is the middle term of the
ordered array.
iii. If even, the median is the average of the middle two terms.

68
iii. Mode
 The mode for a set of data is the value that occurs most frequently in the set.
It can be determined for qualitative and quantitative attributes.
 Data sets with one, two, or three modes are respectively called unimodal,
bimodal, and trimodal and data set with two or more modes is
multimodal. If each data value occurs only once, then there is no mode.
Mean  Mode  3  ( Mean  Median)
 Empirical relation:
(or) Mode  3 Median – 2 Mean
iv. Midrange
 The midrange is the average of the largest and smallest values in the
set.
 Suppose we have the following values for salary (in thousands of dollars),
shown in increasing order: 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110.
i)Mean= 30+36+47+50+52+52+56+60+63+70+70+110=696= 58=>$58,000
52+56 108 12 12
ii)Median= = = 54; then median is $54,000
2 2
iii) Mode= Bimodal. The two modes are $52,000 and $70,000.
Mode= 3 Median – 2 Mean= 3*54 – 2*58 =46=> $46,000
iv) Midrange= 30+110 = 140 = 70 =>$70,000
2 2
69
Relative positions of Mean Vs Median Vs Mode
 In a unimodal frequency curve with perfect symmetric data distribution, the
mean, median, and mode are all at the same center value, as shown in Fig(b).
 Data may be either Positively skewed, where the mode occurs at a value that
is smaller than the median (Fig.c), or
 Negatively skewed, where the mode occurs at a value greater than the median
(Fig.a).

 From the previous example we analyze that


Mean ($58,000) > Median($54,000) > Mode($46,000)
Hence we can infer that the data set is Positively Skewed 70
2) Measuring the Dispersion of Data: i. Range, ii. Quartiles,
iii. Interquartile Range, iv. Five-number summary,
v. Boxplots, vi. Variance, vii.Standard Deviation
 Let x1, x2, : : : , xN be a set of observations for some numeric attribute, X which
is sorted in increasing numeric order.
i. Range
 The Range of the set is the difference between the largest (max()) and
smallest (min()) values.
 Quantiles are points taken at regular intervals of a data distribution, dividing
it into essentially equal size consecutive sets.
 The 2-quantile is the data point dividing the lower and upper halves of the
data distribution. It corresponds to the median.
 The 4-quantiles are the three data points that split the data distribution into
four equal parts; each part represents one-fourth of the data distribution.
They are more commonly referred to as Quartiles. The 100-quantiles are
more commonly referred to as Percentiles; they divide the data distribution
into 100 equal-sized consecutive sets.
 The quartiles give an indication of a distribution’s IQR
center, spread, and shape. The first quartile, denoted by Q1,
is the 25th percentile. It cuts off the lowest 25% of the data.
 The third quartile, denoted by Q3, is the 75th percentile—it
cuts off the lowest 75% (or highest 25%) of the data. The
second quartile is the 50th percentile. As the median, it
71
gives the center of the data distribution
InterQuartile Range (IQR),
Five-number summary, Boxplots
 The distance between the first and third quartiles which is the range covered
by the middle half of the data is called the InterQuartile Range (IQR) and
is defined as IQR = Q3 - Q1.
 The five-number summary of a distribution consists of the median (Q2),
the quartiles Q1 and Q3, and the smallest and largest individual observations,
written in the order of : Minimum, Q1, Median, Q3, Maximum.
 Boxplots are a popular way of visualizing a
distribution. A boxplot incorporates the five-number
summary as follows:
• Typically, the ends of the box are at the quartiles
so that the box length is the interquartile range.
• The median is marked by a line within the box.
• Two lines (called whiskers) outside the box extend
to the smallest (Minimum) and largest
(Maximum) observations.

72
Variance and Standard Deviation
 Variance and standard deviation are measures of data dispersion.
They indicate how spread out a data distribution is scalable in large
databases.
 A low standard deviation means that the data observations tend to be
very close to the mean, while a high standard deviation indicates that
the data are spread out over a large range of values.
 The variance of N observations, x1, x2, : : : , xN , for a numeric attribute
X is:   (Eq.4)

N
2 1 _ N

N   1
( xi  x )2   

xi2   x
2

i 1  N i 1 

 Where x is the mean value of the observations, as defined in Eq.1.


 The standard deviation  of the observations is the square root of the
variance.

73
3) Graphic Displays:
Quantile plots, Quantile–Quantile plots, Histograms, and Scatter plots.
 Graphs are helpful for the visual inspection of data, which is useful for
data preprocessing.
 The quantile plots, quantile–quantile plots, histograms show
univariate distributions (i.e., data for one attribute), while scatter
plots show bivariate distributions (i.e., involving two attributes).
 A quantile plot is a simple and effective way to have a first look at a
univariate data distribution.
• First, it displays all of the data for the given attribute (allowing the
user to assess both the overall behavior and unusual occurrences).
• Second, it plots quantile information.

74
Quantile–Quantile plots
 A quantile–quantile plot, or q-q plot, graphs the quantiles of one univariate
distribution against the corresponding quantiles of another.
 From the given q-q plot for unit price data of items sold at two branches
during a given time period, we infer that 25% of items sold at branch 1 were
less than or equal to $60, while 25% of items sold at branch 2 were less than
or equal to $64.
 Histograms: “Histos”-pole, and “gram”-
chart, histogram-chart of poles. If X is
nominal, then a pole or vertical bar is
drawn for each known value of X whose
height indicates the frequency (i.e., count)
of that X value. The resulting graph is
more commonly known as a bar chart.
 The range of values for X is partitioned
into disjoint consecutive subranges,
referred to as buckets or bins whose range
is known as width. Typically, the buckets
are of equal width.
 For each subrange, a bar is drawn with a
height that represents the total count of
items observed within the subrange. 75
Scatter plots
 A scatter plot is one of the most effective graphical methods for
determining if there appears to be a relationship, pattern, or trend
between two numeric attributes.
 To construct a scatter plot, each pair of values is treated as a pair of
coordinates and plotted as points in the plane.
 The scatter plot is a useful method for providing a first look at bivariate data
to see clusters of points and outliers, or to explore the possibility of
correlation relationships.
 Two attributes, X, and Y, are correlated if one attribute implies the other.
Fig(a), Fig(b) & Fig(c) shows examples of positive correlation, negative
correlation and no correlation relationship between two attributes in each of
the given data sets.. A line of best fit can be drawn to study the correlation
between the variables.

76
Example-1
 Suppose that the data for analysis includes the attribute age. The age
values for the data tuples are (in increasing order) 13, 15, 16, 16, 19,
20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36,40, 45,
46, 52, 70.
i. Find the mean and median
ii. What is the mode of the data? Comment on the data’s modality (i.e.,
bimodal, trimodal, etc.)
iii. What is the midrange of the data?
iv. Can you find (roughly) the first quartile (Q1) and the third quartile
(Q3) of the data.
v. Give the five-number summary of the data.
i. The (arithmetic) mean of the data is = 809/27 = 30. The median
(middle value of the ordered set, as the number of values in the set is
odd) of the data is: 25.
ii. This data set has two values that occur with the same highest
frequency and is, therefore, bimodal. The modes (values occurring with
the greatest frequency) of the data are 25 and 35.

77
Example
iii. The midrange (average of the largest and smallest values in the data
set) of the data is: (70+13)/2 = 41.5.
iv. The first quartile (corresponding to the 25th percentile) of the data is:
20. The third quartile (corresponding to the 75th percentile) of the data
is: 35.
v. The five number summary of a distribution consists of the minimum
value, first quartile, median value, third quartile, and maximum value. It
provides a good summary of the shape of the distribution and for this
data is: 13, 20, 25, 35, 70.

78
Example-2
 Suppose a hospital tested the age and body fat data for 18 randomly
selected adults with the following result:
Age 23 23 27 27 39 41 47 49 50
% fat 9.5 26.5 7.8 17.8 31.4 25.9 27.4 27.2 31.2
Age 52 54 54 56 57 58 58 60 61
% fat 34.6 42.5 28.8 33.4 30.2 34.1 32.9 41.2 35.7
i. Calculate the mean, median, and standard deviation of age and %fat
ii. Draw the boxplots for age and %fat
iii. Draw a scatter plot and a q-q plot based on these two variables
iv. Calculate the correlation coefficient (Pearson’s product moment
coefficient). Are these two variables positively or negatively correlated?
i. For the variable age the mean= 46.44, the median =51, and
standard deviation=12.85.
 For the variable %fat the mean= 28.78, the median=30.7, and the
standard deviation=8.99.

79
Contd..
ii. Draw the boxplots for age and %fat

80
Contd..

iii. Draw a scatter plot and a q-q plot based on these two variables

iv. Calculate the correlation coefficient (Pearson’s product moment


coefficient). Are these two variables positively or negatively correlated?
The Pearson correlation coefficient is 0.82, the variables are positively
correlated. 81
1.8 Data Visualization
 Data visualization aims to communicate data clearly and effectively
through graphical representation.
• To discover data relationships that are otherwise not easily
observable by looking at the raw data.
• To create fun and interesting graphics.
Several representative approaches are:
1.8.1. Pixel-Oriented Visualization Techniques
1.8.2. Geometric Projection Visualization Techniques
1.8.3. Icon-Based Visualization Techniques
1.8.4. Hierarchical Visualization Techniques
1.8.5. Visualizing Complex Data and Relations

82
1.8.1. Pixel-Oriented Visualization Techniques
 A simple way to visualize the value of a dimension is to use a pixel
where the color of the pixel reflects the dimension’s value.
 For a data set of m dimensions, create m windows on the screen, one
for each dimension.
 The m dimension values of a record are mapped to m pixels at the
corresponding positions in the windows.
 The colors of the pixels reflect the corresponding values.
 Inside a window, the data values are arranged in some global order
shared by all windows. The global order may be obtained by sorting
all data records in a way that’s meaningful for the task at hand.
 In pixel-oriented techniques, data records can also be ordered in a
query-dependent way.

Pixel-oriented
visualization of four
attributes by sorting all
customers in income
ascending order.
83
Contd..
 The first pixel in a row is far away from the last pixel in the previous row,
though they are next to each other in the global order. Moreover, a pixel is
next to the one above it in the window, even though the two are not next to
each other in the global order.
 To solve this problem, we can lay out the data records in a space-filling curve
to fill the windows. A space-filling curve is a curve with a range that covers
the entire n-dimensional unit hypercube as shown in Fig.
 The circle segment technique uses windows in the shape of segments of a
circle, as illustrated in Fig. In this technique the dimension windows are
located side by side and form a circle.
Hilbert Curve Z Curve

84
1.8.2. Geometric Projection Visualization Techniques
 A drawback of pixel-oriented visualization techniques is that they
cannot help us much in understanding the distribution of data (do not
show whether there is a dense area ) in a multidimensional space.
 A Geometric projection techniques help users find interesting
projections of multidimensional data sets.
 A 3-D scatter plot uses three axes in a Cartesian coordinate system. If
it also uses color, it can display up to 4-D data points.
 For an n dimensional data set, a scatter-plot matrix is an nXn grid of
2-D scatter plots that provides a visualization of each dimension with
every other dimension.
Visualization of a
3-D data set using
a scatter plot

Visualization
of the Iris data
set using a
scatter-plot
matrix. 85
1.8.3. Icon-Based Visualization Techniques
 Icon-based visualization techniques use small icons to represent
multidimensional data values.
 Two popular icon-based techniques: 1) Chernoff faces and 2) Stick figures.
1) Chernoff faces (1973 Herman Chernoff) display multidimensional data of up
to 18 variables (or dimensions) as a cartoon human face (Fig).
 They help reveal trends in the data. Components of the face, such as the eyes,
ears, mouth, and nose, represent values of the dimensions by their shape,
size, placement, and orientation.
 They make use of the ability of the human mind to recognize small
differences in facial characteristics and to assimilate many facial
characteristics at once.
 They facilitate visualization of regularities and irregularities present in the
data,

86
Contd..
2) The Stick figure visualization technique maps multidimensional data to
five-piece stick figures, where each figure has four limbs and a body.
 Two dimensions are mapped to the display (x and y) axes and the remaining
dimensions are mapped to the angle and/or length of the limbs.
 Fig shows census data, where age and income are mapped to the display axes,
and the remaining dimensions (gender, education, and so on) are mapped to
stick figures.
 If the data items are relatively dense with respect to the two display
dimensions, the resulting visualization shows texture patterns, reflecting data
trends.

87
1.8.4. Hierarchical Visualization Techniques
 Hierarchical visualization techniques partition all dimensions into
subsets (i.e., subspaces). The subspaces are visualized in a hierarchical
manner.
 “Worlds-within-Worlds,” also known as n-Vision, is a representative
hierarchical visualization method. A user can interactively change, in the
outer world, the location of the origin of the inner world. The user then views
the resulting changes of the inner world.
 Tree-maps display hierarchical data as a set of nested rectangles. For
example, Fig shows a tree-map visualizing Google news stories. All news
stories are organized into seven categories, each shown in a large rectangle of
a unique color. Within each category (i.e., each rectangle at the top level), the
news stories are further partitioned into smaller subcategories.

88
1.8.5. Visualizing Complex Data and Relations
 Hierarchical visualization techniques partition all dimensions into
subsets (i.e., subspaces). The subspaces are visualized in a
hierarchical manner.
 A Tag Cloud is a visualization of statistics of user-generated tags in
which, tags are listed alphabetically or in a user-preferred order.
 The importance of a tag is indicated by font size or color.
 Figure 1.8.5 shows a tag cloud for visualizing the popular tags used in
a Website.
 Tag clouds are often used in two ways. First, in a tag cloud for a single
item, we can use the size of a tag to represent the number of times
that the tag is applied to this item by different users.
 Second, when visualizing the tag statistics on multiple items, we can
use the size of a tag to represent the number of items that the tag has
been applied to, that is, the popularity of the tag.

89
Disease influence graph of people at least
20 years old in the NHANES data set

90
1.9. Measuring Data Similarity and Dissimilarity
Similarity and dissimilarity measures are referred to as measures
of proximity.
A similarity measure for two objects, i and j, is a numerical measure of
how alike two data objects are.
• It falls in the range [0, 1].
• It returns the value 0 if the objects are unalike.
• The similarity value is higher when objects are more alike.
• A value of 1 indicates complete similarity (identical).
• Examples: Cosine, Jaccard, Tanimoto,
A dissimilarity measure is a numerical measure of how different two
data objects are.
• It returns a value of 0 (lower) if the objects are the same (alike).
• The higher the dissimilarity value, the more dissimilar the two objects
are.

91
Distance measures:
Euclidean, Manhattan, and Minkowski
Distance measures are commonly used for computing the dissimilarity of
objects described by numeric attributes.
These measures include the
1) Euclidean distance
2) Manhattan distance
3) Minkowski distance
Let i = (xi1, xi2,…, xip ) and j = (xj1, xj2,…, xjp ) be two objects described by p
numeric attributes.
The Euclidean distance between objects i and j is defined as:

The Manhattan (or city block) distance, named so because it is the


distance in blocks between any two points in a city (such as 2 blocks
down and 3 blocks over for a total of 5 blocks). It is defined as:

The Minkowski distance is a generalization of the Euclidean and


Manhattan distances. It is defined as:
92
Example
Given two objects represented by the tuples (22,1,42,10) and (20,0,36,8):
a) Compute the Euclidean distance between the two objects.
b) Compute the Manhattan distance between the two objects.
c) Compute the Minkowski distance between the two objects, using h = 3.
a) Euclidean distance between Numeric attributes:

=
= = 6.7082
b) Manhattan distance between Numeric attributes:

=
= 11
c) Minkowski distance between Numeric attributes:

=
= 6.1534
93

Vous aimerez peut-être aussi