Vous êtes sur la page 1sur 24

MCA Knowledge Base Systems & Data Mining

NOTES:

1. Subject Code : IT -34 Subject Name : Advanced Database management System

2. Learning Objectives of the Course ADBMS :

To know about different database handling techniques. To gain an awareness


of the basic issues in objected oriented data models, learn about the Web-DBMS
integration technology and XML for Internet database applications, familiarize with the
data-warehousing and data-mining techniques and other advanced topics.

3. Unit Name : Knowledge Base Systems & Data Mining

4. Contents of the Unit

6.1 Data mining as a part Knowledge Discovery


process Introduction to machine learning & data mining

6.2 Association rules

6.3 Market-basket Model, support & confidence -Apriori Algorithm -Sampling


Algorithm -Frequent-pattern Tree Algorithm -Partition Algorithm -Other types of
Association rules

6.4 Classification Decision tree induction Bayesian classifiers

6.5 Clustering k-means Algorithm

6.6 Approaches to other data mining problems Discovery of sequential patterns


Discovery of patterns in time series Regression Neural Networks Genetic Algorithms
Text mining Data-visualization

6.7 Applications of Data Mining

Learning Objectives of the Unit : to study different algorithm to perform the


analysis of data.

5. Key Definitions, Key Words in the definitions



Data Mining : The process of discovering interesting hidden & previously
unknown pattern from vast amount of data store such as data warehouse.

6. Key Concepts

Prof. Khandagale S P UNIT NO. 6


MCA Knowledge Base Systems & Data Mining

KBS, KDD, Apriori Algorithm, K-means Algorithm,Bayesion Algorithm,FPT


Algorithm,Applications Of Data mining

7. Questions Asked in the University Exam

Q.1Write Note On : Apriori Algorithm(Nov.2009 4M, Nov. 2010 4M, Nov. 2012 10M)

Q.2 Explain K-means Algorithm in data mining.(Nov. 2009 4M,Apr 2010 5M, Apr
2011 10M)

Q.3 Write Short Note On : machine Learning(Apr 2010 5M, Apr 2012 10M, Apr
2013 4M )

Q.4 Write Short Note on : KBS(Nov .2012 5M, Apr. 2011 5M)

Q.5 Explain Outlier Analysis in Data Mining(Nov. 2010 4M, Apr 2013 4M)
Q.6 Explain Text Mining(Nov 2010 6M, Apr 2013 6M)

Q.7 Explain association rules for data mining with help of an algorithm(Nov 2012 10M)

Question For Practice :

Q.1 Write Short Note On :

1. KDD

2. Bayesian Classifier

3.Sampling Algorithm

Q.2 Explain Various Data mining Applications

Q.3 Explain the architecture of Data mining

8. Learning Resources :
Reference Book :

1. Data Mining Concepts & Techniques Jiawaei & Micheline Kamber ,


ELSEVIER second Edition.

2. Database system concepts', 6th Edition Abraham Silberschatz, Henry Korth,


S, Sudarshan, (McGraw Hill International )

3. Database systems : "Design implementation and management"- Rob


Coronel, 4thEdition, (Thomson Learning Press)

Prof. Khandagale S P UNIT NO. 6


MCA Knowledge Base Systems & Data Mining

4. Database Management Systems - Raghu Ramkrishnan, Johannes Gehrke


Second Edition, (McGraw Hill International )

Reference Link :

http://www.oracle.com/technetwork/articles/sql/11g-dw-olap-100058.html

http://www.cs.ccsu.edu/~markov/ccsu_courses/DataMining-3.html

Knowledge Base Systems & Data Mining


Data Mining :
The discovery of new information in terms of patterns or rules from vast amounts of data.

The process of finding interesting structure in data.

For some experts data mining is the process for knowledge discovery.


The process of discovering interesting
hidden & previously unknown pattern from vast amount
of data store such as data warehouse.

Knowledge Discovery Data


Another popular term used for data mining is knowledge discovery from data, KDD
Knowledge Discovery in Data is the Important process of identifying valid, novel,
potentially useful, and ultimately understandable patterns in data.

Knowledge Discovery in Databases (KDD)


Data mining is actually one step of a larger process known as knowledge discovery in
databases (KDD).
The KDD process model comprises six phases
Data selection
Data cleaning
Data transformation or encoding
Data mining
Reporting and displaying discovered knowledge

Goals of Data Mining and Knowledge Discovery (PICO)


Prediction:
Determine how certain attributes will behave in the future.
Identification:
Identify the existence of an item, event, or activity.
Classification:
Partition data into classes or categories.
Optimization:
Prof. Khandagale S P UNIT NO. 6
MCA Knowledge Base Systems & Data Mining

Optimize the use of limited resources.

Knowledge Discovery
The term Knowledge Discovery in Databases, or KDD for short, refers to the broad
process of finding knowledge in data, and emphasizes the "high-level" application of
particular data mining methods. It is of interest to researchers in machine learning,
pattern recognition, databases, statistics, artificial intelligence, knowledge acquisition
for expert systems, and data visualization.
The unifying goal of the KDD process is to extract knowledge from data in the context
of large databases.
It does this by using data mining methods (algorithms) to extract (identify) what is
deemed knowledge, according to the specifications of measures and thresholds, using
a database along with any required preprocessing, sub sampling, and transformations
of that database.

Data Mining a step in A KDD Process

Interacting with a user / expert in KDD


KDD is not a fully automatically way of analysis.
The user is an important element in KDD process.
User Should decide about, e.g. Choosing task and algorithms, selection in preprocessing.
Interpretation and evaluation of patterns

Prof. Khandagale S P UNIT NO. 6


MCA Knowledge Base Systems & Data Mining

Architecture of a typical data mining systemArchitecture of a typical data mining system

Graphical user interface

Pattern evaluation

Knowledge base

Data mining engine

Database or data warehouse server

Data cleansing
Data Integration Filtering

Database Data warehouse

Information repositories : It include Databases, data warehouse, or other repository


like spreadsheet etc.

Database or Data warehouse server : as per the user requirement these fetches the
relevant information.

Knowledge Base: knowledge consist of Domain knowledge. To guide, explore & evaluate
interestingness of patterns domain knowledge is used. Knowledge base consist of past
experience as well as user belief based on which certain conclusions can be drawn

Data Mining Engine : It is utmost important in data mining system. It consist of set of
functional modules such as : Characterization

Prof. Khandagale S P UNIT NO. 6


MCA Knowledge Base Systems & Data Mining

Association, Correlation analysis, Classification, Prediction, Cluster analysis, oulier


analysis etc

Patter Evaluation Module : searches interesting patterns for which it communicates


with data mining module.

User Interface : It acts as interface between user & data mining system, by specifying a
data mining query, providing info to help the search etc.

Knowledge Based System(KBS)

To provide intelligent decision with appropriate justification KBS works as artificial


intelligence tool.

KBS are systems based on the method & techniques of Artificial intelligence.

Knowledge base is dependent on following concepts :hypothesis, rules, object,


attributes, relations, definitions, events, process, facts etc.

Their core components are : Knowledge base, Acquisition mechanisms , Inference


Mechanism

KBS architecture

The typical architecture of an KBS is often described as follows:

Diagnosis :Problems are identified using number of symptoms or failure

Interpretation : To provide an understanding of a situation from available information.

Prediction : To predict a future state from a set of data or observations.

Design : To develop configuration that satisfies constraints of a design problem

Control : To collect & evaluate evidence & form opinions on that evidence.

Instruction: To train students & correct their performance

Debugging : To identify & prescribe remedies for malfunctions.

Planning : Both short term & long term in project management.

Monitoring : To check performance & flag exceptions

Advantages of KBS :

Prof. Khandagale S P UNIT NO. 6


MCA Knowledge Base Systems & Data Mining

Documentation of knowledge

Intelligent Decision Support

Self learning reasoning & explanation.

Increase availability of expert knowledge

Efficient & cost effective.

Consistency of answers

Explanation of solution

Deal with the uncertainty

Limitations of KBS :

Lack of common sense

Inflexible & difficult to modify

Restricted Domain of Expertise

Lack of learning ability

Not always reliable

Applications Of KBS

Retail: Market basket analysis, Customer relationship management (CRM)

Finance: Credit scoring, fraud detection

Manufacturing: Optimization, troubleshooting

Medicine: Medical diagnosis

Telecommunications: Quality of service optimization

Bioinformatics: Motifs, alignment

Web mining: Search engines

What is intelligence ?

Prof. Khandagale S P UNIT NO. 6


MCA Knowledge Base Systems & Data Mining

Intelligence = Knowledge + ability to perceive, feel, comprehend, process,


communicate, judge, learn.

Artificial Intelligence is the design, study and construction of computer programs that
behave intelligently. -- Tom Dean.

Examples of intelligent agents:

Knowledge-based systems: capture knowledge that people have which are relevant to
a problem.

Common sense reasoning systems: capture knowledge that people commonly hold
which is why this knowledge is not explicitly communicated.

Learning systems: posses the ability to expend their knowledge based on the
accumulated experience.

Natural language understanding systems: support dialog in English/French/Japanese/

Game playing systems.

Intelligent robots.

Speech and vision recognition systems.

What is Machine Learning?

Machine Learning

Study of algorithms that

improve their performance

at some task

with experience

Optimize a performance criterion using example data or past experience.

Role of Statistics: Inference from a sample

Role of Computer science: Efficient algorithms to

Solve the optimization problem

Prof. Khandagale S P UNIT NO. 6


MCA Knowledge Base Systems & Data Mining

Representing and evaluating the model for inference

Machine learning is a process which causes systems to improve with experience.

Machine learning

more heuristic

focused on improving performance of a learning agent

also looks at real-time learning and robotics areas not part of data mining

Types of Machine Learning


Some of the main types of machine learning are:

Supervised Learning, in which the training data is labeled with the correct answers, e.g.
spam or ham. The two most common types of supervised learning are classification
(where the outputs are discrete labels, as in spam filtering) and regression (where the
outputs are real-valued).

Unsupervised learning, in which we are given a collection of unlabeled data, which we


wish to analyze and discover patterns within. The two most important examples are
dimension reduction and clustering.

Reinforcement learning, in which an agent (e.g., a robot or controller) seeks to learn


the optimal actions to take based the outcomes of past actions.

There are many other types of machine learning as well, for example:

1. Semi-supervised learning, in which only a subset of the training data is labeled

2. Time-series forecasting, such as in financial markets

3. Anomaly detection such as used for fault-detection in factories and in surveillance

4. Active learning, in which obtaining data is expensive, and so an algorithm must


determine which training data to acquire and many others.

Major Data Mining Tasks

Classification: predicting an item class

Clustering: finding clusters in data

Prof. Khandagale S P UNIT NO. 6


MCA Knowledge Base Systems & Data Mining

Associations: e.g. A & B & C occur frequently

Visualization: to facilitate human discovery

Summarization: describing a group

Deviation Detection: finding changes

Estimation: predicting a continuous value

Link Analysis: finding relationships

Association Rules

Association rules are frequently used to generate rules from market-basket data.

A market basket corresponds to the sets of items a consumer purchases during


one visit to a supermarket.

The set of items purchased by customers is known as an itemset.

An association rule is of the form X=>Y, where X ={x1, x2, ., xn }, and Y = {y1,y2, ., yn}
are sets of items, with xi and yi being distinct items for all i and all j.

For an association rule to be of interest, it must satisfy a minimum support and


confidence.

Retail shops are often interested in association between different item that people buy.

The Market-Basket Model

A large set of items, e.g., things sold in a supermarket.

A large set of baskets, each of which is a small set of the items, e.g., the things one
customer buys on one day.

Market Basket Analysis is a mathematical modeling technique which is based on


assumption that when customer buys certain product he is likely to buy other or group
of product together

Support is the percentage of transactions that contain all of the items in the itemset.

Milk ->screwdriver Low support

Milk ->Bread high Support

Prof. Khandagale S P UNIT NO. 6


MCA Knowledge Base Systems & Data Mining

If the value support is low, the rule may not be statistically significant.

Confidence : It is the probability that the item in RHS will be purchased given that the
items in the LHS are purchased by customer

Confidence and Support

Support:

The minimum percentage of instances in the database that contain all items
listed in a given association rule.

Support is the percentage of transactions that contain all of the items in the
itemset, LHS U RHS.

Confidence:

Given a rule of the form A=>B, rule confidence is the conditional probability that
B is true when A is known to be true.

Confidence can be computed as

support(LHS U RHS) / support(LHS)

APriori Algorithm :

A two-pass approach called a-priori limits the need for

main memory.

Key idea: monotonicity : if a set of items appears at least

s times, so does every subset.

Converse for pairs: if item i does not appear in s baskets, then no

pair including i can appear in s baskets.

Pass 1: Read baskets and count in main memory the

occurrences of each item.

Requires only memory proportional to #items.

Prof. Khandagale S P UNIT NO. 6


MCA Knowledge Base Systems & Data Mining

Pass 2: Read baskets again and count in main memory

only those pairs both of which were found in Pass 1 to

have occurred at least s times.

Requires memory proportional to square of frequent items only.

Apriori Algorithm
Lk: Set of frequent itemsets of size k (with min support)

Ck: Set of candidate itemset of size k (potentially frequent itemsets)

L1 = {frequent items};

for (k = 1; Lk != ; k++) do

Ck+1 = candidates generated from Lk;

for each transaction t in database do

increment the count of all candidates in Ck+1 that are


contained in t

Lk+1 = candidates in Ck+1 with min_support

return k Lk;

Prof. Khandagale S P UNIT NO. 6


MCA Knowledge Base Systems & Data Mining

The Sampling Algorithm

The sampling algorithm selects samples from the database of transactions that
individually fit into memory. Frequent itemsets are then formed for each sample.

If the frequent itemsets form a superset of the frequent itemsets for the entire
database, then the real frequent itemsets can be obtained by scanning the
remainder of the database.

In some rare cases, a second scan of the database is required to find all frequent
itemsets.

Frequent-Pattern Tree Algorithm

Prof. Khandagale S P UNIT NO. 6


MCA Knowledge Base Systems & Data Mining

The Frequent-Pattern Tree Algorithm reduces the total number of candidate itemsets
by producing a compressed version of the database in terms of an FP-tree.

The FP-tree stores relevant information and allows for the efficient discovery of
frequent itemsets.

The algorithm consists of two steps:

Step 1 builds the FP-tree.

Step 2 uses the tree to find frequent itemsets.

Step 1: Building the FP-Tree

First, frequent 1-itemsets along with the count of transactions containing each item
are computed.

The 1-itemsets are sorted in non-increasing order.

The root of the FP-tree is created with a null label.

For each transaction T in the database, place the frequent 1-itemsets in T in sorted
order. Designate T as consisting of a head and the remaining items, the tail.

Insert itemset information recursively into the FP-tree as follows:

if the current node, N, of the FP-tree has a child with an item name = head,
increment the count associated with N by 1 else create a new node, N, with a
count of 1, link N to its parent and link N with the item header table.

if tail is nonempty, repeat the above step using only the tail, i.e., the old head is
removed and the new head is the first item from the tail and the remaining
items become the new tail.

Step 2: The FP-growth Algorithm For Finding Frequent Itemsets

Input: Fp-tree and minimum support, mins

Output: frequent patterns (itemsets)

procedure FP-growth (tree, alpha);

Begin

if tree contains a single path P then

Prof. Khandagale S P UNIT NO. 6


MCA Knowledge Base Systems & Data Mining

for each combination, beta of the nodes in the

path generate pattern (beta U alpha)

with support = minimum support of nodes in

beta else

for each item, i, in the header of the tree do

begin

generate pattern beta = (i U alpha) with support =

i.support; construct betas conditional pattern base;

construct betas conditional FP-tree,

beta_tree; if beta_tree is not empty then FP-

growth(beta_tree, beta);

end;

End;

The Partition Algorithm


Divide the database into non-overlapping subsets.

Treat each subset as a separate database where each subset fits entirely into main
memory.

Apply the Apriori algorithm to each partition.

Take the union of all frequent itemsets from each partition.

These itemsets form the global candidate frequent itemsets for the entire database.

Verify the global set of itemsets by having their actual support measured for the entire
database.

Association Rules

The cardinality of itemsets in most situations is extremely large.

Prof. Khandagale S P UNIT NO. 6


MCA Knowledge Base Systems & Data Mining

Association rule mining is more difficult when transactions show variability in factors
such as geographic location and seasons.

Item classifications exist along multiple dimensions.

Data quality is variable; data may be missing, erroneous, conflicting, as well as


redundant.

Other Association Rules

Association Rules Among Hierarchies

Multidimensional Association

Negative Association

Classification

Classification is the process of learning a model that is able to describe different


classes of data.

Learning is supervised as the classes to be learned are predetermined.

Learning is accomplished by using a training set of pre-classified data.

The model produced is usually in the form of a decision tree or a set of rules.

Classification : A data mining technique

Decision tree & neural network are example of classification techniques.

Classification makes segments classes of those objects who have certain kind of
similarity almost like clustering.

Decision Tree

It is flowchart like tree structure

It has internal node denotes on an attribute

Branch represents an outcome of the test

Decision tree with two branches is called as Binary tree & with multiple branches is
called Multiway tree.

Prof. Khandagale S P UNIT NO. 6


MCA Knowledge Base Systems & Data Mining

Advantages:

Learning & classification steps are simple & fast

They provide good accuracy

Bayesion Classifiers
It is probabilistic approach based on applying Bays thermo

It is based on strong independence assumption

Bayesion classifier is supported by probability model

Advantage : It works well in complex real world situations

Outlier Analysis & Clustering


Outlier Analysis : When data objects behavior does not matches with its similar kind of
objects, the previous object is called as an outlier

Outliers are those objects who are dissimilar or inconsistent with their fellow objects.

The main reason behind outlier are measurement execution error or assumption.

Applications :

Fraud Detection

Customized marketing

Medical Analysis

Clustering: Clustering is the process where the data objects similar to each other are
placed together in a cluster & dissimilar objects into other clusters.

Applications:
Data Mining

Statistics

Biology

Prof. Khandagale S P UNIT NO. 6


MCA Knowledge Base Systems & Data Mining

Machine learning

Clustering Requirements :

Scalability

Ability to deal with Different types of attributes

Discovery of cluster with arbitrary shape

Ability to deal with noisy Data

Incremental clustering

High Dimensionality

Constraint based clustering

Minimal Requirement for domain knowledge to determine input parameter

K-means Clustering Algorithm


Algorithm: The k-Means algorithm for partitioning based on the mean value of object in the
cluster.

Input: K : number of cluster and

D : database containing n objects.

Output: A set of k clusters that mininimizes the squared-error criterion.

1) Randomly choose k object as the initial cluster centers From D (centroid);

2) Repeat

3) (re)assign each object to the cluster to which the object is the most similar, based on the
mean value of the objects in the cluster;

4) Update the cluster mean

calculate the mean value of the objects for each cluster;

5) Until centroid (center point) no change;

Prof. Khandagale S P UNIT NO. 6


MCA Knowledge Base Systems & Data Mining

Approaches to other data mining problems

Discovery of sequential patterns :

The technique determines objects called as sequential object. These helps to predict the
strong dependencies amongst the events.

The sequential object may be : goods purchased by customer , medical treatment given to the
patient etc.

It is being observed that when particular event occurs it depends on the previous event.

Discovery in patterns in Time Series

The approach is based on identification of similarities between time series of data.

Many transactions are performed on some regular time intervals such as weekly report.

Applications :

Financial Market

Medical diagnosis

Market basket data analysis

Regression : It deals with the prediction of a value rather than class.

If we consider P as a function it will called as regression function which takes place as :

Y=f(x1,x2,x3.xn)
Neural Network : It is the technique which uses generalized regression & provides
iterative method to conduct it over & over again.

Types of Neural Network :

Supervised neural Network

Unsupervised neural Network

Characteristics of Neural Network

Self Adaptive

Classification Tasks

Prof. Khandagale S P UNIT NO. 6


MCA Knowledge Base Systems & Data Mining

Highly quantitative Output

No unique internal representation

Time Series Data

Genetic Algorithm :Algorithm which are capable of performing randomized search


procedure which are adaptive & robust in Nature are called as Genetic algorithm

Applications of GA

Image Analysis

Scheduling

Engineering Design

Text Mining
Text data is everywhere books, news, articles, financial analysis, blogs, social
networking, etc

According to estimates, 80% of worlds data is in unstructured text format

We need methods to extract, summarize, and analyze useful information from


unstructured/text data

Text mining seeks to automatically discover useful knowledge from the massive
amount of data

Active research is going on in the area of text mining in industry and academics

Data Visualization/Visual data Mining(VDM)

Is arecent approach for exploring very large dataset which combines traditional mining
methods and information visualization technique.

It is required science the size of the data is very large & if it is not displayed in an
organized manner trends & patterns can not be recognized appropriately.

It allows user to perform automated calculations & also human perception to observe trends
& patterns.

Prof. Khandagale S P UNIT NO. 6


MCA Knowledge Base Systems & Data Mining

There are different Methods as :

Univariate :

Bivariate

Multivariate

a) Icon based method

b) Pixel based method

c) Dynamic parallel coordinate system

Applications Of data Mining :


Data Mining is widely used in diverse areas. There are number of commercial data
mining system available today yet there are many challenges in this field

FINANCIAL DATA ANALYSIS

The financial data in banking and financial industry is generally reliable and of high
quality which facilitates the systematic data analysis and data mining. Here are the few
typical cases:

Design and construction of data warehouses for multidimensional data analysis and
data mining.

Loan payment prediction and customer credit policy analysis.

Classification and clustering of customers for targeted marketing.

Detection of money laundering and other financial crimes.

TELECOMMUNICATION INDUSTRY

Today the Telecommunication industry is one of the most emerging industries


providing various services such as fax, pager, cellular phone, Internet messenger,
images, e-mail, web data transmission etc.Due to the development of new computer
and communication technologies, the telecommunication industry is rapidly
expanding. This is the reason why data mining is become very important to help and
understand the business.

Prof. Khandagale S P UNIT NO. 6


MCA Knowledge Base Systems & Data Mining

Data Mining in Telecommunication industry helps in identifying the telecommunication


patterns, catch fraudulent activities, make better use of resource, and improve quality
of service. Here is the list examples for which data mining improve telecommunication
services:

Multidimensional Analysis of Telecommunication data.

Fraudulent pattern analysis.

Identification of unusual patterns.

Multidimensional association and sequential patterns analysis.

Mobile Telecommunication services.

Use of visualization tools in telecommunication data analysis.

Data Mining Applications in Health Care and Insurance

The growth of the insurance industry entirely depends on the ability of converting data
into the knowledge, information or intelligence about customers, competitors and its
markets. Data mining is applied in insurance industry lately but brought tremendous
competitive advantages to the companies who have implemented it successfully

Data Mining Applications in Transportation

Data mining helps determine the distribution schedules among warehouses and
outlets and analyze loading patterns.

Data Mining Applications in Medicine

Data mining enables to characterize patient activities to see incoming office visits.

Data mining helps identify the patterns of successful medical therapies for different
illnesses.

Data mining applications are continuously developing in various industries to provide


more hidden knowledge that increases business efficiency and grows businesses.

Identification of product stoled Most : the products which are frequently stored are
identified. Also the mechanism for its security can be planed & thieves can be detected

Prof. Khandagale S P UNIT NO. 6


MCA Knowledge Base Systems & Data Mining

Data Mining for Retail Industry

Retail industry: huge amounts of data on sales, customer shopping history, etc.

Applications of retail data mining

Identify customer buying behaviors

Discover customer shopping patterns and trends

Improve the quality of customer service

Achieve better customer retention and satisfaction

Enhance goods consumption ratios

Design more effective goods transportation and distribution policies

Prof. Khandagale S P UNIT NO. 6