Vous êtes sur la page 1sur 15

Paper code CO 220

Submitted by
Prashant Kumar 2K12/CO/
Shubham Jha 2K12/CO/125
Tanishq Saini 2K12/CO/136
Vishnu Kumar Kanaujia 2K12/CO/14




We would like to express our sincere gratitude to our teacher
Ankita Jain Maam, who by her sincere efforts made us
eligible to throw some light on the topics we have chosen.
It is a very meagre effort of ours to glorify her kindness and
perseverance by which we are able to present our topic with
great enthusiasm.
Also, we would like to express our kind gratitude to our kind
parents and The Supreme Personality of Godhead who
supplied us with adequate resource and capabilities to reach
here.















Acknowledgment
DATA MINING
Data mining, also known as Knowledge Discovery from Data (KDD) is the process of discovering
interesting patterns from massive amounts of data.
It is analogous to mining a vivid process that finds a small set of precious nuggets from a great deal
of raw material.
These patterns must be
1. Valid on test data
2. A good degree of certainty
3. Novel i.e. must not be obvious to system
4. Potentially useful
5.Easily understood by humans
Is Data Mining Need Of The Hour?
We are living in the information age
is a popular saying; however,
we are actually living in the data age
Everyday an alarming amount of data (in Petabytes) enters into storages in form of WWW or the
common storages like inexpensive disks and drives .
And present in the data is dormant but potentially useful information. We could testify to the
growing gap between the generation of data and our understanding of data which can resolve many
issues
Evolution of Data Mining
It can be viewed as a result of the natural evolution of information technology.
Data management includes the following steps
1. Data collection and database creation
2. Data management (data storage, retrieval and transaction processing)
3. Advanced data analysis (data warehousing and data mining)
The early development of data collection and database creation mechanisms served as a
prerequisite for the later development of effective mechanisms for data storage and retrieval and
transaction processing
The abundance of data, coupled with the need for powerful data analysis tools, has been described
as a
data rich but information poor situation
The fast- growing amount of data, collected and stored in numerous data repositories, has far
exceeded our human ability for comprehension.
Data collected in large data repositories become data tombs that are seldom visited.
Important decisions are made based not on the information-rich data stored in data repositories but
rather on a decision maker's intuition, simply because the decision maker does not have the tools to
extract the valuable knowledge embedded in the vast amounts of data.
Evolution of Database Technology
The database systems were very popularly used from the beginning to store data. The timeline of
evolution is as follows
1960s Data collection, Database creation, IMS and Network DBMS
1970s Relational data model, Relational DBMS implementation
1980s RDBMS, Advanced data models (extended-relational, OO, deductive, etc.) and Application-
oriented DBMS (spatial, scientific, engineering, etc.)
1990s Data mining, Data warehousing, Multimedia databases and Web databases
2000s Stream data management and mining, Data mining and its applications, Web technology
(XML, data integration) and Global Information Systems.
The Steps Of Data Mining
The knowledge discovery process is an iterative sequence of the following steps.
All these steps when executed simultaneously and orderly give very nice results :
1. Data cleaning i.e. Removing noise and inconsistent data
2. Data integration i.e. Combining multiple data sources
3. The resulting data are stored in a data warehouse.
4. Data selection i.e. Retrieving data relevant to the analysis task
5. Data transformation i.e. Transformation and consolidation of data into forms appropriate for
mining by performing summary or aggregation operations.
6. Data reduction performed to obtain a smaller representation of the original data without
sacrificing its integrity.
7. Data mining i.e. Applying intelligent methods to extract data patterns.
8. Pattern evaluation i.e. Identification of truly interesting patterns representing knowledge based
on interesting measures.
9. Knowledge presentation i.e. Visualization and knowledge representation techniques used to
present mined knowledge to users.
Knowledge Discovery (KDD) Process
Components Of Data Mining System
Confluence of Multiple Disciplines in Data mining
Some Methods Of Data Mining
1. Statistics
Statistical model is a set of mathematical functions that describe the behaviour of the objects in
terms of random variables and associated probability distributions.
In data mining tasks like data characterization and classification, statistical models of target classes
can be built.
Alternatively, data mining tasks can be built on top of statistical models.
For example statistics to model noise and missing data values.
2.Machine learning
Machine learning investigates how computers can learn or improve their performance and
automatically learn to recognize complex patterns and make intelligent decisions based on data.
Machine learning is a fast-growing discipline, illustrating classic problems in machine learning that
are highly related to data mining.
3. Data Warehousing and Database Systems
Data mining can make good use of scalable database technologies to achieve high efficiency and
scalability on large data sets as it requires a huge real time data processing and storage.
Data mining tasks can be used to extend the capability of existing database systems to satisfy
advanced users' sophisticated data analysis requirements.
A data warehouse integrates data originating from multiple sources and various timeframes.
4.Information retrieval
Increasingly large amounts of text and multimedia data have been accumulated and made available
online due to the fast growth of the Web and applications.
Their effective search and analysis have raised many challenging issues in data mining as a result of
which text mining and multimedia data mining, integrated with information retrieval methods, have
become increasingly important.
Defining An Attribute
An attribute is a data field, representing a characteristic or feature of a data object. The nouns
attribute, dimension, feature, and variable are often used interchangeably in the literature.
The type of an attribute is determined by the set of possible
Nominal
Relating to names(can be represented with numbers as well).
nominal attribute values do not have any meaningful order about them and are not quantitative .
Binary Attributes
Nominal attribute that has only two states, 0 or 1.
Ordinal Attributes
Possible values that have a meaningful order or ranking among them.
Numeric Attributes
A numeric attribute is quantitative; that is, it is a measurable quantity, represented in integer or real
values.
Measuring the Central Tendency: Mean, Median, and Mode
Mean
Center of set of data.
Median
In case of odd number of data, it is the central value else average of the two middle values
Mode
Most frequent occurring data.
Variance and standard deviation
They are measures of data dispersion. They indicate how spread out a data distribution is. A low
standard deviation means that the data observations tend to be very close to the mean, while a high
standard deviation indicates that the data are spread out over a large range of values
GRAPHICAL REPRESENTATION OF DATA
QuantileQuantile Plot
A quantilequantile plot, or q-q plot, graphs the quantiles of one univariate distribution against the
corresponding quantiles of another. It is a powerful visualization tool in that it allows the user to
view whether there is a shift in going from one distribution to another.
Histogram
Histograms (or frequency histograms) are at least a century old and are widely used. Histos means
pole or mast, and gram means chart, so a histogram is a chart of poles.
Scatter Plots and Data Correlation
A scatter plot is one of the most effective graphical methods for determining if there appears to be a
relationship, pattern, or trend between two numeric attributes. To construct a scatter plot, each pair
of values is treated as a pair of coordinates in an algebraic sense and plotted as points in the plane.
Positive and Negative correlation and 3D visualtisation of a scatter plot
Icon-Based Visualization Techniques
Chernoff faces were introduced in 1973 by statistician Herman Chernoff. They display
multidimensional data of up to 18 variables (or dimensions) as a cartoon human face Chernoff faces
help reveal trends in the data. Components of the face, such as the eyes, ears, mouth, and nose,
represent values of the dimensions by their shape, size, placement, and orientation. For example,
dimensions can be mapped to the following facial characteristics: eye size, eye spacing, nose length,
nose width, mouth curvature, mouth width, mouth openness, pupil size, eyebrow slant, eye
eccentricity, and head eccentricity.
Data Matrix versus Dissimilarity Matrix
Data matrix (or object-by-attribute structure): This structure stores the n data objects in the form of
a relational table, or n-by-p matrix (n objects p attributes):
Dissimilarity matrix (or object-by-object structure): This structure stores a collection of proximities
that are available for all pairs of n objects. It is often represented by an n-by-n table:
where d(i, j) is the measured dissimilarity or difference between objects i and j.
What is Data Preprocessing?
Data preprocessing is a data mining technique that involves transforming raw data into an
understandable format. Real-world data is often incomplete, inconsistent, and/or lacking in certain
behaviors or trends, and is likely to contain many errors. Data preprocessing is a proven method of
resolving such issues. Data preprocessing prepares raw data for further processing.

Data preprocessing is used database-driven applications such as customer relationship management
and rule-based applications (like neural networks).

Why Data Preprocessing?
Data in the real world is dirty
Incomplete: lacking attribute values, lacking certain attributes of interest, or m
containing only Aggregate data
e.g., occupation=
Noisy: containing errors or outliers
e.g., Salary=-10
Inconsistent: containing discrepancies in codes or names
e.g., Age=42 Birthday=03/07/1997
e.g., Was rating 1,2,3, now rating A, B, C
e.g., discrepancy between duplicate records
Why Is Data Dirty?
Incomplete data may come from
1. Not applicable data value when collected
2. Different considerations between the time when the data was collected and when it is
analyzed.
3. Human/hardware/software problems
Noisy data (incorrect values) may come from
1. Faulty data collection instruments
2. Human or computer error at data entry
3. Errors in data transmission
Inconsistent data may come from
1. Different data sources
2. Functional dependency violation (e.g., modify some linked data)
Duplicate records also need data cleaning
Why Is Data Preprocessing Important?
No quality data, no quality mining results!
1. Quality decisions must be based on quality data
e.g., duplicate or missing data may cause incorrect or even misleading statistics.
2. Data warehouse needs consistent integration of quality data
Data extraction, cleaning, and transformation comprises the majority of the work of building a
data warehouse
Multi-Dimensional Measure of Data Quality
A well-accepted multidimensional view:
Accuracy
Completeness
Consistency
Timeliness
Believability
Value added
Interpretability
Accessibility
Broad categories:
Intrinsic
Contextual
Representational
Major Tasks in Data Preprocessing
Data Cleaning
Fill in missing values, smooth noisy data, identify or remove outliers, and resolve
inconsistencies

Data integration
Integration of multiple databases, data cubes, or files
Data transformation
Normalization and aggregation
Data reduction
Obtains reduced representation in volume but produces the same or similar
analytical results
Data discretization
Part of data reduction but with particular importance, especially for
numerical data
Data Cleaning
Importance
Data cleaning is one of the three biggest problems in data warehousing
Ralph Kimball
Data cleaning is the number one problem in data warehousingDCI survey
Data cleaning tasks
1. Fill in missing values
2. Identify outliers and smooth out noisy data
3. Correct inconsistent data
4. Resolve redundancy caused by data integration
Data Integration
Data integration:
Combines data from multiple sources into a coherent store
Schema integration: e.g., A.cust-id B.cust-#
Integrate metadata from different sources
Entity Identification problem
Identify real world entities from multiple data sources,
e.g., Bill Clinton = William Clinton
Detecting and resolving data value conflicts
For the same real world entity, attribute values from different sources are different
Possible reasons: different representations, different scales, e.g., metric v/s British units
Data Transformation
Smoothing: remove noise from data
Aggregation: summarization, data cube construction
Generalization: concept hierarchy climbing
Normalization: scaled to fall within a small, specified range
1. min-max normalization
2. z-score normalization
3. normalization by decimal scaling
Attribute/feature construction
New attributes constructed from the given ones
Data Reduction Strategies
Why data reduction?
A database/data warehouse may store terabytes of data
Complex data analysis/mining may take a very long time to run on the complete data set
Data reduction
Obtain a reduced representation of the data set that is much smaller in volume but yet
produce
the same (or almost the same) analytical results
Data reduction strategies
Data cube aggregation:
Dimensionality reduction e.g., remove unimportant attributes
Data Compression
Numerosity reduction e.g., fit data into models
Discretization and concept hierarchy generation
Mining frequent Itemsets
In data mining, association rule learning is a popular and well researched method for discovering
interesting relations between variables in large databases.
In computer science and data mining, Apriori Algorithm is a classic algorithm for learning association
rules.
Apriori Algorithm is designed to operate on databases containing transactions (for example,
collections of items bought by customers, or details of a website frequentation).
Other algorithms are designed for finding association rules in data having no transactions (Winepi
and Minepi), or having no timestamps (DNA sequencing).
Association Rule of Data Mining
Let I = {i 1, i 2, ..., i n} be a set of n binary attributes called items . Let D = {t1, t2, ..., tn} be a set of
transactions called the database.
Each transaction in D has a unique transaction ID and contains a subset of the items in I.
A rule is defined as an implication of the form XY where X, Y I
and = .
The sets of items (for short itemsets ) X and Y are called antecedent(LHS) and consequent (RHS) of
the rule.
Some Useful Concepts
To select interesting rules from the set of all possible rules, constraints on various measures of
significance and interest can be used.
The best-known constraints are minimum thresholds on support and confidence
Support: The support supp( X) of an itemset X is defined as the proportion of transactions in the
data set which contain the itemset.
supp(X) = no. of transactions having itemset X / total no. of transactions
Confidence: The confidence of a rule is defined as
conf(X Y) = supp (X Y)/supp(X)
Lift
The lift of a rule is defined as:
Lift (X Y) = Supp (X Y) /{Supp(Y) Supp(X)}
Conviction
The conviction of a rule is defined as:
convic (X Y) = {1 Supp(Y)} /{1 conf(X Y)}
Apriori Algorithm
Association rule generation is usually split up into two separate steps:
First, minimum support is applied to find all frequent itemsets in a database.
Second, frequent itemsets and the minimum confidence constraint are used to form rules.
Finding all frequent itemsets in a database is difficult since it involves searching all possible itemsets
(item combinations).
The set of possible itemsets is the power set over I and has size 2n 1 (excluding the empty set
which is not a valid itemset).
Although the size of the powerset grows exponentially in the number of items n in I, efficient search
is possible using the downward-closure property of support which guarantees that for a frequent
itemset, all its subsets are also frequent and thus for an infrequent itemset, all its supersets must
also be infrequent.
Pseudo-code for Apriori Algorithm
procedure Apriori (T, minSupport )
{ //T is the database and minSupport is the minimum support
L1= {frequent items};
for (k= 2; L k-1 != ; k++)
{ Ck = candidates generated from L k-1
/*that is cartesian product L k-1 x L k-1 and eliminating any k-1 size itemset that is not */
for each transaction t in database
do
{
#increment the count of all candidates in C k that are contained in t
Lk = candidates in C k with minSupport
}//end for each
}//end for
return k LK;
}
Working of the Algorithm
As is common in association rule mining, given a set of itemsets , the algorithm attempts to find
subsets which are common to at least a minimum number C of the itemsets. Apriori uses a "bottom
up" approach, where frequent subsets are extended one item at a time and groups of candidates are
tested against the data. The algorithm terminates when no further successful extensions are found.
Apriori uses breadth-first search and a tree structure to count candidate item sets efficiently. It
generates candidate item sets of length k from item sets of length k 1. Then it prunes the
candidates which have an infrequent sub pattern. According to the downward closure lemma, the
candidate set contains all frequent k-length item sets. After that, it scans the transaction database to
determine frequent item sets among the candidates. Apriori, while historically significant, suffers
from a number of inefficiencies or trade-offs, which have spawned other algorithms. Candidate
generation generates large numbers of subsets . Bottom-up subset exploration (essentially a
breadth-first traversal of the subset lattice) finds any maximal subset S only after all 2 | S | 1 of
its proper subsets.

FREQUENT PATTERNS IN RESULT
MINIMUM SUPPORT = 20%
[40] 599 [45] 530 [50] 626 [55] 722 [60] 871 [65] 939 [70] 1141 [75] 1195 [80] 1123
[85] 829 [90] 370 [40, 50] 310 [40, 55] 302 [40, 60] 337 [40, 65] 344 [40, 70] 417 [40, 75]
412 [40, 80] 371 [45, 60] 311 [45, 65] 331 [45, 70] 380 [45, 75] 380 [45, 80] 345 [50, 55]
311 [50, 60] 362 [50, 65] 383 [50, 70] 447 [50, 75] 459 [50, 80] 413 [55, 60] 423 [55, 65]
446 [55, 70] 525 [55, 75] 544 [55, 80] 502 [55, 85] 344 [60, 65] 527 [60, 70] 640 [60, 75]
676 [60, 80] 620 [60, 85] 439 [65, 70] 688 [65, 75] 723 [65, 80] 674 [65, 85] 488 [70, 75]
898 [70, 80] 824 [70, 85] 611 [75, 80] 889 [75, 85] 649 [75, 90] 301 [80, 85] 632 [80, 90]
301 [50, 70, 75] 330 [50, 75, 80] 303 [55, 60, 75] 317 [55, 65, 70] 318 [55, 65, 75] 330 [55,
65, 80] 306 [55, 70, 75] 409 [55, 70, 80] 359 [55, 75, 80] 378 [60, 65, 70] 381 [60, 65, 75]
398 [60, 65, 80] 372 [60, 70, 75] 499 [60, 70, 80] 446 [60, 70, 85] 324 [60, 75, 80] 485 [60,
75, 85] 338 [60, 80, 85] 315 [65, 70, 75] 533 [65, 70, 80] 489 [65, 70, 85] 359 [65, 75, 80]
523 [65, 75, 85] 370 [65, 80, 85] 361 [70, 75, 80] 666 [70, 75, 85] 482 [70, 80, 85] 459 [75,
80, 85] 496 [60, 70, 75, 80] 358 [65, 70, 75, 80] 385 [70, 75, 80, 85] 368
Patterns in the third semester result of COE Batch
Mks 201 202 203 204 205 206 207 208 209 210
0 2 1 2 2 2 2 1 1 1 1
5 0 0 0 2 0 0 0 0 0 0
10 1 1 1 4 1 0 0 0 0 0
15 0 0 0 9 0 0 0 0 0 0
20 1 3 1 9 3 3 0 0 0 0
25 3 3 0 13 3 1 0 0 0 0
30 1 2 1 9 1 2 0 0 0 0
35 1 0 0 2 0 0 0 0 0 0
40 8 18 13 27 8 9 0 0 0 0
45 6 9 8 13 7 8 0 0 0 0
50 6 24 4 10 8 11 0 0 0 0
55 8 18 4 14 12 10 0 1 1 1
60 12 14 15 11 12 18 2 0 2 0
65 14 15 8 8 12 12 7 0 12 9
70 28 13 11 5 18 20 10 11 28 32
75 21 5 20 7 17 24 35 29 34 58
80 18 18 22 0 13 13 49 62 32 35
85 11 1 18 0 19 9 29 39 22 9
90 5 0 10 1 9 4 11 3 14 1
95 0 1 8 0 1 0 2 0 0 0
Domains where Apriori Algorithm is used
Adverse Drug Reaction Detection
The Apriori algorithm is used to perform association analysis on the characteristics of patients, the
drugs they are taking, their primary diagnosis, co-morbid conditions, and the ADRs or adverse events
(AE) they experience. This analysis produces association rules that indicate what combinations of
medications and patient characteristics lead to ADRs.
Oracle Bone Inscription Explication
Oracle Bone Inscription (OBI) is one of the oldest writing in the world, but of all 6000 words found till
now there are only about 1500 words that can be explicated explicitly. So explication for OBI is a key
and open problem in this field. Exploring the correlation between the OBI words by Association
Rules algorithm can aid in the research of explication for OBI. Firstly the OBI data extracted from the
OBI corpus are preprocessed; with these processed data as input for Apriori algorithm we get the
frequent itemset. And combined by the interestingness measurement the strong association rules
between OBI words are produced.
Advantages and Disadvantages of Data Mining
Data mining is very advantageous as the huge amount of so called Useless data is analysed
critically. It helps as following:
Help with decision making
Improve company revenue and lower costs
Market basket analysis predict future trends, customer purchase habits
Fraud analysis
But at the same time it can sometimes prove disasterous
Delving into someones privacy
Misuse of information
Transgression into security systems

Vous aimerez peut-être aussi