Vous êtes sur la page 1sur 54

Intelligent Systems in Business

Data Mining
Iulian Năstac
Faculty of Electronics, Telecommunications and Information
Technology, Polytechnic University of Bucharest,
E-mail: nastac@ieee.org
1. Introduction
• The capacity of digital data storage worldwide has
doubled every nine months for at least a decade.
• This is twice the rate predicted by Moore’s law for
the growth of computing power during the same
period.
• In 1965, just four years after the first planar
integrated circuit was discovered, Gordon Moore
observed an exponential growth in the number of
transistors per integrated circuit and predicted that
this trend would continue.
2
Moore's Law

3
More data
• Storage Law is the reason for the need and
rapid growth of data mining
• Our ability to capture and store data has far
out reached our ability to process and utilize
it  we have data tombs

4
… and more data
• Databases today can range in size into the
terabytes — more than 1,000,000,000,000
bytes of data.
• Within these masses of data lies hidden
information of strategic importance.
• But when there are so many “trees”, how
do you draw meaningful conclusions about
the “forest”?

5
The solution
• The newest answer is data mining, which is
being used both to increase revenues
(through improved marketing) and to reduce
costs (through detecting and preventing
waste and fraud).
• Worldwide, organizations of all types are
achieving measurable payoffs from this
technology.
6
Is data mining a business tool?
• Data mining finds patterns and relationships
in data by using sophisticated techniques to
build models — abstract representations of
reality.
• A good model is a useful guide to
understanding business environment and
making decisions.

7
2. What is data mining

A wide variety of techniques with great


potential to help companies focus on the
most important information in their data
warehouses

8
Definition
• The analysis and non-trivial extraction of
data from large databases for the purpose of
discovering new and valuable information
in the form of patterns and rules from
relationships between data elements

9
Definitions (cont.)
• Finding useful patterns in data is known by
different names (including data mining) in
different communities:
– knowledge extraction
– information discovery
– information harvesting
– data archeology
– data pattern processing

10
A summary of definitions
• Data mining (wikipedia) is the process of
extracting patterns from large data sets by
combining methods from statistics and artificial
intelligence with database management.

• Data mining is seen as an increasingly important


tool by modern business to transform data into
business intelligence giving an informational
advantage.
11
Current application areas
• Fastest growing area in the entire business
intelligence market.
• Direct marketing campaigns.
• Fraud detection.
• Models to aid in financial predictions.
• Scientific discovery.

12
Multidisciplinary field
• Mathematics
• Operations research
• Pattern recognition
• Statistics
• Artificial intelligence
• Data base theory
• Data visualization
• Marketing
13
3. Databases and data warehouses
What is a database?
• One or more large structured sets of data,
usually associated with software to update
and query the data.
• A simple database might be a single file
containing many records, each contains the
same set of fields where each field has a
fixed width.

14
Traditional databases
• support OLTP (on-line transaction
processing) which includes:
– insertions
– updates
– deletions
– information query requirements
• are optimized to process queries
• cannot be optimized for data mining
15
What is a data warehouse?
• A generic term for a system for storing, retrieving
and managing large amounts of any type of data.
• Data warehouse software often includes
sophisticated compression and hashing techniques
for fast searches, as well as advanced filtering.
• Planners and researchers can use this data
warehouse freely without worrying about slowing
down day-to-day operations of the production
database.
16
Observation:
• Data warehouses are quite distinct from
traditional data bases in their structure,
functioning, performance and purpose.

17
Data warehouses vs.
conventional databases (OLTP)
A data warehouse is designed and optimized for complex ad hoc
queries as opposed to data manipulation tasks or simple, fixed queries.

18
A typical data warehouse

• is highly denormalized (implies loss of


logical data organization + data duplication)
• contains a lot of redundant data and is
very large
• contains data that is non-volatile (i.e. not
updated)
• runs on a separate dedicated server

19
The advantages of data warehouses over conventional
databases from a data mining perspective:

• they make connections between disparate


entities
• they are optimized for query speed
• they contain historic data
• they are subject-oriented
– irrelevant information is excluded

20
Dimensional Modeling
• Data warehouses are typically developed using dimensional models
rather than the traditional entity/relationship models associated with
conventional relational databases.
• Data is modeled as a hypercube, and the schema is a so-called star
schema with a centralized fact table surrounded by smaller
dimensional tables representing key scientific objects.
• Changing from one dimensional hierarchy (orientation) to another is
easily accomplished in a data cube by a technique called pivoting
(rotation).
• Data warehouses storage utilizes indexing techniques to support high
performance access:
– bitmap indexing
– tuples (table by join indexing)

21
Star schemas
• Star schemas support ad hoc analytical queries.
• They are simple and predictable, with symmetrical entry
points to the fact table from each dimension, so strong
assumptions can be made when developing query tools.
• Star schemas are easily extended to accommodate new
data.
• Performance is high because the number of joins that
have to be made is typically low.
• Snowflake schema is a variation of the star schema in
which the dimensional tables from a star schema are
organize into a hierarchy by normalizing them.
22
Data warehousing applications
• OLAP (on-line analytical processing)
– Analysis of complex data
• DSS (decision-support system) or EIS (executive
information system)
– Organization’s leading decision makers with higher
level data for complex and important decision
• Data Mining
– Knowledge discovery
– Searching data for unanticipated new knowledge
23
Data warehousing characteristics
• multidimensional conceptual view
• generic dimensionality
• unlimited dimensions and aggregation levels
• dynamic sparse matrix handling
• client-server architecture
• multi-user support
• accessibility
• transparency
• intuitive data manipulation
• consistent reporting performance
• flexible reporting
24
Basic Data warehouses
Architecture

25
Data warehouse dimension
• Is generally an order of magnitude (sometimes two orders
of magnitude) larger than the source databases.
• Components:
– Enterprise-wide data warehouse
• huge projects requiring massive investment of time and resources
– Virtual data warehouse
• provides views of operational databases that are materialized for
efficient access
– Data marts
• logical subsets of the complete data warehouse, such as a department
• more tightly focused

26
Acquisition of data for the warehouse involves the
following steps:
• Data must be extracted from multiple and heterogeneous sources
– databases
– financial market data
– environmental data
• Data must be formatted for consistency within the warehouse
– various credit cards may report their transactions differently, making it difficult to compute all
credit sales
• Data must be cleaned to ensure validity before the data are loaded into the warehouse
– Check for validity and quality
– Recognizing erroneous and incomplete data is difficult to automate
– Cleaning that requires automatic error correction can be even tougher
– The process of returning cleaned data to the source is called backflushing
• The data must be fitted into the data model
– are converted to a multidimensional model
• The data must be loaded
– Loading the data is a significant task
– Monitoring tools for loads as well as methods to recover from incomplete or incorrect loads
are required 27
Updating

• Incremental updating is the usual approach


• The refresh policy emerges as a compromise that takes into
account the answer to the following questions:
– How up-to-date must the data be?
– Can the warehouse go off-line, and for how long?
– What are the data interdependencies?
– What is the storage availability?
– What are the distribution requirements (i.e. replication,
partitioning)?
– What is the loading time?

28
Data storage involves the following processes:

• Storing the data according to the data model of the


warehouse
• Creating and maintaining required data structures
• Creating and maintaining appropriate access paths
• Providing for time-variant data as new data are
added
• Supporting the updating of warehouse data
• Refreshing the data
• Purging data
29
Important design considerations
• Usage projections
• The fit of the data model
• Characteristics of available sources
• Design of the metadata component
• Modular component design
• Design for manageability and change
• Considerations of distributed and paralel
architecture
30
Typical Functionality of Data
Warehouses
• Roll-up: Data is summarized with increasing generalization.
– E.g. weekly to quarterly to annually
• Drill-down: Increasing levels of detail are revealed.
– The complement of roll-up
• Pivot (rotation): Cross tabulation is performed.
• Slice and dice: Performing projection operations on the
dimensions.
• Sorting: Data is sorted by ordinal value.
• Selection: Data is available by value or range.
• Derived (computed) attributes: Attributes are computed by
operations on stored and derived values.
31
Difficulties of implementing data
warehouses
• Construction
– design
– hardware / software limitations
– implementations
• Administration
– a team of highly skilled technical experts is needed
– careful coordination
• Quality control

32
Open Issues in Data Warehouses
• Data acquisition
• Data quality management
• Selection and construction of appropriate
access paths and structures
• Self-maintainability
• Functionality
• Performance optimization
33
4. Data Mining Models
There are two main kinds of models in data mining:
• Predictive models
– can be used to forecast explicit values, based on patterns
determined from known results
• Descriptive models
– describe patterns in existing data, and are generally used to
create meaningful subgroups such as demographic clusters

34
Basic data mining concepts
• Hypothesis driven approach
– Inferring patterns or models from data to test
hypothesis
• Discovery driven approach
– No a priori hypothesis stated

35
Data mining is…
• Decision Trees

• Nearest Neighbor
Classification

• Neural Networks

• Rule Induction

• K-means Clustering

• Support Vector Machines 36


Data mining is not ...
• Data warehousing

• SQL / Ad Hoc Queries / Reporting

• Software Agents

• Online Analytical Processing (OLAP)

• Data Visualization
37
Models
• Models range from “easy to understand” to
“somehow incomprehensible”:

— Decision trees Easier


— Rule induction
— Regression models
— Neural Networks Harder

38
How to perform data mining
• Determine the business objectives
• Data preparation
• Apply a data mining algorithm
• Results analysis
• Knowledge assimilation

39
Knowledge discovery in data
bases (KDD)
• KDD refers to the overall process of
discovering useful knowledge from data
• Data mining is a particular step in this
process
• E.g. an application of specific algorithms
for extracting patterns (models from data)

40
The KDD-process (Fayyad et al)

The nontrivial process of identifying valid, novel,


potentially useful, and ultimately understandable
patterns in data

41
The KDD phases (Elmasri et al)
• Data selection
– specific items or categories of items may be selected
• Data cleansing
– corrects or eliminates the errors from data
• Enrichment
– enhances the data with additional sources of information
• Data transformation or encoding
– may be done to reduce the amount of data
• Data mining
• Reporting and displaying the discovered information
– listings
– graphic outputs
– summary tables
– visualizations 42
Goals of Data Mining and
Knowledge Discovery
• Prediction
– what consumers will buy under certain discounts
– how much sales volume a store would generate in a given period
– whether deleting a product line would yield more profits
• Identification (Regression)
– data patterns can be used to identify the existence of an item, an event, or an
activity
– attempts to find a function which models the data with the least error
• Clustering and classification
– for example customers in a supermarket can be categorized into:
• discount-seeking shoppers
• shoppers in a rush
• loyal regular shoppers
• infrequent shoppers
• Association rule learning (Optimization)
– under constraints (limited resources like time, space, money)
– Searches for relationships between variables. For example a supermarket might
gather data on customer purchasing habits. Using association rule learning, the
supermarket can determine which products are frequently bought together and use
this information for marketing purposes 43
Let us remember the models
• Supervised
— Regression models
— k-Nearest-Neighbor
— Neural networks
— Rule induction
— Decision trees
• Unsupervised
— K-means clustering
— Self organized maps

44
Types of Knowledge Discovered
during Data Mining
• Association rules
– When a female retail shopper buys a handbag, she is likely to buy shoes
• Classification hierarchies
– A model may be developed for the factors that determine the desirability
of location of a store on a 1-10 scale
• Sequential patterns
– Association among events with certain temporal relationships (the lady
who buys a handbag will buy shoes within a week)
• Patterns within time series
– Two products show the same selling pattern in summer but a different one
in winter
• Categorization and segmentation
– The adult population may be categorized into five groups from “most
likely to buy” to “least likely to buy” a new product
45
An Architecture for Data Mining

46
5. Profitable Applications
• A wide range of companies have deployed
successful applications of data mining.
• Two critical factors for success with data
mining are:
– a large, well-integrated data warehouse
– a well-defined understanding of the business
process

47
Example 1
• A pharmaceutical company can analyze its recent sales force activity
and their results to improve targeting of high-value physicians and
determine which marketing activities will have the greatest impact in
the next few months.
• The data needs to include competitor market activity as well as
information about the local health care systems.
• The results can be distributed to the sales force via a wide-area
network that enables the representatives to review the
recommendations from the perspective of the key attributes in the
decision process.
• The ongoing, dynamic analysis of the data warehouse allows best
practices from throughout the organization to be applied in specific
sales situations.

48
Example 2
• A credit card company can leverage its vast
warehouse of customer transaction data to identify
customers most likely to be interested in a new
credit product.
• Using a small test mailing, the attributes of
customers with affinity for the product can be
identified.
• Recent projects have indicated more than a 20-
fold decrease in costs for targeted mailing
campaigns over conventional approaches.
49
Example 3
• A diversified transportation company with a large
direct sales force can apply data mining to identify
the best prospects for its services.
• Using data mining to analyze its own customer
experience, this company can build a unique
segmentation identifying the attributes of high-
value prospects.
• Applying this segmentation to a general business
database can yield a prioritized list of prospects by
region.
50
Example 4
• A large consumer package goods company can
apply data mining to improve its sales process to
retailers.
• Data from consumer panels, shipments, and
competitor activity can be applied to understand
the reasons for brand and store switching.
• Through this analysis, the manufacturer can select
promotional strategies that best reach their target
customer segments.
51
Evolutionary Business Question Enabling Characteristics
Step Technologies
Data Collection "What was my total Computers, tapes, Retrospective, static
(1970s) revenue in the last five disks data delivery
years?"
Data Access "What were unit sales Relational databases Retrospective,
(1980s) in Finland last (RDBMS), Structured dynamic data
March?" Query Language delivery at record
(SQL), ODBC level

Data "What were unit sales On-line analytic Retrospective,


Warehousing & in Finland last March? processing (OLAP), dynamic data
Decision Drill down to multidimensional delivery at multiple
Support (1990s) Helsinki." databases, data levels
warehouses
Data Mining "What’s likely to Advanced algorithms, Prospective,
(Emerging happen to Helsinki multiprocessor proactive
Today) unit sales next month? computers, massive information delivery
Why?" databases
52
Current shortcomings
• Insufficient training
• Inadequate tool support
• Data unavailability
• Overabundance of patterns
• Changing and time-oriented data
• Spatially oriented data
• Complex data types
• Scalability
53
References
• Michael R. Berthold, and David J. Hand (2007): Intelligent Data Analysis: An Introduction,
Springer; 2nd edition

• Michael J. A. Berry and Gordon S. Linoff (2004): Data Mining Techniques: For Marketing, Sales,
and Customer Relationship Management, Wiley Computer Publishing; 2 edition

• Kantardzic, Mehmed (2003): Data Mining: Concepts, Models, Methods, and Algorithms. John
Wiley & Sons.

• David J. Hand, Heikki Mannila, and Padhraic Smyth (2001): Principles of Data Mining (Adaptive
Computation and Machine Learning), The MIT Press

• Armstrong JS (2001): Principles of Forecasting, Kluwer Academic Publishers, Norwell,


Massachusetts

• Elmasri, R.and Navathe, S.B. (2000): Data Warehousing And Data Mining - chapter 26, pp. 841-
872, in "Fundamentals of Database Systems", third ed., Addison-Wesley

• Fayyad, U., Piatetsky-Shapiro, G., Smyth, P. (1996): The KDD Process for Extracting Useful
Knowledge from Volumes of Data, Communications of the ACM, November 1996/Vol. 39, N11,
pp. 27-34

• Thearling, K., Data Mining Tutorial, http://www.thearling.com/dmintro/dmintro_2.htm


54

Vous aimerez peut-être aussi