Académique Documents
Professionnel Documents
Culture Documents
Data mining is extraction of useful patterns from data sources, e.g., databases, texts, web, image. Data warehousing A single, complete and consistent store of data obtained from a variety of different sources made available to end users in what they can understand and use in a business context.
Mohsin Sayed Shafat Ali Arshad Shaikh Maqsud Shaikh Saif Shaikh
43 44 46 47 48
Academic Year: 2012-2013 Under the guidance of Prof. Awesh Bhornya
CERTIFICATE
This is to certify that Students from A division of Anjuman-I-Islams Allana Institute of Management Studies (AIAIMS) pursuing first year in MMS has completed the dissertation project on Data Warehousing and Data Mining in the Academic Year 2013-2014.
Dr.Lukman Patel
Prof.Awesh Bhornya
Director AIAIMS
3
Project Guide
ACKNOWLEDGEMENT
A project cannot be said to be the work of an individual. A project is a combination of views and ideas, suggestions and contributions of many people. We are extremely thankful to our project guide Prof. Awesh Bhornya for giving us the valuable guidance and helping me throughout this project and for his special attention on me.
We wish to thank to all the people who had help and assisted us wherever and whenever we needed their help by giving their precious time and valuable suggestion to us.
Also I wish to thank all the respondents who gave me some of their valuable time to fill up the questionnaires, without which the project study wouldnt have been a success
Index Topic
Data Warehousing History What is Data Warehousing Subject Oriented Integrated Non-volatile Time Variant Benefits of a Data warehouse Key developments in early years of data warehousing Dimensional V/S Normalized Data warehouses V/S operational systems Operational Systems V/S Data Warehousing Systems Evolution in organization use Data Warehouse Architecture Data Warehouse Architecture components Types of Data Warehouse Architectures
Page No.
06 06 06 08 08 09 09 09 11 12 14 15 15 17 17 21
Data Mining
Overview The Foundations of Data Mining The Scope of Data Mining Database can be larger in depth and breadth How data mining works Architecture of Data Mining Components of Data Mining Integration of data mining system with a database Or Data Warehouse system Conclusion Bibliography 30 30 33 34 38 40 41 42 43
Data Warehousing
History
The concept of data warehousing dates back to the late 1980s when IBM researchers Barry Devlin and Paul Murphy developed the "business data warehouse". In essence, the data warehousing concept was intended to provide an architectural model for the flow of data from operational systems to decision support environments. The concept attempted to address the various problems associated with this flow, mainly the high costs associated with it. In the absence of a data warehousing architecture, an enormous amount of redundancy was required to support multiple decision support environments. In larger corporations it was typical for multiple decision support environments to operate independently. Though each environment served different users, they often required much of the same stored data. The process of gathering, cleaning and integrating data from various sources, usually from long-term existing operational systems (usually referred to as legacy systems), was typically in part replicated for each environment. Moreover, the operational systems were frequently reexamined as new decision support requirements emerged. Often new requirements necessitated gathering, cleaning and integrating new data from "data marts" that were tailored for ready access by users.
extracted from each of the disparate source data systems. The integration layer integrates the disparate data sets by transforming the data from the staging layer often storing this transformed data in an operational data store (ODS) database. The integrated data are then moved to yet another database, often called the data warehouse database, where the data is arranged into hierarchical groups often called dimensions and into facts and aggregate facts. The combination of facts and dimensions is sometimes called a star schema. The access layer helps users retrieve data. A data warehouse constructed from an integrated data source system does not require ETL, staging databases, or operational data store databases. The integrated data source systems may be considered to be a part of a distributed operational data store layer. Data federation methods or data virtualization methods may be used to access the distributed integrated source data systems to consolidate and aggregate data directly into the data warehouse database tables. Unlike the ETL-based data warehouse, the integrated source data systems and the data warehouse are all integrated since there is no transformation of dimensional or reference data. This integrated data warehouse architecture supports the drill down from the aggregate data of the data warehouse to the transactional data of the integrated source data systems. Data warehouses can be subdivided into data marts. Data marts store subsets of data from a warehouse.This definition of the data warehouse focuses on data storage. The main source of the data is cleaned, transformed, cataloged and made available for use by managers and other business professionals for data mining, online analytical processing, market research and decision support (Marakas & O'Brien 2009). However, the means to retrieve and analyze data, to extract, transform and load data, and to manage the data dictionary are also considered essential components of a data warehousing system. Many references to data warehousing use this broader context. Thus, an expanded definition for data warehousing includes business intelligence tools, tools to extract, transform and load data into the repository, and tools to manage and retrieve metadata. Data warehousing a single, complete and consistent store of data obtained from a variety of different sources made available to end users in what they can understand and use in a business context. TheData Warehousing site aims to help people get a good high-level understanding of what it takes to implement a successful data warehouse project. A lot of the
information is from my personal experience as business intelligence professional, both as a client and as a vendor. A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing. It usually contains historical data derived from transaction data, but it can include data from other sources. It separates analysis workload from transaction workload and enables an organization to consolidate data from several sources. In addition to a relational database, a data warehouse environment includes an extraction, transportation, transformation, and loading (ETL) solution, an online analytical processing (OLAP) engine, client analysis tools, and other applications that manage the process of gathering data and delivering it to business users.
A common way of introducing data warehousing is to refer to the characteristics of a data warehouse as set forth by William Inmon:
Subject Oriented:Data warehouses are designed to help you analyze data. For example,
to learn more about your company's sales data, you can build a warehouse that concentrates on sales. Using this warehouse, you can answer questions like "Who was our best customer for this item last year?" This ability to define a data warehouse by subject matter, sales in this case makes the data warehouse subject oriented.
Nonvolatile:Nonvolatile means that, once entered into the warehouse, data should not
change. This is logical because the purpose of a warehouse is to enable you to analyze what has occurred.
Time Variant: In order to discover trends in business, analysts need large amounts of
data. This is very much in contrast to online transaction processing (OLTP) systems, where performance requirements demand that historical data be moved to an archive. A data warehouse's focus on change over time is what is meant by the term time variant.
Maintain data history, even if the source transaction systems do not. Integrate data from multiple source systems, enabling a central view across the enterprise. This benefit is always valuable, but particularly so when the organization has grown by merger.
Improve data quality, by providing consistent codes and descriptions, flagging or even fixing bad data.
Present the organization's information consistently. Provide a single common data model for all data of interest regardless of the data's source.
Restructure the data so that it makes sense to the business users. Restructure the data so that it delivers excellent query performance, even for complex analytic queries, without impacting the operational systems.
Add value to operational business applications, notably customer relationship management (CRM) systems.
Source systems that provide data to the warehouse or mart; Data integration technology and processes that are needed to prepare the data for use; Different architectures for storing data in an organization's data warehouse or data marts;
Different tools and applications for the variety of users; Metadata, data quality, and governance processes must be in place to ensure that the warehouse or mart meets its purposes.
In regards to source systems listed above, Rainer states, A common source for t he data in data warehouses is the companys operational databases, which can be relational databases. Regarding data integration, Rainer states, It is necessary to extract data from source systems, transform them, and load them into a data mart or warehouse. Rainer discusses storing data in an organizations data warehouse or data marts. There are a variety of possible architectures to store decision-support data.Metadata are data about data. IT personnel need information about data sources; database, table, and column names; refresh schedules; and data usage measures. Today, the most successful companies are those that can respond quickly and flexibly to market changes and opportunities. A key to this response is the effective and efficient use of data and information by analysts and managers. A data warehouse is a repository of historical data that are organized by subject to support decision makers in the organization. Once data are stored in a data mart or warehouse, they can be accessed.
10
1960s General Mills and Dartmouth College, in a joint research project, develop the terms dimensions and facts. 1970s ACNielsen and IRI provide dimensional data marts for retail sales. 1970s Bill Inmon begins to define and discuss the term: Data Warehouse 1975 Sperry Univac Introduce MAPPER (Maintain, Prepare, and Produce Executive Reports) is a database management and reporting system that includes the world's first 4GL. It was the first platform specifically designed for building Information Centers (a forerunner of contemporary Enterprise Data Warehousing platforms)
1983 Teradata introduces a database management system specifically designed for decision support. 1983 Sperry Corporation Martyn Richard Jones defines the Sperry Information Center approach, which while not being a true DW in the Inmon sense, did contain many of the characteristics of DW structures and process as defined previously by Inmon, and later by Devlin. First used at the TSB England & Wales
1984 Metaphor Computer Systems, founded by David Liddle and Don Massaro, releases Data Interpretation System (DIS). DIS was a hardware/software package and GUI for business users to create a database management and analytic system.
1988 Barry Devlin and Paul Murphy publish the article An architecture for a business and information system in IBM Systems Journal where they introduce the term "business data warehouse".
1990 Red Brick Systems, founded by Ralph Kimball, introduces Red Brick Warehouse, a database management system specifically for data warehousing. 1991 Prism Solutions, founded by Bill Inmon, introduces Prism Warehouse Manager, software for developing a data warehouse. 1992 Bill Inmon publishes the book Building the Data Warehouse. 1995 The Data Warehousing Institute, a for-profit organization that promotes data warehousing, is founded. 1996 Ralph Kimball publishes the book The Data Warehouse Toolkit. 2000 Daniel Linstedt releases the Data Vault, enabling real time auditable Data Warehouses warehouse.
11
12
normalized structure divides data into entities, which creates several tables in a relational database. When applied in large enterprises the result is dozens of tables that are linked together by a web of joins. Furthermore, each of the created entities is converted into separate physical tables when the database is implemented (Kimball, Ralph 2008). The main advantage of this approach is that it is straightforward to add information into the database. A disadvantage of this approach is that, because of the number of tables involved, it can be difficult for users both to: 1. Join data from different sources into meaningful information and then 2. Access the information without a precise understanding of the sources of data and of the data structure of the data warehouse. It should be noted that both normalized and dimensional models can be represented in entity-relationship diagrams as both contain joined relational tables. The difference between the two models is the degree of normalization. These approaches are not mutually exclusive, and there are other approaches. Dimensional approaches can involve normalizing data to a degree (Kimball, Ralph 2008). In Information-Driven Business (Wiley 2010), Robert Hillard proposes an approach to comparing the two approaches based on the information needs of the business problem. The technique shows that normalized models hold far more information than their dimensional equivalents (even when the same fields are used in both models) but this extra information comes at the cost of usability. The technique measures information quantity in terms of Information Entropy and usability in terms of the Small Worlds data transformation measure.
Potential high Return on Investment Competitive Advantage Increased Productivity of Corporate Decision Makers
13
Application oriented Read and write tens Detailed High performance and availability clerk, DBA 100MB to GB
14
Operational systems are optimized for preservation of data integrity and speed of recording of business transactions through use of database normalization and an entityrelationship model. Operational system designers generally follow the Codd rules of database normalization in order to ensure data integrity. Codd defined five increasingly stringent rules of normalization. Fully normalized database designs (that is, those satisfying all five Codd rules) often result in information from a business transaction being stored in dozens to hundreds of tables. Relational databases are efficient at managing the relationships between these tables. The databases have very fast insert/update performance because only a small amount of data in those tables is affected each time a transaction is processed. Finally, in order to improve performance, older data are usually periodically purged from operational systems.
Operational Systems v/s Data Warehousing Systems Operational Holds current data Data is dynamic Read/Write accesses Repetitive processing Transaction driven Application oriented Used by clerical staff for day-today operations Normalized data model (ER model) Must be optimized for writes and small queries. Evolution in organization use
These terms refer to the level of sophistication of a data warehouse:
Data Warehouse Holds historic data Data is largely static Read only accesses Adhoc complex queries Analysis driven Subject oriented Used by top managers for analysis Denormalized data model (Dimensional model) Must be optimized for queries involving a large portion of the warehouse.
15
Offline data warehouse: Data warehouses at this stage are updated from data in
the operational systems on a regular basis and the data warehouse data are stored in a data structure designed to facilitate reporting.
Sample applications
Some of the applications of data warehousing include:
Agriculture Biological data analysis Call record analysis Churn Prediction for Telecom subscribers, Credit Card users etc. Decision support Financial forecasting Insurance fraud analysis Logistics and Inventory management Trend analysis
Underestimation of resources for data loading Hidden problems with source systems Required data not captured Increased end-user demands High maintenance Long duration projects Complexity of integration
16
17
The data travels from source systems to presentation servers via the data staging area. The entire process is popularly known as ETL (extract, transform, and load) or ETT (extract, transform, and transfer). Oracles ETL tool is called Oracle Warehouse Builder (OWB) and MS SQL Servers ETL tool is called Data Transformation Services (DTS). Each component and the tasks performed by them are explained below:
OPERATIONAL DATA
The sources of data for the data warehouse are supplied from:
o
The data from the mainframe systems in the traditional network and hierarchical format.
o o
Data can also come from the relational DBMS like Oracle, Informix. In addition to these internal data, operational data also includes external data obtained from commercial databases and databases associated with supplier and customers.
LOAD MANAGER
The load manager performs all the operations associated with extraction and loading data into the data warehouse. These operations include simple transformations of the data to prepare the data for entry into the warehouse. The size and complexity of this component will vary between data warehouses and may be constructed using a combination of vendor data loading tools and custom built programs.
WAREHOUSE MANAGER
The warehouse manager performs all the operations associated with the management of data in the warehouse. This component is built using vendor data management tools and custom built programs. The operations performed by warehouse manager include:
o o
Analysis of data to ensure consistency Transformation and merging the source data from temporary storage into data warehouse tables
18
o o o o
Create indexes and views on the base table. Denormalization Generation of aggregation Backing up and archiving of data
In certain situations, the warehouse manager also generates query profiles to determine which indexes and aggregations are appropriate.
QUERY MANAGER
The query manager performs all operations associated with management of user queries. This component is usually constructed using vendor end-user access tools, data warehousing monitoring tools, database facilities and custom built programs. The complexity of a query manager is determined by facilities provided by the end-user access tools and database.
DETAILED DATA
This area of the warehouse stores all the detailed data in the database schema. In most cases detailed data is not stored online but aggregated to the next level of details. However the detailed data is added regularly to the warehouse to supplement the aggregated data.
The area of the data warehouse stores all the predefined lightly and highly summarized (aggregated) data generated by the warehouse manager. This area of the warehouse is transient as it will be subject to change on an ongoing basis in order to respond to the changing query profiles. The purpose of the summarized information is to speed up the query performance. The summarized data is updated continuously as new data is loaded into the warehouse.
This area of the warehouse stores detailed and summarized data for the purpose of archiving and back up. The data is transferred to storage archives such as magnetic tapes or optical disks.
19
META DATA
The data warehouse also stores all the Meta data (data about data) definitions used by all processes in the warehouse. It is used for variety of purposed including:
o
The extraction and loading process Meta data is used to map data sources to a common view of information within the warehouse. The warehouse management process Meta data is used to automate the production of summary tables.
As part of Query Management process Meta data is used to direct a query to the most appropriate data source.
The structure of Meta data will differ in each process, because the purpose is different. More about Meta data will be discussed in the later Lecture Notes.
The principal purpose of data warehouse is to provide information to the business managers for strategic decision-making. These users interact with the warehouse using end user access tools. The examples of some of the end user access tools can be:
o o o o o
Reporting and Query Tools Application Development Tools Executive Information Systems Tools Online Analytical Processing Tools Data Mining Tools ETL (EXTRACT TRANSFORMATION LOAD) PROCESS
THE
In this section we will discussed about the 4 major process of the data warehouse. They are extract (data from the operational systems and bring it to the data warehouse),transform(the data into internal format and structure of the data warehouse),cleanse (to make sure it is of sufficient quality to be used for decision making) and load (cleanse data is put into the data warehouse). The four processes from extraction through loading often referred collectively as Data Staging.
20
EXTRACT
Some of the data elements in the operational database can be reasonably be expected to be useful in the decision making, but others are of less value for that purpose. For this reason, it is necessary to extract the relevant data from the operational database before bringing into the data warehouse. Many commercial tools are available to help with the extraction process. Data Junction is one of the commercial products. The user of one of these tools typically has an easy-to-use windowed interface by which to specify the following:
o o
Which files and tables are to be accessed in the source database? Which fields are to be extracted from them? This is often done internally by SQL Select statement.
o o o
What are those to be called in the resulting database? What is the target machine and database format of the output? On what schedule should the extraction process be repeated?
TRANSFORM:
The operational databases developed can be based on any set of priorities, which keeps changing with the requirements. Therefore those who develop data warehouse based on these databases are typically faced with inconsistency among their data sources. Transformation process deals with rectifying any inconsistency (if any). One of the most common transformation issues is Attribute Naming Inconsistency. It is common for the given data element to be referred to by different data names in different databases. Employee Name may be EMP_NAME in one database, ENAME in the other. Thus one set of Data Names are picked and used consistently in the data warehouse. Once all the data elements have right names, they must be converted to common formats. The conversion may encompass the following:
Characters must be converted ASCII to EBCDIC or vise versa. Mixed Text may be converted to all uppercase for consistency. Numerical data must be converted in to a common format. Data Format has to be standardized.
21
Measurement may have to convert. Coded data (Male/ Female, M/F) must be converted into a common format.
All these transformation activities are automated and many commercial products are available to perform the tasks. DataMAPPER from Applied Database Technologies is one such comprehensive tool.
CLEANSING
Information quality is the key consideration in determining the value of the information. The developer of the data warehouse is not usually in a position to change the quality of its underlying historic data, though a data warehousing project can put spotlight on the data quality issues and lead to improvements for the future. It is, therefore, usually necessary to go through the data entered into the data warehouse and make it as error free as possible. This process is known as Data Cleansing. Data Cleansing must deal with many types of possible errors. These include missing data and incorrect data at one source; inconsistent data and conflicting data when two or more source are involved. There are several algorithms followed to clean the data, which will be discussed in the coming lecture notes.
LOADING
Loading often implies physical movement of the data from the computer(s) storing the source database(s) to that which will store the data warehouse database, assuming it is different. This takes place immediately after the extraction phase. The most common channel for data movement is a high-speed communication link. Ex: Oracle Warehouse Builder is the API from Oracle, which provides the features to perform the ETL task on Oracle Data Warehouse.
22
Data Warehouse Architecture (Basic) Data Warehouse Architecture (with a Staging Area) Data Warehouse Architecture (with a Staging Area and Data Marts)
Text description of the illustration In Figure 1-2, the metadata and raw data of a traditional OLTP system is present, as is an additional type of data, summary data. Summaries are very valuable in data warehouses because they pre-compute long operations in advance. For example, a typical data warehouse query is to retrieve something like August sales. A summary in Oracle is called a materialized view.
23
24
Figure 1-4 Architecture of a Data Warehouse with a Staging Area and Data Marts
Data Warehousing Systems A data warehousing system can perform advanced analyses of operational data without impacting operational systems. OLTP is very fast and efficient at recording the business transactions - not so good at providing answers to high-level strategic questions. Component Systems Legacy Systems Any information system currently in use that was built using previous technology generations. Most legacy Systems are operational in nature, largely because the automation of transaction-oriented business process had long been the priority of IT projects. Source Systems Any system from which data is taken for a data warehouse. A source system is often called a legacy system in a mainframe environment. Operational Data Stores (ODS) An ODS is a collection of integrated databases designed to support the monitoring of operations. Unlike the databases of OLTP applications (that are function oriented), the ODS
25
contains subject oriented, volatile, and current enterprise-wide detailed information. It serves as a system of record that provides comprehensive views of data in operational sources. Like data warehouses, ODSs are integrated and subject-oriented. However, an ODS is always current and is constantly updated. The ODS is an ideal data source for a data warehouse, since it already contains integrated operational data as of a given point in time. In short, ODS is an integrated collection of clean data destined for the data warehouse.
26
Dimensional Modeling is the only viable technique for delivering data to the end users in a data warehouse. Comparison between ER and Dimensional Modeling The characteristics of ER Model are well understood; its ability to support operational processes is its underlying characteristic. The conventional ER models are constituted to Remove redundancy in the data model Facilitate retrieval of individual records having certain critical identifiers and Therefore, optimize online transaction processing (OLTP) performance
In contrast, the dimensional model is designed to support the reporting and analytical needs of a data warehouse system. Why ER is not suitable for Data Warehouses? End user cannot understand or remember an ER Model. End User cannot navigate an
ER Model. There is no graphical user interface or GUI that takes a general ER diagram and makes it usable by end users. ER modeling is not optimized for complex, ad-hoc queries. They are optimized for
repetitive narrow queries Use of ER modeling technique defeats this basic allure of data warehousing, namely
intuitive and high performance retrieval of data because it leads to highly normalized relational tables. Introduction to Dimensional Modeling Concepts The objective of dimensional modeling is to represent a set of business measurements in a standard framework that is easily understandable by end users. A Dimensional model contains the same information as an ER model but packages the data in a symmetric format whose design goals are User understandability
27
The main components of a Dimensional Model are Fact Tables and Dimension Tables. A fact table is the primary table in each dimensional model that is meant to contain measurements of the business. The most useful facts are numeric and additive. Every fact table represents a many to many relationship and every fact table contains a set of two or more foreign keys that join to their respective dimension tables. A fact depends on many factors. For example, sale amount, a fact, depends on product, location and time. These factors are known as dimensions. Dimensions are factors on which a given fact depends. The sale amount fact can also be thought of as a function of three variables. sales amount = f(product, location, time) Likewise in a sales fact table we may include other facts like sales unit and cost. Dimension tables are companion tables to a fact table in a star schema. Each dimension table is defined by its primary key that serves as the basis for referential integrity with any given fact table to which it is joined. Most dimension tables contain textual information. To understand the concepts of facts, dimension, and star schema, let us consider the following scenario: Imagine standing in the marketplace and watching the products being sold and writing down the quantity sold and the sales amount each day for each product in each store. Note that a measurement needs to be taken at every intersection of all dimensions (day, product, and store). The information gathered can be stored in the following fact table:
28
The facts are Sale Unit, Sale Amount, and Cost (note that all are numeric and additive), which depend on dimensions Date, Product, and Store. The details of the dimensions are stored in dimension tables..
29
Data Mining
Overview
Data mining, the extraction of hidden predictive information from large databases, is a powerful new technology with great potential to help companies focus on the most important information in their data warehouses. Data mining tools predict future trends and behaviours, allowing businesses to make proactive, knowledge-driven decisions. The automated, prospective analyses offered by data mining move beyond the analyses of past events provided by retrospective tools typical of decision support systems. Data mining tools can answer business questions that traditionally were too time consuming to resolve. They scour databases for hidden patterns, finding predictive information that experts may miss because it lies outside their expectations. Most companies already collect and refine massive quantities of data. Data mining techniques can be implemented rapidly on existing software and hardware platforms to enhance the value of existing information resources, and can be integrated with new products and systems as they are brought on-line. When implemented on high performance client/server or parallel processing computers, data mining tools can analyze massive databases to deliver answers to questions such as, "Which clients are most likely to respond to my next promotional mailing, and why?" This white paper provides an introduction to the basic technologies of data mining. Examples of profitable applications illustrate its relevance to todays business environment as well as a basic description of how data warehouse architectures can evolve to deliver the value of data mining to end users.
30
information delivery. Data mining is ready for application in the business community because it is supported by three technologies that are now sufficiently mature:
Commercial databases are growing at unprecedented rates. A recent META Group survey of data warehouse projects found that 19% of respondents are beyond the 50 gigabyte level, while 59% expect to be there by second quarter of 1996.1 In some industries, such as retail, these numbers can be much larger. The accompanying need for improved computational engines can now be met in a cost-effective manner with parallel multiprocessor computer technology. Data mining algorithms embody techniques that have existed for at least 10 years, but have only recently been implemented as mature, reliable, understandable tools that consistently outperform older statistical methods.In the evolution from business data to business information, each new step has built upon the previous one. For example, dynamic data access is critical for drill-through in data navigation applications, and the ability to store large databases is critical to data mining. From the users point of view, the four steps listed in Table 1 were revolutionary because they allowed new business questions to be answered accurately and quickly.The core components of data mining technology have been under development for decades, in research areas such as statistics, artificial intelligence, and machine learning. Today, the maturity of these techniques, coupled with high-performance relational database engines and broad data integration efforts, make these technologies practical for current data warehouse environments.
Business Question
Characteristics
"What was my total revenue in the last five years?" "What were unit sales in New England last March?"
Retrospective, static data delivery Retrospective, dynamic data delivery at record level
31
Data Warehousing & Decision Support (1990s) Data Mining (Emerging Today)
"What were unit sales in New England last March? Drill down to Boston."
Automated prediction of trends and behaviours. Data mining automates the process of finding predictive information in large databases. Questions that traditionally required extensive hands-on analysis can now be answered directly from the data quickly. A typical example of a predictive problem is targeted marketing. Data mining uses data on past promotional mailings to identify the targets most likely to maximize return on investment in future mailings. Other predictive problems include forecasting bankruptcy and other forms of default, and identifying segments of a population likely to respond similarly to given events.
Automated discovery of previously unknown patterns. Data mining tools sweep through databases and identify previously hidden patterns in one step. An example of pattern discovery is the analysis of retail sales data to identify seemingly unrelated products that are often purchased together. Other pattern discovery problems include
32
detecting fraudulent credit card transactions and identifying anomalous data that could represent data entry keying errors. Data mining techniques can yield the benefits of automation on existing software and hardware platforms, and can be implemented on new systems as existing platforms are upgraded and new products developed. When data mining tools are implemented on high performance parallel processing systems, they can analyze massive databases in minutes. Faster processing means that users can automatically experiment with more models to understand complex data. High speed makes it practical for users to analyze huge quantities of data. Larger databases, in turn, yield improved predictions.
More columns. Analysts must often limit the number of variables they examine when doing hands-on analysis due to time constraints. Yet variables that are discarded because they seem unimportant may carry information about unknown patterns. High performance data mining allows users to explore the full depth of a database, without preselecting a subset of variables.
More rows. Larger samples yield lower estimation errors and variance, and allow users to make inferences about small but important segments of a population.
A recent Gartner Group Advanced Technology Research Note listed data mining and artificial intelligence at the top of the five key technology areas that "will clearly have a major impact across a wide range of industries within the next 3 to 5 years."2 Gartner also listed parallel architectures and data mining as two of the top 10 new technologies in which companies will invest during the next 5 years. According to a recent Gartner HPC Research Note, "With the rapid advance in data capture, transmission and storage, large-systems users will increasingly need to implement new and innovative ways to mine the after-market value of their vast stores of detail data, employing MPP [massively parallel processing] systems to create new sources of business advantage (0.9 probability)."3 The most commonly used techniques in data mining are:
Artificial neural networks: Non-linear predictive models that learn through training and resemble biological neural networks in structure.
33
Decision trees: Tree-shaped structures that represent sets of decisions. These decisions generate rules for the classification of a dataset. Specific decision tree methods include Classification and Regression Trees (CART) and Chi Square Automatic Interaction Detection (CHAID) .
Genetic algorithms: Optimization techniques that use process such as genetic combination, mutation, and natural selection in a design based on the concepts of evolution.
Nearest neighbour method: A technique that classifies each record in a dataset based on a combination of the classes of the k record(s) most similar to it in a historical dataset (where k 1). Sometimes called the k-nearest neighbour technique.
Rule induction: The extraction of useful if-then rules from data based on statistical significance.
Many of these technologies have been in use for more than a decade in specialized analysis tools that work with relatively small volumes of data. These capabilities are now evolving to integrate directly with industry-standard data warehouse and OLAP platforms. The appendix to this white paper provides a glossary of data mining terms.
34
might be given a similar situation in the past. Hopefully, if you've got a good model, you find your treasure.
This act of model building is thus something that people have been doing for a long time, certainly before the advent of computers or data mining technology. What happens on computers, however, is not much different than the way people build models. Computers are loaded up with lots of information about a variety of situations where an answer is known and then the data mining software on the computer must run through that data and distill the characteristics of the data that should go into the model. Once the model is built it can then be used in similar situations where you don't know the answer. For example, say that you are the director of marketing for a telecommunications company and you'd like to acquire some new long distance phone customers. You could just randomly go out and mail coupons to the general population - just as you could randomly sail the seas looking for sunken treasure. In neither case would you achieve the results you desired and of course you have the opportunity to do much better than random - you could use your business experience stored in your database to build a model. As the marketing director you have access to a lot of information about all of your customers: their age, sex, credit history and long distance calling usage. The good news is
35
that you also have a lot of information about your prospective customers: their age, sex, credit history etc. Your problem is that you don't know the long distance calling usage of these prospects (since they are most likely now customers of your competition). You'd like to concentrate on those prospects who have large amounts of long distance usage. You can accomplish this by building a model. Table 2 illustrates the data used for building a model for new customer prospecting in a data warehouse.
Customers General information (e.g. demographic data) Proprietary information (e.g. customer transactions) Known Known
Prospects Known
Target
Table 2 - Data Mining for Prospecting The goal in prospecting is to make some calculated guesses about the information in the lower right hand quadrant based on the model that we build going from Customer General Information to Customer Proprietary Information. For instance, a simple model for a telecommunications company might be: 98% of my customers who make more than $60,000/year spend more than $80/month on long distance This model could then be applied to the prospect data to try to tell something about the proprietary information that this telecommunications company does not currently have access to. With this model in hand new customers can be selectively targeted. Test marketing is an excellent source of data for this kind of modelling. Mining the results of a test market representing a broad but relatively small sample of prospects can provide a foundation for identifying good prospects in the overall market. Table 3 shows another common scenario for building models: predict what is going to happen in the future.
36
Yesterday Static information and current plans (e.g. demographic data, marketing plans) Dynamic information (e.g. customer transactions) Known
Today Known
Tomorrow Known
Known
Known
Target
Table 3 - Data Mining for Predictions If someone told you that he had a model that could predict customer usage how would you know if he really had a good model? The first thing you might try would be to ask him to apply his model to your customer base - where you already knew the answer. With data mining, the best way to accomplish this is by setting aside some of your data in a vault to isolate it from the mining process. Once the mining is complete, the results can be tested against the data held in the vault to confirm the models validity. If the model works, its observations should hold for the vaulted data.
37
Figure 1 - Integrated Data Mining Architecture The ideal starting point is a data warehouse containing a combination of internal data tracking all customer contact coupled with external market data about competitor activity. Background information on potential customers also provides an excellent basis for prospecting. This warehouse can be implemented in a variety of relational database systems: Sybase, Oracle, Redbrick, and so on, and should be optimized for flexible and fast data access. An OLAP (On-Line Analytical Processing) server enables a more sophisticated enduser business model to be applied when navigating the data warehouse. The multidimensional structures allow the user to analyze the data as they want to view their business summarizing by product line, region, and other key perspectives of their business. The Data Mining Server must be integrated with the data warehouse and the OLAP server to embed ROI-focused business analysis directly into this infrastructure. An advanced, process-centric metadata template defines the data mining objectives for specific business issues like campaign management, prospecting, and promotion optimization. Integration with the data
38
warehouse enables operational decisions to be directly implemented and tracked. As the warehouse grows with new decisions and results, the organization can continually mine the best practices and apply them to future decisions. This design represents a fundamental shift from conventional decision support systems. Rather than simply delivering data to the end user through query and reporting software, the Advanced Analysis Server applies users business models directly to the warehouse and returns a proactive analysis of the most relevant information. These results enhance the metadata in the OLAP Server by providing a dynamic metadata layer that represents a distilled view of the data. Reporting, visualization, and other analysis tools can then be applied to plan future actions and confirm the impact of those plans. The ideal starting point is a data warehouse containing a combination of internal data tracking all customer contact coupled with external market data about competitor activity. Background information on potential customers also provides an excellent basis for prospecting. This warehouse can be implemented in a variety of relational database systems: Sybase, Oracle, Redbrick, and so on, and should be optimized for flexible and fast data access. An OLAP (On-Line Analytical Processing) server enables a more sophisticated enduser business model to be applied when navigating the data warehouse. The multidimensional structures allow the user to analyze the data as they want to view their business summarizing by product line, region, and other key perspectives of their business. The Data Mining Server must be integrated with the data warehouse and the OLAP server to embed ROI-focused business analysis directly into this infrastructure. An advanced, process-centric metadata template defines the data mining objectives for specific business issues like campaign management, prospecting, and promotion optimization. Integration with the data warehouse enables operational decisions to be directly implemented and tracked. As the warehouse grows with new decisions and results, the organization can continually mine the best practices and apply them to future decisions. This design represents a fundamental shift from conventional decision support systems. Rather than simply delivering data to the end user through query and reporting software, the Advanced Analysis Server applies users business models directly to the warehouse and returns a proactive analysis of the most relevant information. These results
39
enhance the metadata in the OLAP Server by providing a dynamic metadata layer that represents a distilled view of the data. Reporting, visualization, and other analysis tools can then be applied to plan future actions and confirm the impact of those plans.
40
Integration of a Data Mining System with a Database or Data Warehouse System Data Base and Data Warehouse systems, possible integration schemes include
No coupling: No coupling means that a DM system will not utilize any function of a DB or DW system
Loose coupling: Loose coupling means that a DM system will use some facilities of a DB or DW system, fetching data from a data repository managed by these systems, performing data mining, and then storing the mining results either in a file or in a designated place in a database or data warehouse.
Semitight coupling: Semitight coupling means that besides linking a DM system to a DB/DW system, efficient implementations of a few essential data mining primitives (identified by the analysis of frequently encountered data mining functions) can be provided in the DB/DW system.
Tight coupling: Tight coupling means that a DM system is smoothly integrated into the DB/DW system.
Mining methodology and user interaction issues Mining different kinds of knowledge in databases Interactive mining of knowledge at multiple levels of abstraction: Incorporation of background knowledge Data mining query languages and ad hoc data mining Presentation and visualization of data mining results Handling noisy or incomplete data Pattern evaluationthe interestingness problem
Efficiency and scalability of data mining algorithms Parallel, distributed, and incremental mining algorithms Issues relating to the diversity of database types Handling of relational and complex types of data
41
Conclusion:
Data warehousing and Data Mining are two important components of business intelligence. Data warehousing is necessary to analyze (Analysis) the business needs, integrate (Integration) data from several sources, model (Data Modeling) the data in an appropriate manner to present the business information in the form of dashboards and reports (Reporting).
42
Bibliography:
Google Wikipedia Slideshare.com Atuthorstream.com Yahoo.com Google images
43