Vous êtes sur la page 1sur 9

Summer 2013 Master of Science in Information Technology (MScIT) Revised Fall 2011 Semester 4 MIT401 Data Warehousing and

Data Mining 4 Credits (Book ID: B1633) Assignment Set (60 Marks) Note: Answer All the Questions

Call us on 08273413412 or 08791490301 for any future help in project..


1. Define OLTP? Explain the Differences between OLTP and Data Warehouse. [03+07 Marks] Answer. OLTP OLTP is an Online Transaction Processing System to handle day-to-day business transactions. Examples are Railway Reservation Systems, Online Store Purchases etc. These systems handle tremendous amount of data daily. But the common questions that every business personnel get are as follows: Are these systems good enough for analyzing your business? Can we predict our business for long period of time? Can we forecast the business with these data? The OLTP systems alone cannot give the answers for all these questions. And again the answer for all these systems is again a Data Warehouse. So it is time know, the differences between OLTP and Data Warehouse systems.

Differences between OLTP and Data Warehouse Application databases are OLTP (On-Line Transaction Processing) systems where every transaction has to be recorded as and when it occurs. Consider the scenario where a bank ATM has disbursed cash to a customer but was unable to record this event in the bank records. If this happens frequently, the bank wouldn't stay in business for too long. So the banking system is designed to make sure that every transaction gets recorded within the time you stand before the ATM machine. A Data Warehouse (DW) on the other end, is a database (yes, you are right, it's a database) that is designed for facilitating querying and analysis. Often designed as OLAP (On-Line Analytical Processing) systems, these databases contain read-only data that can be queried and analyzed far more efficiently as compared to your regular OLTP application databases. In this sense an OLAP system is designed to be read-optimized. Separation from your application database also ensures that your business intelligence solution is scalable (your bank and ATMs don't go down just because the CFO asked for a report), better documented and managed. Creation of a DW leads to a direct increase in quality of analysis as the table structures are simpler (you keep only the needed information in simpler tables), standardized (well-documented table structures), and often de-normalized (to reduce the linkages between tables and the corresponding complexity of queries). Having a well-designed DW is the foundation for successful BI (Business Intelligence)/Analytics initiatives, which are built upon. Data Warehouses usually store many months or years of data. This is to support historical analysis. OLTP systems usually store data from only a few weeks or months. The OLTP system stores only historical data as

needed to successfully meet the requirements of the current transaction. Table: OLTP VS Data Warehouses Property Nature of Data Warehouse Indexes Joins Duplicate data Aggregate data Queries Nature of queries Updates Historical data OLTP 3 NF Few Many Normalized Rare Mostly predefined Mostly simple All the time Not allowed, Often not Data Warehouse Multidimensional Many Some Demoralized Common Mostly adhoc Mostly complex only refreshed available Essential

2. What is Architecture? Explain various components involved in it. [03+07 Marks] Answer. Architecture The structure that brings all the components of a Data Warehouse together is known as the Architecture. In your Data Warehouse, Architecture includes a number of factors. Primarily, it includes the integrated data that is the centerpiece. The architecture includes everything that is needed to prepare the data and store it. It also includes all the means for delivering information from your Data Warehouse. The architecture is further composed of the rules, procedures, and functions that enable your Data Warehouse to work and fulfill the business requirements. Finally, the architecture is made up of the technology that empowers your Data Warehouse. The architecture provides the overall framework for developing and deploying your Data Warehouse; it is a comprehensive blueprint. The architecture defines the standards, measurements, general design, and support techniques. Architecture is a systematic arrangement of components. The main difference between the database architecture in a standard, on-line transaction processing oriented system (usually ERP or CRM system) and a Data Warehouse is that the systems relational model is usually de-normalized into dimension and fact tables, which are typical to a Data Warehouse database design. Various components of architecture The major components of DWH Architecture are: Source Data Component 1. Production Data 2. Internal Data 3. Archived Data 4. External Data Data Staging Component 1. Data Extraction

2. Data Transformation 3. Data Loading Data Storage Component Information Deliver Component Metadata Component Management and Control Component The diagram below shows the major components of DWH along with their interrelationship. Production Data This category of data comes from the various operational systems of the enterprise. Based on the information requirements in the Data Warehouse, you choose segments of data from the different operational systems. While dealing with this data, you come across many variations in the data formats. Internal Data In every organization, users keep their private spreadsheets, documents, customer profiles, and sometimes even departmental databases. This is the internal data, parts of which could be useful for Data Warehouse for analysis. Archived Data Operational systems are primarily intended to run the current business. In every operational system, you periodically take the old data and store it in archived files. The circumstances in your organization dictate how often and which portions of the operational databases are archived for storage. Some data is archived after a year. External Data Most executives depend on data from external sources for a high percentage of the information they use. They use statistics relating to their industry produced by external agencies. They use market share data of competitors. They use standard values of financial indicators for their business to check on their performance. Data Staging Component The extracted data coming from several disparate sources need to be changed, converted, and made ready in a format that is suitable to be stored for querying and analysis. Three major functions need to be performed for getting the data ready. You have to extract the data, transform the data, and then load the data into the Data Warehouse storage. Data Extraction This function has to deal with numerous data sources. You have to employ the appropriate technique for each data source. Source data may be from different source machines in diverse data formats. Part of the source data may be in relational database systems. Some data may be on other legacy network and hierarchical data models. Many data sources may still be in flat files. Data Transformation In every system implementation, data conversion is an important function. For example, when you implement an operational system such as a magazine subscription application, you have to initially populate your database with data from the prior system records. You may be converting over from a manual system. Or, you may be moving from a file-oriented system to a modern system supported with relational database tables. Data Loading Two distinct groups of tasks form the data loading function. When you complete the design and construction of the Data Warehouse and go live for the first time, you do the initial loading of the data into the Data Warehouse storage. The initial load moves large volumes of data using up substantial amounts of time. Data Storage Component The data storage for the Data Warehouse is a separate repository. The operational systems of your enterprise support the day-to-day operations. These are online transaction processing applications. The data repositories for the operational systems typically contain only the current data. Information delivery component

Who are the users that need information from the Data Warehouse? The range is fairly comprehensive. The new user comes to the Data Warehouse with no training and, therefore, needs prefabricated reports and preset queries. The casual user needs information once in a while, not regularly. This type of user also needs prepackaged information. Metadata Component Metadata in a Data Warehouse is similar to the data dictionary or the data catalog in a Database Management System. In the data dictionary, you keep the information about the logical data structures, the information about the files and addresses, the information about the indexes, and so on. The Data Dictionary contains data about the data in the database. Data Marts A Data Mart is a repository of data gathered from operational data and other sources that is designed to serve a particular community of knowledge workers (Eg: Department wise). In scope, the data may derive from an enterprisewide database or Data Warehouse or be more specialized. A Data Warehouse is a central aggregation of data (which can be distributed physically), a Data Mart is a data repository that may derive from a Data Warehouse that emphasizes ease of access and usability for a particular designed purpose. Management and Control Component This component of the Data Warehouse architecture sits on top of all the other components. The management and control component coordinates the services and activities within the Data Warehouse. This component controls the data transformation and the data transfer into the Data Warehouse storage. On the other hand, it moderates the information delivery to the users. It works with the database management systems and enables data to be properly stored in the repositories. It monitors the movement of data into the staging area and from there into the Data Warehouse storage itself. 3. Explain the Basic tasks in data transformation. [10 Marks] Answer. Basic tasks in data transformation Data transformation operations, such as normalization and aggregation, are additional data preprocessing procedures that would contribute toward the success of the mining process transformations are often the most complex and, in terms of processing time, the most costly part of the ETL process. They can range from simple data conversions to extremely complex data scrubbing techniques.

Before moving the extracted data from the source systems into the Data Warehouse, you inevitably have to perform various kinds of data transformations. You have to transform the data according to standards because they come from many dissimilar source systems. You have to ensure that after all the data is put together, the combined data does not violate any business rules. Data transformation can be achieved in following ways: Smoothing: which works to remove noise from the data? Aggregation: where summary or aggregation operations are applied to the data. For example, the daily sales data may be aggregated so as to compute weekly and annual total scores. Generalization of the data: where low-level or primitive (raw) data are replaced by higher-level concepts through the use of concept hierarchies. For example, categorical attributes, like street, can be generalized to higher-level concepts, like city or country.

Normalization: where the attribute data are scaled so as to fall within a small specified range, such as 1.0 to 1.0, or 0.0 to 1.0. Attribute construction : this is where new attributes are constructed and added from the given set of attributes to help the mining process. The data transformation contains the following basic tasks: Selection: This takes place at the beginning of the whole process of data transformation. You select either whole records or parts of several records from the source systems. The task of selection usually forms part of the extraction function itself. Splitting / Joining: This task includes the types of data manipulation you need to perform on the selected parts of source records. Sometimes (uncommonly), you will be splitting the selected parts even further during data transformation. Joining of parts selected from many source systems is more widespread in the Data Warehouse environment. Conversion: This is an all-inclusive task. It includes a large variety of rudimentary conversions of single fields for two primary reasons one to standardize among the data extractions from disparate source systems, and the other to make the fields usable and understandable to the users. Summarization: Sometimes you may find that it is not feasible to keep data at the lowest level of detail in your Data Warehouse. It may be that none of your users ever need data at the lowest granularity for analysis or querying. For example, for a grocery chain, sales data at the lowest level of detail for every transaction at the checkout may not be needed. Storing sales by product by store by day in the Data Warehouse may be quite adequate. So, in this case, the data transformation function includes summarization of daily sales by product and by store. Enrichment: This task is the rearrangement and simplification of individual fields to make them more useful for the Data Warehouse environment. You may use one or more fields from the same input record to create a better view of the data for the Data Warehouse. This principle is extended when one or more fields originate from multiple records, resulting in a single field for the Data Warehouse

4. What is Data mining? How does data mining work? [03+07 Marks] Answer. Data mining Data mining is concerned with finding hidden relationships present in business data to allow businesses to make predictions for future use. It is the process of data-driven extraction of not so obvious, but useful information from large databases. Data mining has emerged as a key business intelligence technology. But the ultimate question is where can it be useful? And how does it work? The purpose of data mining with POS (point of sale system) system. Usually supermarkets employ a POS (Point Of Sale) system that collects data from each item that is purchased. The POS system collects data on the item brand name, category, size, time and date of the purchase and at what price the item was purchased. In addition, the supermarket usually has a customer rewards program, which is also an input to the POS system. This information can directly link the products purchased with an individual. All this data for every purchase made for years and years is stored in a database in a computer by the supermarket. There are various definitions. Some of them are listed below. Data mining is the efficient discovery of valuable, non-obvious information from a large collection of data.

Knowledge discovery in databases is the non-trivial process of identifying valid novel potentially useful and ultimately understandable patterns in the data. It is the automatic discovery of new facts and relationships in data that are like valuable nuggets of business data. It is the process of extracting previously unknown, valid, and actionable information from large databases and then using the information to make crucial business decisions.

It is an interdisciplinary field bringing together techniques from machine learning, pattern recognition, statistics, databases, visualization, and neural networks. Working Although data mining is still in its infancy, companies in a wide range of industries including finance, health care, manufacturing, transportation, are already using data mining tools and techniques to take advantage of historical data. The whole logic of data mining is based on modeling. Modeling is simply the act of building a model (a set of examples or a mathematical relationship) based on data from situations where the answer is known and then applying the model to other situations where the answers are not known. As a simple example of building a model, consider the director of marketing for a telecommunications company. He would like to focus his marketing and sales efforts on segments of the population most likely to become big users of long-distance services. He knows a lot about his customers, but it is impossible to discern the common characteristics of his best customers because there are so many variables. From this existing database of customers, which contains information such as age, sex, credit history, income, zip code, occupation, etc., he can use data mining tools, such as neural networks, to identify the characteristics of those customers who make lots of long-distance calls. For instance, he might learn that his best customers are unmarried females between the ages of 21 and 35 who earn in excess of $60,000 per year. This, then, is his model for high-value customers, and he would budget his marketing efforts accordingly. Remember, data mining is the task of discovering interesting patterns from large amounts of data where the data can be stored in databases, data warehouses or other information repositories.

5. Explain the Data smoothing techniques. [10 Marks] Answer. Data smoothing techniques Noise is random error or variance in measured variable. Given a numeric attribute such as, price, how can we smooth out the data to remove the noise? There are following data smoothing techniques. 1) Binning: Binning methods smooth a sorted data value by consulting its neighborhood, that is, the values around it. The sorted values are distributed into a number of buckets, or bins. Because binning methods consult the neighborhood of values, they perform local smoothing. Because binning methods consult the neighborhood of values, they perform local smoothing. Here are several common binning techniques:

Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34 Partition into (equidepth) bins: Bin 1: 4, 8, 15 Bin 2: 21, 21, 24 Bin 3: 25, 28, 34 Smoothing by bin means: Bin 1: 9, 9, 9 Bin 2: 22, 22, 22 Bin 3: 29, 29, 29 Smoothing by bin boundaries: Bin 1: 4, 4, 15 Bin 2: 21, 21, 24 Bin 3: 25, 25, 34 In these examples, the data for price are first sorted and then partitioned into equidepth bins of depth 3. 2) Clustering: Outliers may be detected by clustering, where similar values are organized into groups, or clusters. Intuitively, values that fall outside of the set of clusters may be considered outliers. A cluster is therefore a collection of objects which are similar between them and are dissimilar to the objects belonging to other clusters. We can show this with a simple graphical example:

3) Combined computer and human inspection: Outliers may be identified through a combination of computer and human inspection. In one application, for example, an information theoretic measure was used to help identify outlier patterns in a handwritten character database for classification. The measures value reflected the surprise content of the predicted character label with respect to the known label. 4) Regression: Data can be smoothed by fitting the data to a function, such as with regression. Linear regression involves finding the best line to fit two variables, so that one variable can be used to predict the other. Multiple linear regression is an extension of linear regression, where more than two variables are involved and the data are fit

to multidimensional surface. Using regression to find a mathematical equation to fit the data helps smooth out the noise.

6. What is Clustering? Explain in detail. [03+07 Marks] Answer. Clustering Clustering is a method of grouping data into different groups, so that the data in each group share similar trends and patterns. Clustering constitutes a major class of data mining algorithms. The algorithm attempts to automatically partition the data space into a set of regions or clusters, to which the examples in the table are assigned, either deterministically or probability-wise. The goal of the process is to identify all sets of similar examples in the data, in some optimal fashion. The objectives of clustering are: to uncover natural groupings to initiate hypothesis about the data to find consistent and valid organization of the data. So the goal of clustering is to determine the intrinsic grouping in a set of unlabeled data. But how to decide what constitutes a good clustering? It can be shown that there is no absolute best criterion which would be independent of the final aim of the clustering. Consequently, it is the user which must supply this criterion, in such a way that the result of the clustering will suit their needs. For instance, we could be interested in finding representatives for homogeneous groups (data reduction), in finding natural clusters and describe their unknown properties (natural data types), in finding useful and suitable groupings (useful data classes) or in finding unusual data objects (outlier detection). A retailer may want to know where similarities exist in his customer base, so that he can create and understand different groups. He can use the existing database or different customers or, more specifically, different transactions collected over a period of time. The clustering methods will help him in identifying different categories of customers. During the discovery process, the differences between the data sets can be discovered in order to separate them into different groups, and similarity between data sets can be used to group similar data together. Possible Applications clustering algorithms can be applied in many fields, for instance:

Marketing: finding groups of customers with similar behavior given a large database of customer data containing their properties and past buying records; Biology: classification of plants and animals given their features; Libraries: book ordering; Insurance: identifying groups of motor insurance policy holders with a high average claim cost; identifying frauds; City-planning: identifying groups of houses according to their house type, value and geographical location; Earthquake studies: clustering observed earthquake epicenters to identify dangerous zones; WWW: document classification; clustering weblog data to discover groups of similar access patterns.

Requirements The main requirements that a clustering algorithm should satisfy are:

scalability; dealing with different types of attributes; discovering clusters with arbitrary shape; minimal requirements for domain knowledge to determine input parameters; ability to deal with noise and outliers; insensitivity to order of input records; high dimensionality; Interpretability and usability

Vous aimerez peut-être aussi