Vous êtes sur la page 1sur 30

ASSIGNMENT 1

UNIT-1:
1.Explain with an Example the different schemas for Multidimensional
databases?

A: A multidimensional database (MDB) is a type of database that is optimized


for data warehouse and online analytical processing (OLAP) applications.
Multidimensional databases are frequently created using input from
existing relational databases. Whereas a relational database is typically
accessed using a Structured Query Language (SQL) query, a
multidimensional database allows a user to ask questions like "How many
Aptivas have been sold in Nebraska so far this year?" and similar questions
related to summarizing business operations and trends. An OLAP application
that accesses data from a multidimensional database is known as
a MOLAP (multidimensional OLAP) application.

A multidimensional database - or a multidimensional database management


system (MDDBMS) - implies the ability to rapidly process the data in the
database so that answers can be generated quickly. A number of vendors
provide products that use multidimensional databases. Approaches to how
data is stored and the user interface vary.

Types of schemas:
There are four types of schemas are available in data warehouse. Out of which the star
schema is mostly used in the data warehouse designs. The second mostly used data
warehouse schema is snow flake schema. We will see about these schemas in detail.

Star Schema:

A star schema is the one in which a central fact table is sourrounded by


denormalized dimensional tables. A star schema can be simple or complex.
A simple star schema consists of one fact table where as a complex star
schema have more than one fact table.
Snow Flake Schema:

A snow flake schema is an enhancement of star schema by adding


additional dimensions. Snow flake schema are useful when there are low
cardinality attributes in the dimensions.

Galaxy Schema:

Galaxy schema contains many fact tables with some common dimensions
(conformed dimensions). This schema is a combination of many data
marts.
Fact Constellation Schema:

The dimensions in this schema are segregated into independent


dimensions based on the levels of hierarchy. For example, if geography
has five levels of hierarchy like teritary, region, country, state and city;
constellation schema would have five dimensions instead of one.

2. Explain about the concept description? And what are the differences

between concept description in large databases and OLAP?

A:Data generalization and summarization-basedcharacterization

. Analytical characterization: Analysis of attribute relevance.

.Mining class comparisons: Discriminating betweendifferent classes

.Mining descriptive statistical measures in large databases.

Descriptive vs. predictive data mining

Descriptive mining: describes concepts or task-relevantdata sets in concise,


summarative, informative,discriminative forms

Predictive mining: Based on data and analysis,constructs models for the


database, and predicts thetrend and properties of unknown data

Concept description:
Characterization: provides a concise and succinctsummarization of the given
collection of data

Comparison: provides descriptions comparing two or more collections of


data.

The differences between concept description in large databases and OLAP:

3. Differentiate operational database systems and data warehousing?

A: Difference between data warehouse and an operational database

When we go though the data warehouse and operational database we can identify
number of differences. We can define those differences as follows,

Operational database Data warehouse


Stored with a functional or process orientation Stored with a subject orientation

In operational database data are stored with a functional or process orientation. But
when we move to the data warehouse we can identify that data are stored with subject
orientation that facilitates multiple views for data and decision making.

Operational database Data warehouse


Different representation or Unified view of all data elements with a common definition
meanings and representation

Data warehouse concepts it unified view of all data elements with a common definition
and representation.In operational database similar data are allowed different
representations

Operational database Data warehouse


Represent current transactions Historic in natural

Data are historic in nature in data warehouse. Here a dimension is added to facilitate
data analysis and time corporations. But if we turn our eye to the operational database
we can recognize that in operational database model data represent in current
transactions. So there is a big difference in this perspective

Operational database Data warehouse


Update and delete Add only periodically from operational system
In datawarehouse data are changed. But here data only added periodical from
operational systems. But when we turn our eyes to the operational database
perspective we can recongniz that data update and deleted are really common in
operational database.

4. Describe the three-tier data warehousing architecture?

A: Three-Tier Data Warehouse Architecture:

Generally a data warehouses adopts a three-tier architecture. Following are


the three tiers of the data warehouse architecture.

 Bottom Tier − The bottom tier of the architecture is the data warehouse
database server. It is the relational database system. We use the back end tools
and utilities to feed data into the bottom tier. These back end tools and utilities
perform the Extract, Clean, Load, and refresh functions.

 Middle Tier − In the middle tier, we have the OLAP Server that can be
implemented in either of the following ways.

o By Relational OLAP (ROLAP), which is an extended relational database


management system. The ROLAP maps the operations on
multidimensional data to standard relational operations.

o By Multidimensional OLAP (MOLAP) model, which directly implements the


multidimensional data and operations.

 Top-Tier − This tier is the front-end client layer. This layer holds the query
tools and reporting tools, analysis tools and data mining tools.

The following diagram depicts the three-tier architecture of data warehouse



5. Describe the complex aggregation at multiple granularity?

A: In computer science, multiple granularity locking (MGL) is a locking method used


in database management systems (DBMS) and relational databases.
In MGL, locks are set on objects that contain other objects. MGL exploits the
hierarchical nature of the contains relationship. For example, a database may have files,
which contain pages, which further contain records. This can be thought of as a tree of
objects, where each node contains its children. A lock on such as a shared or exclusive
lock locks the targeted node as well as all of its descendants.
Multiple granularity locking is usually used with non-strict two-phase locking to
guarantee serializability.

Lock Modes
In addition to shared (S) locks and exclusive (X) locks from other locking schemes, like
strict two-phase locking, MGL also uses intention shared and intention exclusive locks.
IS locks conflict with X locks, while IX locks conflict with S and X locks. The null lock
(NL) is compatible with everything.
To lock a node in S (or X), MGL has the transaction lock on all of its ancestors with IS
(or IX), so if a transaction locks a node in S (or X), no other transaction can access its
ancestors in X (or S and X). This protocol is shown in the following table:

To Get Must Have on all Ancestors

IS or S IS or IX

IX, SIX or X IX or SIX

Determining what level of granularity to use for locking is done by locking the finest level
possible (at the lowest leaf level), and then escalating these locks to higher levels in the
file hierarchy to cover more records or file elements as needed. This process is known
as Lock Escalation. MGL locking modes are compatible with each other as defined in
the following matrix.

Mode NL IS IX S SIX X

NL Yes Yes Yes Yes Yes Yes

IS Yes Yes Yes Yes Yes No

IX Yes Yes Yes No No No

S Yes Yes No Yes No No

SIX Yes Yes No No No No

X Yes No No No No No

Following the locking protocol and the compatibility matrix, if one transaction holds a
node in S mode, no other transactions can have locked any ancestor in X mode.
6. Discuss briefly about the data warehouse architecture?
A: There are mainly three types of Datawarehouse Architectures: -

Single-tier architecture

The objective of a single layer is to minimize the amount of data stored. This goal is
to remove data redundancy. This architecture is not frequently used in practice.

Two-tier architecture

Two-layer architecture separates physically available sources and data warehouse.


This architecture is not expandable and also not supporting a large number of end-
users. It also has connectivity problems because of network limitations.

Three-tier architecture

This is the most widely used architecture.

It consists of the Top, Middle and Bottom Tier.

1. Bottom Tier: The database of the Datawarehouse servers as the bottom tier.
It is usually a relational database system. Data is cleansed, transformed, and
loaded into this layer using back-end tools.

2. Middle Tier: The middle tier in Data warehouse is an OLAP server which is
implemented using either ROLAP or MOLAP model. For a user, this
application tier presents an abstracted view of the database. This layer also
acts as a mediator between the end-user and the database.

3. Top-Tier: The top tier is a front-end client layer. Top tier is the tools and API
that you connect and get data out from the data warehouse. It could be Query
tools, reporting tools, managed query tools, Analysis tools and Data mining
tools.
7. Demonstrate the efficient processing of OLAP queries?

A: Efficient OLAP Query Processing in Distributed Data Warehouses:

The success of Internet applications has led to an explosive growth in the demand
for bandwidth from ISPs. Managing an IP network includes complex data analysis
(see, e.g., [1]) that can often be expressed as OLAP queries. For example, using
flow-level traffic statistics data one can answer questions like: On an hourly basis,
what fraction of the total number of flows is due to Web traffic? Current day
OLAP tools assume the availability of the detailed data in a centralized
warehouse. However, the inherently distributed nature of of the data collection
(e.g., flow-level traffic statistics are gathered at network routers) and the huge
amount of data extracted at each collection point (of the order of several
gigabytes per day for large IP networks) makes such an approach highly
impractical. The natural solution to this problem is to maintain a distributed data
warehouse, consisting of multiple local data warehouses (sites) adjacent to the
collection points, together with a coordinator. In order for such a solution to
make sense, we need a technology for distributed processing of complex OLAP
queries. We have developed the Skalla system for this task. Skalla translates OLAP
queries specified using relational algebra augmented with the GMDJ operator [2]
into distributed query evaluation plans. Salient features of the Skalla approach
are: the ability to handle complex OLAP queries involving correlated aggregates,
pivots, etc. only partial results are ever shipped — never subsets of the detail
data. Skalla generates distributed query evaluation plans (for the coordinator
architecture) as a sequence of rounds, where a round consists of: (i) each local
site performing some computation and communicating the result to the
coordinator, and (ii) the coordinator synchronizing the results and (possibly)
communicating with the sites. The semantics of the subqueries generated by
Skalla ensure that the amount

of data that has to be shipped between sites is independent of the size of the
underlying data at the sites. This is particularly important in our distributed OLAP
setting. We note that such a bound does not exist for the distributed processing
of traditional SQL join queries [3]. The Skalla evaluation scheme allows for a wide
variety of optimizations that are easily expressed in the algebra and thus readily
integrated into the query optimizer. The optimization schemes included in our
prototype contribute both to the minimization of synchronization traffic and the
optimization of the processing at the local sites. Significant features of the Skalla
approach are the ability to perform both distribution-dependent and
distributionindependent optimizations that reduce the data transferred and the
number of evaluation rounds. We conducted an experimental study of the Skalla
evaluation scheme using TPC(R) data. We found the optimizations to be very
effective, often reducing the processing time by an order of magnitude. The
results demonstrate the scalability of the Skalla techniques and quantify the
performance benefits of the optimization techniques that have gone into the
Skalla sytem.

8. Compare the schemas for the multidimensional data models?

A: Data Warehousing Schemas

1. Star Schema
2. Snowflake Schema
3. Fact Constellation

Star Schema
 A single large central fact table and one table for each dimension
 Every fact points to one tuple in each of the dimensions and has additional
attributes
 Does not capture hierarchies directly.

Snowflake Schema

 Variant of star schema model.


 A single, large and central fact table and one or more tables for each dimension.
 Dimension tables are normalized split dimension table data into additional tables.

Fact Constellation:

 Multiple fact tables share dimension tables.


 This schema is viewed as collection of stars hence called galaxy schema or fact
constellation.
 Sophisticated application requires such schema.

Case Study:

 Afco Foods & Beverages is a new company which produces dairy, bread and
meat products with production unit located at Baroda.
 There products are sold in North, North West and Western region of India.
 They have sales units at Mumbai, Pune , Ahmadabad ,Delhi and Baroda.
 The President of the company wants sales information.

Sales Information

 Report: The Number of units sold.


 113
 Report : The Number of units sold over Time.

Building Data Warehouse


Data Selection

 Data Preprocessing
a. Fill missing values
b. Remove inconsistency
 Data Transformation & Integration
 Data Loading
Data in warehouse is stored in form of fact tables and dimension tables.

9. Explain the Data warehouse applications?

A:Applications of Datawarehouse:

Data Warehouses owing to their potential have deep-rooted applications in every


industry which use historical data for prediction, statistical analysis, and decision
making. Listed below are the applications of Data warehouses across innumerable
industry backgrounds. In this article, we are going to discuss various applications of
data warehouse.

Banking Industry
In the banking industry, concentration is given to risk management and policy reversal
as well analyzing consumer data, market trends, government regulations and reports,
and more importantly financial decision making.
Most banks also use warehouses to manage the resources available on deck in an
effective manner. Certain banking sectors utilize them for market research, performance
analysis of each product, interchange and exchange rates, and to develop marketing
programs.
Analysis of card holder’s transactions, spending patterns and merchant classification, all
of which provide the bank with an opportunity to introduce special offers and lucrative
deals based on cardholder activity. Apart from all these, there is also scope for co-
branding.

Finance Industry
Similar to the applications seen in banking, mainly revolve around evaluation and
trends of customer expenses which aids in maximizing the profits earned by their
clients.

Consumer Goods Industry


They are used for prediction of consumer trends, inventory management, market and
advertising research. In-depth analysis of sales and production is also carried out. Apart
from these, information is exchanged business partners and clientele.

Consumer Goods Industry


They are used for prediction of consumer trends, inventory management, market and
advertising research. In-depth analysis of sales and production is also carried out. Apart
from these, information is exchanged business partners and clientele.

Government and Education


The federal government utilizes the warehouses for research in compliance, whereas the
state government uses it for services related to human resources like recruitment, and
accounting like payroll management.
The government uses data warehouses to maintain and analyze tax records, health
policy records and their respective providers, and also their entire criminal
law database is connected to the state’s data warehouse. Criminal activity is predicted
from the patterns and trends, results of the analysis of historical data associated with
past criminals.
Universities use warehouses for extracting of information used for the proposal of
research grants, understanding their student demographics, and human resource
management. The entire financial department of most universities depends on data
warehouses, inclusive of the Financial Aid department.

Healthcare
One of the most important sector which utilizes data warehouses is the Healthcare
sector. All of their financial, clinical, and employee records are fed to warehouses as it
helps them to strategize and predict outcomes, track and analyze their service feedback,
generate patient reports, share data with tie-in insurance companies, medical aid
services, etc.

Hospitality Industry
A major proportion of this industry is dominated by hotel and restaurant services, car
rental services, and holiday home services. They utilize warehouse services to design
and evaluate their advertising and promotion campaigns where they target customers
based on their feedback and travel patterns.

Insurance
As the saying goes in the insurance services sector, “Insurance can never be bought, it
can be only be sold”, the warehouses are primarily used to analyze data patterns and
customer trends, apart from maintaining records of already existing participants. The
design of tailor-made customer offers and promotions is also possible through
warehouses.

Manufacturing and Distribution Industry


This industry is one of the most important sources of income for any state. A
manufacturing organization has to take several make-or-buy decisions which can
influence the future of the sector, which is why they utilize high-end OLAP tools as a
part of data warehouses to predict market changes, analyze current business trends,
detect warning conditions, view marketing developments, and ultimately take better
decisions.
They also use them for product shipment records, records of product portfolios, identify
profitable product lines, analyze previous data and customer feedback to evaluate the
weaker product lines and eliminate them.
For the distributions, the supply chain management of products operates through data
warehouses.

The Retailers
Retailers serve as middlemen between producers and consumers. It is important for
them to maintain records of both the parties to ensure their existence in the market.
They use warehouses to track items, their advertising promotions, and the consumers
buying trends. They also analyze sales to determine fast selling and slow selling product
lines and determine their shelf space through a process of elimination.

Services Sector
Data warehouses find themselves to be of use in the service sector for maintenance of
financial records, revenue patterns, customer profiling, resource management, and
human resources.

Telephone Industry
The telephone industry operates over both offline and online data burdening them with
a lot of historical data which has to be consolidated and integrated.
Apart from those operations, analysis of fixed assets, analysis of customer’s calling
patterns for sales representatives to push advertising campaigns, and tracking of
customer queries, all require the facilities of a data warehouse.

Transportation Industry
In the transportation industry, data warehouses record customer data enabling traders
to experiment with target marketing where the marketing campaigns are designed by
keeping customer requirements in mind.
The internal environment of the industry uses them to analyze customer feedback,
performance, manage crews on board as well as analyze customer financial reports for
pricing strategies.

10. Discuss briefly about the multidimensional data models?


A: Multidimensional Data Model:
Multidimensional data model stores data in the form of data cube.Mostly, data
warehousing supports two or three-dimensional cubes.

A data cube allows data to be viewed in multiple dimensions.A dimensions are


entities with respect to which an organization wants to keep records.For example
in store sales record, dimensions allow the store to keep track of things like
monthly sales of items and the branches and locations.

A multidimensional databases helps to provide data-related answers to complex


business queries quickly and accurately.

Data warehouses and Online Analytical Processing (OLAP) tools are based on a
multidimensional data model.OLAP in data warehousing enables users to view
data from different angles and dimensions.

Schemas for Multidimensional Data Model are:-


Star Schema

Snowflakes Schema

Fact Constellations Schema

UNIT 2:
1. Explain the various Data pre-processing techniques. How
data reduction helps in data pre-processing?

A: Data preprocessing is a data mining technique that involves transforming


raw data into an understandable format. Real-world data is often incomplete,
inconsistent, and/or lacking in certain behaviors or trends, and is likely to
contain many errors. Data preprocessing is a proven method of resolving
such issues. Data preprocessing prepares raw data for further processing.
Data preprocessing is used database-driven applications such as customer
relationship management and rule-based applications (like neural networks).

Data goes through a series of steps during preprocessing:

Data cleaning

1. Fill in missing values (attribute or class value):


o Ignore the tuple: usually done when class label is missing.
o Use the attribute mean (or majority nominal value) to fill in the
missing value.
o Use the attribute mean (or majority nominal value) for all samples
belonging to the same class.
o Predict the missing value by using a learning algorithm: consider the
attribute with the missing value as a dependent (class) variable and
run a learning algorithm (usually Bayes or decision tree) to predict
the missing value.
2. Identify outliers and smooth out noisy data:
o Binning
 Sort the attribute values and partition them into bins (see
"Unsupervised discretization" below);
 Then smooth by bin means, bin median, or bin boundaries.
o Clustering: group values in clusters and then detect and remove
outliers (automatic or manual)
o Regression: smooth by fitting the data into regression functions.
3. Correct inconsistent data: use domain knowledge or expert decision.

Data transformation

1. Normalization:
o Scaling attribute values to fall within a specified range.
 Example: to transform V in [min, max] to V' in
[0,1], apply V'=(V-Min)/(Max-Min)
o Scaling by using mean and standard deviation (useful when min and
max are unknown or when there are outliers): V'=(V-
Mean)/StDev
2. Aggregation: moving up in the concept hierarchy on numeric attributes.
3. Generalization: moving up in the concept hierarchy on nominal attributes.
4. Attribute construction: replacing or adding new attributes inferred by
existing attributes.

Data reduction

1. Reducing the number of attributes


o Data cube aggregation: applying roll-up, slice or dice operations.
o Removing irrelevant attributes: attribute selection (filtering and
wrapper methods), searching the attribute space (see Lecture 5:
Attribute-oriented analysis).
o Principle component analysis (numeric attributes only): searching for
a lower dimensional space that can best represent the data..
2. Reducing the number of attribute values
o Binning (histograms): reducing the number of attributes by grouping
them into intervals (bins).
o Clustering: grouping values in clusters.
o Aggregation or generalization
3. Reducing the number of tuples
o Sampling

2.Define data cleaning? Express the different techniques for handling


missing values?

A: DATA CLEANING ―Data cleaning is the number one problem in data


warehousing‖— DCI (Discovery Corps, Inc.) survey. Data quality is an essential
characteristic that determines the reliability of data for making decisions. High-
quality data is Complete: All relevant data such as accounts, addresses and
relationships for a given customer is linked. Accurate: Common data problems
like misspellings, typos, and random abbreviations have been cleaned up.
Available: Required data are accessible on demand; users do not need to search
manually for the information. Timely: Up-to-date information is readily available
to support decisions. In general, data quality is defined as an aggregated value
over a set of quality criteria [Naumann.F ,2002; Heiko and Johann, 2006]. Starting
with the quality criteria defined in [Naumann.F ,2002] , the author describes the
set of criteria that are affected by comprehensive data cleansing and define how
to assess scores for each one of them for an existing data collection. To measure
the quality of a data collection, scores have to be assessed for each of the quality
criteria. The assessment of scores for quality criteria can be used to quantify the
necessity of data cleansing for a data collection as well as the success of a
performed data cleansing process of a data collection. Quality criteria can also be
used within the optimization of data cleansing by specifying priorities for each of
the criteria which in turn influences the execution of data cleansing methods
affecting the specific criteria. Data cleaning routines work to ―clean‖ the data by
filling in missing values, smoothing noisy data, identifying or removing outliers,
and resolving inconsistencies. The actual process of data cleansing may involve
107 removing typographical errors or validating and correcting values against a
known list of entities. The validation may be strict. Data cleansing differs from
data validation in that validation almost invariably means data is rejected from
the system at entry and is performed at entry time, rather than on batches of
data. Data cleansing may also involve activities like, harmonization of data, and
standardization of data. For example, harmonization of short codes (St, rd) to
actual words (street, road). Standardization of data is a means of changing a
reference data set to a new standard, ex, use of standard codes. The major data
cleaning tasks include Identify outliers and smooth out noisy data Fill in missing
values Correct inconsistent data Resolve redundancy caused by data
integration Among these tasks missing values causes inconsistencies for data
mining. To overcome these inconsistencies, handling the missing value is a good
solution. In the medical domain, missing data might occur as the value is not
relevant to a particular case, could not be recorded when the data was collected,
or is ignored by users because of privacy concerns or it may be unfeasible for the
patient to undergo the clinical tests, equipment malfunctioning, etc. Methods for
resolving missing values are therefore needed in health care systems to enhance
the quality of diagnosis. The following sections describe about the proposed data
cleaning methods.

Handling Missing Values:

The missing value treating method plays an important role in the data
preprocessing. Missing data is a common problem in statistical analysis. The
tolerance level of missing data is classified as Missing Value (Percentage) -
Significant Upto 1% - Trivial 1-5% - Manageable 5-15% - sophisticated methods to
handle More than 15% - Severe impact of interpretation Several methods have
been proposed in the literature to treat missing data. Those methods are divided
into three categories as proposed by Dempster and et

The different patterns of missing values are discussed in the next section.

Pattern of missing :

The Missing value in database falls into this three categories viz., Missing
Completely at Random (MCAR), Missing Random (MAR) and Non-Ignorable (NI)
Missing Completely at Random (MCAR) This is the highest level of randomness. It
occurs when the probability of an instance (case) having a missing value for an
attribute does not depend on either the known values or the missing values are
randomly distributed across all observations. This is not a realistic assumption for
many real time data. Missing at Random (MAR) When missingness does not
depend on the true value of the missing variable, but it might depend on the
value of other variables that are observed.

3.Explain Data Integration and Transformation?

A:Data Integration:

Data integration is one of the steps of data pre-processing that involves combining
data residing in different sources and providing users with a unified view of these data.
• It merges the data from multiple data stores (data sources)
• It includes multiple databases, data cubes or flat files.
• Metadata, Correlation analysis, data conflict detection, and resolution of semantic
heterogeneity contribute towards smooth data integration.
• There are mainly 2 major approaches for data integration - commonly known as "tight
coupling approach" and "loose coupling approach".
Tight Coupling
o Here data is pulled over from different sources into a single physical location through
the process of ETL - Extraction, Transformation and Loading.
o The single physical location provides an uniform interface for querying the data.
ETL layer helps to map the data from the sources so as to provide a uniform data
o warehouse. This approach is called tight coupling since in this approach the data is
tightly coupled with the physical repository at the time of query.
ADVANTAGES:
1. Independence (Lesser dependency to source systems since data is physically
copied over)
2. Faster query processing
3. Complex query processing
4. Advanced data summarization and storage possible
5. High Volume data processing
DISADVANTAGES: 1. Latency (since data needs to be loaded using ETL)

1. Costlier (data localization, infrastructure, security)

Loose Coupling
o Here a virtual mediated schema provides an interface that takes the query from the
user, transforms it in a way the source database can understand and then sends the
query directly to the source databases to obtain the result.
o In this approach, the data only remains in the actual source databases.
o However, mediated schema contains several "adapters" or "wrappers" that can
connect back to the source systems in order to bring the data to the front end.
ADVANTAGES:
Data Freshness (low latency - almost real time)
Higher Agility (when a new source system comes or existing source system changes -
only the corresponding adapter is created or changed - largely not affecting the other
parts of the system)
Less costlier (Lot of infrastructure cost can be saved since data localization not
required)
DISADVANTAGES:
1. Semantic conflicts
2. Slower query response
3. High order dependency to the data sources
For example, let's imagine that an electronics company is preparing to roll out a new
mobile device. The marketing department might want to retrieve customer information
from a sales department database and compare it to information from the product
department to create a targeted sales list. A good data integration system would let the
marketing department view information from both sources in a unified way, leaving out
any information that didn't apply to the search.
II. DATA TRANSFORMATION:
In data mining pre-processes and especially in metadata and data warehouse, we use
data transformation in order to convert data from a source data format into destination
data.
We can divide data transformation into 2 steps:
• Data Mapping:
It maps the data elements from the source to the destination and captures any
transformation that must occur.
• Code Generation:
It creates the actual transformation program.
Data transformation:
• Here the data are transformed or consolidated into forms appropriate for mining.
• Data transformation can involve the following:
Smoothing:
• It works to remove noise from the data.
• It is a form of data cleaning where users specify transformations to correct data
inconsistencies.
• Such techniques include binning, regression, and clustering.
Aggregation:
• Here summary or aggregation operations are applied to the data.
• This step is typically used in constructing a data cube for analysis of the data at
multiple granularities.
• Aggregation is a form of data reduction.
Generalization :
• Here low-level or “primitive” (raw) data are replaced by higher-level concepts through
the use of concept hierarchies.
• For example, attributes, like age, may be mapped to higher-level concepts, like youth,
middle-aged, and senior.
• Generalization is a form of data reduction.
Normalization:
• Here the attribute data are scaled so as to fall within a small specified range, such as
1:0 to 1:0, or 0:0 to 1:0.
• Normalization is particularly useful for classification algorithms involving neural
networks, or distance measurements such as nearest-neighbor classification and
clustering
• For distance-based methods, normalization helps prevent attributes with initially large
ranges (e.g., income).
• There are three methods for data normalization:

1. min-max normalization :

o performs a linear transformation on the original data


o Suppose that minAand maxAare the minimum and maximum values of an attribute, A.
o Min-max normalization maps a value, v, of A to v0 in the range [new minA;newmaxA]

by computing
o Min-max normalization preserves the relationships among the original data values.

1. z-score normalization

o Here the values for an attribute, A, are normalized based on the mean and standard
deviation of A.

o Value, v of A is normalized to v0 by computing , where A and σA are the mean


and standard deviation, respectively.
o This method of normalization is useful when the actual minimum and maximum of
attribute Aare unknown, or when there are outliers that dominate the min-max
normalization.

1. normalization by decimal scaling:

o Here the normalization is done by moving the decimal point of values of attribute A.
o The number of decimal points moved depends on the maximum absolute value of A.
o Value, v of A is normalized to v0 by computing where j is the
smallest integer such that Max
• Attribute construction:
o Here new attributes are constructed and added from the given set of attributes to help
the mining process.
o Attribute construction helps to improve the accuracy and understanding of structure in
high-dimensional data.
o By combining attributes, attribute construction can discover missing information about
the relationships between data attributes that can be useful for knowledge discovery.
EG:The structure of stored data may vary between applications, requiring semantic
mapping prior to the transformation process. For instance, two applications might store
the same customer credit card information using slightly different structures:

To ensure that critical data isn’t lost when the two applications are integrated,
information from Application A needs to be reorganized to fit the data structure of
Application B.

4. Discuss briefly about the data smoothing techniques?

A:Data Smoothing:

When data collected over time displays random variation, smoothing techniques can be
used to reduce or cancel the effect of these variations. When properly applied, these
techniques smooth out the random variation in the time series data to reveal underlying
trends.

XLMiner features four different smoothing techniques: Exponential, Moving Average, Double
Exponential, and Holt-Winters. Exponential and Moving Average are relatively simple
smoothing techniques and should not be performed on data sets involving seasonality.
Double Exponential and Holt-Winters are more advanced techniques that can be used on
data sets involving seasonality.

Exponential Smoothing

Exponential Smoothing is one of the more popular smoothing techniques due to its flexibility,
ease in calculation, and good performance. Exponential Smoothing uses a simple average
calculation to assign exponentially decreasing weights starting with the most recent
observations. New observations are given relatively more weight in the average calculation
than older observations. The Exponential Smoothing tool uses the following formulas.

S0= x0

St = αxt-1 + (1-α)st-1, t > 0

where

 original observations are denoted by {xt} starting at t = 0


 α is the smoothing factor which lies between 0 and 1

Exponential Smoothing should only be used when the data set contains no seasonality. The
forecast is a constant value that is the smoothed value of the last observation.

Moving Average Smoothing

In Moving Average Smoothing, each observation is assigned an equal weight, and each
observation is forecasted by using the average of the previous observation(s). Using the time
series X1, X2, X3, ....., Xt, this smoothing technique predicts Xt+k as follows :

St = Average (xt-k+1, xt-k+2, ....., xt), t= k, k+1, k+2, ...N

where, k is the smoothing parameter.

XLMiner allows a parameter value between 2 and t-1 where t is the number of observations in
the data set. Note that when choosing this parameter, a large parameter value will
oversmooth the data, while a small parameter value will undersmooth the data. The past three
observations will predict the future observations. As with Exponential Smoothing, this
technique should not be applied when seasonality is present in the data set.

Double Exponential Smoothing


Double Exponential Smoothing can be defined as the recursive application of an exponential
filter twice in a time series. Double Exponential Smoothing should not be used when the data
includes seasonality. This technique introduces a second equation that includes a trend
parameter; thus, this technique should be used when a trend is inherent in the data set, but
not used when seasonality is present. Double Exponential Smoothing is defined by the
following formulas.

St = At + Bt , t = 1,2,3,..., N

Where, At = axt + (1- a) St-1 0< a <= 1

Bt = b (At - At-1) + (1 - b ) Bt-1 0< b <= 1

The forecast equation is: Xt+k = At + K Bt , K = 1, 2, 3, ...

where, a denotes the Alpha parameter, and b denotes the trend parameters. These two
parameters can be entered manually.

XLMiner includes an optimize feature that will choose the best values for alpha and trend
parameters based on the Forecasting Mean Squared Error. If the trend parameter is 0, then
this technique is equivalent to the Exponential Smoothing technique. (However, results may
not be identical due to different initialization methods for these two techniques.)

Holt-Winters Smoothing

Holt Winters Smoothing introduces a third parameter (g) to account for seasonality (or
periodicity) in a data set. The resulting set of equations is called the Holt-Winters method, after
the names of the inventors. The Holt-Winters method can be used on data sets involving trend
and seasonality (a, b , g). Values for all three parameters can range between 0 and 1.

The following three models associated with this method.

Multiplicative: Xt = (At+ Bt)* St +et where At and Bt are previously calculated initial estimates.
St is the average seasonal factor for the tth season.

At = axt/St-p + (1-a)(At-1 + Bt-1)

Bt = b(At + At-1) + (1 - b)Bt-1

St = gxt/At + (1 - g)St-p

Additive: Xt = (At+ Bt) +St + et

No Trend: b = 0, so, Xt = A * St +et


Holt-Winters smoothing is similar to Exponential Smoothing if b and g = 0, and is similar to
Double Exponential Smoothing if g = 0

5. Explain different data mining tasks for knowledge discovery?

A: What is the KDD Process?

The term Knowledge Discovery in Databases, or KDD for short, refers to the broad
process of finding knowledge in data, and emphasizes the "high-level" application of
particular data mining methods. It is of interest to researchers in machine learning,
pattern recognition, databases, statistics, artificial intelligence, knowledge acquisition
for expert systems, and data visualization.

The unifying goal of the KDD process is to extract knowledge from data in the
context of large databases.

It does this by using data mining methods (algorithms) to extract (identify) what is
deemed knowledge, according to the specifications of measures and thresholds, using
a database along with any required preprocessing, subsampling, and transformations
of that database.

An Outline of the Steps of the KDD Process


The overall process of finding and interpreting patterns from data involves the
repeated application of the following steps:

1. Developing an understanding of
o the application domain
o the relevant prior knowledge
o the goals of the end-user
2. Creating a target data set: selecting a data set, or focusing on a subset of
variables, or data samples, on which discovery is to be performed.
3. Data cleaning and preprocessing.
o Removal of noise or outliers.
o Collecting necessary information to model or account for noise.
o Strategies for handling missing data fields.
o Accounting for time sequence information and known changes.
4. Data reduction and projection.
o Finding useful features to represent the data depending on the goal of the
task.
o Using dimensionality reduction or transformation methods to reduce the
effective number of variables under consideration or to find invariant
representations for the data.
5. Choosing the data mining task.
o Deciding whether the goal of the KDD process is classification,
regression, clustering, etc.
6. Choosing the data mining algorithm(s).
o Selecting method(s) to be used for searching for patterns in the data.
o Deciding which models and parameters may be appropriate.
o Matching a particular data mining method with the overall criteria of the
KDD process.
7. Data mining.
o Searching for patterns of interest in a particular representational form or
a set of such representations as classification rules or trees, regression,
clustering, and so forth.
8. Interpreting mined patterns.
9. Consolidating discovered knowledge.

The terms knowledge discovery and data mining are distinct.

KDD refers to the overall process of discovering useful knowledge from data. It
involves the evaluation and possibly interpretation of the patterns to make the decision
of what qualifies as knowledge. It also includes the choice of encoding schemes,
preprocessing, sampling, and projections of the data prior to the data mining step.
Data mining refers to the application of algorithms for extracting patterns from data
without the additional steps of the KDD process.

Vous aimerez peut-être aussi