Académique Documents
Professionnel Documents
Culture Documents
UNIT-1:
1.Explain with an Example the different schemas for Multidimensional
databases?
Types of schemas:
There are four types of schemas are available in data warehouse. Out of which the star
schema is mostly used in the data warehouse designs. The second mostly used data
warehouse schema is snow flake schema. We will see about these schemas in detail.
Star Schema:
Galaxy Schema:
Galaxy schema contains many fact tables with some common dimensions
(conformed dimensions). This schema is a combination of many data
marts.
Fact Constellation Schema:
2. Explain about the concept description? And what are the differences
Concept description:
Characterization: provides a concise and succinctsummarization of the given
collection of data
When we go though the data warehouse and operational database we can identify
number of differences. We can define those differences as follows,
In operational database data are stored with a functional or process orientation. But
when we move to the data warehouse we can identify that data are stored with subject
orientation that facilitates multiple views for data and decision making.
Data warehouse concepts it unified view of all data elements with a common definition
and representation.In operational database similar data are allowed different
representations
Data are historic in nature in data warehouse. Here a dimension is added to facilitate
data analysis and time corporations. But if we turn our eye to the operational database
we can recognize that in operational database model data represent in current
transactions. So there is a big difference in this perspective
Bottom Tier − The bottom tier of the architecture is the data warehouse
database server. It is the relational database system. We use the back end tools
and utilities to feed data into the bottom tier. These back end tools and utilities
perform the Extract, Clean, Load, and refresh functions.
Middle Tier − In the middle tier, we have the OLAP Server that can be
implemented in either of the following ways.
Top-Tier − This tier is the front-end client layer. This layer holds the query
tools and reporting tools, analysis tools and data mining tools.
Lock Modes
In addition to shared (S) locks and exclusive (X) locks from other locking schemes, like
strict two-phase locking, MGL also uses intention shared and intention exclusive locks.
IS locks conflict with X locks, while IX locks conflict with S and X locks. The null lock
(NL) is compatible with everything.
To lock a node in S (or X), MGL has the transaction lock on all of its ancestors with IS
(or IX), so if a transaction locks a node in S (or X), no other transaction can access its
ancestors in X (or S and X). This protocol is shown in the following table:
IS or S IS or IX
Determining what level of granularity to use for locking is done by locking the finest level
possible (at the lowest leaf level), and then escalating these locks to higher levels in the
file hierarchy to cover more records or file elements as needed. This process is known
as Lock Escalation. MGL locking modes are compatible with each other as defined in
the following matrix.
Mode NL IS IX S SIX X
X Yes No No No No No
Following the locking protocol and the compatibility matrix, if one transaction holds a
node in S mode, no other transactions can have locked any ancestor in X mode.
6. Discuss briefly about the data warehouse architecture?
A: There are mainly three types of Datawarehouse Architectures: -
Single-tier architecture
The objective of a single layer is to minimize the amount of data stored. This goal is
to remove data redundancy. This architecture is not frequently used in practice.
Two-tier architecture
Three-tier architecture
1. Bottom Tier: The database of the Datawarehouse servers as the bottom tier.
It is usually a relational database system. Data is cleansed, transformed, and
loaded into this layer using back-end tools.
2. Middle Tier: The middle tier in Data warehouse is an OLAP server which is
implemented using either ROLAP or MOLAP model. For a user, this
application tier presents an abstracted view of the database. This layer also
acts as a mediator between the end-user and the database.
3. Top-Tier: The top tier is a front-end client layer. Top tier is the tools and API
that you connect and get data out from the data warehouse. It could be Query
tools, reporting tools, managed query tools, Analysis tools and Data mining
tools.
7. Demonstrate the efficient processing of OLAP queries?
The success of Internet applications has led to an explosive growth in the demand
for bandwidth from ISPs. Managing an IP network includes complex data analysis
(see, e.g., [1]) that can often be expressed as OLAP queries. For example, using
flow-level traffic statistics data one can answer questions like: On an hourly basis,
what fraction of the total number of flows is due to Web traffic? Current day
OLAP tools assume the availability of the detailed data in a centralized
warehouse. However, the inherently distributed nature of of the data collection
(e.g., flow-level traffic statistics are gathered at network routers) and the huge
amount of data extracted at each collection point (of the order of several
gigabytes per day for large IP networks) makes such an approach highly
impractical. The natural solution to this problem is to maintain a distributed data
warehouse, consisting of multiple local data warehouses (sites) adjacent to the
collection points, together with a coordinator. In order for such a solution to
make sense, we need a technology for distributed processing of complex OLAP
queries. We have developed the Skalla system for this task. Skalla translates OLAP
queries specified using relational algebra augmented with the GMDJ operator [2]
into distributed query evaluation plans. Salient features of the Skalla approach
are: the ability to handle complex OLAP queries involving correlated aggregates,
pivots, etc. only partial results are ever shipped — never subsets of the detail
data. Skalla generates distributed query evaluation plans (for the coordinator
architecture) as a sequence of rounds, where a round consists of: (i) each local
site performing some computation and communicating the result to the
coordinator, and (ii) the coordinator synchronizing the results and (possibly)
communicating with the sites. The semantics of the subqueries generated by
Skalla ensure that the amount
of data that has to be shipped between sites is independent of the size of the
underlying data at the sites. This is particularly important in our distributed OLAP
setting. We note that such a bound does not exist for the distributed processing
of traditional SQL join queries [3]. The Skalla evaluation scheme allows for a wide
variety of optimizations that are easily expressed in the algebra and thus readily
integrated into the query optimizer. The optimization schemes included in our
prototype contribute both to the minimization of synchronization traffic and the
optimization of the processing at the local sites. Significant features of the Skalla
approach are the ability to perform both distribution-dependent and
distributionindependent optimizations that reduce the data transferred and the
number of evaluation rounds. We conducted an experimental study of the Skalla
evaluation scheme using TPC(R) data. We found the optimizations to be very
effective, often reducing the processing time by an order of magnitude. The
results demonstrate the scalability of the Skalla techniques and quantify the
performance benefits of the optimization techniques that have gone into the
Skalla sytem.
1. Star Schema
2. Snowflake Schema
3. Fact Constellation
Star Schema
A single large central fact table and one table for each dimension
Every fact points to one tuple in each of the dimensions and has additional
attributes
Does not capture hierarchies directly.
Snowflake Schema
Fact Constellation:
Case Study:
Afco Foods & Beverages is a new company which produces dairy, bread and
meat products with production unit located at Baroda.
There products are sold in North, North West and Western region of India.
They have sales units at Mumbai, Pune , Ahmadabad ,Delhi and Baroda.
The President of the company wants sales information.
Sales Information
Data Preprocessing
a. Fill missing values
b. Remove inconsistency
Data Transformation & Integration
Data Loading
Data in warehouse is stored in form of fact tables and dimension tables.
A:Applications of Datawarehouse:
Banking Industry
In the banking industry, concentration is given to risk management and policy reversal
as well analyzing consumer data, market trends, government regulations and reports,
and more importantly financial decision making.
Most banks also use warehouses to manage the resources available on deck in an
effective manner. Certain banking sectors utilize them for market research, performance
analysis of each product, interchange and exchange rates, and to develop marketing
programs.
Analysis of card holder’s transactions, spending patterns and merchant classification, all
of which provide the bank with an opportunity to introduce special offers and lucrative
deals based on cardholder activity. Apart from all these, there is also scope for co-
branding.
Finance Industry
Similar to the applications seen in banking, mainly revolve around evaluation and
trends of customer expenses which aids in maximizing the profits earned by their
clients.
Healthcare
One of the most important sector which utilizes data warehouses is the Healthcare
sector. All of their financial, clinical, and employee records are fed to warehouses as it
helps them to strategize and predict outcomes, track and analyze their service feedback,
generate patient reports, share data with tie-in insurance companies, medical aid
services, etc.
Hospitality Industry
A major proportion of this industry is dominated by hotel and restaurant services, car
rental services, and holiday home services. They utilize warehouse services to design
and evaluate their advertising and promotion campaigns where they target customers
based on their feedback and travel patterns.
Insurance
As the saying goes in the insurance services sector, “Insurance can never be bought, it
can be only be sold”, the warehouses are primarily used to analyze data patterns and
customer trends, apart from maintaining records of already existing participants. The
design of tailor-made customer offers and promotions is also possible through
warehouses.
The Retailers
Retailers serve as middlemen between producers and consumers. It is important for
them to maintain records of both the parties to ensure their existence in the market.
They use warehouses to track items, their advertising promotions, and the consumers
buying trends. They also analyze sales to determine fast selling and slow selling product
lines and determine their shelf space through a process of elimination.
Services Sector
Data warehouses find themselves to be of use in the service sector for maintenance of
financial records, revenue patterns, customer profiling, resource management, and
human resources.
Telephone Industry
The telephone industry operates over both offline and online data burdening them with
a lot of historical data which has to be consolidated and integrated.
Apart from those operations, analysis of fixed assets, analysis of customer’s calling
patterns for sales representatives to push advertising campaigns, and tracking of
customer queries, all require the facilities of a data warehouse.
Transportation Industry
In the transportation industry, data warehouses record customer data enabling traders
to experiment with target marketing where the marketing campaigns are designed by
keeping customer requirements in mind.
The internal environment of the industry uses them to analyze customer feedback,
performance, manage crews on board as well as analyze customer financial reports for
pricing strategies.
Data warehouses and Online Analytical Processing (OLAP) tools are based on a
multidimensional data model.OLAP in data warehousing enables users to view
data from different angles and dimensions.
Snowflakes Schema
UNIT 2:
1. Explain the various Data pre-processing techniques. How
data reduction helps in data pre-processing?
Data cleaning
Data transformation
1. Normalization:
o Scaling attribute values to fall within a specified range.
Example: to transform V in [min, max] to V' in
[0,1], apply V'=(V-Min)/(Max-Min)
o Scaling by using mean and standard deviation (useful when min and
max are unknown or when there are outliers): V'=(V-
Mean)/StDev
2. Aggregation: moving up in the concept hierarchy on numeric attributes.
3. Generalization: moving up in the concept hierarchy on nominal attributes.
4. Attribute construction: replacing or adding new attributes inferred by
existing attributes.
Data reduction
The missing value treating method plays an important role in the data
preprocessing. Missing data is a common problem in statistical analysis. The
tolerance level of missing data is classified as Missing Value (Percentage) -
Significant Upto 1% - Trivial 1-5% - Manageable 5-15% - sophisticated methods to
handle More than 15% - Severe impact of interpretation Several methods have
been proposed in the literature to treat missing data. Those methods are divided
into three categories as proposed by Dempster and et
The different patterns of missing values are discussed in the next section.
Pattern of missing :
The Missing value in database falls into this three categories viz., Missing
Completely at Random (MCAR), Missing Random (MAR) and Non-Ignorable (NI)
Missing Completely at Random (MCAR) This is the highest level of randomness. It
occurs when the probability of an instance (case) having a missing value for an
attribute does not depend on either the known values or the missing values are
randomly distributed across all observations. This is not a realistic assumption for
many real time data. Missing at Random (MAR) When missingness does not
depend on the true value of the missing variable, but it might depend on the
value of other variables that are observed.
A:Data Integration:
Data integration is one of the steps of data pre-processing that involves combining
data residing in different sources and providing users with a unified view of these data.
• It merges the data from multiple data stores (data sources)
• It includes multiple databases, data cubes or flat files.
• Metadata, Correlation analysis, data conflict detection, and resolution of semantic
heterogeneity contribute towards smooth data integration.
• There are mainly 2 major approaches for data integration - commonly known as "tight
coupling approach" and "loose coupling approach".
Tight Coupling
o Here data is pulled over from different sources into a single physical location through
the process of ETL - Extraction, Transformation and Loading.
o The single physical location provides an uniform interface for querying the data.
ETL layer helps to map the data from the sources so as to provide a uniform data
o warehouse. This approach is called tight coupling since in this approach the data is
tightly coupled with the physical repository at the time of query.
ADVANTAGES:
1. Independence (Lesser dependency to source systems since data is physically
copied over)
2. Faster query processing
3. Complex query processing
4. Advanced data summarization and storage possible
5. High Volume data processing
DISADVANTAGES: 1. Latency (since data needs to be loaded using ETL)
Loose Coupling
o Here a virtual mediated schema provides an interface that takes the query from the
user, transforms it in a way the source database can understand and then sends the
query directly to the source databases to obtain the result.
o In this approach, the data only remains in the actual source databases.
o However, mediated schema contains several "adapters" or "wrappers" that can
connect back to the source systems in order to bring the data to the front end.
ADVANTAGES:
Data Freshness (low latency - almost real time)
Higher Agility (when a new source system comes or existing source system changes -
only the corresponding adapter is created or changed - largely not affecting the other
parts of the system)
Less costlier (Lot of infrastructure cost can be saved since data localization not
required)
DISADVANTAGES:
1. Semantic conflicts
2. Slower query response
3. High order dependency to the data sources
For example, let's imagine that an electronics company is preparing to roll out a new
mobile device. The marketing department might want to retrieve customer information
from a sales department database and compare it to information from the product
department to create a targeted sales list. A good data integration system would let the
marketing department view information from both sources in a unified way, leaving out
any information that didn't apply to the search.
II. DATA TRANSFORMATION:
In data mining pre-processes and especially in metadata and data warehouse, we use
data transformation in order to convert data from a source data format into destination
data.
We can divide data transformation into 2 steps:
• Data Mapping:
It maps the data elements from the source to the destination and captures any
transformation that must occur.
• Code Generation:
It creates the actual transformation program.
Data transformation:
• Here the data are transformed or consolidated into forms appropriate for mining.
• Data transformation can involve the following:
Smoothing:
• It works to remove noise from the data.
• It is a form of data cleaning where users specify transformations to correct data
inconsistencies.
• Such techniques include binning, regression, and clustering.
Aggregation:
• Here summary or aggregation operations are applied to the data.
• This step is typically used in constructing a data cube for analysis of the data at
multiple granularities.
• Aggregation is a form of data reduction.
Generalization :
• Here low-level or “primitive” (raw) data are replaced by higher-level concepts through
the use of concept hierarchies.
• For example, attributes, like age, may be mapped to higher-level concepts, like youth,
middle-aged, and senior.
• Generalization is a form of data reduction.
Normalization:
• Here the attribute data are scaled so as to fall within a small specified range, such as
1:0 to 1:0, or 0:0 to 1:0.
• Normalization is particularly useful for classification algorithms involving neural
networks, or distance measurements such as nearest-neighbor classification and
clustering
• For distance-based methods, normalization helps prevent attributes with initially large
ranges (e.g., income).
• There are three methods for data normalization:
1. min-max normalization :
by computing
o Min-max normalization preserves the relationships among the original data values.
1. z-score normalization
o Here the values for an attribute, A, are normalized based on the mean and standard
deviation of A.
o Here the normalization is done by moving the decimal point of values of attribute A.
o The number of decimal points moved depends on the maximum absolute value of A.
o Value, v of A is normalized to v0 by computing where j is the
smallest integer such that Max
• Attribute construction:
o Here new attributes are constructed and added from the given set of attributes to help
the mining process.
o Attribute construction helps to improve the accuracy and understanding of structure in
high-dimensional data.
o By combining attributes, attribute construction can discover missing information about
the relationships between data attributes that can be useful for knowledge discovery.
EG:The structure of stored data may vary between applications, requiring semantic
mapping prior to the transformation process. For instance, two applications might store
the same customer credit card information using slightly different structures:
To ensure that critical data isn’t lost when the two applications are integrated,
information from Application A needs to be reorganized to fit the data structure of
Application B.
A:Data Smoothing:
When data collected over time displays random variation, smoothing techniques can be
used to reduce or cancel the effect of these variations. When properly applied, these
techniques smooth out the random variation in the time series data to reveal underlying
trends.
XLMiner features four different smoothing techniques: Exponential, Moving Average, Double
Exponential, and Holt-Winters. Exponential and Moving Average are relatively simple
smoothing techniques and should not be performed on data sets involving seasonality.
Double Exponential and Holt-Winters are more advanced techniques that can be used on
data sets involving seasonality.
Exponential Smoothing
Exponential Smoothing is one of the more popular smoothing techniques due to its flexibility,
ease in calculation, and good performance. Exponential Smoothing uses a simple average
calculation to assign exponentially decreasing weights starting with the most recent
observations. New observations are given relatively more weight in the average calculation
than older observations. The Exponential Smoothing tool uses the following formulas.
S0= x0
where
Exponential Smoothing should only be used when the data set contains no seasonality. The
forecast is a constant value that is the smoothed value of the last observation.
In Moving Average Smoothing, each observation is assigned an equal weight, and each
observation is forecasted by using the average of the previous observation(s). Using the time
series X1, X2, X3, ....., Xt, this smoothing technique predicts Xt+k as follows :
XLMiner allows a parameter value between 2 and t-1 where t is the number of observations in
the data set. Note that when choosing this parameter, a large parameter value will
oversmooth the data, while a small parameter value will undersmooth the data. The past three
observations will predict the future observations. As with Exponential Smoothing, this
technique should not be applied when seasonality is present in the data set.
St = At + Bt , t = 1,2,3,..., N
where, a denotes the Alpha parameter, and b denotes the trend parameters. These two
parameters can be entered manually.
XLMiner includes an optimize feature that will choose the best values for alpha and trend
parameters based on the Forecasting Mean Squared Error. If the trend parameter is 0, then
this technique is equivalent to the Exponential Smoothing technique. (However, results may
not be identical due to different initialization methods for these two techniques.)
Holt-Winters Smoothing
Holt Winters Smoothing introduces a third parameter (g) to account for seasonality (or
periodicity) in a data set. The resulting set of equations is called the Holt-Winters method, after
the names of the inventors. The Holt-Winters method can be used on data sets involving trend
and seasonality (a, b , g). Values for all three parameters can range between 0 and 1.
Multiplicative: Xt = (At+ Bt)* St +et where At and Bt are previously calculated initial estimates.
St is the average seasonal factor for the tth season.
St = gxt/At + (1 - g)St-p
The term Knowledge Discovery in Databases, or KDD for short, refers to the broad
process of finding knowledge in data, and emphasizes the "high-level" application of
particular data mining methods. It is of interest to researchers in machine learning,
pattern recognition, databases, statistics, artificial intelligence, knowledge acquisition
for expert systems, and data visualization.
The unifying goal of the KDD process is to extract knowledge from data in the
context of large databases.
It does this by using data mining methods (algorithms) to extract (identify) what is
deemed knowledge, according to the specifications of measures and thresholds, using
a database along with any required preprocessing, subsampling, and transformations
of that database.
1. Developing an understanding of
o the application domain
o the relevant prior knowledge
o the goals of the end-user
2. Creating a target data set: selecting a data set, or focusing on a subset of
variables, or data samples, on which discovery is to be performed.
3. Data cleaning and preprocessing.
o Removal of noise or outliers.
o Collecting necessary information to model or account for noise.
o Strategies for handling missing data fields.
o Accounting for time sequence information and known changes.
4. Data reduction and projection.
o Finding useful features to represent the data depending on the goal of the
task.
o Using dimensionality reduction or transformation methods to reduce the
effective number of variables under consideration or to find invariant
representations for the data.
5. Choosing the data mining task.
o Deciding whether the goal of the KDD process is classification,
regression, clustering, etc.
6. Choosing the data mining algorithm(s).
o Selecting method(s) to be used for searching for patterns in the data.
o Deciding which models and parameters may be appropriate.
o Matching a particular data mining method with the overall criteria of the
KDD process.
7. Data mining.
o Searching for patterns of interest in a particular representational form or
a set of such representations as classification rules or trees, regression,
clustering, and so forth.
8. Interpreting mined patterns.
9. Consolidating discovered knowledge.
KDD refers to the overall process of discovering useful knowledge from data. It
involves the evaluation and possibly interpretation of the patterns to make the decision
of what qualifies as knowledge. It also includes the choice of encoding schemes,
preprocessing, sampling, and projections of the data prior to the data mining step.
Data mining refers to the application of algorithms for extracting patterns from data
without the additional steps of the KDD process.