Vous êtes sur la page 1sur 19

1

utu wurehoose
Data Warehouse is a central managed and integrated database containing data from the
operational sources in an organization (such as SAP, CRM, ERP system). It may gather
manual inputs from users determining criteria and parameters for grouping or classifying
records.
That database contains structured data for query analysis and can be accessed by users.
The data warehouse can be created or updated at any time, with minimum disruption to
operational systems. It is ensured by a strategy implemented in a ETL process.

A source for the data warehouse is a data extract from operational databases. The data is
validated, cleansed, transformed and finally aggregated and it becomes ready to be loaded
into the data warehouse.
Data warehouse is a dedicated database which contains detailed, stable, non-volatile and
consistent data which can be analyzed in the time variant.
Sometimes, where only a portion of detailed data is required, it may be worth considering
using a data mart. A data mart is generated from the data warehouse and contains data
focused on a given subject and data that is frequently accessed or summarized.
Business Intelligence - Data Warehouse - ETL:



Keeping the data warehouse Iilled with very detailed and not eIIiciently selected data may lead
to growing the database to a huge size, which may be diIIicult to manage and unusable. A good
example oI successIul data management are set-ups used by the leaders in the Iield oI telecoms
such as O2 broadband. To signiIicantly reduce number oI rows in the data warehouse, the data is
aggregated which leads to the easier data maintenance and eIIiciency in browsing and data
analysis.



Key Data Warehouse systems and the most widely used database engines Ior storing and serving
data Ior the enterprise business intelligence and perIormance management:
O Teradata
O SAP BW - Business InIormation Warehouse
O Oracle
O MicrosoIt SQL Server
O IBM DB2
O SAS

utuWurehoose Architectore
The main difference between the database architecture in a standard, on-line transaction
processing oriented system (usually ERP or CRM system) and a DataWarehouse is that the
systems relational model is usually de-normalized into dimension and fact tables which are
typical to a data warehouse database design.
The differences in the database architectures are caused by different purposes of their
existence.

In a typical %! system the database performance is crucial, as end-user interface
responsiveness is one of the most important factors determining usefulness of the
application. That kind of a database needs to handle inserting thousands of new records
every hour. To achieve this usually the database is optimized for speed of Inserts, Updates
and Deletes and for holding as few records as possible. So from a technical point of view
most of the SQL queries issued will be INSERT, UPDATE and DELETE.

Opposite to OLTP systems, a DataWarehouse is a system that should give response to
almost any question regarding company performance measure. Usually the
information delivered from a data warehouse is used by people who are in charge of making
decisions. So the information should be accessible quickly and easily but it doesn't need to
be the most recent possible and in the lowest detail level.

utu murt
Data marts are designated to fulfill the role of strategic decision support for managers
responsible for a specific business area.

Data warehouse operates on an enterprise level and contains all data used for reporting
and analysis, while data mart is used by a specific business department and are focused on
a specific subject (business area).

A scheduled ETL process populates data marts within the subject specific data warehouse
information.

The typical approach for maintaining a data warehouse environment with data marts is to
have one Enterprise Data Warehouse which comprises divisional and regional data
warehouse instances together with a set of dependent data marts which derive the
information directly from the data warehouse.



It is crucial to keep data marts consistent with the enterprise-wide data warehouse system
as this will ensure that they are properly defined, constituted and managed. Otherwise the
DW environment mission of being "the single version of the truth" becomes a myth.
However, in data warehouse systems there are cases where developing an independent
data mart is the only way to get the required figures out of the DW environment.
Developing independent data marts, which are not 100% reconciled with the data
warehouse environment and in most cases includes a supplementary source of data, must
be clearly understood and all the associated risks must be identified.

Data marts are usually maintained and made available in the same environment as the data
warehouse (systems like Oracle, Teradata, MS SQL Server, SAS) and are smaller in size
than the enterprise data warehouse.
There are also many cases when data marts are created and refreshed on a server and
distributed to the end users using shared drives or email and stored locally. This approach
generates high maintenance costs, however makes it possible to keep data marts available
offline.

There are two approaches to organize data in data marts:
O Database datamart tables or its extracts represented by text Iiles - one-dimensional, not
aggregated data set; in most cases the data is processed and summarized many times by the
reporting application.
O Multidimensional database (MDDB) - aggregated data organized in multidimensional
structure. The data is aggregated only once and is ready Ior business analysis right away.

In the next stage, the data Irom data marts is usually gathered by a reporting or analytic
processing (OLAP) tool, such as Hyperion, Cognos, Business Objects, Pentaho BI, MicrosoIt
Excel and made available Ior business analysis.

Usually, a company maintains multiple data marts serving the needs oI Iinance, marketing, sales,
operations, IT and other departments upon needs.

Example use oI data marts in an organization: CRM reporting, customer migration analysis,
production planning, monitoring of marketing campaigns, performance indicators, internal
ratings and scoring, risk management, integration with other systems (systems which use the
processed DW data) and more uses specific to the individual business.

#eporting
A successful reporting platform implementation in a business intelligence environment
requires great attention to be paid from both the business end users and IT professionals.
The fact is that the reporting layer is what business users might consider a data warehouse
system and if they do not like it, they will not use it. Even though it might be a perfectly
maintained data warehouse with high-quality data, stable and optimized ETL processes and
faultless operation. It will be just useless for them, thus useless for the whole organization.



The problem is that the report generation process is not particularly interesting from the IT
point of view as it does not involve heavy data processing and manipulation tasks. The IT
professionals do not tend to pay great attention to this BI area as they consider it rather
'look and feel' than the 'real heavy stuff'.
From the other hand, the lack of technical exposure of the business users usually makes the
report design process too complicated for them.
The conclusion is that the key to success in reporting (and the whole BI environment) is the
collaboration between the business and IT professionals.

utu mining
Data mining is an area of using intelligent information management tool to discover the
knowledge and extract information to help support the decision making process in an
organization. Data mining is an approach to discovering data behavior in large data sets by
exploring the data, fitting different models and investigating different relationships in vast
repositories.

The information extracted with a data mining tool can be used in such areas as decision
support, prediction, sales forecasts, financial and risk analysis, estimation and optimization.

Sample real-world business use of data mining applications includes:
O CRM - aids customers classiIication and retention campaigns
O Web site traIIic analysis - guest behavior prediction or relevant content delivery
O Public sector organizations may use data mining to detect occurences oI Iraud such as money
laundering and tax evasion, match crime and terrorist patterns, etc.
O Genomics research - analysis oI the vast data stores

The most widely known and encountered Data Mining techniques:
O Statistical modeling which uses mathematical equations to do the analysis. The most popular
statistical models are: generalized linear models, discriminant analysis, linear regression and
logistic regression.
O Decision list models and decision trees
O Neural networks
O Genetic algorithms
O Screening models

Data mining tools oIIer a number data discovery techniques to provide expertise to the data and
to help identiIy relevant set oI attributes in the data:
O Data manipulation which consists oI construction oI new data subsets derived Irom existing
data sources.
O Browsing, auditing and visualization oI the data which helps identiIy non-typical, suspected
relationships between variables in the data.
O Hypothesis testing

A group oI the most signiIicant data mining tools is represented by:


O SPSS Clementine
O SAS Enterprise Miner
O IBM DB2 Intelligent Miner
O STATISTICA Data Miner
O Pentaho Data Mining (WEKA)
O IsoIt Alice

ATA III#ATION
Try to remember all the times, when you had to collect data from many different sources. I
guess, this reminds you all those situations when you have been searching and processing
data for many hours, because you had to check every source one by one. After that it turns
out that you had missed something and you spent another day attaching that missing data.
However, there is solution, which can simplify your work. That solution is called "data
federation.
Wbat is tbis and bow does it work ?
Data federation is a kind of software that standardizes the integration of data from different
(sometimes very dispersed) sources. It collects data from multiple sources, creates strategic
layer of data and optimizes the integration of dispersed views, providing standardized
access to those integrated view of information within single layer of data. That created layer
ensures re-usability of the data. Due to that data federation has a very important
significance while creating a SOA kind of software (Service Oriented Architecture - a
computer system, where the main aim is to create software that will live up users
expectations).
Tbe most important advantages of good data federation software
Data federation strong points:
O Large companies process huge amounts of data, that comes from different sources.
In addition, with the company development, there are more and more data sources,
that employees have to deal with and eventually its nearly impossible to work
without using specific supportive software.
O Standardized and simplified access to data - thanks to that even when we use data
from different, dispersed sources that additionally have different formats, they can
look like they are from the same source.
O ou can easily create integrated views of data and libraries of data store, that can be
used multiple times.
O Data are provided in real time, from original sources, not from cumulative databases
or duplicates.
O Efficient delivery of up-to-date data and protection of database in the same time.
ata federation weeknesses
There is no garden without weeds, so data federation also have some drawbacks.

While using that software, parameters should be closely watched and optimized, because


aggregation logic takes place in server, not in the database. Wrong parameters or other
errors can influence on the transmission or correctness of results. We should also remember
that data comes from big amount of sources and it is important to check their reliability.
Through using unproven and uncorrected data, we can have some errors in our work and
that can cause financial looses for our company. Software cant always judge if the source is
reliable or not, so we have to make sure, that there ale special information managers, who
watch over the correctness of data and decide what sources can our data federation
application use.

As you can see, data federation system is very important, especially for big companies. It
makes work with big amount of data from different sources much more easier.
Top ata Federation tools
Below a list of the most popular enterprise data integration tools providing
the data federation features:
O SAP BusinessObjects Data Federator
O Sybase Data Federation
O IBM InfoSphere Federation Server
O Oracle Data Service Integrator
O SAS Enterprise Data Integration Server - data federation features

O uaLa federaLlon furLher readlng on daLa federaLlon

osiness Intelligence stuIIing
This study illustrates how global companies typically structure business intelligence
resources and what teams and departments are involved in the support and maintenance of
an nterprise Data Warehouse.
This proposal may range widely across the organizations depending on the company
requirements, the sector and BI strategy, however it might be considered as a template.

The successful business intelligence strategy requires the involvement of people from
various departments. Those are mainly:
- Top Management (CIO, board of directors members, business managers)
- Finance
- Sales and Marketing
- IT - both business analysts and technical managers and specialists
teering committee , Business owners
A typical enterprise data warehouse environment is under the control and direction of a
steering committee. The steering committee sets the policy and strategic direction of the
data warehouse and is responsible for the prioritization of the DW initiatives.
The steering committee is usually chaired by a DW business owner. Business owner acts
as sponsor and champion of the whole BI environment in an organization.


Business ata Management group , ata governance team
This group forms the Business Stream of the BI staff and consist of a cross-functional
team of representatives from the business and I% (mostly IT managers and analysts)
with a shared vision to promote data standards, data quality, manage DW metadata and
assure that the Data Warehouse is The Single Version of the Truth. Within the data
governance group each major data branch has a business owner appointed to act as data
steward. One of the key roles of the data steward is to approve access to the DW data.
The new DW projects and enhancements are shaped and approved by the Business Data
Management group. The team also defines the Business Continuity Management and
disaster recovery strategy.
ata Warebousing team
The IT stream of a Data Warehouse is usually the vastest and the most relevant from the
technical perspective. It is hard to name this kind of team precisely and easily as it may
differ significantly among organizations. This group might be referenced as usiness
Intelligence team, Data Warehousing team, Information Analysis Group,
Information Delivery team, MIS Support, I development, Datawarehousing
Delivery team, I solutions and services. asically, the creativity of HR departments in
naming teams and positions is much broader than you can imagine.
Some organizations also group the staff into smaller teams, oriented on a specific function
or a topic.

The typical positions occupied by the members of the datawarehousing team are:
O uaLa Warehouse analysL Lhose are people who know Lhe CL1 and MlS sysLems and Lhe
dependencles beLween Lhem ln an organlzaLlon very ofLen Lhe analysLs creaLe funcLlonal speclflcaLlon
documenLs hlghlevel soluLlon ouLllnes eLc
O uW archlLecL a Lechnlcal person whlch needs Lo undersLand Lhe buslness sLraLegy and lmplemenL
LhaL vlslon Lhrough Lechnology
O 8uslness lnLelllgence speclallsL uaLa Warehouse speclallsL uW developer whlch are Lechnlcal 8l
experLs
O L1L modeler L1L developer daLa lnLegraLlon Lechnlcal experLs
O 8eporLlng analysL CLA analysL lnformaLlon uellvery analysL 8eporL deslgner people wlLh
analyLlcal mlnds wlLh some Lechnlcal exposure
O 1eam Leaders by far Lhe mosL ofLen each of Lhe uaLa Warehouslng Leam subgroups has lLs own Leam
leader who reporLs Lo a 8uslness lnLelllgence l1 manager 1hose are very ofLen experLs who had been
promoLed
O 8uslness lnLelllgence l1 manager ls a very lmporLanL role because of Lhe facL LhaL a 8l manager ls a llnk
beLween Lhe sLeerlng commlLLee Lhe daLa governance group and Lhe experLs


W upport and administration
This team is focused on the support, maintenance and resolution of operational problems
within the data warehouse environment. The people included may be:
O uaLabase admlnlsLraLors (u8A)
O AdmlnlsLraLors of oLher 8l Lools ln an organlzaLlon (daLabase reporLlng L1L)
O Pelpdesk supporL cusLomer supporL supporL Lhe reporLlng fronL ends accesses Lo Lhe uW
appllcaLlons and oLher hardware and sofLware problems
ample W team organogram
Enterprise Data Warehouse organizational chart with all the people involved and a sample
staff count. The head count in our sample DW organization lists 35 people. Be aware that
this organogram depends heavily on the company size and its mission, however it may be
treated as a good illustration of proportions on the BI resources in a typical environment.
Keep in mind that not all positions involve full time resources.
O SLeerlng commlLLee 8uslness owners *
O 8uslness uaLa ManagemenL group
O uW Leam
uW/L1L speclallsLs and developers
reporL deslgners and analysLs
uW Lechnlcal analysLs
1esLers* (an analysL may also play Lhe role of a LesLer)
O SupporL and malnLenance *

utu Wurehoose metudutu
The metadata in a data warehouse system unfolds the definitions, meaning, origin and rules
of the data used in a Data Warehouse. There are two main types of metadata in a data
warehouse system: business metadata and technical metadata. Those two types
illustrate both business and technical point of view on the data.
The Data Warehouse Metadata is usually stored in a Metadata Repository which is accessible
by a wide range of users.
Business metadata
Business metadata (datawarehouse metadata, front room metadata, operational metadata)
- this type of metadata stores business definitions of the data, it contains high-level
definitions of all fields present in the data warehouse, information about cubes, aggregates,
datamarts.
Business metadata is mainly addressed to and used by the data warehouse users, report


authors (for ad-hoc querying), cubes creators, data managers, testers, analysts.

Typically, the following information needs to be provided to describe business metadata:
O DW 1ab|e Name
O DW Co|umn Name
O us|ness Name shorL and descLlpLlve header lnformaLlon
O Def|n|t|on exLended descrlpLlon wlLh brlef overlvlew of Lhe buslness rules for Lhe fleld
O |e|d 1ype a flag may lndlcaLe wheLher a glven fleld sLores Lhe key or a dlscreLe value wheLher ls
acLlve or noL or whaL daLa Lype ls lL 1he conLenL of LhaL fleld (or flelds) may vary upon buslness needs
Tecbnical metadata
Technical metadata (ETL process metadata, back room metadata, transformation metadata)
is a representation of the ETL process. It stores data mapping and transformations from
source systems to the data warehouse and is mostly used by datawarehouse developers,
specialists and ETL modellers.
Most commercial ETL applications provide a metadata repository with an integrated
metadata management system to manage the ETL process definition.
The definition of technical metadata is usually more complex than the business metadata
and it sometimes involves multiple dependencies.

The technical metadata can be structured in the following way:
O ource Database or sysLem deflnlLlon lL can be a source sysLem daLabase anoLher daLa warehouse
flle sysLem eLc
O 1arget Database uaLa Warehouse lnsLance
O ource 1ab|es one or more Lables whlch are lnpuL Lo calculaLe a value of Lhe fleld
O ource Co|umns one or more columns whlch are lnpuL Lo calculaLe a value of Lhe fleld
O 1arget 1ab|e LargeL uW Lable and column are always slngle ln a meLadaLa reposlLory
O 1arget Co|umn LargeL uW column
O 1ransformat|on Lhe descrlpLlve parL of a meLadaLa enLry lL usually conLalns a loL of lnformaLlon so lL
ls lmporLanL Lo use a common sLandard LhroughouL Lhe organlsaLlon Lo keep Lhe daLa conslsLenL


Some too|s ded|cated to the metadata management(many of Lhem are bundled wlLh L1L Lools)
O 1eradaLa MeLadaLa Servlces
O Lrwln uaLa modeller
1

O MlcrosofL 8eposlLory
O l8M (AscenLlal) MeLaSLage
O enLaho MeLadaLa
O AblnlLlo LML (LnLerplse MeLadaLa LnvlronmenL)

$ - $lowly chunging dimensions
Slowly changing dimensions (SCD) determine how the historical changes in the dimension
tables are handled. Implementing the SCD mechanism enables users to know to which
category an item belonged to in any given date.

Types of Slowly Changing Dimensions in the Data Warehouse architectures:
O %ype 0 SCD is not used frequently, as it is classified as when no effort has been
made to deal with the changing dimensions issues. So, some dimension data may be
overwritten and other may stay unchanged over the time and it can result in
confusing end-users.
O %ype 1 SCD DW architecture applies when no history is kept in the database. The
new, changed data simply overwrites old entries. This approach is used quite often
with data which change over the time and it is caused by correcting data quality
errors (misspells, data consolidations, trimming spaces, language specific
characters).
Type 1 SCD is easy to maintain and used mainly when losing the ability to track the
old history is not an issue.
O In the %ype 2 SCD model the whole history is stored in the database. An additional
dimension record is created and the segmenting between the old record values and
the new (current) value is easy to extract and the history is clear. The fields
'effective date' and 'current indicator' are very often used in that dimension.
O %ype 3 SCD - only the information about a previous value of a dimension is written
into the database. An 'old 'or 'previous' column is created which stores the
immediate previous attribute. In Type 3 SCD users are able to describe history
immediately and can report both forward and backward from the change.
However, that model can't track all historical changes, such as when a dimension
changes twice or more. It would require creating next columns to store historical
data and could make the whole data warehouse schema very complex.
O %ype 4 SCD idea is to store all historical changes in a separate historical data table
for each of the dimensions.
In order to manage Slowly Changing Dimensions properly and easily it is highly
recommended to use Surrogate Keys in the Data Warehouse tables.
A Surrogate Key is a technical key added to a fact table or a dimension table which is used
instead of a business key (like product ID or customer ID).
11

Surrogate keys are always numeric and unique on a table level which makes it easy to
distinguish and track values changed over time.


In practice, in big production Data Warehouse environments, mostly the Slowly Changing
Dimensions %ype 1, %ype 2 and %ype 3 are considered and used. It is a common
practice to apply different SCD models to different dimension tables (or even columns in the
same table) depending on the business reporting needs of a given type of data.

TL process and concepts
% stands for extraction, transformation and loading. Etl is a process that involves the
following tasks:
O extracting data from source operational or archive systems which are the primary
source of data for the data warehouse
O transforming the data - which may involve cleaning, filtering, validating and
applying business rules
O loading the data into a data warehouse or any other database or application that
houses data
The ETL process is also very often referred to as Data Integration process and ETL tool as a
Data Integration platform.
The terms closely related to and managed by ETL processes are: data migration, data
management, data cleansing, data synchronization and data consolidation.

The main goal of maintaining an ETL process in an organization is to migrate and transform
data from the source OLTP systems to feed a data warehouse and form data marts.

utu Wurehoosing ITL totoriul
The ETL and Data Warehousing tutorial is organized into lessons representing various
business intelligence scenarios, each of which describes a typical data warehousing
challenge.
This guide might be considered as an % process and Data Warehousing knowledge
base with a series of examples illustrating how to manage and implement the ETL process in
a data warehouse environment.

The purpose of this tutorial is to outline and analyze the most widely encountered real life
datawarehousing problems and challenges that need to be taken during the design and
architecture phases of a successful data warehouse project deployment.

Going through the sample implementations of the business scenarios is also a good way to
compare usiness Intelligence and % tools and get to know the different approaches
to designing the data integration process. This also gives an idea and helps identify strong
and weak points of various ETL and data warehousing applications.

This tutorial shows how to use the following BI, ETL and datawarehousing tools: Datastage,
SAS, Pentaho, Cognos and Teradata.
1

ata Warebousing & TL Tutorial lessons
O Surrogate key generation example which includes information on business keys and
surrogate keys and shows how to design an ETL process to manage surrogate keys
in a data warehouse environment. $ample design in Pentaho Data ntegration
O Header and trailer processing - considerations on processing files arranged in blocks
consisting of a header record, body items and a trailer. This type of files usually
come from mainframes, also it applies to EDI and EPIC files. $olution examples in
Datastage, $$ and Pentaho Data ntegration
O Loading customers - a data extract is placed on an FTP server. It is copied to an ETL
server and loaded into the data warehouse. $ample loading in Teradata MultiLoad
O Data allocation ETL process case study for allocating data. xamples in Pentaho Data
ntegration and Cognos PowerPlay
O Data masking and scambling algorithms and ETL deployments. $ample Kettle
implementation
O Site traffic analysis - a guide to creating a data warehouse with data marts for
website traffic analysis and reporting. $ample design in Pentaho Kettle
O Data Quality - ETL process design aimed to test and cleanse data in a Data
Warehouse. $ample outline in PD
O ML ETL processing

enerute sorrogute key
oal
Fill in a data warehouse dimension table with data which comes from different source
systems and assign a unique record identifier (surrogate key) to each record.
cenario overview and details
To illustrate this example, we will use two made up sources of information to provide data
about customers dimension. Each extract contains customer records with a business key
(natural key) assigned to it.

In order to isolate the data warehouse from source systems, we will introduce a technical
surrogate key instead of re-using the source system's natural (business) key.
A unique and common surrogate key is a one-field numeric key which is shorter, easier to
maintain and understand, and independent from changes in source system than using a
business key. Also, if a surrogate key generation process is implemented correctly, adding a
new source system to the data warehouse processing will not require major efforts.

Surrogate key generation mechanism may vary depending on the requirements, however
the inputs and outputs usually fit into the design shown below:
Inputs:
- an input respresented by an extract from the source system
- datawarehouse table reference for identifying the existing records
- maximum key lookup

Outputs:
1

- output table or file with newly assigned surrogate keys
- new maximum key
- updated reference table with new records
roposed solution
Assumptions:
- The surrogate key field for our made up example is WH_CUST_NO.
- To make the example clearer, we will use SCD 1 to handle changing dimensions. This
means that new records overwrite the existing data.
The ETL process implementation requires several inputs and outputs.
Input data:
- customers_extract.csv - first source system extract
- customers2.txt - second source system extract
- CUST_REF - a lookup table which contains mapping between natural keys and surrogate
keys
- MA_KE - a sequence number which represents last key assignment

Output data:
- D_CUSTOMER - table with new records and correctly associated surrogate keys
- CUST_REF - new mappings added
- MA_KE sequence increased


%he design of an % process for generating surrogate keys will be as follows:
O 1he loadlng process wlll be execuLed Lwlce once for each of Lhe lnpuL flles
O Check lf Lhe lookup reference daLa ls correcL and avallable
8Cu_8Ll Lable
max_key sequence
O 8ead Lhe exLracL and flrsL check lf a record already exlsLs lf lL does asslgn an exlsLlng surrogaLe key Lo
lL and updaLe Lhe desclpLlve daLa ln Lhe maln dlmenslon Lable
O lf lL ls a new record Lhen
populaLe a new surrogaLe key and asslgn lL Lo Lhe record 1he new key wlll be populaLed by
lncremenLlng Lhe old maxlmum key by 1
lnserL a new record lnLo Lhe producLs Lable
lnserL a new record lnLo Lhe mapplng Lable (whlch sLores buslness and surrogaLe keys mapplng)
updaLe Lhe new maxlmum key
ample Implementations
eneraLlon of surrogaLe key lmplemenLaLlon ln varlous L1L envlronmenLs
O ul surrogaLe key surrogaLe key generaLlon example lmplemenLed ln enLaho uaLa lnLegraLlon

1

!rocessing u heuder und truiler textIile

oal
Process a text file which contains records arranged in blocks consisting of a header record,
details (items, body) and a trailer. The aim is to normalize records and load them into a
relational database structure.
cenario overview and details
Typically, header and item processing needs to be implemented when processing files that
origin from mainframe systems, and also DI transmission files, SWIF% Schemas,
!IC files.

The input file in our scenario is a stream of records representing invoices. Each invoice
consists of :
- A header containing invoice number, dates, customer reference and other details
- One or more items representing ordered products, including an item number, quantity,
price and value
- A trailer which contains summary information for all the items

The records are distinguished by the first character in each line: H stands for headers, I for
items and T for the trailers.
The lines in a section of the input file are fixed length (so for example all headers have the
same number of characters but it varies from items).


Input data:
O 1exL flle ln a PeaderLraller formaL 1he flle presenLed ln our example ls dlvlded lnLo headers lLems
and Lrallers
A header sLarLs wlLh a leLLer P and Lhen conLalns an lnvolce number a cusLomer number a daLe and
lnvolce currency
Lvery lLem sLarLs wlLh a leLLer l whlch ls followed by producL lu producL age quanLlLy and neL value
1he Lraller sLarLs wlLh a 1 and conLalns Lwo values whlch acL as a checksum Lhe LoLal number of lnvolce
llnes and a LoLal neL value
1

Sample header and Lraller lnpuL LexLflle



CuLpuL daLa
O lnvC_PLAuL8 and lnvC_LlnL relaLlonal Lables one wlLh lnvolce headers and Lhe oLher wlLh Lhe
lnvolce llnes
O re[ecLs_1LxL a LexL flle wlLh loadlng errors deslred ouLpuL

IT! copy und loud costomers extruct
oal
Load cusLomers daLa lnLo a daLa warehouse accordlng Lo Lhe buslness requlremenLs
1he cusLomers daLa deLalls ls very ofLen changed ln a source sysLem and we wanL Lo reflecL Lhose
changes ln Lhe daLa warehouse We also wanL Lo be able Lo keep Lrack of LhaL changes
cenario overview and details
1he cusLomers daLa ls exLracLed from Lhe source sysLem on a monLhly basls and placed on a l1 server
1he L1L process needs Lo geL Lhe flle locally and load lL lnLo Lhe daLa warehouse 1here are several
posslble scenarlos whlch need Lo be lncluded ln Lhe process
a new cusLomer ls added
a cusLomer already exlsLs and needs Lo be updaLed
an exlsLlng cusLomer remalns unchanged
a record may be lnvalld

AddlLlonally we wanL Lo keep Lrack of Lhe changes made Lo Lhe cusLomers daLa and be able Lo see all
Lhe changes Lo a glven record ln Llme 1o handle hlsLorlcal daLa we wlll use Lhe 1ype Slowly Changlng
ulmenslon SCu 1ype means LhaL Lhe changed or deleLed daLa ls sLored ln a separaLe Lable A
1

LlmesLamp and a change lndex wlll be added Lo handle records whlch change more Lhan once

lnpuL daLa
u_CuS1CML8 daLa warehouse Lable wlLh cusLomers
uwh_cusL_exLracL_ 1008txt cusLomers exLracL for Lhe currenL monLh

CuLpuL daLa
u_CuS1CML8 updaLed Lable whlch sLores only currenL records
u_CuS1CML8_PlS1 Lable Lable wlLh hlsLorlcal daLa for Lhe cusLomers dlmenslon lL sLores deleLed
records and cusLomers LhaL have already been updaLed
CusL_errors_100827LxL a log flle wlLh loadlng errors
roposed solution
1he deslgn of an L1L process flow for Lhe cusLomers loadlng wlll be as follows
O A LexL flle wlLh cusLomer exLracL ls generaLed by a source sysLem and placed on an l1 server
O 1he flle ls reLrleved Lo an L1L server
O 1he cusLomer exLracL ls loaded lnLo a Lemporary Lable
O 1he exlsLlng uW cusLomers flle ls loaded lnLo a Lemporary lookup flle
O Lach record from Lhe cusLomers flle ls valldaLed and looked up from Lhe exlsLlng cusLomers flle 1he
Lransform needs Lo apply Lhe followlng rules
a lf a record ls malformed and does noL pass valldaLlon lL ls redlrecLed Lo a re[ecL flow
b lf Lhe lookup does noL maLch records lL means LhaL Lhls ls a new cusLomer and lL needs Lo be loaded
lnLo Lhe cusLomers Lable
c lf Lhe lookup maLches we need Lo compare Lhe nonkey flelds Lo check lf Lhe cusLomer deLalls have
changed 1here are Lwo opLlons avallable all flelds remaln Lhe same (Lhen we leave Lhe record as lL ls
and proceed Lo Lhe nexL one) or a fleld has changed ln LhaL case Lhe currenL record needs Lo be
lnsLerLed lnLo Lhe hlsLorlcal Lable and replaced by a new one ln Lhe maln cusLomers Lable
Implementation
Loadlng cusLomers L1L process lmplemenLaLlon ln varlous envlronmenLs
O 1eradaLa MulLlLoad and an l1 shell scrlpL Lo load cusLomers exLracL

utu ullocution
1

oal
Populate data for a daily sales report which indicates a profit margin for each invoice in the
data warehouse.
This means that it will be feasible to get an information on how much revenue is generated
by each invoice line.
Financial background
In absolute terms the profit margin can be illustrated with the following expression:
Profit margin = sales amount - costs - sales deductions (discounts) - rebates
Net profit margin = profit margin - taxes
Sales amount is the gross total sales figure listed on an invoice and paid by a customer
Sales deductions is a discount given during a sales transaction (listed on an invoice)
Costs include variable and fixed costs (provided on a monthly and yearly basis)
Rebates and customer bonus are usually given to a customer and calculated on a monthly,
quarterly and yearly basis.
ata allocation concept
Data allocation (technique also referred to as filling gaps) is useful when dealing with data
which has a different level of detail (granularity) and there are gaps for some measures.
In data warehousing systems, the allocation technique is in many cases compulsory and
used widely in order to get a consistent and complete set of data.

The concept of data allocation is closely related to the granularity of the data. In data
warehousing, data granularity refers to the level of detail in a given fact table. The tables
below illustrate various levels of granularity.
Coarse-grained data (low granularity)
Date value
2007 1000
2008 2000
2009 1500
...

Fine-grained (high granularity)
Date value
20080101 8
20080102 15
20080103 12
20080107 14
20080109 11
...


Sample measures LhaL are very ofLen allocaLed ln a daLa warehouse are costs opetotloool fotecosts
soles ploos costomet tebotes ooJ boooses etc

1here are Lwo approaches for daLa allocaLlon
1

O Dynam|c a||ocat|on (welghLed or proporLlonal allocaLlon) values are allocaLed uslng calculaLed
subLoLals of anoLher value 1he welghLed Lype of allocaLlon ls ofLen used ln realllfe daLa warehouses
envlronmenLs Sample uses of dynamlc allocaLlon Jeslqoote pottloos of o boJqet pool ollocote
mooofoctotloq costs to ptoJocts etc
O |xed a||ocat|on whlch means LhaL Lhere ls a consLanL value asslgned Lo all records lncluded ln Lhe
allocaLlon group 8e aware LhaL Lhls approach mlghL be rlsky and confuslng as Lhose values cannoL be
summarlzed Sample uses of flxed allocaLlon are stotloq voloes tbot Jo oot cbooqe ofteo (fot exomple
cteJlt cotJ llmlt) ot coooot be ollocoteJ Jyoomlcolly

t |s a|so |mportant to keep |n m|nd that |n some bus|ness cases a||ocat|on |s unsu|tab|e r|or to us|ng
a||ocat|on |t |s necessary to ana|yze the data thorough|y and make sure |t f|ts |nto a bus|ness |og|c
cenario details
The company's data warehouse stores the sales data down to the invoice line level of detail
and the costs which are calculated on a monthly (variable costs) and quarterly basis (fixed
costs).

- The variable costs total value is assigned per year, month and product group.
- The fixed costs figure is a grand total assigned per year and month.

The aim is to compare revenue to fixed and variable costs in all time dimension levels
available.

The source data has the following table structure:
Date_id;nvc_head;invc_line;prod_id;prod_grp;cust_id;quantity;price;sales_amount
olution outline
The data allocation ETL process will be realized in a few steps:
1. Load technical table invoices - the table contains all data related to the invoices,
including gross sales and net sales
2. Load updated monthly and yearly costs into a separate costs table
3. Create another technical table which assigned importance levels and the following
figures which are populated using a fixed allocation mechanism mentioned above for
groups of data records: variable costs, fixed costs, sales invoice total
4. Load the DW invoices table - with costs figures allocated accordingly and calculated
profit margin

olutions and sample implementations
O uaLa allocaLlon ln enLaho uaLa lnLegraLlon sample L1L processlng ln ul based on a
producLlon/manufacLurlng daLa
1

O lor furLher analysls please also refer Lo Lhe Cognos measure allocaLlon example Cognos 8uslness
lnLelllgence appllcaLlons provlde an auLomaLed bullLln mechanlsm Lo lmplemenL Lhe daLa allocaLlon
Lechnlque