Vous êtes sur la page 1sur 26

ABC

Datawarehouse Modeling and Design

Unit 1.

An Introduction to Data Warehousing

What is Data warehousing?


According to Bill Inmon, known as the father of Data Warehousing, a data warehouse is a subject oriented, integrated, time-variant, nonvolatile collection of data in su ort of management decisions! Subject-oriented means that all relevant data about a subject is gathered and stored as a single set in a useful format" Integrated refers to data being stored in a globall# acce ted fashion with consistent naming conventions, measurements, encoding structures, and h#sical attributes, even when the underl#ing o erational s#stems store the data differentl#" Non-volatile means the data warehouse is read-onl#$ data is loaded into the data warehouse and accessed there" Time-variant data! %he relevance of time-variant is in the sense of data getting added on as time goes on! %ime being the most im ortant dimension, etc! Data warehousing is a conce t! It is a set of hardware and software com onents that can be used to better anal#&e the massive amounts of data that com anies are accumulating to make better business decisions! Data Warehousing is not just data in the data warehouse, but also the architecture and tools to collect, 'uer#, anal#&e and resent information!

Unit 1.1

Data warehousing concepts

Operational vs! informational data: ( erational data is the data #ou use to run #our business! %his data is what is t# icall# stored, retrieved, and u dated b# #our (nline %ransactional )rocessing *(+%), s#stem! An (+%) s#stem ma# be, for e-am le, a reservations s#stem, an accounting a lication, or an order entr# a lication! Informational data is created from the wealth of o erational data that e-ists in #our business and some e-ternal data useful to anal#&e #our business! Informational data is what makes u a data warehouse! Informational data is t# icall#$ .ummari&ed o erational data De-normali&ed and re licated data Infre'uentl# u dated from the o erational s#stems ( timi&ed for decision su ort a lications )ossibl# /read onl#/ *no u dates allowed, .tored on se arate s#stems to lessen im act on o erational s#stems

ABC

0er 1!12

)age 1

ABC

Datawarehouse Modeling and Design

A data warehouse is a 3subject-oriented, integrated, non-volatile, time variant collection of data in su ort of management decisions 4Inm5!6 %he end-users of a data warehouse are usuall# business anal#sts, as distinct from field ersonnel or call takers! Question$ What do #ou think is the skill rofile of the data warehouse end-user7 Operational Data Content Data Organization Nature of Data Data Structure & ormat !ccess "robabilit# Data $pdate $sage %esponse Time .ource$ 4.%<5 Current values A lication b# a D#namic Com le-" suitable for o erational com utation 8igh 9 dated on a field-b#-field basis 8ighl# structured re etitive rocessing .ub-second to :-; seconds lication Decision support Archival, summari&ed, calculated data .ubject areas across enter rise .tatic until refreshed .im le" suitable for business anal#sis Moderate to low Accessed and mani ulated" no direct u date 8ighl# unstructured anal#tical rocessing .econds to minutes

Question$ Do the descri tions under 3Data structure = format6 fit in with the skill rofiles of the res ective end-users7 A data mart is a scaled down de lo#ment of a data warehouse that contains data focusing on a de artmental user>s anal#tical re'uirements! ?or e-am le, the (hio-based 8untington Bank Cor oration set u a data mart for its general ledger s#stem, to get the ledger s#stem@s functional information to the bank@s financial anal#sts and budget coordinators 'uickl#! Data mining is the rocess of e-amining data for trends and atterns that might have evaded human anal#sis! ?or e-am le, .hoko>s .unda# circulars contained cou ons advertising health and beaut# aids, consumables, and household chemicals, which were are all located on the lefthand side of the stores! .hoko>s data mining e-ercise revealed that eo le who were coming in to sho gravitated to the left-hand side of the store for the romotional items and were not necessaril# sho ing the whole store! Conse'uentl#, it added a arel romotions to the .unda# circulars! An on-line Analytical Processing (OLAP) a lication is intended to rovide end-users an abilit# to erform an# business logic and statistical anal#sis that is relevant! %his anal#sis must ha en fast, i!e!, it must deliver most res onses to users within about five seconds, with the sim lest anal#ses taking no more than one second and ver# few taking more than :2 seconds! Multidimensional databases are non-relational DBM. roducts that are s eciali&ed for use for the kinds of 'ueries in data warehouses! %his is in contrast to using s eciali&ed anal#sis tools that run on to of a traditional ADBM.! What is the A(I for a data warehouse7 A recent stud# 4?is5 of BC major com anies b# the International Data Cor oration found an average three-#ear return on investment in data warehouse s#stems of B21D! More instructive is the ver# wide range of returns re orted b# the com anies, from 1E,222 ercent to minus 1,FCG ercent! Moral$ data warehousing is not a silver bullet" use with careH Multi-dimensional data structures can be im lemented with multidimensional databases or e-tended ADBM.s! Aelational databases can su ort this structure through s ecific database designs *schema,, such as /star-schema/, intended for multi-dimensional anal#sis and highl#

ABC

0er 1!12

)age :

ABC

Datawarehouse Modeling and Design

inde-ed or summari&ed designs! %hese structures are sometimes referred to as relational (+A) *A(+A),-based structures! &etadata'Information Catalogue: Metadata describes the data that is contained in the data warehouse *e!g! Data elements and business-oriented descri tion, as well as the source of that data and the transformations or derivations that ma# have been erformed to create the data element!

Unit 1.2

Benefits of Data Warehousing

A well designed and im lemented data warehouse can be used to$ 9nderstand business trends and make better forecasting decisions Bring better roducts to market in a more timel# manner Anal#&e dail# sales information and make 'uick decisions that can significantl# affect #our com an#@s erformance Data warehousing can be a ke# differentiator in man# different industries! At resent, some of the most o ular Data warehouse a lications include$ sales and marketing anal#sis across all industries inventor# turn and roduct tracking in manufacturing categor# management, vendor anal#sis, and marketing rogram effectiveness anal#sis in retail rofitable lane or driver risk anal#sis in trans ortation

Unit 1.3

evolved.

Datawarehousing Application Class: How it has been

%hroughout the histor# of s#stems develo ment, the rimar# em hasis had been given to the o erational s#stems and the data the# rocess! But there is a difference in the fundamental re'uirements of the o erational and anal#sis s#stems are different$ the o erational s#stems need erformance, whereas the anal#sis s#stems need fle-ibilit# and broad sco e! It has rarel# been acce table to have business anal#sis interfere with and degrade erformance of the o erational s#stems! Data warehousing has 'uickl# evolved into a uni'ue and o ular business a lication class! Iarl# builders of data warehouses alread# consider their s#stems to be ke# com onents of their I% strateg# and architecture! In building a Datawarehouse a 1! Data from legac# s#stems! lication the source in uts are listed below!

In the 1JG2>s virtuall# all business s#stem develo ment was done on the IBM mainframe com uters using tools such as Cobol, CIC., IM., DB:, etc! %he 1JF2>s brought in the new mini-com uter latforms such as A.KB22 and 0ALK0M.! %he late eighties and earl# nineties made 9MIL a o ular server latform with the introduction of clientKserver architecture! B# some estimates, more than G2 ercent of business data for large cor orations still resides in the mainframe environment!

:!

I-tracted data from micro deskto databases!

ABC

0er 1!12

)age ;

ABC

Datawarehouse Modeling and Design

In recent times advanced users will fre'uentl# use deskto database rograms that allow them to store and work with the information e-tracted from the legac# sources! Man# deskto re orting and anal#sis tools are increasingl# targeted towards end users and have gained considerable o ularit# on the deskto ! Another fall side to this is the difficult# in sharing anal#ses with others, e!g! during budgeting, one user *sa# the boss, ma# create anal#sis models *sa# allocation rules, that are to be used b# all others! %he first user then generates the final out ut b# utting these anal#ses together! ?urthermore, semantics of the data ma# need to be standardi&ed for use before letting it out to the users! In a deskto environment, this ma# be nearl# im ossible! As the data is stored on dis arate s#stems, it is ver# difficult to ensure that u dates to the data are communicated to all users, e!g! sa# sales data comes in, and one erson sends brand-wise summaries to some ke# users who then forwards them to his sub-ordinates! .ome hours after that it is reali&ed that data from one of the warehouses was missed out, and revised re orts are sent! Aesult$ Different eo le working on different versions of the same data! 9nnecessar# reconciliation issues cro u later! ;! Decision-.u ort and Management Information .#stem * MI.,

%he last categor# of anal#sis s#stems has been decision su ort s#stems and e-ecutive information s#stems! Decision su ort s#stems tend to focus more on detail and are targeted towards lower to mid-level managers! I-ecutive information s#stems have generall# rovided a higher level of consolidation and a multi-dimensional view of the data, as high level e-ecutives need more the abilit# to slice and dice the same data than to drill down to review the data detail! %his categor# is somehow close to Datawarehousing a lications, but it has the following defect!

%hese s#stems have data in descri tive standard business terms, rather than in cr# tic com uter fields names! Mon-technical user design data names and data structures in these s#stems for use! Datawarehousing A lications are in rominence toda# because there are ke# technolog# is available, hardware rices are down, good .erver software, availabilit# of internet a lications, Most im ortantl# lots of tools are available too!

Unit 1.4

elf !e"iew through a Case stu#$.

A large <rocer# chain, which has large stores around C22 in ; states! %his has man# de artments also! Iach store deals with around E2222 roducts! B2222 roducts are brought from e-ternal vendors! Aest :2222 is re ared from different de artments! Iach roduct has a product code( 1! :! ;! B! C! E! Which is the rimar# ke#7 What is the )lace for Data Collection7 What are the different business activities7 (n what the management will be interested in What should be Business <oal7 What is the <rain7

ABC

0er 1!12

)age B

ABC

Datawarehouse Modeling and Design

G! F!

What is the business measurement s for the ?act table7 <ive an a ro-imate Database si&e! .i&e of the ?act table!

After answering the above 'uestions give an attem t to following conce tual 'uestions! 1! :! ;! B! C! Define grain statement! Define measure! ?ind difference between (+%) and (+A)! .u l# two .N+>s to justif# both s#stem

Oustif# how the im ortance of %ime with res ect to both (+%) and (+A)! Is (+A) and Datawarehousing go together!$

ABC

0er 1!12

)age C

ABC

Data Warehouses$ Modeling and Design

The dimensional model


%raditional normali&ed database designs are ina ro riate for data warehouses for : reasons$ D.. rocessing can involve accessing hundreds of thousands of rows at a time across several tables! Com le- joins can seriousl# com romise erformance 9sageKaccess aths in an (+%) environment are known a riori! In D.., usage is ver# unstructured" users often decide what data to anal#&e moments before the# re'uest it, and a lications cannot be hardcoded for a articular schema

Data modeling is a useful design tool because it allows automatic generation of normali&ed database schema from an IA diagram! Because traditional normali&ed database designs are ina ro riate, traditional data modeling is also ina ro riate as a design tool ! It continues to be a useful tool for modeling and understanding business information *the business essence, in a technolog#-inde endent wa#, and rovides a foundation for ma ing the data in the o erational data stores to that in the data warehouse! Database designs for data warehouses Course follow a star schema! %here are one or more central fact tables, each 3surrounded6 b# several dimension tables that rovide the foreign ke#s that define each fact table row! %he fact table is like a transaction table while the dimension tables are like master tables! ?or e-am le, if the attendance of a student in a course is a 3transaction,6 the associated dimensions would be Course, Instructor, .tudent and CoursePdate! ?or each such 3transaction6, we might like to store noPofPda#sPattended" this would then be the fact of the transaction! Mote that DatePid is Instructor actuall# a foreign ke#! 8aving a se arate table called time is essential for erforming the t# icall# re'uired in a business conte-t!
.tudent

Attendance Fact CoursePid .tudentPid DatePid InstructorPid RofPda#sPattended labPda#sPattended

%ime

kind of tem oral anal#sis that is

%he driver of data warehouse design is the nature of the standard data warehouse 'uer#, which is 3<ive me 4aggregated5 facts broken down b# dimensions D1 and D: for such-and-such time eriod!6 ?or e-am le, we might be interested in looking at attendance b# Course b# Da#, or the sum of attendances b# Course b# Qear, b# .tudent b# Nuarter, and so on! %his translates into .N+ which looks something like this$ select D1.attrib_1, D2.attrib2, sum(F.fact1), sum(F.fact2) from fact F, dimension1 D1, dimension2 D2, time T where F.dim1_key = D1.key and F.dim2_key = D2.key and F.timekey = T.key and T. uarter = !1"1##$% &rou' by D1.attrib1, D2.attrib2 order by D1.attrib1, D2.attrib2 I-ercise$ Write an .N+ that reveals the number of attendances from .B91 b# Course b# Qear! Mote that time is conce tuall# just another dimension! A dimensional constraint or filter such as T. uarter = !1"1##$% is called an application constraint. Exercise $ Qou run a grocer# store chain! Qour sales fact table records, for each sale, the 9)C, the date and time, store id, details of romotional offers on the roduct sold, and of course the sale

ABC

0er 1!12

)age E

ABC

Data Warehouses$ Modeling and Design

'uantit# and selling rice! Imagine the kinds of tem oral anal#ses #ou would like to do with #our sales data! What attributes would #ou like the Time dimension table to have7 8ow large could this dimension table ossibl# get7 rowsing is the activit# of e- loring a single-dimension table, rior to firing the tem late 'uer# above, with a two-fold ur ose$ *i, choosing attributes for the select clause of the 'uer#" this might be done b# sim l# dragging the attribute name onto a gra hic re resenting the tem late 'uer# or re ort! *ii, choosing a lication constraints for selecting a subset of rows of the table for the 'uer#! Consider the com letel# denormali&ed roduct table, with each row containing information about roduct, roduct categor#, and ackage descri tion! %he browsing activit# might result in the a lication constraint (acka&e_desc S 3Tetra(ak6 after a select 'acka&e_desc where cate&ory = )be*era&e+. Drilling down is the action of dragging an attribute name *from a dimension table, onto an e-isting re ort! %he si&e of a dimension table is invariabl# a tin# fraction of the si&e of the fact table! Besides, a data warehouse is u dated onl# once a da#, and it is onl# a small fraction of these da#s on which a dimension table is ever u dated! %he dimension tables are thus left completely unnormalized ! A olic# of having com letel# unnormali&ed dimension table allows gra hical browsing and automatic .N+ generation for all standard user 'ueries of the kind mentioned above! It also a#s to have as man# descri tive or 'ualif#ing attributes for each dimension as can be imagined, so that the end-user can set a variet# of a lication constraints! *Consider, for e-am le, an anal#st who wants to know how the sale of aints on 8oli *the festival, da#s differs from sales on other da#s!, %here is a subtle difference between a dimension and an entity ! 3%ime6 is a dimension, but is Da# Month Nuarter Qear associated with a great man# entities, as shown in the accom an#ing figure! ?or convenience, one of the hierarchies *the one most commonl# used in B-week Week 'ueries, is usuall# designated the primary eriod dimension, and ever# other hierarchy a secondar# dimension! Iach level in a hierarch# is said to 1;-week roll-up to the ne-t level *though it is a arent that roll-u s are not alwa#s uni'uel# defined,! eriod Dimensional modeling, then, attem ts to de ict the facts and their associated dimensions, without e- licitl# de icting the entities and relationshi s that make u a dimensional hierarch#! Exercise $ Qou want to anal#&e course attendances as well as course nominations! Develo a star schema to do this! Would #ou choose to have one fact table or multi le7 If #ou choose the former, will #ou have an# new dimensions7 In general, a single fact table is a good idea where multi le t# es of facts share a subt# esu ert# e relationshi with the bulk of the attributes being common! ?or e-am le, transactions involving bank accounts have different flavors de ending on whether it is a savings account or a checking account that is being o erated! %his difference in flavor manifests itself as mild variations in the com osition of attributes that make u the transaction fact! With relativel# unrelated fact t# es on the other hand, the number of common attributes is small, so the referred choice is to have a custom fact table for each fact t# e, and re licate common attributes in all custom fact tables to avoid joins! Exercise $ What ha ens to a table that re resents a dimension which has subt# es7 It is common to have data oints *facts, that are described as an adjective of #our base data *e!g! actual sales and budgeted sales,! Aather than antici ate all adjectives during warehouse design, we can create a partitioning dimension that holds onl# the adjectives and their descri tions *each adjective is called a artition,, with the fact table row containing a column called just sales,

ABC

0er 1!12

)age G

ABC

Data Warehouses$ Modeling and Design

along with a new foreign ke# called 'artition! %his makes it eas# to add a new t# e of fact such as 3forecast sales6 $ just insert a row in the artition for the new adjective 3forecast6, and have the fact table foreign ke# 'artition indicate 3forecast6 for each record that re resents a forecast! !

Unit 1.%

&ssues in Di'ensional (o#eling

Big #i'ensions
Denormali&ation increases redundanc# and, conse'uentl#, si&e! .ometimes a full# denormali&ed dimension table does become uncomfortabl# large! When this ha ens, the dimension ma# be normali&ed or snowfla!ed in the following manner! %he dimension table stores one ke# for each level of the dimension@s hierarch#! %he lowest level ke# joins the dimension table to the central fact table! %he rest of the ke#s join the dimension table to the corres onding higher-level tables! In a snowfla!e schema, ever# dimension is normali&ed in this manner! %he word 3snowflake6 refers to the sha e of the full# normali&ed schema when re resented gra hicall#! Exercise $ .nowflake the Time dimension! 8ow will #ou handle multi le hierarchies7 Do #ou think Time is a big dimension7 %here is significant difference of o inion about whether snowflaking should be done at all even for big dimensions! Aal h Timball 4Tim5 believes it should never be done, while the .tanford %echnolog# <rou *an Informi- com an#, believes it is useful, since in a dimension table of C22,222 rows, it is conceivable to save two megab#tes er row through normali&ation and hence save a full gigab#te of disk! Timball com ares the saving to the overall si&e of the t# ical warehouse which is about C2 <B! )erformance is also a factor to be considered$ without snowflaking, a 'uer# that needs to anal#&e sales b# brand will have to rummage through C22,222 roduct rows to filter out erha s a score brands! Another factor is the com le-it# of the structures as erceived b# the end-user *a business anal#st,, and the associated loss of browsing abilit# *recall the definition of browsing given earlier,! ?inall#, load rograms and overall maintenance become more difficult to manage as the data model becomes more com le-! A dimension such as Customer ma# have man# 'ualif#ing attributes, such as age, se-, and incomePlevel, which are of interest to the business anal#st not as s ecific values but as a combination of brackets! Instead of retaining these as individual attributes in the customer dimension table, we can re lace these b# a demogra hicsPke# that oints to a row in a minidimension table as shown$ Demograp)ics minidimension demogra hicsPke# agePbracket incomePbracket se!!!! Sales fact timePke# customerPke# demogra hicsPke# roductPke# !!!! Customer dimension customerPke# demogra hicsPke# firstPname address !!!!

.uch a schema s eeds u 'ueries with com le- demogra hic conditions! Also note that not all combinations need to be stored in the minidimension tableUa customer in the age grou V12 is unlikel# to be in a high-income categor#! Nuestion$ Wh# make demogra hicsPke# a art of the fact table7 Dimensions change their characteristics, albeit slowl#! When a customer changes her address, we can either modif# the address in the customer>s record in the customer dimension table *and lose historical information, or insert a new customer dimension record to ca ture histor#!

ABC

0er 1!12

)age F

ABC

Data Warehouses$ Modeling and Design

Exercise $ Qou ma# be able to limit the number of changes of interest" for e-am le, #ou might rarel# anal#&e data that is over a #ear old, and it is rare for marital status to change more than three times a #ear! 8ow could #ou handle this efficientl#7 A demogra hicsPke# can also undergo changes over time! A customer ma# move into a higher income bracket, re'uiring a change in demogra hicsPke#! It is eas# to see that demogra hicsPke# can be treated just like an attribute, albeit a more com le- one!

Unit 1.)

Aggregation strategies

Data-intensi"e #ueries access a large number of rows! Data-selecti"e #ueries touch onl# a few rows, but contain com le- and diverse selection criteria! Data warehouse end-users take more strategic decisions and hence e-ecute a large number of data-intensive 'ueries, unlike (+%) endusers! A t# ical data-intensive 'uer# would be 3give me sales b# region for each brand!6 )reaggregation strategies are re'uired to reduce res onse time for such 'ueries! A sim le measure of the need for re-aggregation is to com ute the compression ratio, which is *the number of rows re orted,K*the number of rows retrieved,! $ull aggregation refers to the recom utation and storage of all ossible aggregates *i!e!, combinations of all levels of all dimensions,! Exercise $ 8ow would #ou estimate the increase in the overall database si&e for full aggregation7 8ow man# additional tables are necessar# to store recom uted aggregates7 %he le"el techni#ue allows the answer to this 'uestion to be &ero! With this techni'ue, ever# aggregate fact is stored in the base fact table! ?act table rows that stored roduct category sales b# store b# da# would have the 'roduct_key oint to rows in the (roduct table that identified a roduct category rather than an individual roduct! A le*el field in the (roduct table would have le*el = )cate&ory+ for such rows and le*el = )base+ for rows re resenting individual roducts! Exercise $ 8ow would #ou determine the total number of fact and dimension tables for full aggregation using a different dimension table for each level and a different fact table for each aggregate7 %he level techni'ue and se arate table techni'ue are of course two ends of the 4full aggregation5 s ectrum! With a h#brid a roach, some aggregates can be stored using the level techni'ue while others can be stored in se arate tables! It ma# be better to use the level techni'ue for cases where new facts kee ouring in! %he alternative to recom utation is dynamic or %&L-based aggregation, which is meaningful for aggregates that are not com uted often enough to warrant recom utation! ?or e-am le, aggregate sales for roduct categories can be com uted b# selecting .9M*salePvalue, and grou ing b# categor#! Aggregates for the com lete roduct hierarch# *sales b# sub-categor#, categor#, brand, etc!, can be com uted b# successive select statements that grou at the res ective level! Again, a h#brid a roach is ossible along this a-is$ for e-am le, monthl# totals ma# be recom uted * reaggregated, while #earl# totals ma# be arrived at using .N+-based aggregation of monthl# totals! Aggregate na"igators are tools that allow user a lications to fire base-level .N+ as though the# were erforming ure d#namic aggregation! %he navigator maintains definitions of the current aggregation table structure, and uses this to re hrase the .N+ to access the relevant aggregate tables instead! %he algorithm t# icall# looks for the smallest aggregate fact table whose associated dimension tables contain all the dimensional attributes re'uired for the su lied 'uer#! (nce this is done, the base-level fact and dimension table names are re laced with the aggregate fact and dimension table names! select categor#Pdescri tion, sum *'t#, from salesPfact, roduct, store, time where W join conditions on roduct, store = time X and store!cit# S Cincinnati and time!da# S 21211JJE grou b# categor#Pdescri tion ABC select categor#Pdescri tion, sum *'t#, from categor#*sales*fact, categor#*product, store, time where W join conditions on categor#*product, store = time X and store!cit# S Cincinnati and time!da# S 21211JJE grou b# categor#Pdescri tion 0er 1!12 )age J

ABC

Data Warehouses$ Modeling and Design

Exercise $ Can #ou use partitioning dimensions to handle adjectives for aggregates7 8ow about 3Average6 itself as an adjective7

Unit 1.*

elf re"iew Case stu#$ on Di'ensional (o#eling

%he following case stud# is to be read and observed with the solution given At the end of it one must be ver# clear how to draw a Dimensional model! %he ke# to creating an efficient MDDB a lication is thorough anal#sis of both the data and its users! After the data elements have been identified for re orting to the end users, the business entities will fall in distinct grou s of variables with similar characteristics or dimensions! ?or e-am le, consider a sales organi&ation, which sells articles to different customers through different su liers s read at various geogra hical locations! ?rom the transaction data *base tables, of the organi&ation, we can design a fact table which contains the denormali&ed sales data at a granularit# which is re'uired for creating an MDDB! %his fact tables stores the 9nits and the Dollars of the sales volume for at a dail# level! %he fact table for this case can be outlined as follows$ )roduct <eogra h# .u lier %ime 9nits Dollars

In this e-am le, the first B columns re resent the ke# determinants of the two facts *)roducts sold in 9nits and Dollars,! In an MDDB model, the fields of the four Dimensions must intersect to determine the values of the facts! %o create the dimensions for the MDDB which is to be built from the base table, it advisable to have dimension tables for each of the dimensions! %hese dimension tables should be used to derive the dimension fields, hierarchies and other attributes, if re'uired! In this case, the dimensions could be derived as follows$

)A(D <A) G1 G1

)A(D ?AMI+Q F1 F

AA%IC+I A1 A

ABC

0er 1!12

)age 12

ABC

Data Warehouses$ Modeling and Design

)roduct

<eogra h#

C(9M%AQ #1 #

AI<I(M $1 $

.8() %1 %

.u

lier

<A(9) .9)) G1 G QIAA '1 '1 N%A Q1 Q1

.9))+IIA %&1 %& M(M%8 (1 (1 DAQ )1 )

%ime

Measure

Measure Code 9nits Dollars

Measure Mame )roducts .old in 9nits )roducts .old in Dollars

)recision 9nit T

Mote that %his design com lies with the classic .%AA schema design of a Datawarehouse %he fact table contains a com ound rimar# ke#, with one segment for each dimension, and additional columns of additive , numeric facts Iach dimension in the design has a defined hierarch#! %he arent-child relationshi *additive!semi"additive!non"additive , could be business driven! Alternativel#, the dimension tables can be designed to have the attribute level indicator of each record Iach dimension contains *and not restricted to, a ke# segment Deriving a dimension from a table in an MDDB is alwa#s advisable because of the following rimar# reasons$ Dimension si&e can be reduced b# selecting onl# the valid dimension fields from the table! %hus, reducing the si&e of the MDDB Modifications to the dimension hierarchies can be handled easil# ?acilitates better maintenance of the cube build rocess ?acilitates standardi&ation of dimensions across different MDDB a lications in an organi&ation Descri tions and levels of the dimension fields can be stored in the tables ?acts of the data have been clubbed into a Measure dimension to store different attributes of the facts and handle an# changes in future *for e-am le, )recision is one of the ro erties of the facts included here,!

Unit 2.

Data Warehouse Architecture

ABC

0er 1!12

)age 11

ABC

Data Warehouses$ Modeling and Design

Unit 2.1

Data Warehousing Architecture (o#el

%he following com onents should be considered for a successful im lementation of a Data Warehousing solution$ 4an5 ( en Data Warehousing architecture with common interfaces for roduct integration Data Modeling with abilit# to model star-schema and multi-dimensionalit# I-traction and %ransformationK ro agation tools to load the data warehouse Data warehouse database server Anal#sisKend-user tools$ (+A)Kmultidimensional anal#sis, Ae ort and 'uer# %ools to manage information about the warehouse *Metadata, %ools to manage the Data Warehouse environment Transforming operational data into informational data: Creating the informational data, that is, the data warehouse, from the o erational s#stems is a ke# art of the overall data warehousing solution! Building the informational database is done with the use of transformation or ro agation tools! %hese tools not onl# move the data from multi le o erational s#stems, but often mani ulate the data into a more a ro riate format for the warehouse! %his could mean$ %he creation of new fields that are derived from e-isting o erational data .ummari&ing data to the most a ro riate level needed for anal#sis Denormali&ing the data for erformance ur oses Cleansing of the data to ensure that integrit# is reserved!

ABC

0er 1!12

)age 1:

ABC

Data Warehouses$ Modeling and Design

Iven with the use of automated tools, however, the time and costs re'uired for data conversion are often significant! Bill Inmon has estimated F2D of the time re'uired to build a data warehouse is t# icall# consumed in the conversion rocess! Data +are)ouse database servers--t)e )eart of t)e +are)ouse: (nce read#, data is loaded into a relational database management s#stem *ADBM., which acts as the data warehouse! .ome of the re'uirements of database servers for data warehousing include$ )erformance, Ca acit#, .calabilit#, ( en interfaces, Multi le-data structures, o timi&er to su ort for star-schema, and Bitma ed inde-ing ! .ome of the o ular data stores for data warehousing are relational databases like (racle, DB:, Informi- or s eciali&ed Data Warehouse databases like AedBrick, .A.! %o rovide the level of erformance needed for a data warehouse, an ADBM. should rovide ca abilities for arallel rocessing - .#mmetric Multi rocessor *.M), or Massivel# )arallel )rocessor *M)), machines, near-linear scalabilit#, data artitioning, and s#stem administration! Data ,are)ousing Solutions - +)at is )ot%olution Area
%eport and .uer#

Product
Im rom tu BrioNuer# Business (bjects Cr#stel Ae orts D.. AgentK.erver Decision.uite IssBase I- ress .erver )ower)la# Brio Inter rise Business (bjects Inter rise Miner Clementine Discover# .erver Intelligent Minor Darwin

'endor
Cognos Brio %echnolog# Business (bjects Inc .eagate .oftware Microstrateg# Information Advantage 8# erion .olutions (racle Cor ! Cognos Cor oration Brio %echnolog# Business (bjects .A. Institute .).. )ilot .oftware IBM %hinking Machines

O/!" ' &D anal#sis

Data mining

Data &odeling
Data e0traction1 transformation1 load

IAKWin
Data)ro agator Info)um Integrit# Data Ae-Ing! Warehouse Manager )owerMart DB: (racle .erver M. .N+ .erver AedBrick Warehouse .A. .#stem %eradata DB.

)latinum
IBM )latinum %echnolog# 0alit# %echnolog# )rism .olutions Informatica IBM (racle Microsoft Aed Brick Cor ! .A. Institute MCA

Databases for data +are)ousing

rofitabilit# anal#sis or risk assessment in banking claims anal#sis or fraud detection in insurance

Unit 3.

Issues in Datawarehousing Projects.

It is im ortant to recogni&e the issues involved in building and hence in managing the Datawarehouse! Interestingl# some user ma# sa# that a data warehouse that is onl# C2 gigab#tes is not a full-fledged data warehouse, and the# ma# refer to it instead as a data mart! ?or a smaller com an#, C2 gigab#tes or even much less can re resent ever# relevant iece of information covering last 12 #ears and can well re resent a owerful data warehouse!

ABC

0er 1!12

)age 1;

ABC

Data Warehouses$ Modeling and Design

%he issues a roject leader must kee in mind!4sas5 1! ?or se arating the data for business anal#sis from the o erational data! :! %he logical transformation of the data, including data warehouse modeling and denormali&ation of the data ;! %he issues associated with h#sical transformation of the data! B! %he generation of summar# views!

Unit 3.1

Carr$ing Data fro' +,-. to Warehousing #ata

%hese issue is here how to se arate and when to se arate because the o erational data from anal#sis data have not significantl# changed with the evolution of the data warehousing s#stems, e-ce t that now the# are considered more formall# during the data warehouse building rocess! In the anal#sis and design hase building Datawarehouse is done through a journe# from e-isting IA model! Advances in technolog# to roducing standard re orts, toda#>s data warehousing s#stems su ort ver# so histicated online anal#sis including multi-dimensional anal#sis! Data warehousing s#stems are most successful when data can be combined from more than one o erational s#stem! When the data needs to be brought together from more than one source a lication, it is natural that this integration be done at a lace inde endent of the source a lications! %he rimar# reason for combining data from multi le source a lications is the abilit# to cross-reference data from these a lications! Mearl# all data in a t# ical data warehouse is built around the time dimension! %he data warehouse s#stem can serve not onl# as an effective latform to merge data from multi le current a lications" it can also integrate multi le versions of the same a lication! ?or e-am le, an organi&ation ma# have migrated to a new standard business a lication that re laces an old mainframe-based, custom-develo ed legac# a lication! %he data warehouse s#stem can serve as a ver# owerful and much needed latform to combine the data from the old and the new a lications! Designed ro erl#, the data warehouse can allow for #ear-on-#ear anal#sis even though the base o erational a lication has changed! ( erational s#stems are designed for acce table erformance for re-defined transactions! ?or e-am le, an order rocessing s#stem might s ecif# the number of active order takers and the average number of orders for each o erational hour! Iven the 'uer# and re orting transactions against the o erational s#stem are most likel# to be redefined with redictable volume! Iven though man# of the 'ueries and re orts that are run against a data warehouse are redefined, it is nearl# im ossible to accuratel# redict the activit# against a data warehouse! Data is mostl# non-volatile! %his attribute of the data warehouse has man# ver# im ortant im lications for the kind of data that is brought to the data warehouse and the timing of the data transfer! Man# data warehousing rojects have failed miserabl# when the# attem ted to s#nchroni&e volatile data between the o erational and data warehousing s#stems! In short, the se aration of o erational data from the anal#sis data is the most fundamental datawarehousing conce t! Mot onl# is the data stored in a structured manner outside the o erational s#stem, businesses toda# are allocating considerable resources to build data warehouses at the same time that the o erational a lications are de lo#ed!

Unit 3.2

&ssue 2 :,ogical transfor'ation of operational #ata

%he data is logicall# transformed when it is brought to the data warehouse from the o erational s#stems! %he issues associated with the logical transformation of data brought from the o erational s#stems to the data warehouse ma# re'uire considerable anal#sis and design effort! %he architecture of the data warehouse and the data warehouse model greatl# im act the success of the roject! %his section reviews some of the most fundamental conce ts of relational database

ABC

0er 1!12

)age 1B

ABC

Data Warehouses$ Modeling and Design

theor# that do not full# a l# to data warehousing s#stems! Iven though most data warehouses are de lo#ed on relational database latforms, some basic relational rinci les are knowingl# modified when develo ing the logical and h#sical model of the data warehouses! Im ortance of the ossibilit# of s#nchroni&ed data in the source s#stems, e!g! if the roduct codes are not standard across the source s#stems, and roduct attributes are stored across s#stems, it becomes im ossible to maintain all the roduct attributes in the warehouse! %his is one of the most im ortant concerns to be taken care of before initiating a data-warehousing roject! While data scrubbing and cleaning can take care of the ast data, for continuous u dates in an efficient manner, these re'uirements become essential! %he data warehouse model needs to be e-tensible and structured such that the data from different a lications can be added as a business case can be made for the data! A data warehouse roject in most cases cannot include data from all ossible a lications right from the start! Man# of the successful data warehousing rojects have taken an incremental a roach to adding data from the o erational s#stems and aligning it with the e-isting data! Data warehouse model aligns with the business structure A data warehouse logical model aligns with the business structure rather than the data model of an# articular a lication! %he same logic can be a lied to entities in an entit# relationshi diagram, which are used as the starting oint for o erational s#stems! %hough the relevant oints are being covered Y i!e! narrow definition of entities in a lications and the need to create one consolidated attribute base Y the im act is not felt Y erha s because a direct com arison with IA modeling is not made! I feel we should also introduce the conce t of an enter rise data model here!

$nit 2(3(4

De-normalization of data

A data modeler in an o erational s#stem would take normali&ed logical data model and convert it into a h#sical data model that is significantl# de-normali&ed! De-normali&ation reduces the need for database table joins in the 'ueries! .ome of the reasons for de-normali&ing the data warehouse model are the same as the# would be for an o erational s#stem, namel#, erformance and sim licit#! .tatic relationshi s in historical data! Another reason that de-normali&ation is an im ortant rocess in data warehousing modeling is that the relationshi between man# attributes does not change in this historical data! Another im ortant e-am le can be the rice of a roduct! %he rices in an o erational s#stem ma# change constantl#! .ome of these rice changes ma# be carried to the data warehouse with a eriodic sna shot of the roduct rice table! In a data warehousing s#stem #ou would carr# the list rice of the roduct when the order is laced with each order regardless of the selling rice for this order ! maintain d#namic relationshi s between business entities, whereas a data warehouse s#stem ca tures relationshi s between business entities at a given time!

Unit 3.3

&ssue 3 : .h$sical transfor'ation of operational #ata

8istorical data and the current o erational a lication data are likel# to have some missing or invalid values )h#sical transformation of data homogeni&es and urifies the data! %hese data warehousing rocesses are t# icall# known as 3data scrubbing6 or 3data staging6 rocesses! )h#sical transformation includes the use of eas#-to-understand standard business terms, and standard values for the data! A com lete dictionar# associated with the data warehouse can be a ver# useful tool! During these h#sical transformation rocesses the data is sometimes 3staged6 before it is entered into the data warehouse! %he data ma# be combined from multi le a lications during this 3staging6 ste or the integrit# of the data ma# be checked during this rocess!

ABC

0er 1!12

)age 1C

ABC

Data Warehouses$ Modeling and Design

%he terms and names used in the o erational s#stems are transformed into uniform standard business terms b# the data warehouse transformation rocesses! It is im ortant to give single h#sical definition of an attribute!As an attribute is defined h#sicall# for the data warehouse, it is essential to use meaningful data t# es and lengths! 9se the standard data length and data t# e for each attribute ever#where it is used! A functional data dictionar# can facilitate this consistent use of h#sical attributes! .econd im ortant oint is to use consistentl# entit# attribute values All attributes in the data warehouse need to be consistent in the use of redefined values! Different source a lications invariabl# use different attribute values to re resent the same meaning! %hese different values need to be converted into a single, most sensible value as the data is loaded into the data warehouse! (r, if the data is to be used b# the same set of users, one ma# need to store the different attributes too, so that users do not see a disconnect between their o erational and decision su ort s#stems! A far more im ortant roblem is inconsistent definition and use of the entities themselves, e!g! some a lications ma# be storing information at the rice code detail level *encoded in the roduct code,, while others ma# be storing at a lanning code level *all rice codes, variants, etc! are clubbed,! Moreover, because of user habits, some of the codes used in the lanning s#stem ma# be outdated, and re laced b# new codes in the sales s#stem! Clubbing information from multi le sources then becomes a big roblem!

Unit 3.4 &ssue 4: Data with #efault an# 'issing "alues to /e interprete# consistentl$.
%he data brought into the data warehouse is sometimes incom lete or contains values that cannot be transformed ro erl#! It is ver# im ortant for the data warehouse transformation rocess to use intelligent default values for the missing or corru t data! It is also im ortant to devise a mechanism for users of the data warehouse to be aware of these default values! .ome data attributes can easil# be defaulted to a reasonable value when the original is missing or corru t! (ther values can be obtained b# referencing other current data! ?or e-am le, a missing roduct attribute such as unit-of-measure on an order entit# can be obtained b# accessing the current roduct database! .ome attributes cannot be filled b# defaults for missing values! In fact, it ma# be dangerous to attem t to assign default for certain t# es of missing values! A oor default ma# corru t the data and lead to invalid anal#sis at a later stage! In these cases, it is safest to leave the missing values as blank! In some cases, it ma# make sense to ick a s ecific value or s#mbol that indicates a missing value! ?ebruar# is not stored in the data warehouse! Also, missing data for art of the #ear revents an# meaningful #ear-on-#ear anal#sis! It is im ortant to design a good s#stem to log and identif# data that is missing from the data warehouse! When a user runs a 'uer# against the data warehouse, it is essential to understand the o ulation against which the 'uer# is run! Accurate and com lete transformations hel maintain the integrit# of the data warehouse!

Unit 3.%

&ssue %: (apping Data to reflect Business "iew

.ummar# views often are generated not onl# b# summari&ing the detail data but also b# a l#ing business rules to the detail data! ?or e-am le, the summar# views ma# contain a filter that a lies the e-act business rules for considering an order a sale or a filter that a lies the business rules for allocating a sale to a channel entit#! %he summar# views can hide the com le-ities of the detail data from the end user for man#, if not most, anal#sis tasks! %he business rules that are a lied in generating summar# views can be com le-! %hese business rules ma# determine e-actl# what constitutes a sale or the# ma# determine how a sale is allocated to a sales or channel entit#! In addition to a l#ing the business rules while generating summar# views, the data warehousing s#stem ma# erform com le- database o erations such as multi-table joins! )roduct sales ma# be com uted b# joining the .ales, Invoice, and )roduct

ABC

0er 1!12

)age 1E

ABC

Data Warehouses$ Modeling and Design

tables! %he criteria to join these tables ma# be com le-! While individuals mining data in the warehouse detail records need to understand all the com le-ities of business rules, most users can retrieve effective summar# business information without full# understanding the detail data! %he single most im ortant reason for building the summar# views is the significant erformance gains the# facilitate! %he summar# views in a data warehouse rovide multi le views into the same detail data! %hese views are redefined dimensions into the detail data! %hese views rovide an efficient method for the anal#st to link with the detail data when necessar#!

Unit 3.) &ssue ): electing -ools to /e use# against the #ata warehouse
In most data warehousing rojects, there is a need to select a referred data warehouse access tool for the most active users! A small number of users generate most of the anal#sis activit# against the data warehouse! %he data warehouse erformance can be tuned to the re'uirements of the tool a ro riate for these active users! %his tool can be used for training and demonstration of the data warehouse! A user can start with a low-level tool that is alread# familiar to him or her! After becoming familiar with the data warehouse he or she ma# be able to justif# the cost and effort involved with using a more com le- tool!

$nit 2(5(4

!n 6valuation C)ec7list

The choice of an ,-.( tool for a 'articular en*ironment and a''lication de'ends on the key re uirements of the analysts, 'ro&rammers and the end/users of the a''lication. 0efore &ettin& into the details of sub1ectin& an ,-.( tool to any e*aluation criteria and 'erformin& any test, one needs to ha*e the 'erformance re uirements *ery clear to &uide the e*aluation 'rocess. 2ome of the focus areas can be found out by ha*in& the followin& uestions answered at the outset34das5rak6

Data Access Features


7hat would be the final data format8 7hat are the common selection criteria to be used8 .re mathematical o'erations (addition, subtraction etc.) critical to the selection 8 .re statistical o'erations (2tatistical functions /mean, a*era&e, standard de*iation etc.) critical8 9se of other data mani'ulation features 2ort, Discard, and Filter, 2hiftin& critical8

Data E !loration Features


:s &ra'hical re'resentation *ery im'ortant8 .re traffic li&ht analysis, key 'erformance analysis, 'attern matchin&, and lifetime analysis useful for the users8

Usa"ilit#
;ow much of ,-.( familiarity e<ists with the users8 Do the users ha*e any uantified 'erformance e<'ectations8 .re e<'ert o'tions, user 'ro&rammin& re uired8 :s =9: a decidin& factor8 7hat is the tolerable online res'onse time8

$i%e and $cala"ilit#


;ow bi& is the is the current *olume of data8 7hat is the data &rowth 'otential in future8 ;ow fast is the data &rowin&8 .re the users 'latform/s'ecific for the tool8

ABC

0er 1!12

)age 1G

ABC

Data Warehouses$ Modeling and Design 7hat is the batch u'date window8 7hat is the ma<imum number of dimension for a sin&le >DD08 7hat is the ma<imum a&&re&ation le*el8

&ritical 'esources
7hat are the critical 2ystem resources that need to be o'timally used by the ,-.( tool8 7hat are the resources that are factors for e*aluation8

$et o( Features
7hat features will be hi&h 'riority for the users8 7ill the features hel' the users do the work more 'roducti*ely8

)imitations and &onstraints


7hat are the limitations and constraints that should be ruled out8 .re there any 'roblems in the current tool(if any) 8

Unit 3.*

.erfor'ance consi#erations

)h#sical design for a data warehouse is concerned rimaril# with 'uer# erformance and less with storage or u date erformance! Nueries can be s eeded u in two wa#s$ b# s eeding u the retrieval of rows from an individual table, and b# s eeding u the multi le-table join rocess! Speeding up retrieval ! A bitmap inde( creates an arra# where the columns are the domain of the inde-ed field and the rows corres ond to the rows of the table! If we inde-ed MaritalP.tatus with values .ingle, Married and (ther, we would have three columns in the bitma ! Iach value in the arra# is an onKoff bit that indicates the value of the field in the corres onding row! %his inde-ing scheme s eeds u row selection b# the use of bitwise o erations! Bitma inde-es are often used in conjunction with B-trees! .u ose #ou want to inde'roduct_cate&ory for which there are 1222 distinct values! .u ose there are 122 leaves for the B-tree" each leaf will then re resent a range of 12 values for the field being inde-ed *'roduct_cate&ory,! %hen a bitma can be maintained at each leaf" each such bitma will have the same number of rows as before but onl# 12 columns! Bitma -based techni'ues are suitable onl# for low-cardinality data *i!e!, the inde-ed field must have no more than a cou le of hundred distinct values,! ?or medium and high cardinalit# data, a more traditional B-tree im lementation is usuall# used! Mot sur risingl#, bitma -based inde-ing schemes are costl# to u date, but this is oka#$ remember that u dates are a rare henomenon in a data warehouseH %he use of aggregates can relieve the ressure to build inde-es! A 'uer# that does not constrain over a given dimensional level *e!g!, 3get sales b# brand b# region does not constrain on the roduct dimension levels below the level 3brand6, can be redirected to a suitable aggregate table! *hus only one sort order on the master composite index on the fact table needs to be built ! In the sales data warehouse for e-am le, this com osite inde- could be time b# roduct b# store! )ut differentl#, onl# 'ueries that constrain on the lowest levels will use the base fact table and this com osite inde-! Speeding up joins( %raditional databases t# icall# join two tables at a time" this can be disastrous for data warehouse 'ueries! Worse, the erformance varies dramaticall# with the order in which the tables are joined! A DBM. s eciali&ed for data warehouses will instead roceed as follows! ?irst all dimensional constraints are evaluated and a list of res ective rimar# ke#s is generated! %hese are combined to generate a sorted list of com osite ke#s that is matched with the fact table inde- which is itself a sorted list of com osite ke#s! Mote that this rocess can be aralleli&ed, and indeed is, b# leading database vendors!

ABC

0er 1!12

)age 1F

ABC

Data Warehouses$ Modeling and Design

Another ossibilit# is to reduce the number of dimensional tables b# creating dimensions whose instances are actuall# combinations of two or more dimensions! If, for e-am le, we were interested onl# in :22 brand-region combinations, a brand/re&ion dimension table containing :22 rows could re lace the individual brand and re&ion tables! Storage: aggregate e0plosion! ?ull aggregation is often dangerous in ractice! Consider a sales data warehouse where there are 12222 roducts and 122 stores! (n an# given da#, not all roducts are sold in all stores" erha s onl# 1D of the ossible 12 E combinations actuall# occur! Qet, most roducts will be sold somewhere, and each store will sell something+ so that both roduct-wise and store-wise 4dail#5 aggregates are relevant! %his number is itself 12122, so that the database si&e will double if these aggregates are recom uted! Adding a dimension *sa# customer, will clearl# com ound the roblem$ * Exercise $ how7, %he term-com ounded growth factor ()*$) 4)en5 indicates the database si&e with full aggregation as a multi le of its si&e with no re-aggregation, and is usuall# between 1!C and :!C per dimension! %he solution to this roblem is trial and error$ use business anal#sis and 'uer# atterns to decide what aggregates are worth recom uting! Pareto+s law can be e- ected to hold in this case$ F2D of 'ueries will utili&e onl# :2D of all ossible aggregates, so that it is ossible to meet erformance re'uirements ade'uatel# b# constructing onl# a fraction of all ossible aggregates!

Unit 3.0

!is1s in Datawarehousing .ro2ects.

Z Definitions of data are alwa#s inconsistent across user t# es, u stream data s#stems and a lications! Mo : usersKs#stems agree to a common definition easil#! %his will ha en in AAKDesign .tage! Z Data ownershi in data warehouses is ver# less! .o the sanctit# of the data is most of the times sus ected! %his comes out as a roblem onl# when an a lication is develo ed to show the data to the users! %his will ha en in %estingKIm lementation stage! Z 8igh de endenc# on 9 -stream s#stems! An# dela#s in making the u -stream interfaces read# affect the AAKDesignKDevelo ment c#cle! %his will ha en in AAKDesignKDevelo ment! Z In a re orting a lication, roblems mostl# originated from the u -stream s#stems are attributed to the a lication! %his gives rise to end-user dissatisfaction! %his will ha en in Acce tanceK%estingKIm lementation! Z (btaining test data for data validation is a risk if real time data is of ver# high confidentialit#! %his will ha en in Develo mentK%esting stage! Z .'uee&ed develo ment c#cle due to high visibilit# of the re orting a lication! the background rocess of collecting data from different sources is not visible to the users! %his will ha en in Develo mentKIm lementation stage! Z Data- rocessing time needs to be minimi&ed to ensure availabilit# of most u -to-date data worldwide at all times! %his will ha en in )roductionKIm lementation stage! Z %he datamart Kwarehouse su ort team is located in a countr#, it is often e- ected to address issues from users located in different time &ones! %his will ha en in ostroductionKmaintenance stage!4das5

Unit 3.3

Case stu#$: &nsurance

%his case stud# will touch u on almost all conce ts that have been introduced before! %hink of an insurance com an# that insures automobiles, homes and individuals! A transaction is related either to olic# formulation or to claims rocessing! An insurance com an# sells coverages+ and the data warehouse is to be used to assess the rofitabilit# of the e-isting coverages! Also of im ortance is the efficienc# of claims redressal! Exercise $ +ist some 'ueries that would be a address! ro riate for an insurance data warehouse to

ABC

0er 1!12

)age 1J

ABC

Data Warehouses$ Modeling and Design

Because a transaction is related either to olic# formulation or to claims rocessing, we can have two fact tables, one for olic# formulation and one for claims rocessing! Question$ What about having just one fact table7 Conversel#, what about having one fact table for each t# e of claim *smallKlarge, or each t# e of olic# *domesticKautomobileKindustrial,7 %he olic# creation and claims rocessing fact tables have the following structures *attributes in italics indicate those which will also a ear in monthl# olic# and claims snapshots! Polic# &reation Fact Ta"le TransactionDate$ time dimension with alternate hierarchies EffectiveDate$ .QM(MQM of TransactionDate InsuredParty#$ big, dirt# dimension Am'loyeeB *AgentKBrokerKAaterK9nderwriter, Coverage#$ %his is the com an#>s 3 roduct!6 Ma# be re resented nicel# as a subt# eKsu ert# e hierarch#! ?o*ered:temB$ Iach ?o*era&e s ecifies some ?o*ereditems! (olicyB$ Nuite ossibl# a degenerate dimension TransactionTy'e$ *CreateKAateK9nderwriteKCancel!!!, Fact$ .et of transaction attributes functionall# determined b# the above! Attribute set ma# differ for each coverage! Additional policy creation snapshot attributes , Premium accrued Premium due No-claims bonus (derived&laims Processing Fact Ta"le TransactionDate3 EffectiveDate3 InsuredParty# Am'loyeeB3 the authori&er of the claim ?o*era&eB ?o*ered:temB (olicyB ?laimantB$ usuall# a dirt# dimension ?laimB$ a codified descri tion of a claim Third(artyB$ *WitnessKI- ertK)a#ee,! %his assumes onl# one redefined third art# is involved! TransactionTy'e3 *( enK.et reserveKIns ectK)a#!!!, Fact$ .et of transaction attributes functionall# determined b# the above Additional claims processing snapshot attributes , OutstandingClaims$ number of outstanding claims at a oint in time *semi additive,!

&uestions, :nsured(arty is a dirty dimension, which means that multi le instances of :nsured(arty ma# actuall# re resent the same insured art#! 8ow can we clean it7 Does ?o*ered:tem need to be distinguished from ?o*era&e7 (r does a coverage automaticall# s ecif# a ?o*ered:tem too7 Is ?o*ered:tem a big dimension7 Will ?o*ered:tem figure in a fact table re resenting a monthl# olic# status sna shot7 %he Aater ma# assign a riskP&rade to the olic# during the @ate transaction! %o which tableKdimension does riskP&rade belong7 A degenerate dimension is one which has no se arate dimension table! 8ow might (olicy be a degenerate dimension7 %he ?o*ered:tem called 3automobile6 has attributes different from the CoveredItem called 3com uter!6 Are .utomobile and ?om'uter subclasses *subt# es, or instances of ?o*ered:tem 7 When and how would #ou s lit the olic# fact table to handle subt# es re resented b# different attribute sets of Fact7 Would this re'uire 'ueries to alwa#s access more than one fact table7 .hould the attributes of a claim *as re resented b# the codified identifier ?laim R, constitute a dimension or should the# be art of the ?AC%. of the claims rocessing fact table7 Mote that the attribute set of a claim de ends on the coverage scheme!

Coverages come in a huge number of flavors, just as do customers *insured arties,! %he set of attributes which are s ecified most often as a combination to browse coverages can be hived off into a minidimension! I-am les of such attributes could be riskPle*el and marketPse&ment! %he subset of rows in the olic# creation fact table that re resent olic# cancellations ma# have no useful ?AC% attributes, since the ur ose of each such row is merel# to record the fact of cancellation of an e-isting olic#! If we fragment the fact table hori&ontall# so as to create a

ABC

0er 1!12

)age :2

ABC

Data Warehouses$ Modeling and Design

se arate table to store just olic# cancellations, we have what is called a factless fact table ! Iach row of a factless fact table contains onl# foreign ke#s corres onding to the relevant dimensions, but no fact attributes per se! A factless fact table can also be used to store information about coverages and covered items that did not have an# bu#ers in, sa#, a given month! In the sim lest case this table would have columns for the foreign ke#s Co*era&eR, ?o*ered:tem R and >onth! At the end of each month, a row would be inserted into this table for each CoverageKCovered item combination that did not attract an# bu#ers *did not figure in an# new olic#,! %his is an e-am le of a co"erage table *not to be confused with insurance coverages,! Exercise $ Write a .N+ to determine the number of coverageKcovered item combinations that did not figure in an# new olic# in a given month! !ggregates ! Imagine a re'uirement to re ort, for each coverage, the total remiums received and total claim a#ments made b# calendar month! %his needs to be further aggregated to re ort total remium and claims a#ments made b# month! Mote that the re'uirement to 3aggregate across olicies6 usuall# im lies a re'uirement to also aggregate across other dimensions such as Am'loyee and :nsured(arty to get a useful result! Exercises$ 8ow would #ou design the warehouse to handle a 'uer# along the lines of 38ow man# of our customers have chosen mone#-back olicies76 48int$ this ma# re'uire some changes to the o erational s#stem too!5 If each recom uted aggregate is stored in a se arate aggregate fact table, what is the ma-imum number of such aggregate fact tables re'uired7 .orkout$ A sales and marketing data warehouse needs to be built for a manufacturer of hos ital health care roducts! .alesmen are assigned territories, which roll u to districts, regions and areas! A roduct rolls u to subgrou , grou and famil#, where a subgrou is defined b# the assembl# line that it rolls out from! *A single manufacturing facilit# ma#, of course, have more than one assembl# line!, %he roducts are t# icall# bought b# hos itals through bu#ing grou s to which the res ective hos itals belong, though a hos ital ma# sometimes bu# through a direct contract or even through a bu#ing grou in which it is not a member! We need to be able to anal#&e these different t# es of sales! Design the data warehouse!

Unit 3.14 elf5!e"iew: A Case tu#$ to arri"e at a co'plete Datawarehouse olution.

Data ,are)ousing for inance S#stems


!bout t)is case stud#8 #ustomer /ame 0ndustry 1ro2ect3s- /ame %ervice offered Focus Area *echnology &ser 1rofile Application Features
A leading com uter manufacturer Com uter 8ardware and .oftware Manufacture Data Warehousing AAKDesignKDevelo mentKIm lementationKWarrant# Data Ae orting for ?inance and ( erations ADBM.K(+A) ?inance Managers, Anal#sts, )lanners and C?( office

ABC

0er 1!12

)age :1

ABC

Data Warehouses$ Modeling and Design

:BZG Worldwide Access I-ecutive +evel %licing!)icing!)rilling <lobal a lication with regional granularity .eamless D.. to corp4 changes : Data warehouses, 1 Datamart and E (+A) Cubes B Qears 8istorical, Current and : #ears ?orecast data available for anal#sis on ())5 &pstream!downstream interfaces Ad"hoc and #anned re orting Bookmarking Business 610s Aobust security features Admin features Bulletin Board

Client "rofile:
T)e client designs, manufactures and markets ersonal com uters and related ersonal com uting and communicating solutions for sale rimaril# to education, creative, consumer and business customers! It leads the area of com uting in revolutionar# roducts and innovative designs in all as ects!

Client9s driver for t)e +are)ousing project:


,)# did t)e customer )ave to underta7e t)is project,)at +ere )is drivers- ,)at did )e +ant at t)e end of t)e da#-

%he (+A) Ae orting .#stems started as art of an initiative for meeting the anal#tical re orting needs of Client@s ?inance = ( erations e-ecutives through resentation of useful data on a timel# and accurate manner! ?inance e-ecutives at Client@s site needed ke# global and regional Actual, 1lan and Forecast data for %rend Anal#sis, 0isuali&ation, Budgeting, )lanning, and Modeling to su ort Decision-Making! %he transaction s#stems did not segment or aggregate data for business anal#sis so, there was a need to have a common, consistent, fast and anal#sis-read# tool for the ?inance users world-wide! %he financial data being too sensitive, a two layer user securit# was re'uired to revent users from accessing data outside their area as well as during freeze period! Before this roject, ?inance data available on Mainframe s#stems was manuall# rocessed to roduce some custom re orts for to -management! %he access was restricted and limited to user e- ertise for data within a certain timeframe!
ABC 0er 1!12 )age ::

ABC

Data Warehouses$ Modeling and Design

Unit *.

+)AP FU,DA-E,TA)$.

In 1JJ;, I!?! Codd = Associates ublished a white a er, commissioned b# Arbor .oftware *now 8# erion .olutions,, entitled @)roviding (+A) *(n-line Anal#tical )rocessing, to 9ser-Anal#sts$ An I% Mandate@! Dr Codd is, of course, ver# well known as a res ected database researcher from the 1JE2s through to the late 1JF2s and is credited with being the inventor of the relational database model, but his (+A) rules roved to be controversial due to being vendor-s onsored, rather than mathematicall# based! 4ola 5

Basic Features
-. &ultidimensional Conceptual :ie+ Dr Codd, believe this to be the central core of (+A)! :! Intuitive Data &anipulation! Dr Codd refers data mani ulation to be done through direct actions on cells in the view, without recourse to menus or multi le actions! . !ccessibilit#$! In this rule, Dr Codd essentiall# describes (+A) engines as middleware, sitting between heterogeneous data sources and an (+A) front-end! Most roducts can achieve this, but often with more data staging and batching than vendors like to admit! / ;atc) 60traction vs Interpretive ! %his rule effectivel# re'uires that roducts offer both their own staging database for (+A) data as well as offering live access to e-ternal data! %oda#, this would be regarded as the definition of a h#brid (+A), which is indeed becoming the most o ular architecture, so Dr Codd has roved to be ver# erce tive in this area! 0$ O/!" !nal#sis &odels! Dr Codd re'uires that (+A) roducts should su ort all four anal#sis models that he describes in his white a er *Categorical, I-egetical, Contem lative and ?ormulaic,! )erha s Dr Codd was antici ating data mining in this rule7 1$ Client Server !rc)itecture !Dr Codd re'uires not onl# that the roduct should be clientKserver but that the server com onent of an (+A) roduct should be sufficientl# intelligent that various clients can be attached with minimum effort and rogramming for integration! %his is a much tougher test than sim le clientKserver, and relativel# few roducts 'ualif#! )erha s he

ABC

0er 1!12

)age :;

ABC

Data Warehouses$ Modeling and Design

was antici ating a widel# acce ted A)I standard, which (+I DB for (+A) is e- ected to become! 2$ Transparenc# ! %his test is also a tough but valid one! ?ull com liance means that a user of, sa#, a s readsheet should be able to get full value from an (+A) engine and not even be aware of where the data ultimatel# comes from! +ike the revious feature, this is a tough test for o enness! 3$ &ulti-$ser Support !Dr Codd recogni&es that (+A) a lications are not all read-onl# and sa#s that, to be regarded as strategic, (+A) tools must rovide concurrent access *retrieval and u date,, integrit# and securit#!

Special Features
4$ Treatment of Non-Normalized Data ! %his refers to the integration between an (+A) engine and de-normali&ed source data! Dr Codd oints out that an# data u dates erformed in the (+A) environment should not be allowed to alter stored de-normali&ed data in feeder s#stems! regarded as calculated cells within the (+A) database! -5$ Storing O/!" %esults: <eeping T)em Separate from Source Data ! %his is reall# an im lementation rather than a roduct issue! In effect, Dr Codd is endorsing the widel# held view that read-write (+A) a lications should not be im lemented directl# on live transaction data, and (+A) data changes should be ke t distinct from transaction data!! --$ 60traction of &issing :alues! All missing values are cast in the uniform re resentation defined b# the Aelational Model -6$ Treatment of &issing :alues! All missing values to be ignored b# the (+A) anal#&er regardless of their source!

Reporting Features
-.$ le0ible %eporting! Dr Codd re'uires that the dimensions can be laid out in an# wa# that the user re'uires in re orts! We would agree, and most roducts are ca able of this in their formal re ort writers! Dr Codd does not e- licitl# state whether he e- ects the same fle-ibilit# in the interactive viewers! -/$ $niform %eporting "erformance ! Dr Codd re'uires that re orting erformance be not significantl# degraded b# increasing the number of dimensions or database si&e! Curiousl#, nowhere does he mention that the erformance must be fast, merel# that it be consistent! %here are differences between roducts, but the rinci al factor that affects erformance is the degree to which the calculations are erformed in advance and where live calculations are done *client, multidimensional server engine or ADBM.,! %his is far more im ortant than database si&e, number of dimensions or re ort com le-it#! -0$ !utomatic !djustment of ")#sical /evel ! Dr Codd re'uires that the (+A) s#stem adjusts its h#sical schema automaticall# to ada t to the t# e of model, data volumes and s arsit#!

Dimension Control
-1$ =eneric Dimensionalit# ! Dr Codd takes the urist view that each dimension must be e'uivalent in both its structure and o erational ca abilities! 8owever, he does allow additional o erational ca abilities to be granted to selected dimensions * resumabl# including time,, but he insists that such additional functions should be grantable to an# dimension! -2$ $nlimited Dimensions & !ggregation /evels ! %echnicall#, no roduct can ossibl# com l# with this feature, because there is no such thing as an unlimited entit# on a limited com uter! In an# case, few a lications need more than about eight or ten dimensions, and few hierarchies have more than about si- consolidation levels!! -3$ $nrestricted Cross-dimensional Operations !Dr Codd asserts, and we agree, that all forms of calculation must be allowed across all dimensions, not just the @measures@ dimension! In fact, man# roducts that use onl# relational storage are weak in this area! Most roducts with a multidimensional database are strong! %hese t# es of calculations are im ortant if #ou are doing com le- calculations, not just cross tabulations, and are articularl# relevant in a lications that anal#se rofitabilit#!

ABC

0er 1!12

)age :B

ABC In (+A) server data stored in three different wa#s ! Multidimensional (+A) *M(+A), Aelational (+A) *A(+A), 8#brid (+A) *8(+A),

Data Warehouses$ Modeling and Design

&O/!"
M(+A) is a high erformance, multidimensional data storage format! With M(+A), data is stored on the (+A) server! M(+A) gives the best 'uer# erformance, because it is s ecificall# o timi&ed for multidimensional data 'ueries! M(+A) storage is a ro riate for small to medium-si&ed data sets where co #ing all of the data to the multidimensional format would not re'uire significant loading time or utili&e large amounts of disk s ace!

%O/!"
With A(+A) data remains in the original relational tables! A se arate set of relational tables is used to store and reference aggregation data! A(+A) is ideal for large databases or legac# data that is infre'uentl# 'ueried!

>O/!"
8(+A) combines elements from M(+A) and A(+A)! 8(+A) kee s the original data in relational tables but stores aggregations in a multidimensional format! 8(+A) rovides connectivit# to large data sets in relational tables while taking advantage of the faster erformance of the multidimensional aggregation storage!

ABC

0er 1!12

)age :C

ABC

Data Warehouses$ Modeling and Design

%eferences and bibliograp)#: 4?is5 +! ?isher *1JJE, 3Along the Infobahn$ Data Warehouses6 in %trategy 7 5usiness, Boo&, Allen and 8amilton, Inc! 4<uff5 ?! Mc<uff *1JJG,! 3Data Modeling for Data Warehouses,6 htt $KKmembers!comKfmcguffKdwmodel 4Inm5 W! 8! Inmon! *1JJ7, 5uilding the )ata .arehouse, Oohn Wile#, MQ! 4In:5 W! 8! Inmon! *1JJE, #reating the )ata .arehouse )ata (odel from the #orporate )ata (odel+ )rism .olutions %ech %o ic 0ol!1 Mo! :!, )rism .olutions Inc!, .unn#vale, CA! )rimar# reference 4Tim5 A! Timball! *1JJE, *he )ata .arehouse *oolkit, Oohn Wile#, MQ! 4Mer5 M! I! Meredith and A! Thader *1JJG,, 3Divide and Aggregate$ Designing +arge Warehouses,6 technical re ort, Miller ?reeman Inc! 4.%<5 3Designing the Data Warehouse on Aelational Databases,6 technical re ort, .tanford %echnolog# <rou , Inc *an Informi- Com an#,! 4)en5 M! )ennies *1JJG, 3Database e- losion,6 Business Intelligence +td! Data Warehousing - ?or Better Business Decisions 8an9An2aneyulu (arempudi 3marempudi:msn4com4dash6 Data 7arehousin& For Finance 2ystems by .ssis Dash , :nfosys htt $KKwww!sas!com 4sas6 2.2 :nstitute website 4ola 5 What is (+A) b# M!)endese b# Business Intelligence +imited :222 4dashKrakesh5 Multidimensional database tool evaluation! B# Assiss Das and Dr! Aakesh Agarwal!

ABC

0er 1!12

)age :E

Vous aimerez peut-être aussi