Académique Documents
Professionnel Documents
Culture Documents
In data warehousing, a fact table consists of the measurements, metrics or facts of a business process. It is
often located at the centre of a star schema or a snowflake schema, surrounded by dimension tables.
Fact tables provide the (usually) additive values that act as independent variables by which dimensional
attributes are analyzed. Fact tables are often defined by their grain. The grain of a fact table represents the
most atomic level by which the facts may be defined. The grain of a SALES fact table might be stated as
"Sales volume by Day by Product by Store". Each record in this fact table is therefore uniquely defined by
a day, product and store. Other dimensions might be members of this fact table (such as location/region)
but these add nothing to the uniqueness of the fact records. These "affiliate dimensions" allow for
additional slices of the independent facts but generally provide insights at a higher level of aggregation (a
region contains many stores).
Example
If the business process is SALES, then the corresponding fact table will typically contain columns
representing both raw facts and aggregations in rows such as:
• $12,000, being "sales for New York store for 15-Jan-2005"
• $34,000, being "sales for Los Angeles store for 15-Jan-2005"
• $22,000, being "sales for New York store for 16-Jan-2005"
• $50,000, being "sales for Los Angeles store for 16-Jan-2005"
• $21,000, being "average daily sales for Los Angeles Store for Jan-2005"
• $65,000, being "average daily sales for Los Angeles Store for Feb-2005"
• $33,000, being "average daily sales for Los Angeles Store for year 2005"
"average monthly sales" is a measurement which is stored in the fact table. The fact table also contains
foreign keys from the dimension tables, where time series (e.g. dates) and other dimensions (e.g. store
location, salesperson, product) are stored.
All foreign keys between fact and dimension tables should be surrogate keys, not reused keys from
operational data.
The centralized table in a star schema is called a fact table. A fact table typically has two types of
columns: those that contain facts and those that are foreign keys to dimension tables. The primary key of a
fact table is usually a composite key that is made up of all of its foreign keys. Fact tables contain the
content of the data warehouse and store different types of measures like additive, non additive, and semi
additive measures.
Measure types
• Additive - Measures that can be added across all dimensions.
• Non Additive - Measures that cannot be added across all dimensions.
• Semi Additive - Measures that can be added across few dimensions and not with others.
A fact table might contain either detail level facts or facts that have been aggregated (fact tables that
contain aggregated facts are often instead called summary tables).
Special care must be taken when handling ratios and percentage. One good design rule[1] is to never store
percentages or ratios in fact tables but only calculate these in the data access tool. Thus only store the
numerator and denominator in the fact table, which then can be aggregated and the aggregated stored
values can then be used for calculating the ratio or percentage in the data access tool.
In the real world, it is possible to have a fact table that contains no measures or facts. These tables are
called "factless fact tables", or "junction tables".
The "Factless fact tables" can for example be used for modeling many-to-many relationships or capture
events[1]
Types of fact tables
There are basically three fundamental measurement events, which characterizes all fact tables.[2]
• Transactional
A transactional table is the most basic and fundamental. The grain associated with a transactional
fact table is usually specified as "one row per line in a transaction", e.g., every line on a receipt.
Typically a transactional fact table holds data of the most detailed level, causing it to have a great
number of dimensions associated with it.
• Periodic snapshots
The periodic snapshot, as the name implies, takes a "picture of the moment", where the moment
could be any defined period of time, e.g. a performance summary of a salesman over the previous
month. A periodic snapshot table is dependent on the transactional table, as it needs the detailed
data held in the transactional fact table in order to deliver the chosen performance output.
• Accumulating snapshots
This type of fact table is used to show the activity of a process that has a well-defined beginning
and end, e.g., the processing of an order. An order moves through specific steps until it is fully
processed. As steps towards fulfilling the order are completed, the associated row in the fact table
is updated. An accumulating snapshot table often has multiple date columns, each representing a
milestone in the process. Therefore, it's important to have an entry in the associated date dimension
that represents an unknown date, as many of the milestone dates are unknown at the time of the
creation of the row.
Steps in designing fact table
• Identify a business process for analysis (like sales).
• Identify measures or facts (sales dollar), by asking questions like what ‘number of’ XX are
relevant for the business process (Replace the XX, and test if the question makes sense business
wise).
• Identify dimensions for facts (product dimension, location dimension, time dimension,
organization dimension), by asking questions which makes sense business wise, like 'Analyse by'
XX, where XX are replaced with the subject to test.
• List the columns that describe each dimension (region name, branch name, business unit name).
• Determine the lowest level (granularity) of summary in a fact table (e.g. sales dollar).
Or utilize the four step design process described in Kimball[1]
Early-arriving fact
From Wikipedia, the free encyclopedia
Jump to: navigation, search
In the Data Warehouse practice of ETL, an early fact or early-arriving fact [1] denotes the detection of a
dimensional natural key during fact table source loading, prior to the assignment of a corresponding
primary key or surrogate key in the dimension table. Hence, the fact which cites the dimension arrives
early, relative to the definition of the dimension value.
Procedurally, an early fact can be treated as:
• an error, on the presumption that the dimensional attribute values should have been collected
before fact source loading.
• a valid fact, whose collection pauses whilst the missing dimensional attribute value itself is
collected.
• a valid fact, where a primary key value is generated on the dimension with no attributes(stub /
dummy row), the fact completes processing, and the dimension attributes are populated
(overwritten) later in the load processing on the new row.
Dimension table
In data warehousing, a dimension table is one of the set of companion tables to a fact table.
The fact table contains business facts or measures and foreign keys which refer to candidate keys
(normally primary keys) in the dimension tables.
Contrary to fact tables, the dimension tables contain descriptive attributes (or fields) which are typically
textual fields or discrete numbers that behave like text. These attributes are designed to serve two critical
purposes: query constraining/filtering and query result set labeling.
Dimension attributes are supposed to be:
• Verbose - labels consisting of full words,
• Descriptive,
• Complete - no missing values,
• Discretely valued - only one value per row in dimensional table,
• Quality assured - no misspelling, no impossible values.
Dimension table rows are uniquely identified by a single key field. It is recommended that the key field is
a simple integer for the reason that key value is meaningless and is only used to be join fields between the
fact and dimension tables.
The usage of surrogate dimension keys brings several advantages among:
• Performance - join processing is much more efficient if a single field surrogate key is used,
• Buffer from operational key management practices - prevents form situations when removed data
rows might reappear when their natural keys might be reused or reassigned after a long period of
dormancy,
• Mapping to integrate disparate sources,
• Handle unknown or not applicable connections,
• Track changes in dimension attribute values.
Usage of surrogate keys also brings an additional costs due the burden put on the ETL system. Still,
pipeline processing can be improved and ETL tools have built-in improved surrogate key processing.
The goal of dimension table is to create standardized conformed dimensions which can be shared across
the enterprise's data warehouse environment and joining to multiple fact table representing various
business processes.
Conformed dimensions are highly important to enterprise nature of DW/BI system for following reasons:
• Consistency - every fact table is filtered consistently, result query answer are labeled consistently,
• Integration - queries are able to drill different processes fact tables separately for each individual
fact table and then join the results on common dimension attributes,
• Reduced development time to market - the common dimensions are available without recreating
the wheel over again.
Over time, the attributes of a given row in a dimension table may change. For example, the shipping
address for a company may change. Kimball refers to this phenomenon as Slowly Changing Dimensions.
Strategies for dealing with this kind of change are divided into three categories:
• Type One - Simply overwrite the old value(s).
• Type Two - Add a new row containing the new value(s), and distinguish between the rows using
Tuple-versioning techniques.
• Type Three - Add a new attribute to the existing row.
Degenerate dimension
According to Ralph Kimball [1], in a data warehouse, a degenerate dimension is a dimension which is
derived from the fact table and doesn't have its own dimension table. Degenerate dimensions are often
used when a fact table's grain represents transactional level data and one wishes to maintain system
specific identifiers such as order numbers, invoice numbers(Bill No) and the like without forcing their
inclusion in their own dimension. The decision to use degenerate dimensions is often based on the desire
to provide a direct reference back to a transactional system without the overhead of maintaining a separate
dimension table.
Other Definitions
While many sources agree with Kimball, other definitions of degenerate dimension can be found both
online and in other reference works:
• "A degenerate dimension is data that is dimensional in nature but stored in a fact table." [2]
• "Any values in the fact table that don’t join to dimensions are either considered degenerate
dimensions or measures." [3]
• "A degenerate dimension is when the dimension attribute is stored as part of fact table, and not in a
separate dimension table." [4]
• "A degenerate dimension acts as a dimension key in the fact table but does not join a
corresponding dimension table because all its interesting attributes have already been placed in
other analytic dimensions." [5]
Supplier_History
Supplier_ Supplier_C Supplier_N Supplier_S Create_D
key ode ame tate ate
Acme Supply 22-Dec-
123 ABC CA
Co 2004
This method resembles how database audit tables and change data capture techniques function.
Type 6 / Hybrid
The Type 6 method combines the approaches of types 1, 2 and 3 (1 + 2 + 3 = 6). One possible explanation
of the origin of the term was that it was coined by Ralph Kimball during a conversation with Stephen Pace
from Kalido. Ralph Kimball calls this method "Unpredictable Changes with Single-Version Overlay" in
The Data Warehouse Toolkit[1].
The Supplier table starts out with one record for our example supplier:
Supplier_ Supplier_C Supplier_N Current_S Historical_S Start_D End_Da Current_F
Key ode ame tate tate ate te lag
Acme Supply 01-Jan- 31-Dec-
123 ABC CA CA Y
Co 2000 9999
The Current_State and the Historical_State are the same. The Current_Flag attribute indicates that this is
the current or most recent record for this supplier.
When Acme Supply Company moves to Illinois, we add a new record, as in Type 2 processing:
Supplier_ Supplier_C Supplier_N Current_S Historical_S Start_D End_Da Current_F
Key ode ame tate tate ate te lag
Acme Supply 01-Jan- 21-Dec-
123 ABC IL CA N
Co 2000 2004
Acme Supply 22-Dec- 31-Dec-
124 ABC IL IL Y
Co 2004 9999
We overwrite the Current_State information in the first record (Supplier_Key = 123) with the new
information, as in Type 1 processing. We create a new record to track the changes, as in Type 2
processing. And we store the history in a second State column (Historical_State), which incorporates Type
3 processing.
If our example supplier company were to relocate again, we would add another record to the Supplier
dimension, and we would once again overwrite the contents of the Current_State column:
Supplier_ Supplier_C Supplier_N Current_S Historical_S Start_D End_Da Current_F
Key ode ame tate tate ate te lag
Acme Supply 01-Jan- 21-Dec-
123 ABC NY CA N
Co 2000 2004
Acme Supply 22-Dec- 03-Feb-
124 ABC NY IL N
Co 2004 2008
Acme Supply 04-Feb- 31-Dec-
125 ABC NY NY Y
Co 2008 9999
Note that, for the current record (Current_Flag = 'Y'), the Current_State and the Historical_State are
always the same.[1]
Type 2 / Type 6 Fact Implementation
[edit] Surrogate Key Alone
In many Type 2 and Type 6 SCD implementations, the surrogate key from the dimension is put into the
fact table in place of the natural key when the fact data is loaded into the data repository.[1] The surrogate
key is selected for a given fact record based on its effective date and the Start_Date and End_Date from
the dimension table. This allows the fact data to be easily joined to the correct dimension data for the
corresponding effective date.
Here is the Supplier table as we created it above using Type 6 methodology:
Supplier_ Supplier_C Supplier_N Current_S Historical_S Start_D End_Da Current_F
Key ode ame tate tate ate te lag
Acme Supply 01-Jan- 21-Dec-
123 ABC NY CA N
Co 2000 2004
Acme Supply 22-Dec- 03-Feb-
124 ABC NY IL N
Co 2004 2008
125 ABC Acme Supply NY NY 04-Feb- 31-Dec- Y
Co 2008 9999
The following SQL retrieves the correct Supplier Surrogate_Key for each Delivery fact record, based on
the primary effective date, Delivery_Date:
SELECT
supplier.supplier_key
FROM supplier
INNER JOIN delivery
ON supplier.supplier_code = delivery.supplier_code
AND delivery.delivery_date >= supplier.start_date
AND delivery.delivery_date <= supplier.end_date
A fact record with an effective date of August 9, 2001 will be linked to Surrogate_Key 123, with a
Historical_State of 'CA'. A fact record with an effective date of October 11, 2007 will be linked to
Surrogate_Key 124, with a Historical_State of 'IL'.
Once the Delivery table contains the correct Supplier_Key, it can easily be joined to the Supplier table
using that key. The following SQL retrieves, for each fact record, the correct supplier name and the state
the supplier was located in at the time of the delivery:
SELECT
delivery.delivery_cost,
supplier.supplier_name,
supplier.historical_state
FROM delivery
INNER JOIN supplier
ON delivery.supplier_key = supplier.supplier_key
If you've utilized Type 6 processing for your dimension, then you can easily retrieve the state the company
is currently located in, using the same Supplier_Key:
SELECT
delivery.delivery_cost,
supplier.supplier_name,
supplier.current_state
FROM delivery
INNER JOIN supplier
ON delivery.supplier_key = supplier.supplier_key
[edit] Both Surrogate and Natural Key
An alternate implementation is to place both the surrogate key and the natural key into the fact table.[2]
This allows the user to select the appropriate dimension records based on:
• the primary effective date on the fact record (above),
• the most recent or current information,
• any other date associated with the fact record.
This method allows more flexible links to the dimension, even if you have used the Type 2 approach
instead of Type 6.
Here is the Supplier table as we might have created it using Type 2 methodology:
Supplier_ Supplier_C Supplier_N Supplier_S Start_Da Current_F
End_Date
Key ode ame tate te lag
Acme Supply 01-Jan- 21-Dec-
123 ABC CA N
Co 2000 2004
Acme Supply 22-Dec- 03-Feb-
124 ABC IL N
Co 2004 2008
Acme Supply 04-Feb- 31-Dec-
125 ABC NY Y
Co 2008 9999
The following SQL retrieves the most current Supplier_Name and Supplier_State for each fact record:
SELECT
delivery.delivery_cost,
supplier.supplier_name,
supplier.supplier_state
FROM delivery
INNER JOIN supplier
ON delivery.supplier_code = supplier.supplier_code
WHERE supplier.current_flag = 'Y'
If there are multiple dates on the fact record, the fact can be joined to the dimension using another date
instead of the primary effective date. For instance, the Delivery table might have a primary effective date
of Delivery_Date, but might also have an Order_Date associated with each record.
The following SQL retrieves the correct Supplier_Name and Supplier_State for each fact record based on
the Order_Date:
SELECT
delivery.delivery_cost,
supplier.supplier_name,
supplier.supplier_state
FROM delivery
INNER JOIN supplier
ON delivery.supplier_code = supplier.supplier_code
AND delivery.order_date >= supplier.start_date
AND delivery.order_date <= supplier.end_date
Some cautions:
• If the join query is not written correctly, it may return duplicate rows and/or give
incorrect answers.
• The date comparison might not perform well.
• Some Business Intelligence tools do not handle generating complex joins well.
• The ETL processes needed to create the dimension table needs to be carefully
designed to ensure that there are no overlaps in the time periods for each distinct
item of reference data.
Combining Types
Different SCD Types can be applied to different columns of a table. For example, we can apply Type 1 to
the Supplier_Name column and Type 2 to the Supplier_State column of the same table, the Supplier table.
Concept
The core of any OLAP system is an OLAP cube (also called a 'multidimensional cube' or a hypercube). It
consists of numeric facts called measures which are categorized by dimensions. The cube metadata is
typically created from a star schema or snowflake schema of tables in a relational database. Measures are
derived from the records in the fact table and dimensions are derived from the dimension tables.
Each measure can be thought of as having a set of labels, or meta-data associated with it. A dimension is
what describes these labels; it provides information about the measure.
A simple example would be a cube that contains a store's sales as a measure, and Date/Time as a
dimension. Each Sale has a Date/Time label that describes more about that sale.
Any number of dimensions can be added to the structure such as Store, Cashier, or Customer by adding a
foreign key column to the fact table. This allows an analyst to view the measures along any combination
of the dimensions.
For example:
Sales Fact Table
+-------------+----------+
| sale_amount | time_id |
+-------------+----------+ Time Dimension
| 2008.10| 1234 |---+ +---------+-------------------+
+-------------+----------+ | | time_id | timestamp |
| +---------+-------------------+
+---->| 1234 | 20080902 12:35:43 |
+---------+-------------------+
Multidimensional databases
Multidimensional structure is defined as “a variation of the relational model that uses multidimensional
structures to organize data and express the relationships between data”.[7] The structure is broken into
cubes and the cubes are able to store and access data within the confines of each cube. “Each cell within a
multidimensional structure contains aggregated data related to elements along each of its dimensions”.[8]
Even when data is manipulated it is still easy to access as well as be a compact type of database. The data
still remains interrelated. Multidimensional structure is quite popular for analytical databases that use
online analytical processing (OLAP) applications (O’Brien & Marakas, 2009). Analytical databases use
these databases because of their ability to deliver answers swiftly to complex business queries. Data can
be seen from different ways, which gives a broader picture of a problem unlike other models.[9]
Aggregations
It has been claimed that for complex queries OLAP cubes can produce an answer in around 0.1% of the
time for the same query on OLTP relational data.[10][11] The most important mechanism in OLAP which
allows it to achieve such performance is the use of aggregations. Aggregations are built from the fact
table by changing the granularity on specific dimensions and aggregating up data along these dimensions.
The number of possible aggregations is determined by every possible combination of dimension
granularities.
The combination of all possible aggregations and the base data contains the answers to every query which
can be answered from the data .[12]
Because usually there are many aggregations that can be calculated, often only a predetermined number
are fully calculated; the remainder are solved on demand. The problem of deciding which aggregations
(views) to calculate is known as the view selection problem. View selection can be constrained by the
total size of the selected set of aggregations, the time to update them from changes in the base data, or
both. The objective of view selection is typically to minimize the average time to answer OLAP queries,
although some studies also minimize the update time. View selection is NP-Complete. Many approaches
to the problem have been explored, including greedy algorithms, randomized search, genetic algorithms
and A* search algorithm.
Types
OLAP systems have been traditionally categorized using the following taxonomy.[13]
[edit] Multidimensional
Main article: MOLAP
MOLAP is the 'classic' form of OLAP and is sometimes referred to as just OLAP. MOLAP stores this
data in an optimized multi-dimensional array storage, rather than in a relational database. Therefore it
requires the pre-computation and storage of information in the cube - the operation known as processing.
[edit] Relational
Main article: ROLAP
ROLAP works directly with relational databases. The base data and the dimension tables are stored as
relational tables and new tables are created to hold the aggregated information. Depends on a specialized
schema design.This methodology relies on manipulating the data stored in the relational database to give
the appearance of traditional OLAP's slicing and dicing functionality. In essence, each action of slicing
and dicing is equivalent to adding a "WHERE" clause in the SQL statement.
[edit] Hybrid
Main article: HOLAP
There is no clear agreement across the industry as to what constitutes "Hybrid OLAP", except that a
database will divide data between relational and specialized storage. For example, for some vendors, a
HOLAP database will use relational tables to hold the larger quantities of detailed data, and use
specialized storage for at least some aspects of the smaller quantities of more-aggregate or less-detailed
data.
[edit] Comparison
Each type has certain benefits, although there is disagreement about the specifics of the benefits between
providers.
• Some MOLAP implementations are prone to database explosion, a phenomenon
causing vast amounts of storage space to be used by MOLAP databases when certain
common conditions are met: high number of dimensions, pre-calculated results and
sparse multidimensional data.
• MOLAP generally delivers better performance due to specialized indexing and storage
optimizations. MOLAP also needs less storage space compared to ROLAP because the
specialized storage typically includes compression techniques.[14]
• ROLAP is generally more scalable.[14] However, large volume pre-processing is difficult
to implement efficiently so it is frequently skipped. ROLAP query performance can
therefore suffer tremendously.
• Since ROLAP relies more on the database to perform calculations, it has more
limitations in the specialized functions it can use.
• HOLAP encompasses a range of solutions that attempt to mix the best of ROLAP and
MOLAP. It can generally pre-process swiftly, scale well, and offer good function
support.
[edit] Other types
The following acronyms are also sometimes used, although they are not as widespread as the ones above:
• WOLAP - Web-based OLAP
• DOLAP - Desktop OLAP
• RTOLAP - Real-Time OLAP