DWH Concepts

1. What is a Data Warehouse?
A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing. It usually contains historical data derived from transaction data, but it can include data from other sources. It separates analysis workload from transaction workload and enables an organization to consolidate data from several sources. In addition to a relational database, a data warehouse environment includes an extraction, transportation, transformation, and loading (ETL) solution, an online analytical processing (OLAP) engine, client analysis tools, and other applications that manage the process of gathering data and delivering it to business users. Subject Oriented Integrated Nonvolatile Time Variant
Subject Oriented Data warehouses are designed to help you analyze data. For example, to learn more about your company's sales data, you can build a warehouse that concentrates on sales. Using this warehouse, you can answer questions like "Who was our best customer for this item last year?" This ability to define a data warehouse by subject matter, sales in this case, makes the data warehouse subject oriented. Integrated Integration is closely related to subject orientation. Data warehouses must put data from disparate sources into a consistent format. They must resolve such problems as naming conflicts and inconsistencies among units of measure. When they achieve this, they are said to be integrated. Nonvolatile Nonvolatile means that, once entered into the warehouse, data should not change. This is logical because the purpose of a warehouse is to enable you to analyze what has occurred. Time Variant In order to discover trends in business, analysts need large amounts of data. This is very much in contrast to online transaction processing (OLTP) systems, where performance requirements demand that historical data be moved to an archive. A data warehouse's focus on change over time is what is meant by the term time variant.
2. OLTP (On line transaction processing)
Operational or OLTP systems are designed to optimize the creation and updating of individual records. Typical transactions might represent the addition of new customers, taking orders or recording a change of address. Reporting and analysis systems, by contrast, typically need to summarize large numbers of individual transaction records, and perform better if summary data is pre-calculated and stored 3. OLAP(on line analytical processing) The term on-line analytic processing is used to distinguish the requirements of reporting and analysis systems from those of transaction processing systems designed to run day-to-day business operations. On line transaction processing (OLTP) focuses on capturing and updating information efficiently. This works best in a normalised, relational database, where every piece of data is stored in only one place, as part of a single record in a specific table. Management reporting, on the other hand, usually requires many records to be summarized, and information from different parts of the database to be combined, e.g. to derive a useful ratio. Good performance requires a different data structure, and the use of aggregates. OLAP tools represent data as if it were held in one or more multi-dimensional arrays, known as cubes, with cells like a spreadsheet. These cubes often have more than 3 dimensions, so strictly speaking they should be called hyper cubes, but it is much easier to visualize and explain how OLAP cubes are structured in plain 3-D. The edges of the cube represent the important dimensions of the business, such as time, country and product. One edge usually represents different measures, but some tools use separate cubes for each measure.
Each cell can be uniquely identified by specifying a member from each dimension e.g. {1999, Cost of sales, UK}. By selecting one or more members from each
dimension, the user can slice and dice the cube to view almost any subset of the data from different perspectives. Dimension members may be organized into a hierarchy, with summary level members such as year, region or product group. The user can then drill down from one level to the next to see more detailed data, and then drill back up. Most OLAP tools also enable the user to switch instantly between tabular and chart formats, and to save favorite views of the data as reports for future reference . 4. MOLAP, ROLAP, and HOLAP In the OLAP world, there are mainly two different types: Multidimensional OLAP (MOLAP) and Relational OLAP (ROLAP). Hybrid OLAP (HOLAP) refers to technologies that combine MOLAP and ROLAP. MOLAP This is the more traditional way of OLAP analysis. In MOLAP, data is stored in a multidimensional cube. The storage is not in the relational database, but in proprietary formats. Advantages: Excellent performance: MOLAP cubes are built for fast data retrieval, and is optimal for slicing and dicing operations. Can perform complex calculations: All calculations have been pre-generated when the cube is created. Hence, complex calculations are not only doable, but they return quickly. Disadvantages: Limited in the amount of data it can handle: Because all calculations are performed when the cube is built, it is not possible to include a large amount of data in the cube itself. This is not to say that the data in the cube cannot be derived from a large amount of data. Indeed, this is possible. But in this case, only summary-level information will be included in the cube itself. Requires additional investment: Cube technology are often proprietary and do not already exist in the organization. Therefore, to adopt MOLAP technology, chances are additional investments in human and capital resources are needed.
ROLAP This methodology relies on manipulating the data stored in the relational database to give the appearance of traditional OLAP's slicing and dicing functionality. In essence, each action of slicing and dicing is equivalent to adding a "WHERE" clause in the SQL statement. Advantages: Can handle large amounts of data: The data size limitation of ROLAP technology is the limitation on data size of the underlying relational database. In other words, ROLAP itself places no limitation on data amount. Can leverage functionalities inherent in the relational database: Often, relational database already comes with a host of functionalities. ROLAP technologies, since they sit on top of the relational database, can therefore leverage these functionalities. Disadvantages:
Performance can be slow: Because each ROLAP report is essentially a SQL query (or multiple SQL queries) in the relational database, the query time can be long if the underlying data size is large. Limited by SQL functionalities: Because ROLAP technology mainly relies on generating SQL statements to query the relational database, and SQL statements do not fit all needs (for example, it is difficult to perform complex calculations using SQL), ROLAP technologies are therefore traditionally limited by what SQL can do. ROLAP vendors have mitigated this risk by building into the tool out-of-the-box complex functions as well as the ability to allow users to define their own functions. HOLAP HOLAP technologies attempt to combine the advantages of MOLAP and ROLAP. For summary-type information, HOLAP leverages cube technology for faster performance. When detail information is needed, HOLAP can "drill through" from the cube into the underlying relational data.
5. Differences Between OLTP and Data warehouse Environments Data warehouses and OLTP systems have very different requirements. Here are some examples of differences between typical data warehouses and OLTP systems: Workload Data warehouses are designed to accommodate ad hoc queries. You might not know the workload of your data warehouse in advance, so a data warehouse should be optimized to perform well for a wide variety of possible query operations. OLTP systems support only predefined operations. Your applications might be specifically tuned or designed to support only these operations. Data modifications A data warehouse is updated on a regular basis by the ETL process (run nightly or weekly) using bulk data modification techniques. The end users of a data warehouse do not directly update the data warehouse. In OLTP systems, end users routinely issue individual data modification statements to the database. The OLTP database is always up to date, and reflects the current state of each business transaction. Schema design Data warehouses often use denormalized or partially denormalized schemas (such as a star schema) to optimize query performance. OLTP systems often use fully normalized schemas to optimize update/insert/delete performance, and to guarantee data consistency.
Typical operations A typical data warehouse query scans thousands or millions of rows. For example, "Find the total sales for all customers last month." A typical OLTP operation accesses only a handful of records. For example, "Retrieve the current order for this customer."
Historical data Data warehouses usually store many months or years of data. This is to support historical analysis. OLTP systems usually store data from only a few weeks or months. The OLTP system stores only historical data as needed to successfully meet the requirements of the current transaction.
6. Data Warehouse Architecture (Basic) Figure 1-2 shows a simple architecture for a data warehouse. End users directly access data derived from several source systems through the data warehouse. Figure 1-2 Architecture of a Data Warehouse
In Figure 1-2, the metadata and raw data of a traditional OLTP system is present, as is an additional type of data, summary data. Summaries are very valuable in data warehouses because they pre-compute long operations in advance. For example, a typical data warehouse query is to retrieve something like August sales.
7. Data Warehouse Architecture (with a Staging Area) In Figure 1-2, you need to clean and process your operational data before putting it into the warehouse. You can do this programmatically, although most data warehouses use a staging area instead. A staging area simplifies building summaries and general warehouse management. Figure 1-3 illustrates this typical architecture. Figure 1-3 Architecture of a Data Warehouse with a Staging Area
8. Data Warehouse Architecture (with a Staging Area and Data Marts) Although the architecture in Figure 1-3 is quite common, you may want to customize your warehouse's architecture for different groups within your organization. You can do this by adding data marts, which are systems designed for a particular line of business. Figure 1-4 illustrates an example where purchasing, sales, and inventories are separated. In this example, a financial analyst might want to analyze historical data for purchases and sales.
Figure 1-4 Architecture of a Data Warehouse with a Staging Area and Data Marts
9. Data mart Similar to a data warehouse, but holding only the data needed for a specific business function or department . Different data marts may be stored in different locations on different platforms using different database products.
10.Mart or warehouse? A data mart need not be small, but is likely to contain a subset or summary of the detailed information available in the warehouse. It will be structured to optimize the specific reports and analyses needed by a clearly defined group of users, and is much easier to build than a complete data warehouse. A central data warehouse can feed multiple data marts, with overlapping content. Each mart then provides a customized view of the organization, based on consistent data from the main warehouse. The warehouse may be allowed to grow from the first mart to be implemented, possibly sharing the same hardware platform and database. This approach can lead to major problems and rework as the warehouse expands .
11.Data staging The process of extracting, transforming, loading and checking data on its way from source system to data warehouse. Copies may be stored at intermediate steps in a data staging area. The most difficult and time-consuming aspect of building a data warehouse is taking data from disparate source systems, converting them into a consistent form that can be loaded into the warehouse, checking their quality and automating this process. 12.Conceptual, Logical, and Physical Data Models There are three levels of data modeling. They are conceptual, logical, and physical. This section will explain the difference among the three, the order with which each one is created, and how to go from one level to the other. Conceptual Data Model Features of conceptual data model include: Includes the important entities and the relationships among them. No attribute is specified. No primary key is specified. At this level, the data modeler attempts to identify the highest-level relationships among the different entities. Logical Data Model Features of logical data model include: Includes all entities and relationships among them. All attributes for each entity are specified. The primary key for each entity specified. Foreign keys (keys identifying the relationship between different entities) are specified. Normalization occurs at this level. At this level, the data modeler attempts to describe the data in as much detail as possible, without regard to how they will be physically implemented in the database. In data warehousing, it is common for the conceptual data model and the logical data model to be combined into a single step (deliverable).
The steps for designing the logical data model are as follows: Identify all entities. Specify primary keys for all entities. Find the relationships between different entities. Find all attributes for each entity. Resolve many-to-many relationships. Normalization. Physical Data Model Features of physical data model include: Specification all tables and columns. Foreign keys are used to identify relationships between tables. Demoralization may occur based on user requirements. Physical considerations may cause the physical data model to be quite different from the logical data model. At this level, the data modeler will specify how the logical data model will be realized in the database schema. The steps for physical data model design are as follows: Convert entities into tables. Convert relationships into foreign keys. Convert attributes into columns. Modify the physical data model based on physical constraints / requirements. 13.Dimensional Data Model Dimensional modeling The process of identifying the dimensions required for analysis, defining the hierarchies and levels they contain, and making sure they conform. . This is different from the 3rd normal form, commonly used for transactional (OLTP) type systems. To understand dimensional data modeling, let's define some of the terms commonly used in this type of modeling: Dimension: A business perspective useful for analyzing data. A dimension usually contains one or more hierarchies that can be used to drill up or down to different levels of detail. Typical dimensions include product, customer, time, department, location and channel. Dimension data is typically collected at the lowest level of detail and then aggregated into higher level totals that are more useful for analysis. These natural rollups or aggregations within a dimension table are called hierarchies. Hierarchy An organisation of data into a logical tree structure defining parent-child relationships between the levels in a dimension. Controls data consolidation and drill down paths. A typical time dimension would have a hierarchy based on date, week, month, quarter and year.
Attribute: A unique level within a dimension. For example, Month is an attribute in the Time Dimension. Level: A position in a hierarchy. For example, a time dimension might have a hierarchy that represents data at the Month, Quarter, and Year levels. Level Relationships Level relationships specify top-to-bottom ordering of levels from most general (the root) to most specific information. They define the parent-child relationship between the levels in a hierarchy. Star schema In the star schema design, a single object (the fact table) sits in the middle and is radically connected to other surrounding objects (dimension lookup tables) like a star. A star schema can be simple or complex. A simple star consists of one fact table; a complex star can have more than one fact table.
The most natural way to model a data warehouse is as a star schema, only one join establishes the relationship between the fact table and any one of the dimension tables. A star schema optimizes performance by keeping queries simple and providing fast response time. All the information about each level is stored in one row. Snowflake schema A type of star schema in which the dimension tables are partly or fully normalized. The main advantage of the snowflake schema is the improvement in query performance due to minimized disk storage requirements and joining smaller lookup tables. The main disadvantage of the snowflake schema is the additional maintenance efforts needed due to the increase number of lookup tables. Whether one uses a star or a snowflake largely depends on personal preference and business needs.
Dimension table A table containing the data for one dimension within a star schema. The primary key is used to link to the fact table, and each level in the dimension has a corresponding field in the dimension table. Dimension tables describe the business entities of an enterprise, represented as hierarchical, categorical information such as time, departments, locations, and products. Dimension tables are sometimes called lookup or reference tables. Fact: Data, usually numeric and additive, that can be examined and analyzed. Examples include sales, cost, and profit. Fact and measure are synonymous; fact is more commonly used with relational environments, measure is more commonly used with multidimensional environments. There are three types of facts: Additive: Additive facts are facts that can be summed up through all of the dimensions in the fact table. Semi-Additive: Semi-additive facts are facts that can be summed up for some of the dimensions in the fact table, but not the others. Non-Additive: Non-additive facts are facts that cannot be summed up for any of the dimensions present in the fact table. Derived fact (or measure) A fact (or measure) that is generated from existing data using a mathematical operation or a data transformation. Examples include averages, totals, percentages, and differences.
Let us use examples to illustrate each of the three types of facts. The first example assumes that we are a retailer, and we have a fact table with the following columns: Date Store Product Sales_Amount The purpose of this table is to record the sales amount for each product in each store on a daily basis. Sales_Amount is the fact. In this case, Sales_Amount is an additive fact, because you can sum up this fact along any of the three dimensions present in the fact table -- date, store, and product. For example, the sum of Sales_Amount for all 7 days in a week represent the total sales amount for that week. Say we are a bank with the following fact table: Date Account Current_Balance Profit_Margin The purpose of this table is to record the current balance for each account at the end of each day, as well as the profit margin for each account for each day. Current_Balance and
Profit_Margin are the facts. Current_Balance is a semi-additive fact, as it makes sense to add them up for all accounts (what's the total current balance for all accounts in the bank?), but it does not make sense to add them up through time (adding up all current balances for a given account for each day of the month does not give us any useful information). Profit_Margin is a non-additive fact, for it does not make sense to add them up for the account level or the day level. Fact Table: A fact table is a table that contains the measures of interest. For example, sales amount would be such a measure. This measure is stored in the fact table with the appropriate granularity. For example, it can be sales amount by store by day. In this case, the fact table would contain three columns: A date column, a store column, and a sales amount column. Types of Fact Tables Based on the above classifications, there are two types of fact tables: Cumulative: This type of fact table describes what has happened over a period of time. For example, this fact table may describe the total sales by product by store by day. The facts for this type of fact tables are mostly additive facts. The first example presented here is a cumulative fact table. Snapshot: This type of fact table describes the state of things in a particular instance of time, and usually includes more semi-additive and non-additive facts. The second example presented here is a snapshot fact table. Fact Table Granularity Granularity The first step in designing a fact table is to determine the granularity of the fact table. By granularity, we mean the lowest level of information that will be stored in the fact table. This constitutes two steps: Determine which dimensions will be included. Determine where along the hierarchy of each dimension the information will be kept. The determining factors usually goes back to the requirements. Which Dimensions To Include Determining which dimensions to include is usually a straightforward process, because business processes will often dictate clearly what are the relevant dimensions. For example, in an off-line retail world, the dimensions for a sales fact table are usually time, geography, and product. This list, however, is by no means a complete list for all off-line retailers. A supermarket with a Rewards Card program, where customers provide some personal information in exchange for a rewards card, and the supermarket would offer lower prices for certain items for customers who present a rewards card at checkout, will also have the ability to track the customer dimension. Whether the data warehousing system includes the customer dimension will then be a decision that needs to be made. What Level Within Each Dimensions To Include Determining which part of hierarchy the information is stored along each dimension is a bit more tricky. This is where user requirement (both stated and possibly future) plays a major role. In the above example, will the supermarket wanting to do analysis along at the hourly level? (i.e., looking at how certain products may sell by different hours of the day.) If so, it makes sense to use 'hour' as the lowest level of granularity in the time dimension. If daily analysis is sufficient, then 'day' can be used as the lowest level of
granularity. Since the lower the level of detail, the larger the data amount in the fact table, the granularity exercise is in essence figuring out the sweet spot in the tradeoff between detailed level of analysis and data storage. Measure A numeric value stored in a fact table or cube. Typical examples include sales value, sales volume, price, stock and headcount. Granularity (or grain) The level of detail of the facts (or measures) stored in the data warehouse.
Cube A multi-dimensional representation of data that can be viewed from different perspectives. Meta data Data describing the data held in the warehouse. This may include a description of the tables and fields in the warehouse, what they mean, similar descriptions of the source data and a mapping between the two, details of the transformations made, the reliability of each item, when it was last updated etc.
Slowly growing dimensions are dimension tables that have slowing increasing dimension data, without updates to existing dimensions. You maintain slowly growing dimensions by appending new data to the existing table. Slowly changing dimensions are dimension tables that have slowly increasing dimension data, as well as updates to existing dimensions. When updating existing dimensions, you decide whether to keep all historical dimension data, no historical data, or just the current and previous versions of dimension data. Type 1: In Type 1 Slowly Changing Dimension, the new information simply overwrites the original information. In other words, no history is kept. In our example, recall we originally have the following table: Customer Key Name State 1001 Christina Illinois After Christina moved from Illinois to California, the new information replaces the new record, and we have the following table: Customer Key Name State 1001 Christina California Advantages: - This is the easiest way to handle the Slowly Changing Dimension problem, since there is no need to keep track of the old information.
Disadvantages: - All history is lost. By applying this methodology, it is not possible to trace back in history. For example, in this case, the company would not be able to know that Christina lived in Illinois before. Usage: About 50% of the time. When to use Type 1: Type 1 slowly changing dimension should be used when it is not necessary for the data warehouse to keep track of historical changes Type 2: In Type 2 Slowly Changing Dimension, a new record is added to the table to represent the new information. Therefore, both the original and the new record will be present. The new record gets its own primary key. In our example, recall we originally have the following table: Customer Key Name State 1001 Christina Illinois After Christina moved from Illinois to California, we add the new information as a new row into the table: Customer Key Name State 1001 Christina Illinois 1005 Christina California Advantages: - This allows us to accurately keep all historical information. Disadvantages: - This will cause the size of the table to grow fast. In cases where the number of rows for the table is very high to start with, storage and performance can become a concern. - This necessarily complicates the ETL process. Usage: About 50% of the time. When to use Type 2: Type 2 slowly changing dimension should be used when it is necessary for the data warehouse to track historical changes Type 3: In Type 3 Slowly Changing Dimension, there will be two columns to indicate the particular attribute of interest, one indicating the original value, and one indicating the current value. There will also be a column that indicates when the current value becomes active. In our example, recall we originally have the following table: Customer Key Name State 1001 Christina Illinois To accommodate Type 3 Slowly Changing Dimension, we will now have the following columns: Customer Key Name Original State Current State Effective Date
After Christina moved from Illinois to California, the original information gets updated, and we have the following table (assuming the effective date of change is January 15, 2003): Customer Key Name Original State Current State Effective Date 1001 Christina Illinois California 15-JAN-2003 Advantages: - This does not increase the size of the table, since new information is updated. - This allows us to keep some part of history. Disadvantages: - Type 3 will not be able to keep all history where an attribute is changed more than once. For example, if Christina later moves to Texas on December 15, 2003, the California information will be lost. Usage: Type 3 is rarely used in actual practice. When to use Type 3: Type III slowly changing dimension should only be used when it is necessary for the data warehouse to track historical changes, and when such changes will only occur for a finite number of time.
ETL Extraction, transformation, and loading. ETL refers to the methods involved in accessing and manipulating source data and loading it into a data warehouse. The order in which these processes are performed varies. Note that ETT (extraction, transformation, transportation) and ETM (extraction, transformation, move) are sometimes used instead of ETL.
14.What is a Surrogate key? A surrogate key is a substitution for the natural primary key. It is just a unique identifier or number for each row that can be used for the primary key to the table. The only requirement for a surrogate primary key is that it is unique for each row in the table. Data warehouses typically use a surrogate, (also known as artificial or identity key), key for the dimension tables primary keys. They can use Infa sequence generator, or Oracle sequence, or SQL Server Identity values for the surrogate key. It is useful because the natural primary key (i.e. Customer Number in Customer table) can change and this makes updates more difficult. Some tables have columns such as AIRPORT_NAME or CITY_NAME which are stated as the primary keys (according to the business users) but ,not only can these change, indexing on a numerical value is probably better and you could consider creating a surrogate key called, say, AIRPORT_ID. This would be internal to the
system and as far as the client is concerned you may display only the AIRPORT_NAME.
15.ODS (Operational data store) A subject-oriented, integrated, frequently updated store of detailed data needed to support transactional systems with integrated data, e.g. the current master customer list shared by several systems. If available, the ODS is often a good source for dimension data. When the term was first coined, the ODS was expected to support queries requiring data at the most detailed level available, which typically had to be excluded from the warehouse to keep size within reasonable bounds. Thus the ODS would contain maximum detail and be refreshed in real time, whereas the warehouse would be lightly summarized and refreshed at regular intervals. Nowadays, hardware costs are less of a limiting factor, and it is more usual to restructure even the most detailed data for query and analysis and include them in the warehouse, possibly in near real time. Thus an ODS is less likely to be queried directly by users, and acts more like a source system.
16.ETL (Extract, Transformation & Load) The process of extracting data from source systems, transforming this into the required structure and loading into the data warehouse. ETL tools are available to assist with this process. Data cleansing The process of correcting errors and removing inconsistencies before importing data into the warehouse. Aggregate Pre-stored summary of data or grouping of detailed data which satisfies a specific business rule. Example rules: sum, min, count, or combinations of them. 17.Why are OLTP database designs not generally a good idea for a Data Warehouse? OLTP databases are normalized structure designed to reduce data, increase performance of transactions, and decrease overall reads/writes. When you are wanting to build summaries and cube analysis blocks of things, pulling this from a highly normalized transaction based structure creates huge amounts of overhead, is slow, and is inefficient. 18.Why should you put your data warehouse on a different system than your OLTP system?
There are a log of reasons. OLTP is designed to serve real-time transactional business needs. The idea is get in/get out quickly and efficiently. OLAP is designed to have large batch windows that aggregate and summarize data into reporting schemas. The two are dynamically opposed. The processing of the OLAP portion of the database kills the processing and memory utilization needed to keep the OLTP system active and "lively". The OLAP system also interferes with the data as it's locking and causing contention on data as it's creating the aggregate views that are needed. The list for this goes on and on and on and on. They are two vastly different types of systems though. The resources will interfere with each other and end up choking your entire system to death.
Drill To navigate from one item to a set of related items. Drilling typically involves navigating up and down through the levels in a hierarchy. When selecting data, you can expand or collapse a hierarchy by drilling down or up in it, respectively.
Drill down To expand the view to include child values that are associated with parent values in the hierarchy Drill up To collapse the list of descendant values that are associated with a parent value in the hierarchy Slice and dice This is an informal term referring to data retrieval and manipulation. We can picture a data warehouse as a cube of data, where each axis of the cube represents a dimension. To "slice" the data is to retrieve a piece (a slice) of the cube by specifying measures and values for some or all of the dimensions. When we retrieve a data slice, we may also move and reorder its columns and rows as if we had diced the slice into many small pieces. A system with good slicing and dicing makes it easy to navigate through large amounts of data. Aggregation The process of consolidating data values into a single value. For example, sales data could be collected on a daily basis and then be aggregated to the week level, the week data could be aggregated to the month level, and so on. The data can then be referred to as aggregate data. Aggregation is synonymous with summarization, and aggregate data is synonymous with summary data.
Decision Support Systems (DSS) are a specific class of computerized information system that supports business and organizational decision-making activities. A properly designed DSS is an interactive software-based system intended to help decision makers compile useful information from raw data, documents, personal knowledge, and/or business models to identify and solve problems and make decisions. Typical information that a decision support application might gather and present would be: Accessing all of your current information assets, including legacy and relational data sources, cubes, data warehouses, and data marts. Comparative sales figures between one week and the next. Projected revenue figures based on new product sales assumptions. The consequences of different decision alternatives, given past experience in a context that is described.
19. Decision Support Systems (DSS)
Business Intelligence Software Solutions
Business Intelligence (BI) gives you the ability to gain insight into your business or organization by understanding your company's information assets. These assets can include customer databases; supply chain information, personnel data, manufacturing, and sales and marketing activity as well as any other source of information critical to your operation. Business intelligence software allows you to integrate these disparate data sources into a single coherent framework for real-time reporting and detailed analysis by anyone in your extended enterprise customers, partners, employees, managers, and executives.
20.Materialized Views for Data Warehouses A materialized view is also called snap shots is a table that stores derived data and a query associated with it. A materialized view can query tables, views, and other materialized views. Collectively these are called master tables (a replication term) or detail tables. the materialized views commonly created are primary key, rowid, and subquery materialized views In data warehouses, you can use materialized views to precompute and store aggregated data such as the sum of sales. Materialized views in these environments are often referred to as summaries, because they store summarized data. They can also be used to precompute joins with or without aggregations. Materialized views improve query performance by precalculating expensive join and aggregation operations on the database prior to execution and storing the results in the database Unlike views a materialized view contains both definition and data are stored. where as in views only definition is stored.
When ever the query is fired against the base table ,oracle server checks whether the query satisfies the data in the materialized view ,if it satisfies the oracle server redirects the query to materialized view rather than the base table. 21.Refresh of a Materialized views A materialized view refresh is an efficient batch operation that makes a materialized view reflect a more current state of its master table or master materialized view. A refresh of an up datable materialized view first pushes the deferred transactions at the materialized view site to its master site or master materialized view site. Then, the data at the master site or master materialized view site is pulled down and applied to the materialized view. A row in a master table may be updated many times between refreshes of a materialized view, but the refresh updates the row in the materialized view only once with the current data. For example, a row in a master table may be updated 10 times since the last refresh of a materialized view, but the result is still only one update of the corresponding row in the materialized view during the next refresh. Refresh Types The following types of refresh methods are supported by Oracle. Complete - build from scratch Fast - only apply the data changes Force - try a fast refresh, if that is not possible, do a complete refresh Never - never refresh the materialized view
Each materialized view may have its own refresh method. The query that defines the materialized view determines which of the methods is applicable.Hence it may not always be possible for a materialized view to be fast refreshable. All materialized views can be refreshed using a complete refresh. Once defined, a materialized view can be refreshed to reflect the latest data changes either either ON DEMAND or ON COMMIT. The user can control the refresh of materialized views by choosing to refresh ON DEMAND. If the user chooses to refresh ON COMMIT, Oracle will automatically refresh the materialized view on the commit of any transaction that updates tables referenced by the materialized view.
Complete Refresh
To refresh the materialized view, the result set of the query replaces the existing materialized view data. Oracle can perform a complete refresh for any materialized view. Depending on the amount of data that satisfies the defining query, a complete refresh can take a substantially longer amount of time to perform than a fast refresh. If you perform a complete refresh of a master materialized view, then the next refresh performed on any materialized views based on this master materialized view must be a complete refresh. If a fast refresh is attempted for such a materialized view after its master materialized view has performed a complete refresh. Fast Refresh To perform a fast refresh, the master that manages the materialized view first identifies the changes that occurred in the master since the most recent refresh of the materialized view and then applies these changes to the materialized view. Fast refreshes are more efficient than complete refreshes when there are few changes to the master because the participating server and network replicate a smaller amount of data. You can perform fast refreshes of materialized views only when the master table or master materialized view has a materialized view log. Force Refresh To perform a force refresh of a materialized view, the server that manages the materialized view attempts to perform a fast refresh. If a fast refresh is not possible, then Oracle performs a complete refresh. Use the force setting when you want a materialized view to refresh if a fast refresh is not possible. 22.Materialized View Log A materialized view log is required on a master if you want to fast refresh materialized views based on the master. When you create a materialized view log for a master table or master materialized view, Oracle creates an underlying table as the materialized view log. A materialized view log can hold the primary keys, rowids, or object identifiers of rows, or both, that have been updated in the master table or master materialized view. A materialized view log can also contain other columns to support fast refreshes of materialized views with subqueries. Following are the types of materialized view logs: Primary Key: The materialized view records changes to the master table or master materialized view based on the primary key of the affected rows. Row ID: The materialized view records changes to the master table or master materialized view based on the rowid of the affected rows. Object ID: The materialized view records changes to the master object table or
master object materialized view based on the object identifier of the affected row objects.
23.Types of materialized views There are three types of materialized views: Read only materialized view Updateable materialized view Writeable materialized view
Write able Materialized Views A writeable materialized view is one that is created using the FOR UPDATE clause but is not part of a materialized view group. Users can perform DML operations on a writeable materialized view, but if you refresh the materialized view, then these changes are not pushed back to the master and the changes are lost in the materialized view itself. Writeable materialized views are typically allowed wherever fast-refreshable read-only materialized views are allowed.
Advantages: They are created with the for update clause during creation without then adding the materialized view to a materialized view group. In such a case, the materialized view is updatable, but the changes are lost when the materialized view refreshes.
Read-Only Materialized Views You can make a materialized view read-only during creation by omitting the FOR UPDATE clause or disabling the equivalent option in the Replication Management tool. Read-only materialized views use many of the same mechanisms as updatable materialized views, except that they do not need to belong to a materialized view group.
Advantages: There is no possibility for conflicts as they cannot be updated. Complex materialized views are supported
Updatable Materialized Views You can make a materialized view updatable during creation by including the FOR UPDATE clause or enabling the equivalent option in the Replication Management tool. For changes made to an updatable materialized view to be pushed back to the master during refresh, the updatable materialized view must belong to a materialized view group. Advantages: Can be updated even when disconnected from the master site or master materialized view site. Requires fewer resources than multimaster replication.
Are refreshed on demand. Hence the load on the network might be reduced compared to using multimaster replication because multimaster replication synchronises changes at regular intervalls.
Query rewrite The query rewrite facility is totally transparent to an application which needs not be aware of the existence of the underlying materialized view.

DWH Concepts

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

DWH Concepts

Transféré par

Droits d'auteur :

Formats disponibles

1. What is a Data Warehouse?

2. OLTP (On line transaction processing)

19. Decision Support Systems (DSS)

Business Intelligence Software Solutions

Vous aimerez peut-être aussi