EOUG95 - Technical Architecture For The Data Warehouse - Paper

David Walker
Page 1
A TECHNICAL ARCHITECTURE FOR THE DATA WAREHOUSE David Walker - Consultant European Oracle User Group, Firenze, Italia, 1995 Data Management & Warehousing, Wokingham, United Kingdom Summary The paper aims to set out a brief description of the process of implementing a Data Warehouse and then to look in more detail at one particular aspect of the process, namely the Technical Architecture. Whilst it can not be exhaustive it does set out a model that can be used as a guide in the implementation of a system. Introduction When an organisation decides to build a Data Warehouse the are four key elements required of the process. These elements are: The Business Analysis A process of gathering together users of information (this is not the IT department but other departments such as marketing, sales, finance, customer service, etc.) and asking them what their goals are and how they measure them. The measurements are the Key Performance Indicators (KPI) and will eventually become the answers required from the Data Warehouse. The constraints that they place on these KPI such as time, location, etc. are the dimensions of the Data Warehouse. This basic design allows users to formulate natural language questions against a relational database [for example What is the total sales (KPI) and average sales (KPI) by region (Dimension) over the past two years (Dimension) ?] Obtaining a common understanding between departments of the KPI and dimensions will in itself prove a challenge and is normally approached as a series of workshops. Users should understand that the Data Warehouse is an iterative design and should therefore not expect answers to all their questions on the first pass. It is also likely that the required information from the system will change over time as decisions, based on information that has come available, will change the course of business and require different KPI to help further decision support.
David Walker
Page 2
The Database Schema Design The next stage of the process is to take the requirements identified by the users and design a schema that will support the queries. By taking the questions identified by users and looking at the data available on the existing systems a mapping can be produced. This will inevitably show a mismatch between the requirement and the information available. Decisions then have to be made to either supplement this from external sources, modify internal sources to provide such information, or omit the information until a later iteration. The inevitable consequence of this approach is that one, or a number of star schemas will be derived. These are schemas that have a large central (or Fact) table containing the KPI and foreign keys to a number of smaller dimension tables. Each dimension table will have a relatively low number of rows but has a large number of (highly indexed) columns that are to some extent de-normalised. A user may see a table called TIME that has columns Day of Year, Day of Week, Day of Month, Month, Week Number, Year and Julian Date. For five years of information this will only contain 1825 (5*365) rows. The primary key will be Julian Date. This dimension will be one of perhaps eight that relate to the large central table. If each of these dimensions contains 2000 rows, then the central table could have up to 2.56^26 rows. In practice, there is a high level of sparsity (no sales on Saturday or Sunday would reduce the TIME dimension by 30% alone) and therefore it is practical to build the systems to support the design. Each star schema can be considered a Data Mart for one or more departments. These star schemas may also contain pre-calculated summaries, averages, etc. The schema design must also include Meta Data, or data about data. This will allow users to interrogate the database to discover the where a particular field is and if appropriate how it is derived (for example how was the gross profit calculated, which system provided the stock levels, etc.). The Meta Data can also be used to enforce security on the system by determining what a user can or can not see.
David Walker
Page 3
The Technical Architecture This is the substantive part of the paper and is covered in detail below. At this stage it is sufficient to say that the technical architecture is the design of a practical method to build, maintain and secure the Data Warehouse. As the system is rolled out it will grow in its importance and will become mission critical. Whilst not being required on a seven by twenty-four basis, it will need to be available to run very large, and possibly long running queries. Other issues will include backup and recovery, response, accuracy and presentation. Project Management As has already been stated the building of a Data Warehouse will be both interdepartmental and an iterative approach. It is necessary for the process to have the complete buy-in from management and to have strong project management to maintain the momentum over the long term. Not only does the project manager have to deal with the initial implementation, but build the necessary infrastructure to keep the project rolling over the longer term. A failure to achieve this is more often the cause of the collapse of a Data Warehouse project than the technical hurdles encountered along the way. The Technical Architecture
David Walker
Presentation Layer
Page 4
Third Party Tools
Third Party Tools
The Data Warehouse
Middleware
Middleware
Decision Support Systems
Decision Support Systems
Meta Data
Transaction Repository Data

Operational Systems
Acquisition External Data Sources
OLTP System
Legacy System
Security
EIS
EIS
David Walker
Page 5
Data Acquisition As can be seen at the bottom of the diagram there are a number of different types of source system. The first problem is not only identify the sources, but to map and extract the data that you require from each of the systems. This may require complex mappings to obtain all the information. Data acquisition is made up of a number of components. Each of which is required to ensure the successful loading of data. In some cases (such as data loading via SQL*Net a number of these components are performed together)
Data Load Loading Exception Processing Quality Assurance Publication Data Extraction Extraction Transformation Collation Migration
Extraction This is the physical process of extracting the data from the source system, either to flat files, or tapes. Transformation This is the action required to change coding into common formats, a common problem is that the support desk customer id may be different from that of the finance or marketing database. In order for the data to be accessible by all it has to be converted into a common format. Collation Once a file has been extracted and transformed it can be collated with other files. It may be advantageous at this stage to pre-sort some of the data. Migration This is the moving of the data from the source system to the Data Warehouse. This can take place either via a network or tape loads. Networking issues arise if the volumes to be moved can not be supported by the available bandwidth. If tapes are used then there is the compatibility and reliability of media, and the ability of operators to consistently put the tapes into the system in the correct order.
David Walker
Page 6
Loading Once obtained the data must be loaded into the system. This may require the use of Oracle SQL*Loader. Consideration needs to be given to whether the parallel and/or direct load options can be supported, with the requirement to re-build the indexes afterwards. Another consideration is whether users will be on the system whilst the load takes place, and where aggregates have been used how they are re-calculated after loading Exception Processing Despite the best efforts of the IT department there will after loading be exceptions. Examples of this range from the simple (a new customer key was not inserted and therefore all transactions associated with that customer have not yet been loaded) to the complex (a particular branch office was unable to generate their load file, but all other branches have. How are the gaps accounted for?). This process must handle both the expected and the unexpected. Quality Assurance Once the data has been loaded it is necessary to perform some form of Quality Assurance on the data. The load and exception processing may have all succeeded but how is good is the data. Some key indicators need to be checked after each load to ensure that they are within the expected norms (i.e. sales in Central London did not rise by 700% overnight). This can normally be achieved by setting high and low bounds on KPI, the exceptions are then manually checked to see if they are valid and if so the bounds changed. If not they changes either need to be fixed or rolled back. Publication After the completion of the Quality Assurance end users need to be notified of the state of the database. This can be achieved in a number of ways from a mail message, message of the day, or an automatic initial query that returns the status of the system. These messages should range from the fact that the Data Warehouse is either available or unavailable to the date of the last load and the percentage of feeds used (e.g. if one branch of twenty failed to load and a statistical process was used to populate the data then the data might be described as available with 95% source information included). This information is important to the users as it may affect the way in which the use the outputs from the system.
David Walker
Page 7
Data acquisition is performed in two stages. The first is the initial data load that can be effectively managed by a series of bespoke programs that either directly load data into the database (for example PL/SQL queries from the database to other Oracle databases or databases that are accessible via SQL*Net and Oracle Gateways). It may also be possible to use one of the growing number of tools on the market designed for this purpose. These include Platinum InfoPump, Prism Warehouse Manager and HP Intelligent Warehouse. This exercise can be considered a one off effort and whilst it is desirable to re-use the technology for the second stage the differences in volumes and change control may not make this possible. The second stage is the day to day updating of the system. It is unlikely that a single tool will be able to perform this task entirely. However one should try to use a small number of tools rather than build a complex set of bespoke code to perform the data acquisition. There are two reasons for this.: The first is the changing dynamics of the systems supplying the information. It is unlikely that the source systems remain static, indeed they are likely to have changed in some form before the end of the first phase of the Data Warehouse project. Therefore a tool that allows the Warehouse Manager to update the relationships between the source and target entities without having to re-craft large code segments is desirable. The second reason is the speed of development that is required. It is unacceptable to say that the Data Warehouse will not be updated for six weeks whilst the new COBOL extraction code is written and tested. Thus traditional software development techniques to not meet the need of the Data Warehouse. The Transaction Repository The Transaction Repository is the core of the system. This is where the data that has been loaded is held. The data here is the lowest level that will be accessible by the users. The volumes as have already been described can be very large. A 250 Million row table with 100 bytes per row will be in excess of 30Gb, this is without any indexes, rollback space or other associated database requirements. Further drains on disk resources will include the use of mirroring (doubling the disk space requirement), or RAID technologies (Which can add up to 20% to the disk space requirement). With a large Data Warehouse this may start to cause a problem as the upper bound of most large UNIX systems (for example
David Walker
Page 8
Sequent, Pyramid, HP and IBM) is around 3.5Tb of raw disk. This may also not all be use-able due to system design considerations such as channel bandwidth to the disk array, or slots available in the system cabinets. The total amount of disk space may be expanded by the use of external mass storage devices supplied by companies such as EMC, DG (Clariion), etc. An alternative strategy is to distribute the Data Warehouse over a number of systems. One might build a system that has a Fact Table and its associated Dimensions on a single system, the next fact table would reside on another system and if they shared any fact tables then one would use Oracle Replication server to ensure consistency between the copies. A third solution may be to build a completely distributed system and then use a tool such as MicroStrategy DSS Server or HP Intelligent Warehouse to co-ordinate the query. This provides users with a single entry point, but allows the back-end power of a distributed system. It is important to notice that in the rush to build a disk system large enough to hold all the data requirements one does not preclude processing power. Often SMP architectures can have either processor or disk controller cards up to a combined limit, whilst MMP architectures may need to dedicate nodes to manage the virtual disk array presented to the other nodes. Finally it is important to realise that where clustering technologies and MMP systems are utilised that there is an overhead involved in returning information from the nodes to the processor that spawned that parallel query. This means that simply adding additional nodes does not mean linear growth in system capacity. The performance of the data warehouse can now also be dramatically improved by the use of bit-mapped and hash indexes, pre-joined tables and parallel queries options. These technologies, whilst very powerful are not panaceas and can only be used in a controlled fashion within the Data Warehouse. Data Aggregation It is possible to improve the performance of the system by the use of Data Aggregation. This allows often used bands of data (for example quarterly figures) to be pre-calculated and stored is the database. The level of aggregation also may be more meaningful to the level of user. For example a CEO will want to have a high level view (traditionally associated with EIS systems) and drill down into a lower level of data (traditional Decision Support Systems).
David Walker
Page 9
However the CEOs DSS system may be the entry point for the managers who work under him and they drill down from the DSS level into the transaction repository to track individual items or trends on a lower scale. The use of aggregates, whilst improving overall speed may in itself occupy large amounts of disk space. For example a schema with three dimensions (Region 50 rows, Customer 10,000 rows and Time 365 rows) will have a fact table of total sales with 182.5 Million (50*10,000*365) rows. If data is now stored for not only each day, but also 52 weeks, 12 months and 4 quarters the total size rises to 216.5 (50*10,000*{365+52+12+4}) Million rows, a growth of 20% in the total database size. This does not take into account any aggregations of the other two dimensions that will also cause large amounts of growth in the database size. It is therefore possible to produce a system that contains Data Marts, suitable for use by front end tools that query against a smaller database made up of the required aggregates and submits jobs in batch to the transaction repository that run over a much longer time on a separate system. This allows uses to get good local response for a large proportion of the queries and gives them an understanding that certain types of highly detailed queries will result in batch jobs. If queries are monitored then it is possible to change the aggregation strategy over a period of time. This is because once implemented the users will start to manipulate the data in ways that where not envisaged in the original design.
David Walker
Page 10
Meta Data & Security The value of the Data Warehouse is defined by the ability of end users to obtain business benefit from the information they glean from the system. This requires an understanding of what data is there and more importantly how it was derived. Different departments, even different individuals within a department may understand the information in different ways. When the system returns an answer that is unexpected the use must be able to not only view the results but pursue the unexpected element until they are satisfied with its derivation. The Meta data also allows a way for new and interesting perspectives to be obtained on data that is in the system. For example an insurance underwriter may be able to combine factors such as location, weather and car type as a basis for new underwriting decisions, whereas historically the information whilst held within the organisation may not have been accessible from a single point and required too much effort to collate.
The value of this type of cross functional fertilisation raises the problems of security. The Data Warehouse is at its best when queries range across the system bringing back a new perspective. However there are internal problems (for example sales quotas between different departments being confidential) and external problems (for example key queries being run by an employee who then leaves to join a competitor). There is no easy answer to this, but defining access rights against the Meta Data does enable some level of security to be introduced. It will be a matter of balance between the needs of the company as a whole and the risks of allowing such broad access to staff. The outcome of this challenge will depend on the culture of the organisation and in particular the co-operation of end user managers.
Meta Data Master schema Star schema Star schema description Table Table description Table row count Column Column description Column derivation Column format
Security Control of user access to the data
David Walker
Page 11
Middle-ware and Presentation The wide variety of tools on the market make it almost impossible to make any sensible comments in such a small space. What is clear is that there a number of different types of query tool. These can be classified into Summary Query, Trawling (or Mining), Simple Data Access and Data Manipulation. Some of these tools will exist in the organisation today (often spreadsheets and in larger organisations SAS may well exist). Added to this are the number of user requirements about how they want the data to be presented (for example graphically, text with dimensions, etc.) The only way in which to choose a tool is to build up a shortlist from current market literature (the Internet and magazines provide countless references) and draw up a shortlist based on the users stated requirements, and some more traditional constraints such as cost and support. Most vendors are more than happy to provide evaluation and trial copies which end users can then evaluate for themselves. It is certain that individual preference and differing requirements will mean that no one tool will suffice, and given the cost of PC tools in general one can afford to allow a number to be tried and accepted or rejected. The choice of middle-ware is somewhat dependant on the final choice of tool, however there are two clear alternatives within the Oracle arena: SQL*Net and ODBC. Tools such a the HP Intelligent Warehouse will also provide their own connectivity that routes the query through the query manager or co-ordinator. The choice of middle-ware should be limited to one or two, forcing the rejection of front end tools that do not meet this requirement. Conclusion The Data Warehouse is an end user tool, which the IT department must build and support. It is a dynamic, iterative long-term project that requires different techniques from those traditionally employed by the information systems professionals. Staff are required to work in cross-functional teams that will discover business critical facts that may be the key to the success of your employer (and consequently you) in the longer term. The scale of the infrastructure required to support such a venture can and should not be underestimated. It is certain that the Data Warehouse will involve larger volumes of data and new querying techniques that will not be envisaged at the initial design phase. It is therefore critical that the system is built with a flexible and scaleable technical architecture designed to meet the unexpected.

EOUG95 - Technical Architecture For The Data Warehouse - Paper

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

EOUG95 - Technical Architecture For The Data Warehouse - Paper

Transféré par

Droits d'auteur :

Formats disponibles

David Walker

Third Party Tools

Third Party Tools

The Data Warehouse

Decision Support Systems

Decision Support Systems

Transaction Repository Data

Acquisition External Data Sources

Security Control of user access to the data

Vous aimerez peut-être aussi