Académique Documents
Professionnel Documents
Culture Documents
Chapter 4 The Extended Logical Data Model "Too much of a good thing is just right."
Too much of a good thing is just right, but even a little of a bad thing will mess up your data warehouse. We have arrived at building the Extended Logical Data Model. The Extended Logical Data Model will serve as the input to the Physical Data Model. We are in a critical stage. We must produce excellence in this section because we dont want bad input going into our Physical Data Model. This comes from Computer Science 101 -Garbage In Garbage Out (GIGO). Building a poor Extended Logical Data Model makes about as much sense as Mae West running east looking for a Sunset! Things will be headed in the wrong direction. If however, you can produce quality in your ELDM, your warehouse is well on its way to being just right! This chapter will begin with the Application Development Life Cycle, and then talk about the Logical Model and Normalization. From there, we get to the meat and discuss the metrics, which is a critical part of the Extended Logical Data Model. The ELDM will become our final input into the Physical Data Model.
Chapter 4
you get an outstanding Physical Model, then fourth Design the Application, and then lastly you perform your Development and Assurance Testing. During the testing phase of an application it is important to check that Teradata is using Parallelism, there are no large spool space peaks, and that AMP utilization is equal. A HOT AMP is a bad sign and so is running out of Spool.
"He who asks a question may be a fool for five minutes, but he who never asks a question remains a fool forever."
The biggest key to this section is knowledge and not being afraid to ask the right questions. Knowledge about the user environment is vitally important. If you can ask the right questions, you will build a model that will map to the users needs. In addition, you will be able to deliver a world-class data warehouse that remains cool forever. Here is how this works. The logical modelers will create a logical data model. Then it is up to you to ask the right questions and find out about the demographics of the data. Only then you can build the proper Physical Data Model. Remember: The Logical Data Model will be the input to the Extended Logical Data Model. The Extended Logical Data Model will be input to the Physical Data Model. The Physical Data Model is where Denormalization and the advantage of parallelism are determined.
"When you are courting a nice girl an hour seems like a second. When you sit on a red-hot cinder a second seems like an hour. Thats relativity."
The first step of the design phase is called the Logical Data Model (LDM). The LDM is a logical representation of tables that reside in the data warehouse database. Tables, rows and columns are the equivalent of files, records and fields in most programming languages. A properly normalized data model allows users to ask a wide variety of questions today, tomorrow, and forever.
The following illustration displays the Employee Table. The columns are emp, dept, lname, fname and sal:
Chapter 4
LNAME
FNAME
SAL
Notice that each of the four rows in the employee table is listed across all of columns and each row has a value for each column. A row is the smallest unit that can be inserted, or deleted from a table in a data warehouse database.
Primary Keys
"Instead of giving politicians the keys to the city, it might be better to change the locks."
The Primary Key of a table is the column or group of columns whose values will identify each row of that table. Every table has to have a primary key: Tables are very flexible when it comes to defining how a tables data can be laid out. However, every table must have a primary key. Each row within that table must always be uniquely identifiable. Every table can only have one primary key: If the table happens to have several possible combinations that could work as a primary key, only one can be chosen. You cannot have more than one primary key on a table. The smallest group of columns, often just one, is usually the best. PK means primary key: Primary keys will be marked with the letters PK.
Chapter 4
Chapter 4
Chapter 4
Once the tables are defined, related, and normalized, the Logical Data Model is complete. The next phase involves the business users, which establish the foundation for creation of the Extended Logical Data Model. Read on as we discuss this next step of the design process of Teradata.
Chapter 4
"Its time for the human race to enter the Solar System."
A data model of an operational system will often be different than the data model of a data warehouse. Operational systems must track everything that the business is doing, but data warehouses usually start small and once the Return On Investment (ROI) is realized they continue to grow. The approach to modeling is the same for operational systems as a data warehouses Decision Support System (DSS), but the difference lies in the fact that we want to track what will give us the most bang for the buck in our DSS warehouse and once we make money off this we will expand our tracking. Management understands that money doesnt grow on trees and you may only get one chance to fund a data warehouse project. Figuring what will bring the best ROI as you start the warehouse is extremely important.
When trying to identify an entity, the best thing to do first is to establish whether or not its a major entity or a minor entity.
Chapter 4
Chapter 4
When they discover the center of the universe, a lot of people will be disappointed to discover they are not in it."
A one-to-one relation is found when each occurrence of one entity (entity A) can be related to only one occurrence of another entity (entity B), or none at all. The same applies when you relate entity B to entity A. This is the rarest form of relations, and may not be found in most data models you create.
Ones dignity may be assaulted, vandalized, and cruelly mocked, but it cannot be taken away unless it is surrendered."
A one-to-many relation is found when each occurrence of one entity (entity A) can be related to only one occurrence of another entity (entity B). But when it comes to each occurrence of entity B, you can find multiple occurrences relating to Entity A. This is a very common relation found in tables, and will appear in almost all models. For example, you can only have 1 department assigned to an employee, but you can have multiple employees assigned to a department.
Chapter 4
"Cats are smarter than dogs. You cant get eight cats to pull a sled through snow."
A many-to-many relation is found when each occurrence of on entity (entity A) can be related to many different occurrences of another entity (entity B). The same will apply when you relate entity B to entity A. This is also a very common form of a relation.
Chapter 4
"A lot of people approach risk as if its the enemy when its really fortunes accomplice."
Because a Many-to-Many relationship does not have a direct Primary Key Foreign Key relationship an associative table is utilized as the middle man. The associative table has a multi-column Primary Key. One of the associative tables Primary Key columns is the Primary Key of table A and the other Primary Key column of the associative table is the Primary Key of table B. The example below shows the exact syntax for joining the Student Table to the Course Table via the associative table; Student Course Table.
"The surprising thing about young fools is how many survive to become old fools."
As you can see from our example below we were able to join our Many-to-Many relationships table via the associative table.
Chapter 4
"If computers get too powerful, we can organize them into a committee that will do them in."
An attribute is a characteristic of an entity or a relation describing its character, amount or extent.
"This is the sixth book Ive written, which isnt bad for a guy whos only read two."
Chapter 4
A subset of an entity is a group of occurrences of an entity that have a common interest or quality that separates the group from the rest of the occurrences of that entity. We call the entity as a whole the Superset.
"Hes the kind of guy who lights up a room just by flicking a switch."
A dependent of an entity is a noun and it could have attributes of its very own, however it only exists as part of some other entity. This entity is called the parent of the dependent.
Chapter 4
Chapter 4
"They always say time changes things, but you actually have to change them yourself."The third scenario of special relations is the time relation. This type of scenario is a relation between an entity, a subset, or a dependent, and a time value.
"The reputation of a thousand years may be determined by the conduct of one hour."
A normalized data warehouse will have many different tables that are related to one another. Tables that have a relation can be joined. Most normalized databases will have many tables with fewer columns. This provides flexibility and is a natural way for business users to view the business. Each table created will have a Primary Key. All of the columns in the table should relate to the Primary Key. A Foreign Key is another column in the table that is also a Primary Key in another table. The Primary Key Foreign key relation is how joins are performed on two tables.
Chapter 4
Relational Databases use the Primary Key Foreign Key relationships to join tables together.
"Never insult an alligator until after you have crossed the river."
Never insult a modeler until after you have the ERWin diagram in hand. Many believe that a normalized model is best while others argue that dimensional models are better. A combination of both is an excellent strategy. Dimensional Modeling was originally designed for retail supermarket applications because their systems did not have the performance to perform joins, full table scans, aggregations, and sorting. Dimensional Modeling often implements fewer tables and can be adapted to enhance performance. This is because dimensional modeling was designed around answering specific questions.
Chapter 4
Chapter 4
The best attributes tend to be textual and discrete. It would be wise to label your attributes as real words, instead of abbreviations. A typical dimensional attribute will include a short description, long description, brand name, category name, packaging size, packaging type, and several other product characteristics. Not all numeric data has to be put into a dimensional table. For example, the price of certain products is always changing; gas will work. Because the standard cost of gasoline is continuously changing, its possible to consider the price of gasoline not as a constant attribute, but rather as a measured fact. Sometimes there will be data thats questionable. If a case like that arises, its possible that the data can go into either the fact table or the dimensional table, depending on how the designer is developing his dimensional model. Dimensional tables contain rich sets of values, columns, and dimensional attributes that give users the tools to cut and analyze their data in every which way. Remember, the role of a dimensional table is to identify and represent hierarchical relationships within a business.
"Just because something doesnt do what you planned it to do doesnt mean its useless."
Thomas Edison wasnt a big fan of Dimensional Modeling. You just cant string together a few tables and hope the systems fast as lightning. To properly build a dimensional model, a battle plan is needed. Fortunately for us, we have the four-step Dimensional Modeling Process. The four-step Dimensional Modeling process enables us to slowly, yet accurately, map out our model. The steps go as follows:
1)
Select the business process to model: A process is any business action done by a company that is recorded by their data-collection system. If youre having trouble thinking of actions your company is doing, simply ask around. People whose job depends on being able to analyze the data will know exactly what the key business processes are. Other forms of database modeling typically focus on process by department, rather than processes by business. This is not good. If we create our model by splitting up processes by department, there will be many areas containing data duplication using different labels. Dimensional modeling tries to limit duplicate data as much as possible. Only publish data once. This will reduce ETL development, and will reduce overhead on data management and disk storage.
2)
Define the grain of said business process: Defining the grain means that youre listing exactly what each fact table row represents. This will help to convey any level of detail linked to fact table measurements. A common mistake amongst a team of database developers is that they will not agree on table granularity. If grain is improperly defined, it will become a major problem in the future. If you find in the next two steps that youve improperly determined the grain, thats ok! Its happened to everyone (yeah, Im looking at you!). Just go back to step two, come up with a better grain, and go from there.
3)
Pick any dimension that applies to each fact table row: Its imperative that our fact tables are full of a set of dimensions that represent all possible descriptions that take on single values within the context of each measurement. For each dimension, every attribute must be listed, no matter how minute or large. Typical dimensions are date, customer, and employee.
Chapter 4
"The weak can never forgive. Forgiveness is the attribute of the strong."
Each dimension on a dimensional model is going to have attributes that make that table unique from the rest. Each table will vary, depending on what information is on which table. Attributes will look something like the following:
"People can have the Model T in any color so long as its black."
The following two pictures represent an Entity-Relational (ER) Model and a Dimensional Model (DM):
Chapter 4
"The man who is swimming against the stream knows the strength of it."
Dimensional Modeling contains a number of warehousing advantages that the ER model lacks. First off, the dimensional model is very predictable. Query tools and report writers are able to make strong assumptions about the dimensional model to make user interfaces more understandable, and to make more efficient processing. Metadata is able to use the cardinality of values within a dimension to help guide user-interface behavior. Because the framework is predictable, it strengthens processing. The database engine is able to make strong assumptions about constraining dimension tables and then linking to the fact table all at once along with the Cartesian product of the dimension table keys that satisfy the users constraints. This approach enables the user to evaluate arbitrary n-way joins to a fact table within a single pass through the fact tables index. There are several standard approaches for handling certain modeling situations. Each situation has a wellunderstood set of alternative decisions that can be programmed in report writers, query tools, and user interfaces. These situations can include:
-
Slow-changing dimensions involving dimension tables that change slowly over time. Dimensional modeling helps to provide techniques for handling these slow-changing dimensions, depending on the company. Miscellaneous products where a business needs to track a number of different lines of business within a single common set of attributes and facts. Pay-in-advance databases where transactions of the company are more than small revenue accounts, however the business may want to look at single transactions as well as a report of revenue regularly.
The last strength of the dimensional model is the growing pool of DBA utilities and software processes that regulate and use aggregates. Remember, an aggregate is considered summary data that is logically redundant within the data warehouse and are used to enhance query performance. A well-formed strategy for comprehensive aggregates is needed for any medium to large data warehouse implementation. Another way to look at it is if you dont have any aggregates, lots of money could end up being wasted on minuscule hardware upgrades. Aggregate management software packages and navigation utilities depend on a specific single structure of the fact and dimension tables, which in turn is dependant of the dimensional model. If you stick to ER modeling, you will not benefit from these tools.
Chapter 4
every user can be interviewed during the dimensional model design phase, and b) user requirements vary from group to group, different star-schemas will be optimal for different groups of users. A single star-join schema will never fit everyones requirements in a data warehouse. The data warehousing industry has long-since discovered that a single database will not work for all purposes. There are several reasons why companies need their own star-join schemas and cant share a star-join schema with another company:
-
Sequencing of data. Finance users love to see data sequenced one way while marketing users love to see it sequenced another way. Data definitions. The sales department considers a sale as closed business, while the accounting department sees it as booked business. Granularity. Finance users look at things in terms of monthly and yearly revenue. Accounting users look at things in terms of quarterly revenue. Geography. The sales department looks at things in term of ZIP code, while the marketing department might look at things in terms of states. Products. Sales tend to look things in terms of future products, while the finance department tends to look at things in terms of existing products. Time. Finance looks at the world through calendar dates, while accounting looks at the world through closing dates Sources of data. A source system will feed one star join while another source system feeds another.
The differences between certain business operations and others are much more vast than the short list above. Because there are many types of users on a database, many situations can arise at any time, and on any subject.
Chapter 4
-
Every star join will contain much of the same data as others. Star joins can become unnecessarily large when every star join thinks that is has to go back and gather the same data that the other star joins have already collected. The results from each star join will be inconsistent with the results of every other star join. The ability to correctly distinguish the write data from the wrong data will be nearly impossible. New star joins will require just as much work to create as previous star joins already in the data warehouse. When each star join is built independently, new star joins will be built on a data warehouse with no foundation. The interface that supports applications that feed star joins will become unmanageable.
Because of these conclusions, dimensional modeling as a basis for a data warehouse can lead to many problems when multiple star joins are involved. It will never become apparent that there is a problem with a star join when youre looking at just one star join. But when a database contains multiple star joins, it becomes apparent that the dimensional model has many limitations. In the end, Dimensional and 3NF modeling work well together and both have their benefits. Dimensional modeling is great for pulling information out of data, while 3NF is great for the quick retrieval of that data. Most warehouses today take advantage of both and will find ways to implement both theories into their database.
Chapter 4
Follow Socrates advice and assume you know nothing when starting this portion of the design process. However, If you make a column your Primary Index that is never accessed in the WHERE clause or used to JOIN to other tables, then Socrates had you in mind on the last three words of his quote. It will be your job to interview users, look at applications, and find out what columns are being used in the WHERE clause for the SELECTs. The key here is to investigate what tables are being joined. It is also important to know what columns are joining the tables together. You will be able to find some join information from the users, but common sense plays a big part in join decisions. Understand how column(s) are accessed, and your warehouse is on its way to providing true wisdom! COLUMN ACCESS in the WHERE CLAUSE:
How frequently the table will be accessed via this column. The number of rows that will be accessed multiplied by the frequency at which the column will be used. How frequently the table will be joined to another table by this column being used in the WHERE clause. The number of rows joined.
Quite often, new designers to Teradata believe that selecting the Primary Index will be easy. They just pick the column that will provide the best data distribution. They assume that if they keep the Primary Index the same as the Primary Key column (which is unique by definition), then the data will distribute evenly (which it true). The Primary Index is about distribution, but even more important is Join Access. Remember this golden rule! The Primary Index is the fastest way to access data and Secondary Indexes are next.
Write SQL and utilize a wide variety of tools to get the data demographics.
yCombine these demographics with the column and join access information to complete the Extended Logical Data Model. yThen use this information to create the Primary Indexes, Secondary Indexes, and other options in the Physical Database Design.
DATA DEMOGRAPHICS:
Chapter 4
Distinct Values
Maximum Rows per Value Typical Rows per Value
The total number of unique values that will be stored in this column. The number of rows that will have the most popular value in this column. The typical number of rows for each column value. The number of rows with NULL values for the column. A relative rating for how often the column value will change. The value range is from 0-9, with 0 describing columns that do not change, and 9 describing columns that change with every write operation.
Chapter 4
DEPT
FK 5K 6K 50K
LNAME FNAME
SAL
100 0 0
0 0 0
0 0 0
DATA DEMOGRAPHICS
Distinct Rows Max Rows Per Value Max Rows Null Typical Rows Per Value Change Rating 20K 1 0 1 0 5K 50 12 15 2 12K 1K 0 3 1 N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A
Once we have our table templates we are ready for the Physical Database Design. Read on, this is becoming interesting!
"Nothing can stand against persistence; even a mountain will be worn down over time."
We have arrived at the moment of truth. We are arriving at the top of the mountain. It is now time to create the Physical Database Design model. The biggest keys to a good physical model are choosing the correct: yPrimary Indexes ySecondary Indexes yDenormalization Tables yDerived Data Options The physical model is important because that is the piece that makes Teradata perform at the optimum level on a daily basis. If you have done a great job with the physical model Teradata should perform like lightning. If you have done the job on the physical model and Teradata is not performing to your anticipated speed then you might want to get an upgrade. Remember, Teradata needs to be designed to perform best on a daily basis. You dont justify an UPGRADE because of a slow Year-End or Quarter End Report! You justify an upgrade if you have done due diligence on the physical model and have reached the point in time when your system is not performing well on a daily basis.
Chapter 4
No two minds are alike, but two tables are usually joined in exactly the same manner by everybody. This is why the most important factor when picking a Primary Index is Join Access Frequency.
You still need to look if the index will distribute well. With poor distribution you are at risk of running out of Perm or Spool. Then you must look at Value Access Frequency to see if the column is accessed frequently. If it is then it is a great candidate for a primary or secondary index.
If the Distribution for a Column is BAD, it has already been eliminated as a Primary Index Candidate. If you have a column with high Join Access Frequency, then this is your Primary Index of choice. If you have a column that has a high Value Access Frequency, you can always create a secondary index for it. If there are no Join Columns that survived the Distribution Analysis, then pick the column with the best Value Access as your Primary Index. If all of the above fail or two columns are equally important, then pick the column with the best distribution as your Primary Index.
Denormalization
"Most databases denormalize because they have to, but Teradata denormalizes because it wants to."
Denormalization is the process of implementing methods to improve SQL query performance. The biggest keys to consider when deciding to denormalize a table are PERFORMANCE and VOLATILTITY. Will performance improve significantly and does the volatility factor make denormalization worthwhile? It is in the physical model that you can also determine the places to denormalize for extra speed. Before you go crazy denormalizing remember these valid considerations for optimal system performance first or you are wasting your time. Make sure statistics will be collected properly. Then make sure you have chosen your Primary Indexes based on user ACCESS, UNIQUENESS for distribution purposes, and that your Primary Indexes are stable values (Change Rating low). Also, know your environment and your business priorities. For example, most often the performance benefits of secondary indexes in OLTP environments outweigh performance costs of Batch Maintenance. Improved performance is an admirable goal, however, one must be aware of the hazards of denormalization. Denormalization will always reduce the amount of flexibility that you have with your data and can also complicate the development effort. In addition, it will also increase the risk for data anomalies. Lastly, it could also take on extra I/O and space overhead in some cases. Others believe that denormalization has a positive effect on application coding because some feel it will reduce the potential for data problems.
Chapter 4
Either way you should consider denormalization if users run certain queries over and over again and speed is a necessity. The key word here is performance. Performance for known queries is the most complete answer. It is a great idea whenever you denormalize from your logical model to include the denormalization in "The Denormalization Exception Report". This report keeps track of all deviations from 3rd normal form in your data warehouse.
Derived Data
Derived data is data that is calculated from other data. For instance, taking all of the employees salaries and averaging them would calculate the Average Employee Salary. It is important to be able to determine whether it is better to calculate derived data on demand or to place this information into a summary table. The 4 key factors for deciding whether to calculate or store stand alone derived data are:
y y y y
Response Time Requirements Access Frequency of the request Volatility of the column Complexity of the calculation
Response Time Requirements Derived data can take a period of time to calculate while a query is running. If user requirements need speed and their requests are taking too long, then you might consider denormalizing to speed up the request. If there is no need for speed, then be formal and stay with normal. Access frequency of the request If one user needs the data occasionally then calculate on demand, but if there are many users requesting the information daily, then consider denormalizing so many users can be satisfied. Volatility of the column If the data changes often, then there is no reason to store the data in another table or temporary table. If the data never changes and you can run the query one time and store the answer for a long period of time then you may want to consider denormalizing. If the game stays the same there is no need to be formal make it denormal. Complexity of the calculation The more complex the calculation the longer the request may take to process. If the calculation takes a long time to calculate and you dont have the time to wait then you might consider placing it in a table. When you look at the above considerations you begin to see a clear picture. If there are several requests for derived data, and the data is relatively stable, then denormalize and make sure that when any additional requests are made the answer is ready to go.
Chapter 4
Temporary Tables
Setting up Derived, Volatile or Global Temporary tables allow users to use a temporary table during their entire session. This is a technique where everyone wins. A great example might be this: Lets say you have a table that has 120,000,000 rows. Yes, the number is 120 million rows. It is a table that tracks detail data for an entire year. You have been asked to run calculations on a per month basis. You can create a temporary table, insert only the month you need to calculate, and run queries until you logoff the session. Your queries in theory will run twelve times faster. After you logoff, the data in the temporary table goes away.
TABLE 1 - Employee Table EMP PK 1 2 3 4 5 6 DEPT LNAME FK 40 BROWN 20 20 40 10 30 JONES NGUYEN JACOBS SIMPSON FNAME CHRIS JEFF XING SHERRY SAL 65000.00 70000.00 55000.00 30000.00
HAMILTON LUCY
TABLE 2 - Department Table DEPT PK 10 20 30 40 DEPT_NAME Human Resources Sales Finance Information Technology
Chapter 4
DEPT 10 20 30 40
Count_Sal 1 2 1 2
Chapter 4
INSERT INTO Sales_Report_vt SELECT Sale_Date ,SUM(Daily_Sales) ,AVG(Daily_Sales) ,MAX(Daily_Sales) ,MIN(Daily_Sales) FROM Sales_Table GROUP BY Sale_Date;
The create statement of a volatile table has a few options that need further explanation. The LOG option indicates there will be transaction logging of before images. The ON COMMIT PRESERVE ROWS means that at the end of a transaction, the rows in the volatile table will not be deleted. The information in the table remains for the entire session. Users can ask questions to the volatile table until they log off. Then the table and data go away.
CREATE GLOBAL TEMPORARY TABLE Sales_Report_gt, LOG ( Sale_Date DATE ,Sum_Sale DECIMAL(9,2) ,Avg_Sale DECIMAL(7,2)
Chapter 4
,Max_Sale DECIMAL(7,2) ,Min_Sale DECIMAL(7,2) ) PRIMARY INDEX(Sale_Date)ON COMMIT PRESERVE ROWS ; Now that the Global Temporary Table has been created, the table must be populated with an INSERT/SELECT statement like the following: INSERT INTO Sales_Report_gt SELECT Sale_Date ,SUM(Daily_Sales) ,AVG(Daily_Sales) ,MAX(Daily_Sales) ,MIN(Daily_Sales)FROM Sales_Table GROUP BY Sale_Date ;