Vous êtes sur la page 1sur 31

Chapter 4

Chapter 4 The Extended Logical Data Model "Too much of a good thing is just right."
Too much of a good thing is just right, but even a little of a bad thing will mess up your data warehouse. We have arrived at building the Extended Logical Data Model. The Extended Logical Data Model will serve as the input to the Physical Data Model. We are in a critical stage. We must produce excellence in this section because we dont want bad input going into our Physical Data Model. This comes from Computer Science 101 -Garbage In Garbage Out (GIGO). Building a poor Extended Logical Data Model makes about as much sense as Mae West running east looking for a Sunset! Things will be headed in the wrong direction. If however, you can produce quality in your ELDM, your warehouse is well on its way to being just right! This chapter will begin with the Application Development Life Cycle, and then talk about the Logical Model and Normalization. From there, we get to the meat and discuss the metrics, which is a critical part of the Extended Logical Data Model. The ELDM will become our final input into the Physical Data Model.

The Application Development Life Cycle

"Failure accepts no alibis. Success requires no explanation."


The design of the physical database is key in the success of implementing a data warehouse along with the applications that reside on the system. Whenever something great is built, whether a building, an automobile, or a business system, there is a process that facilitates its development. When designing the physical database, the process to conduct this is known as the Application Development Life Cycle. If you follow the process you will need no alibis because success will be yours. The six major phases of this process are as follows: yDesign - Developing the Logical Data Model (LDM), Extended Logical Data Model (ELDM), and the Physical Data Model (PDM). yDevelopment - Generating the queries according to the requirements of the business users. yTest - Measure the impact that the queries have on system resources. yProduction - Follow the plan and move the procedures into the production environment. yDeployment - Training the users on the system application and query tools (i.e., SQL, MicroStrategy, Business Objects, etc.). yMaintenance - Re-examining strategies such as loading data and index selection. It is important that you understand the fundamentals and which order to perform them in the development life cycle of an application. First there is the Business Discovery, then second is a Logical Data Model, then third

Chapter 4

you get an outstanding Physical Model, then fourth Design the Application, and then lastly you perform your Development and Assurance Testing. During the testing phase of an application it is important to check that Teradata is using Parallelism, there are no large spool space peaks, and that AMP utilization is equal. A HOT AMP is a bad sign and so is running out of Spool.

Asking the Right Questions

"He who asks a question may be a fool for five minutes, but he who never asks a question remains a fool forever."
The biggest key to this section is knowledge and not being afraid to ask the right questions. Knowledge about the user environment is vitally important. If you can ask the right questions, you will build a model that will map to the users needs. In addition, you will be able to deliver a world-class data warehouse that remains cool forever. Here is how this works. The logical modelers will create a logical data model. Then it is up to you to ask the right questions and find out about the demographics of the data. Only then you can build the proper Physical Data Model. Remember: The Logical Data Model will be the input to the Extended Logical Data Model. The Extended Logical Data Model will be input to the Physical Data Model. The Physical Data Model is where Denormalization and the advantage of parallelism are determined.

Logical Data Model

"When you are courting a nice girl an hour seems like a second. When you sit on a red-hot cinder a second seems like an hour. Thats relativity."
The first step of the design phase is called the Logical Data Model (LDM). The LDM is a logical representation of tables that reside in the data warehouse database. Tables, rows and columns are the equivalent of files, records and fields in most programming languages. A properly normalized data model allows users to ask a wide variety of questions today, tomorrow, and forever.
The following illustration displays the Employee Table. The columns are emp, dept, lname, fname and sal:

Chapter 4

Employee Table EMP Primary Key 1 2 3 4

DEPT Foreign Key 40 20 20 10

LNAME

FNAME

SAL

BROWN JONES NGUYEN JACOBS

CHRIS JEFF XING SHERRY

95000.00 70000.00 55000.00 34000.00

Notice that each of the four rows in the employee table is listed across all of columns and each row has a value for each column. A row is the smallest unit that can be inserted, or deleted from a table in a data warehouse database.

Primary Keys

"Instead of giving politicians the keys to the city, it might be better to change the locks."
The Primary Key of a table is the column or group of columns whose values will identify each row of that table. Every table has to have a primary key: Tables are very flexible when it comes to defining how a tables data can be laid out. However, every table must have a primary key. Each row within that table must always be uniquely identifiable. Every table can only have one primary key: If the table happens to have several possible combinations that could work as a primary key, only one can be chosen. You cannot have more than one primary key on a table. The smallest group of columns, often just one, is usually the best. PK means primary key: Primary keys will be marked with the letters PK.

Chapter 4

"Life is a foreign language; all men mispronounce it."


A foreign key is a column or group of columns that happens to be a primary key in another table. Foreign keys help to relate a group of rows to another group of rows. Both groups of rows are required to have a common column containing like-data so that they can match up with each other, and these groups of rows can be found on the same or multiple table(s). FK means foreign key: When drawing out the design of your tables, FK will stand for your foreign keys. They can also be numbered so that you can properly mark multi-column foreign keys (FK1, FK2, etc).

Primary Key Foreign Keys Establish a Relationship


"One of the keys to happiness is bad memory." Primary Key Foreign Key relationships establish a relation therefore the tables can be joined. The picture below shows the joining of the Employee Table and the Department Table.

Chapter 4

"The noblest search is the search for excellence."


The term normalizing a database came from IBMs Codd. He was intrigued by Nixons term of trying to normalize relations with China so he began his search for excellence to normalize relations between tables. He called the process normalization. Most modelers believe the noblest search is the search for perfection so they often can take years to perfect the data model. This is a mistake. When creating a logical data model, dont strive for perfection; aim for excellence. And do it quickly! Normalizing a database is a process of placing columns that are not key related (PK/FK) columns into tables in such a way that flexibility is utilized, redundancy is minimized, and update anomalies are vaporized! Industry experts consider Third Normal Form (3NF) mandatory, however it is not an absolute necessity when utilizing a Teradata Data Warehouse. Teradata has also been proven to be extremely successful with Second Normal Form implementations. The first four forms are described as follows: 1. First Normal Form (1NF) eliminates repeating groups, for each Primary Key there is only one occurrence of the column data. 2. Second Normal Form (2NF) is when the columns/attributes relate to the entire Primary Key, not just a portion of it. All tables with a single column Primary Key are considered second normal form. 3. Third Normal Form (3NF) states that all columns/attributes relate to the entire primary key and no other key or column. The interesting thing about a normalized model is that at each level, consistency is maintained. For example, in first normal form there are no repeating groups. When a table is normalized from first to second normal form, it will have no repeating groups and attributes must relate to the entire primary key. In third normal form it will have no repeating groups (like first normal form), attributes must relate to the entire primary key (like second normal form), and attributes must relate to the primary key and not to each other (like third normal form).

Chapter 4

Once the tables are defined, related, and normalized, the Logical Data Model is complete. The next phase involves the business users, which establish the foundation for creation of the Extended Logical Data Model. Read on as we discuss this next step of the design process of Teradata.

"Examine what is said, not him who speaks."


An entity represents a person, place, or anything of that matter that can be uniquely identified. We mentioned earlier in the book that a noun is the name of a type of a person, place or thing in simple terms. Entities are nouns that we want to track. We will want to track our favorite nouns such as employees, customers, locations, departments, products and stores. A great building starts with a great foundation. A great table starts by tracking a great entity. If a business can name a noun that it wants to keep track of, then this is the start of a great table.

Chapter 4

Entities of Operational Systems Vs Decision Support

"Its time for the human race to enter the Solar System."
A data model of an operational system will often be different than the data model of a data warehouse. Operational systems must track everything that the business is doing, but data warehouses usually start small and once the Return On Investment (ROI) is realized they continue to grow. The approach to modeling is the same for operational systems as a data warehouses Decision Support System (DSS), but the difference lies in the fact that we want to track what will give us the most bang for the buck in our DSS warehouse and once we make money off this we will expand our tracking. Management understands that money doesnt grow on trees and you may only get one chance to fund a data warehouse project. Figuring what will bring the best ROI as you start the warehouse is extremely important.

When trying to identify an entity, the best thing to do first is to establish whether or not its a major entity or a minor entity.

Chapter 4

Examples of Major and Minor Entities

Chapter 4

"Fate chooses your relations, you choose your friends."


A relation can be a state of being, an association, an action, or an event that will tie two or more entities together. Relations come in three different forms; one-to-one, one-to-many and many-to-many. No matter what the form of the relation is the tables being related will have a Primary Key Foreign Key relationship. The term Normalization in a relational database actually comes from Dr. Codd of IBM. He was inspired by President Richard Nixon who at the time was trying to build a positive relationship with China. President Nixon termed what he was doing as normalizing relations with China. Because there is only one China and only one United States of America, their normalizing of relations could be considered a one-to-one relationship.

When they discover the center of the universe, a lot of people will be disappointed to discover they are not in it."
A one-to-one relation is found when each occurrence of one entity (entity A) can be related to only one occurrence of another entity (entity B), or none at all. The same applies when you relate entity B to entity A. This is the rarest form of relations, and may not be found in most data models you create.

Ones dignity may be assaulted, vandalized, and cruelly mocked, but it cannot be taken away unless it is surrendered."
A one-to-many relation is found when each occurrence of one entity (entity A) can be related to only one occurrence of another entity (entity B). But when it comes to each occurrence of entity B, you can find multiple occurrences relating to Entity A. This is a very common relation found in tables, and will appear in almost all models. For example, you can only have 1 department assigned to an employee, but you can have multiple employees assigned to a department.

Chapter 4

"Cats are smarter than dogs. You cant get eight cats to pull a sled through snow."
A many-to-many relation is found when each occurrence of on entity (entity A) can be related to many different occurrences of another entity (entity B). The same will apply when you relate entity B to entity A. This is also a very common form of a relation.

Chapter 4

"A lot of people approach risk as if its the enemy when its really fortunes accomplice."
Because a Many-to-Many relationship does not have a direct Primary Key Foreign Key relationship an associative table is utilized as the middle man. The associative table has a multi-column Primary Key. One of the associative tables Primary Key columns is the Primary Key of table A and the other Primary Key column of the associative table is the Primary Key of table B. The example below shows the exact syntax for joining the Student Table to the Course Table via the associative table; Student Course Table.

"The surprising thing about young fools is how many survive to become old fools."
As you can see from our example below we were able to join our Many-to-Many relationships table via the associative table.

Chapter 4

"If computers get too powerful, we can organize them into a committee that will do them in."
An attribute is a characteristic of an entity or a relation describing its character, amount or extent.

"This is the sixth book Ive written, which isnt bad for a guy whos only read two."

Chapter 4

A subset of an entity is a group of occurrences of an entity that have a common interest or quality that separates the group from the rest of the occurrences of that entity. We call the entity as a whole the Superset.

"Hes the kind of guy who lights up a room just by flicking a switch."
A dependent of an entity is a noun and it could have attributes of its very own, however it only exists as part of some other entity. This entity is called the parent of the dependent.

Chapter 4

"To me, old age is always 15 years older than I am."


In this chapter were going to learn about special cases of the relation section, which are called recursive relations. A recursive relation is a relation between variable occurrences of an entity. This also extends to relations to subsets of entities and dependents of entities.

"My idea of an agreeable person is a person who agrees with me."


The second type of special cases is called complex relations. A complex relation is a relation thats shared between more than two entities. This also includes subsets and dependents of entities, which is why it can be complex.

Chapter 4

"They always say time changes things, but you actually have to change them yourself."The third scenario of special relations is the time relation. This type of scenario is a relation between an entity, a subset, or a dependent, and a time value.

A Normalized Data Warehouse

"The reputation of a thousand years may be determined by the conduct of one hour."
A normalized data warehouse will have many different tables that are related to one another. Tables that have a relation can be joined. Most normalized databases will have many tables with fewer columns. This provides flexibility and is a natural way for business users to view the business. Each table created will have a Primary Key. All of the columns in the table should relate to the Primary Key. A Foreign Key is another column in the table that is also a Primary Key in another table. The Primary Key Foreign key relation is how joins are performed on two tables.

Chapter 4

Relational Databases use the Primary Key Foreign Key relationships to join tables together.

"Never insult an alligator until after you have crossed the river."
Never insult a modeler until after you have the ERWin diagram in hand. Many believe that a normalized model is best while others argue that dimensional models are better. A combination of both is an excellent strategy. Dimensional Modeling was originally designed for retail supermarket applications because their systems did not have the performance to perform joins, full table scans, aggregations, and sorting. Dimensional Modeling often implements fewer tables and can be adapted to enhance performance. This is because dimensional modeling was designed around answering specific questions.

Chapter 4

"Facts are stupid things."


The Dimension Table The dimension table helps to describe the measurements on the fact table. The dimension table can and will contain many columns and/or attributes. Dimension tables tend to be relatively shallow when it comes to the average number of rows within a dimension table. Each dimension is defined by its primary key, which serves as the basis for referential integrity with any fact table to which it is joined.
Dimension attributes serves as the source of query constraints, groupings, and report labels. When a query or report is requested, attributes can be identified as the words following by. For example, a user may ask to see last weeks total sales by Product_Id and by Customer_Number. Product_Id and Customer_Number will have to be available as dimension attributes. Dimension Table attributes are key to making a data warehouse usable and understandable. Dimensions also implement the user interface to the data warehouse.

Chapter 4

The best attributes tend to be textual and discrete. It would be wise to label your attributes as real words, instead of abbreviations. A typical dimensional attribute will include a short description, long description, brand name, category name, packaging size, packaging type, and several other product characteristics. Not all numeric data has to be put into a dimensional table. For example, the price of certain products is always changing; gas will work. Because the standard cost of gasoline is continuously changing, its possible to consider the price of gasoline not as a constant attribute, but rather as a measured fact. Sometimes there will be data thats questionable. If a case like that arises, its possible that the data can go into either the fact table or the dimensional table, depending on how the designer is developing his dimensional model. Dimensional tables contain rich sets of values, columns, and dimensional attributes that give users the tools to cut and analyze their data in every which way. Remember, the role of a dimensional table is to identify and represent hierarchical relationships within a business.

"Just because something doesnt do what you planned it to do doesnt mean its useless."
Thomas Edison wasnt a big fan of Dimensional Modeling. You just cant string together a few tables and hope the systems fast as lightning. To properly build a dimensional model, a battle plan is needed. Fortunately for us, we have the four-step Dimensional Modeling Process. The four-step Dimensional Modeling process enables us to slowly, yet accurately, map out our model. The steps go as follows:
1)

Select the business process to model: A process is any business action done by a company that is recorded by their data-collection system. If youre having trouble thinking of actions your company is doing, simply ask around. People whose job depends on being able to analyze the data will know exactly what the key business processes are. Other forms of database modeling typically focus on process by department, rather than processes by business. This is not good. If we create our model by splitting up processes by department, there will be many areas containing data duplication using different labels. Dimensional modeling tries to limit duplicate data as much as possible. Only publish data once. This will reduce ETL development, and will reduce overhead on data management and disk storage.

2)

Define the grain of said business process: Defining the grain means that youre listing exactly what each fact table row represents. This will help to convey any level of detail linked to fact table measurements. A common mistake amongst a team of database developers is that they will not agree on table granularity. If grain is improperly defined, it will become a major problem in the future. If you find in the next two steps that youve improperly determined the grain, thats ok! Its happened to everyone (yeah, Im looking at you!). Just go back to step two, come up with a better grain, and go from there.

3)

Pick any dimension that applies to each fact table row: Its imperative that our fact tables are full of a set of dimensions that represent all possible descriptions that take on single values within the context of each measurement. For each dimension, every attribute must be listed, no matter how minute or large. Typical dimensions are date, customer, and employee.

Chapter 4

"The weak can never forgive. Forgiveness is the attribute of the strong."
Each dimension on a dimensional model is going to have attributes that make that table unique from the rest. Each table will vary, depending on what information is on which table. Attributes will look something like the following:

"People can have the Model T in any color so long as its black."
The following two pictures represent an Entity-Relational (ER) Model and a Dimensional Model (DM):

Chapter 4

"The man who is swimming against the stream knows the strength of it."
Dimensional Modeling contains a number of warehousing advantages that the ER model lacks. First off, the dimensional model is very predictable. Query tools and report writers are able to make strong assumptions about the dimensional model to make user interfaces more understandable, and to make more efficient processing. Metadata is able to use the cardinality of values within a dimension to help guide user-interface behavior. Because the framework is predictable, it strengthens processing. The database engine is able to make strong assumptions about constraining dimension tables and then linking to the fact table all at once along with the Cartesian product of the dimension table keys that satisfy the users constraints. This approach enables the user to evaluate arbitrary n-way joins to a fact table within a single pass through the fact tables index. There are several standard approaches for handling certain modeling situations. Each situation has a wellunderstood set of alternative decisions that can be programmed in report writers, query tools, and user interfaces. These situations can include:
-

Slow-changing dimensions involving dimension tables that change slowly over time. Dimensional modeling helps to provide techniques for handling these slow-changing dimensions, depending on the company. Miscellaneous products where a business needs to track a number of different lines of business within a single common set of attributes and facts. Pay-in-advance databases where transactions of the company are more than small revenue accounts, however the business may want to look at single transactions as well as a report of revenue regularly.

The last strength of the dimensional model is the growing pool of DBA utilities and software processes that regulate and use aggregates. Remember, an aggregate is considered summary data that is logically redundant within the data warehouse and are used to enhance query performance. A well-formed strategy for comprehensive aggregates is needed for any medium to large data warehouse implementation. Another way to look at it is if you dont have any aggregates, lots of money could end up being wasted on minuscule hardware upgrades. Aggregate management software packages and navigation utilities depend on a specific single structure of the fact and dimension tables, which in turn is dependant of the dimensional model. If you stick to ER modeling, you will not benefit from these tools.

"The problem with political jokes is that they get elected."


The dimensional model can fit a data warehouse very nicely. However, this doesnt mean that the dimensional model is perfect. There are several situations where an ER model is better than the dimensional model, and vice versa. A star-join schema tends to fit the needs of users who have been interviewed; its possible that it doesnt fit all users. There will always be users who dont contribute to the dimensional modeling process. The star-join design tends to optimize the access of data for solely one group of users at the expense of everyone else. A starjoin schema with just one shape and a set number of constraints can be extremely optimal for one group of users, and horrendous for another. Star-join schemas are shaped around user requirements, and because a) not

Chapter 4

every user can be interviewed during the dimensional model design phase, and b) user requirements vary from group to group, different star-schemas will be optimal for different groups of users. A single star-join schema will never fit everyones requirements in a data warehouse. The data warehousing industry has long-since discovered that a single database will not work for all purposes. There are several reasons why companies need their own star-join schemas and cant share a star-join schema with another company:
-

Sequencing of data. Finance users love to see data sequenced one way while marketing users love to see it sequenced another way. Data definitions. The sales department considers a sale as closed business, while the accounting department sees it as booked business. Granularity. Finance users look at things in terms of monthly and yearly revenue. Accounting users look at things in terms of quarterly revenue. Geography. The sales department looks at things in term of ZIP code, while the marketing department might look at things in terms of states. Products. Sales tend to look things in terms of future products, while the finance department tends to look at things in terms of existing products. Time. Finance looks at the world through calendar dates, while accounting looks at the world through closing dates Sources of data. A source system will feed one star join while another source system feeds another.

The differences between certain business operations and others are much more vast than the short list above. Because there are many types of users on a database, many situations can arise at any time, and on any subject.

"A problem well stated is a problem half-solved."


Each department of a company tends to conduct their business differently than other departments. Each department sees an area of the data warehouse differently than the others because of their unique and different user requirements. Each department tends to care for different aspects of the company, which is why each department sees things differently from others. Because of the need for each department to view things in their own unique way, each department requires a different star-join schema for their data warehouse. A star join schema that is optimal for the finance department is practically useless for marketing. The reality of the business world is that one star-join schema will not fit all aspects of a companys data warehouse. It is possible to design a star-join schema specifically for each department, but even then certain problems will arise. Most of the problems dont even become apparent until multiple star joins have been designed for the data warehouse. When you have multiple independent star-join schema environments, the same detailed data will appear on each star-join. Data will no longer be reconcilable, and any new star-join creation will require the same amount of work as the old star-joins. Because of this:

Chapter 4
-

Every star join will contain much of the same data as others. Star joins can become unnecessarily large when every star join thinks that is has to go back and gather the same data that the other star joins have already collected. The results from each star join will be inconsistent with the results of every other star join. The ability to correctly distinguish the write data from the wrong data will be nearly impossible. New star joins will require just as much work to create as previous star joins already in the data warehouse. When each star join is built independently, new star joins will be built on a data warehouse with no foundation. The interface that supports applications that feed star joins will become unmanageable.

Because of these conclusions, dimensional modeling as a basis for a data warehouse can lead to many problems when multiple star joins are involved. It will never become apparent that there is a problem with a star join when youre looking at just one star join. But when a database contains multiple star joins, it becomes apparent that the dimensional model has many limitations. In the end, Dimensional and 3NF modeling work well together and both have their benefits. Dimensional modeling is great for pulling information out of data, while 3NF is great for the quick retrieval of that data. Most warehouses today take advantage of both and will find ways to implement both theories into their database.

"Choice, not chance, determines destiny."


The information you gather to create the Extended Logical Data Model will provide the main source of information, as well as the backbone for the choices you will make in the Physical Data Model. Leave nothing to chance. You are in the process of determining the final outline of your data warehouse. The completed Logical Data Model is used as a basis for the development of the Extended Logical Data Model (EDLM). The ELDM includes information regarding physical column access and data demographics for each table. This also serves as input to the implementation of the Physical Data Model. The ELDM contains data demographics, data distribution, sizing, and even access. In other words, it maps transactions and applications to their prospective related objects. The successful completion of the ELDM depends heavily on input from the business users. Users are very knowledgeable about the questions that need to be answered today. This knowledge in turn brings definition on how the warehouse will be utilized. In addition, this input provides clues about the queries that will be used to access the warehouse, the transactions that will occur, and the rate at which these transactions occur. Inside the ELDM, each column will have information concerning expected usage and demographic characteristics. How do we find this information? The best way to resolve this is by running queries that use aggregate functions. The SAMPLE function can also be helpful when collecting data demographics. This information, along with the frequency at which a column will appear in the WHERE clause, is extremely important. The two major components of an ELDM are the following: yCOLUMN ACCESS in the WHERE Clause or JOIN CLAUSE. y DATA DEMOGRAPHICS.

"The only true wisdom is in knowing you know nothing."

Chapter 4

Follow Socrates advice and assume you know nothing when starting this portion of the design process. However, If you make a column your Primary Index that is never accessed in the WHERE clause or used to JOIN to other tables, then Socrates had you in mind on the last three words of his quote. It will be your job to interview users, look at applications, and find out what columns are being used in the WHERE clause for the SELECTs. The key here is to investigate what tables are being joined. It is also important to know what columns are joining the tables together. You will be able to find some join information from the users, but common sense plays a big part in join decisions. Understand how column(s) are accessed, and your warehouse is on its way to providing true wisdom! COLUMN ACCESS in the WHERE CLAUSE:

Value Access Frequency


Value Access Rows * Frequency

Join Access Frequency

Join Access Rows

How frequently the table will be accessed via this column. The number of rows that will be accessed multiplied by the frequency at which the column will be used. How frequently the table will be joined to another table by this column being used in the WHERE clause. The number of rows joined.

Quite often, new designers to Teradata believe that selecting the Primary Index will be easy. They just pick the column that will provide the best data distribution. They assume that if they keep the Primary Index the same as the Primary Key column (which is unique by definition), then the data will distribute evenly (which it true). The Primary Index is about distribution, but even more important is Join Access. Remember this golden rule! The Primary Index is the fastest way to access data and Secondary Indexes are next.

Dont count the days, make the days count."


Mohammed Ali has given you some advice that is reflective of your next objective. Count the DATA so you can make the DATA count. So how do you accomplish this task? The following will be your guide:
y

Write SQL and utilize a wide variety of tools to get the data demographics.

yCombine these demographics with the column and join access information to complete the Extended Logical Data Model. yThen use this information to create the Primary Indexes, Secondary Indexes, and other options in the Physical Database Design.

DATA DEMOGRAPHICS:

Chapter 4

Distinct Values
Maximum Rows per Value Typical Rows per Value

The total number of unique values that will be stored in this column. The number of rows that will have the most popular value in this column. The typical number of rows for each column value. The number of rows with NULL values for the column. A relative rating for how often the column value will change. The value range is from 0-9, with 0 describing columns that do not change, and 9 describing columns that change with every write operation.

Maximum Rows NULL


Change Rating

Data Demographics answer these questions:


y

How evenly will the data be spread across the AMPs?

yWill my data have spikes causing AMP space problems?


y

Will the column change too often to be an index?

Extended Logical Data Model Template


Below is an example of an Extended Logical Data Model. The Value Access and Data Demographics have been collected. We can now use this to pick our Primary Indexes and Secondary Indexes. During the first pass at the table you should pick your potential Primary Indexes. Label them UPI or NUPI based on whether or not the column is unique. At this point in time, dont look at the change rating. The reason for the Extended Logical Data Model is to provide input for the Physical Model so that the Parsing Engine (PE) Optimizer can best choose the least costly access method or join path for user queries. The optimizer will look at factors such as row selection criteria, index and column demographics, and for the best index choices. A great Primary Index will have: yA Value Access frequency that is high yA Join Access frequency that is high yReasonable distribution yA change rating below 2 The Example Below illustrates an ELDM template of the Employee Table (assuming 20,000 Employees):

Chapter 4

EMP PK & FK ACCESS


Value Acc Freq Join Acc Freq Value Acc Rows PK SA 6K 7K 70K

DEPT
FK 5K 6K 50K

LNAME FNAME

SAL

100 0 0

0 0 0

0 0 0

DATA DEMOGRAPHICS
Distinct Rows Max Rows Per Value Max Rows Null Typical Rows Per Value Change Rating 20K 1 0 1 0 5K 50 12 15 2 12K 1K 0 3 1 N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A

Once we have our table templates we are ready for the Physical Database Design. Read on, this is becoming interesting!

The Physical Data Model

"Nothing can stand against persistence; even a mountain will be worn down over time."
We have arrived at the moment of truth. We are arriving at the top of the mountain. It is now time to create the Physical Database Design model. The biggest keys to a good physical model are choosing the correct: yPrimary Indexes ySecondary Indexes yDenormalization Tables yDerived Data Options The physical model is important because that is the piece that makes Teradata perform at the optimum level on a daily basis. If you have done a great job with the physical model Teradata should perform like lightning. If you have done the job on the physical model and Teradata is not performing to your anticipated speed then you might want to get an upgrade. Remember, Teradata needs to be designed to perform best on a daily basis. You dont justify an UPGRADE because of a slow Year-End or Quarter End Report! You justify an upgrade if you have done due diligence on the physical model and have reached the point in time when your system is not performing well on a daily basis.

Chapter 4

No two minds are alike, but two tables are usually joined in exactly the same manner by everybody. This is why the most important factor when picking a Primary Index is Join Access Frequency.

You still need to look if the index will distribute well. With poor distribution you are at risk of running out of Perm or Spool. Then you must look at Value Access Frequency to see if the column is accessed frequently. If it is then it is a great candidate for a primary or secondary index.
If the Distribution for a Column is BAD, it has already been eliminated as a Primary Index Candidate. If you have a column with high Join Access Frequency, then this is your Primary Index of choice. If you have a column that has a high Value Access Frequency, you can always create a secondary index for it. If there are no Join Columns that survived the Distribution Analysis, then pick the column with the best Value Access as your Primary Index. If all of the above fail or two columns are equally important, then pick the column with the best distribution as your Primary Index.

Denormalization

"Most databases denormalize because they have to, but Teradata denormalizes because it wants to."
Denormalization is the process of implementing methods to improve SQL query performance. The biggest keys to consider when deciding to denormalize a table are PERFORMANCE and VOLATILTITY. Will performance improve significantly and does the volatility factor make denormalization worthwhile? It is in the physical model that you can also determine the places to denormalize for extra speed. Before you go crazy denormalizing remember these valid considerations for optimal system performance first or you are wasting your time. Make sure statistics will be collected properly. Then make sure you have chosen your Primary Indexes based on user ACCESS, UNIQUENESS for distribution purposes, and that your Primary Indexes are stable values (Change Rating low). Also, know your environment and your business priorities. For example, most often the performance benefits of secondary indexes in OLTP environments outweigh performance costs of Batch Maintenance. Improved performance is an admirable goal, however, one must be aware of the hazards of denormalization. Denormalization will always reduce the amount of flexibility that you have with your data and can also complicate the development effort. In addition, it will also increase the risk for data anomalies. Lastly, it could also take on extra I/O and space overhead in some cases. Others believe that denormalization has a positive effect on application coding because some feel it will reduce the potential for data problems.

Chapter 4

Either way you should consider denormalization if users run certain queries over and over again and speed is a necessity. The key word here is performance. Performance for known queries is the most complete answer. It is a great idea whenever you denormalize from your logical model to include the denormalization in "The Denormalization Exception Report". This report keeps track of all deviations from 3rd normal form in your data warehouse.

Derived Data
Derived data is data that is calculated from other data. For instance, taking all of the employees salaries and averaging them would calculate the Average Employee Salary. It is important to be able to determine whether it is better to calculate derived data on demand or to place this information into a summary table. The 4 key factors for deciding whether to calculate or store stand alone derived data are:
y y y y

Response Time Requirements Access Frequency of the request Volatility of the column Complexity of the calculation

Response Time Requirements Derived data can take a period of time to calculate while a query is running. If user requirements need speed and their requests are taking too long, then you might consider denormalizing to speed up the request. If there is no need for speed, then be formal and stay with normal. Access frequency of the request If one user needs the data occasionally then calculate on demand, but if there are many users requesting the information daily, then consider denormalizing so many users can be satisfied. Volatility of the column If the data changes often, then there is no reason to store the data in another table or temporary table. If the data never changes and you can run the query one time and store the answer for a long period of time then you may want to consider denormalizing. If the game stays the same there is no need to be formal make it denormal. Complexity of the calculation The more complex the calculation the longer the request may take to process. If the calculation takes a long time to calculate and you dont have the time to wait then you might consider placing it in a table. When you look at the above considerations you begin to see a clear picture. If there are several requests for derived data, and the data is relatively stable, then denormalize and make sure that when any additional requests are made the answer is ready to go.

Chapter 4

Temporary Tables
Setting up Derived, Volatile or Global Temporary tables allow users to use a temporary table during their entire session. This is a technique where everyone wins. A great example might be this: Lets say you have a table that has 120,000,000 rows. Yes, the number is 120 million rows. It is a table that tracks detail data for an entire year. You have been asked to run calculations on a per month basis. You can create a temporary table, insert only the month you need to calculate, and run queries until you logoff the session. Your queries in theory will run twelve times faster. After you logoff, the data in the temporary table goes away.

TABLE 1 - Employee Table EMP PK 1 2 3 4 5 6 DEPT LNAME FK 40 BROWN 20 20 40 10 30 JONES NGUYEN JACOBS SIMPSON FNAME CHRIS JEFF XING SHERRY SAL 65000.00 70000.00 55000.00 30000.00

MORGAN 40000.00 20000.00

HAMILTON LUCY

TABLE 2 - Department Table DEPT PK 10 20 30 40 DEPT_NAME Human Resources Sales Finance Information Technology

TABLE 3 Dept_Salary Table (Temporary Table)

Chapter 4

DEPT 10 20 30 40

Sum_SAL 40000.00 125000.00 20000.00 95000.00

Count_Sal 1 2 1 2

Avg(Sal) 40000 75000 20000 47500

Volatile Temporary Tables


Volatile tables have multiple characteristics in common with derived tables. They are materialized in spool and are unknown to the Data Dictionary. They require NO data dictionary access or transaction logging. The table definition is designed for optimal performance because the definition is kept in memory. It is restricted to a single query statement at a time. However, unlike a derived table, a volatile table may be utilized multiple times, and in more than one SQL statement throughout the life of a session. This feature allows for additional queries to utilize the same rows in the temporary table without requiring the rows to be rebuilt. The ability to use the rows multiple times is the biggest advantage over derived tables. An example of how to create a volatile table would be as follows: CREATE VOLATILE TABLE Sales_Report_vt, LOG ( Sale_Date DATE ,Sum_Sale DECIMAL(9,2) ,Avg_Sale DECIMAL(7,2) ,Max_Sale DECIMAL(7,2) ,Min_Sale DECIMAL(7,2) ) ON COMMIT PRESERVE ROWS ; Now that the Volatile Table has been created, the table must be populated with an INSERT/SELECT statement like the following:

Chapter 4

INSERT INTO Sales_Report_vt SELECT Sale_Date ,SUM(Daily_Sales) ,AVG(Daily_Sales) ,MAX(Daily_Sales) ,MIN(Daily_Sales) FROM Sales_Table GROUP BY Sale_Date;

The create statement of a volatile table has a few options that need further explanation. The LOG option indicates there will be transaction logging of before images. The ON COMMIT PRESERVE ROWS means that at the end of a transaction, the rows in the volatile table will not be deleted. The information in the table remains for the entire session. Users can ask questions to the volatile table until they log off. Then the table and data go away.

Global Temporary Tables


Global Temporary Tables are similar to volatile tables in that they are local to a users session. However, when the table is created, the definition is stored in the Data Dictionary. In addition, these tables are materialized in a permanent area known as Temporary Space. Because of these reasons, global tables can survive a system restart and the table definition will not discarded at the end of the session. However, when a session normally terminates, the rows inside the Global Temporary Table will be removed. Lastly, Global tables require no spool space. They use Temp Space. Users from other sessions cannot access another users materialized global table. However, unlike volatile tables, once the table is de-materialized, the definition still resides in the Data Dictionary. This allows for future materialization of the same table. If the global table definition needs to be dropped, then an explicit DROP command must be executed. How real does Teradata consider global temporary tables? They can even be referenced from a view or macro. An example of how to create a global temporary table would be as follows:

CREATE GLOBAL TEMPORARY TABLE Sales_Report_gt, LOG ( Sale_Date DATE ,Sum_Sale DECIMAL(9,2) ,Avg_Sale DECIMAL(7,2)

Chapter 4

,Max_Sale DECIMAL(7,2) ,Min_Sale DECIMAL(7,2) ) PRIMARY INDEX(Sale_Date)ON COMMIT PRESERVE ROWS ; Now that the Global Temporary Table has been created, the table must be populated with an INSERT/SELECT statement like the following: INSERT INTO Sales_Report_gt SELECT Sale_Date ,SUM(Daily_Sales) ,AVG(Daily_Sales) ,MAX(Daily_Sales) ,MIN(Daily_Sales)FROM Sales_Table GROUP BY Sale_Date ;

Vous aimerez peut-être aussi