Vous êtes sur la page 1sur 9

Identifying Candidate Dimension Columns

Compliments of Phil Gunning

Multidimensional clustering (MDC) is a significant new clustering capability introduced in DB2


UDB V8.1 that enables tables to be continuously and automatically clustered along multiple
dimensions. MDC tables don't require regular reorganization (except to consolidate row overflows
and to reclaim space in extents caused by deletions) and provide for significant performance and
availability enhancements for DB2 UDB V8 databases.

It is important to understand how to identify candidate dimension columns to ensure that you do
not waste space and obtain the performance improvements you desire.

Critical to the performance of MDC tables is the selection of columns which will be used as
dimensions for MDC tables. Since each unique occurrence of a column or multiple columns in a
multi-column dimension key index will be stored in an extent, the amount of space used by an
MDC table could be many times more than that of a non-MDC table. In addition to the increase in
space usage, performance of queries could degrade due to the additional I/O required to retrieve
the extents (blocks). Candidate dimension columns can be identified in a couple of ways. Use the
following three steps to evaluate the space usage and space occupied per cell:

 Determine the number of cells

 Determine the space occupied per cell

 Determine the cell utilization

If the MDC table under analysis already exists as a regular table, the following query can be used
to identify the Cells per Block for the candidate dimensions under analysis:

With cell_table as (
SELECT DISTINCT dimcol1, dimcol2,…..dimcolN
FROM table )
SELECT COUNT(*) as cell_count
FROM cell_table

Note that this estimate will be inaccurate if there are correlations between the dimension
columns. If a non-MDC table does not exist, then you can estimate the number of cells per block
by estimating the number of unique combinations of the dimension candidate columns.

The next step is to determine the space occupied per cell as follows:

First determine the number of rows in the table by issuing the following SQL statement:

SELECT COUNT(*) FROM table.name

Or make an estimate of the number of rows if the table doesn't exist. Then determine the
extentsize in bytes using the following SQL statement:

(SELECT PAGESIZE * EXTENTSIZE) from SYSCAT.SYSTABLESPACES WHERE TBSPACE LIKE


'tablespace_name')

Next, determine the minimum, average and maximum rows per cell using the following query:

With cell_table (Dimc1, Dimc2,….DimcN, RpC)


as SELECT DISTINCT( Dimc1, Dimc2, …DimcN,
COUNT(*) from table
GROUP BY Dimc1, Dimc2, ….DimcN)
SELECT avg(RpC) as RPC, min(RpC) as MinRpC, max(RpC) as MaxRpC
FROM cell_table

Using the RpC results from the previous formula, compute the space used by rows in a cell
(Space per cell) using the following formula:

Space per Cell = Rows per Cell x (average row size)

Next, using the results of the above queries and formulas, compute the space utilization per cell
as follows:

Cell Utilization = Space Per Cell / (extentsize in bytes)

A cell utilization value of 1.0 is the optimum value. It indicates that one cell will fully occupy one
extent. Cell utilization should be less than 10. If higher than 10, multiple extents will be required to
store the rows in a cell. A cell utilization value of 0.1 or less indicates that a large amount of space
allocated to the table will be unused and this should be avoided. A table with cell utilization of 0.1
or less could use many times the amount of space as a non-MDC table. To increase cell
utilization, experiment through a trial and error process with other dimension columns and extent
sizes.

When considering the use of MDC, you need to collect and analyze the queries that will be going
against the table being reviewed, and determine if MDC tables could benefit those queries. The
following types of statements can benefit from an MDC table:

 Columns with coarse granularity

 Columns used in an ORDER BY or GROUP BY clause

 Foreign key columns in fact tables used in a STAR schema

 Index columns from STAR schema dimension tables

 Columns used in range, equality and IN predicates

After you have created MDC tables, run the query or query workload against non-MDC and MDC
versions of the tables using db2batch and DB2 Explain to validate your design and ensure that
you have met your performance objectives with the MDC table. This is an iterative process and
may result in you selecting new dimension candidates, extent sizes and/or using generated
columns to reduce the cardinality of the candidate dimensions.

Generated columns can be used to reduce the granularity of a candidate dimension so that the
number of rows per extent fill up most of the extent, which prevents extents from only having a
few entries which would waste space and not necessarily improve performance.

For example, we could use a student identification number and divide it by 10 to create a
generated column to use as a dimension.

Previous Tips of the Month


January Converting Non-Partitioned Tablespaces to Partitioned Tablespaces without Down
Time
February Drop a Specific Stored Procedure When There are Other Procedures with the
Same Name
March "Truncating" a Table
April Display or Set the Indexing Rules for a Text Index
May Common Mistakes When Using Joins
June DB2 Replication Pruning

2003 Tips
2002 Tips
2001 Tips
2000 Tips

January's Tip of the Month

Converting Non-Partitioned Tablespaces to Partitioned Tablespaces without Down Time


By Robert Catterall for DB2 Magazine

Here's what we do at CheckFree to convert non-partitioned tablespaces without down time:

1. Create a new tablespace (and table and associated indexes) with the desired
specifications (for example, DSSIZE 64G).

2. Unload data from the "original" table, and load into the new tablespace.

3. Assuming that the data in the "original" tablespace was updated in the course of the
unload/load procedure, use a DB2 log analysis tool (available from IBM and third-party
software vendors) to extract the changes from the log and apply them to the "new"
tablespace.

4. Do step 3 iteratively to bring the "new" tablespace closer, in terms of content currency, to
the "original" tablespace.

5. When a brief window of no access or read-only access to the "original" tablespace is


available, do a final log extract and apply to achieve data currency synchronization.

6. If step 5 is accomplished during a read-only period (re: the "original" tablespace), go


briefly to non-access, and change the name of the "original" table to something different
(other than the name of the "new" table). Then, change the name of the "new" table to the
former name of the "original" table.

7. Rebind packages invalidated via the RENAME TABLE operation.

8. Restore application program access to the tablespace.

It's strongly recommended that you keep the "old" tablespace and table (what I referred to as the
"original" tablespace and table), around for a while (2-4 weeks is our norm), just in case
something goes wrong. Make sure that when you finally drop the "old" tablespace, you don't
accidentally drop the new one. The RESTRICT ON DROP option of CREATE TABLE can help
here.
We've successfully executed this procedure a number of times. It's similar to what happens under
the covers when you run an online REORG of an actively updated tablespace.

February's Tip of the Month

Drop a Specific Stored Procedure When There are Other Procedures with the Same Name
From IBM's Frequently Asked Questions

How can you drop one specific stored procedure when you have several overloaded procedures
with the same name?

Each procedure must have a unique specificname in the DB2 system catalog. This specificname
can be used to delete the one procedure without affecting others with the same
procname/routinename.

Use the following query to find the specificname for a procedure (be sure to modify the where
clause with the correct name). There will be a break in specificname that will help identify the
procedure with the parameters matching the procedure to be dropped.

db2 select substr(specificname,1,10) as "SPECIFICNAME", \


substr(routinename,1,10) as "ROUTINE", substr(parmname,1,20)as "PARMNAME", \
CASE rowtype \
WHEN 'B' THEN 'Input/Output' \
WHEN 'C' THEN 'Result after casting' \
WHEN 'O' THEN 'Output' \
WHEN 'P' THEN 'Input' \
WHEN 'R' THEN 'Result before casting' \
END \
as "PARMTYPE" , \
substr(typename,1,10) \
from syscat.routineparms where routinename = 'MY_PROCEDURE' \
order by specificname, routinename
Once the correct specificname is determined, use this ddl:
db2 drop specific procedure

March's Tip of the Month

"Truncating" a Table
From www.tek-tips.com

The following options are good for DB2 V7:

1. Using DELETE

DELETE FROM <tablename>

This causes all the rows in the table to be deleted. Log records are written. If the table is
large (in terms of millions of records), a very large active log space is required. DELETE
Triggers are fired.

2. Using IMPORT/LOAD
3. IMPORT from /dev/null of del
4. replace into <tablename>
5. LOAD from /dev/null of del
replace into <tablename> NONRECOVERABLE

More priveleges are required for these tasks. IMPORT .. REPLACE requires CONTROL
on the table and LOAD requires LOAD Authority on the table. The good thing about this is
that there is minimal logging.

The tables may be left in check pending state. DELETE Triggers are not fired.

From the database recovery point of view, it is advisable to use IMPORT

6. Using NLI

ALTER TABLE <tablename> NOT LOGGED INITIALLY WITH EMPLY TABLE

The table should have been created with the NLI Option. No logging is done. DELETE
triggers are not fired.

April's Tip of the Month

Display or Set the Indexing Rules for a Text Index


From IBM's Frequently Asked Questions

The desrulix command can be used to display or set the indexing rules for a text index.

The desrulix command (set/get default indexing rules) is executed from the command line with
this syntax:

DESRULIX [-h|-H|-?|-copyright]
[-quiet]
-s <search service name>
-x <index name>
[-dfmt <document format> numerical or: TDS, ASCIISECTION, RTF, HTML]
[-ccsid <default code page>]
[-lang <default language> numerical or: ARB, CAT, CHS, CHT, DAN, DEU, DES,
ENG,
ENU, ESP, FIN, FRA, FRC, HBR, ISL,
ITA,
JAP, KOR, NLD, NOB, NON, NOR, PTG,
PTB,
RUS, SVE]
The search service is TXINS000, the index name is the DB2 index name, which can be found
under indexes in the control center. For example, to display the indexing rules:
C:\PROGRA~1\SQLLIB\BIN>desrulix -s txins000 -x ix025913_004
The output looks like this:
DESRULIX - indexing rules

Default indexing rules

--------------------------------------------
Document format . . . . . . : TDS/ASCII
Default CCSID . . . . . . .: 819
Default language . . . . . .: EN_US

--------------------------------------------

DESRULIX: Command terminated successfully.

May's Tip of the Month

Common Mistakes When Using Joins


Compliments of Gabrielle Wiorkowski, Gabrielle & Associates

The purpose of this tip is to alert you to some mistakes that have been made by many people,
maybe even you. These mistakes can occur when someone is rushed or tired which can result in
poor performance, and can give joins an undeservedly bad reputation. An awareness of these
common mistakes can help prevent them in the future.

Forgetting to Include a Join Predicate

Perhaps the most distressing mistake is a join statement for which the user has forgotten to
include a predicate. The result is called a Cartesian product. In relational terms, a Cartesian
product includes a row in the result table for every combination of rows in the participating tables.
In other words, the number of rows in a Cartesian product equals the number of rows in one table
multiplied by the number of rows in the second table. If S has 1,000 rows and SPJ has 10,000,
the statement

SELECT S.SN, PN, JN


FROM S, SPJ;
returns a Cartesian product with 10 million rows. It is unlikely that any application would have use
for such a result. A more likely explanation is that the user or developer forgets to key in the
WHERE clause.

Forgetting to Specify a Join Condition

A partial Cartesian product can be less obvious and can occur when a join condition is forgotten.
This is especially easy to do when the SQL statements are complicated. Consider the following
select on the catalog tables that estimates the average number of values that appear on a page.
Which join condition is missing, thus requiring a partial Cartesian product to be performed?

SELECT I.CREATOR, I.TBNAME, I.NAME,


I.FIRSTKEYCARDF, I.FULLKEYCARDF,
C.NAME, C.COLCARDF,
T.CARDF, T.NPAGESF,
T.CARDF/T.NPAGESF/I.FIRSTKEYCARDF,
T.CARDF/T.NPAGESF/I.FULLKEYCARDF
FROM SYSIBM.SYSINDEXES I,
SYSIBM.SYSTABLES T,
SYSIBM.SYSCOLUMNS C,
SYSIBM.SYSKEYS K
WHERE I.FIRSTKEYCARDF > 0
AND I.FULLKEYCARDF > 0
AND T.CARDF > 0
AND T.NPAGESF > 0
AND I.CREATOR = 'PAT'
AND K.IXCREATOR = I.CREATOR
AND C.TBCREATOR = I.CREATOR
AND T.CREATOR = I.CREATOR
AND T.NAME = C.TBNAME
AND T.NAME = I.TBNAME
AND C.NAME = K.COLNAME;
The missing join condition is I.NAME = K.IXNAME, which results in a partial Cartesian product.

An even more subtle oversight involves a composite index. If the first column in a composite
index on K.IXCREATOR and K.IXNAME is forgotten, a matching index scan cannot be used
(does not result in a Cartesian product), significantly degrading the join’s performance.

Forgetting to Include a Local Predicate

A seemingly simple SELECT statement required 1.5 hours to run and gave many duplicate rows.
The problem is shown using the SPJ and S tables in the following SELECT statement to
determine information on suppliers who supply parts for J4 or are located in London.

SELECT S.SN, STATUS, CITY


FROM SPJ, S
WHERE (SPJ.SN = S.SN
AND SPJ.JN = 'J4')
OR S.CITY = 'London'
ORDER BY S.SN, STATUS, CITY;
It is necessary to form a partial Cartesian product of SPJ and S for suppliers in London because
there is no join predicate for the ORed predicate (OR S.CITY = 'London').

The statement can be reformulated to avoid a partial Cartesian product like:

SELECT S.SN, STATUS, CITY


FROM SPJ, S
WHERE (SPJ.SN = S.SN
AND SPJ.JN = 'J4')
UNION
SELECT S.SN, STATUS, CITY
FROM S
WHERE CITY = 'London'
ORDER BY 1, 2, 3;
DISTINCT is not used because UNION eliminates duplicates. This avoids a second sort.

Failing to Limit the Number of Rows Selected

Even the simplest omission can be costly. It most frequently occurs with developers new to the
set processing capability of SQL. Accustomed to one-record-at-a-time processing, they feel
comfortable joining two entire tables and using the host program to test for the desired values.
For example, when looking for suppliers of part P5 and the jobs on which it is used, they might
join the entire S and SPJ tables with this statement:

SELECT S.SN, PN, JN


FROM S, SPJ
WHERE S.SN = SPJ.SN;
Then they use a cursor to fetch the returned rows and test for those containing P5. This join is
much more expensive than a join of only the P5 rows from one table with the other, which can be
accomplished with this statement:
SELECT S.SN, PN, JN
FROM S, SPJ
WHERE S.SN = SPJ.SN
AND PN = 'P5';
June's Tip of the Month

DB2 Replication Pruning


From www.tek-tips.com

When you are pruning your replication tables using the asncmd prune command, the process
uses the database logs. If the data in your change data tables and/or UOW table is large, logs will
get filled up. This will cause incomplete pruning of the tables in addition to taking up the log space
from other applications, causing those applications to fail.

There are a few different ways to get out of the situation (where pruning requires more log space
than is available):

A) Temporarily increase the amount of log space available.

B) Check the DATA_CAPTURE set for the CD table in SYSCAT.TABLES. If it is Y, change it to N,


that way the entries logged in the transaction log as a result of DELETE issued by PRUNE will be
much smaller and might be contained in the present size of the log file available.

C) Quiesce the source tables. Allow Capture to catch up with the log so that all data is captured.
Stop Capture. Allow Apply to catch up and apply all data (subs_set synchpoint = register table
synchpoint where global_record ='Y'). Now all data is subject to pruning. Drop and recreate the cd
and uow tables. This causes zero logging and zero guesswork.

D) Work around this problem by pruning changes manually:

 Start Capture with the NOPRUNE parm so that pruning is NOT done automatically.
 Start Apply so that the captured changes are applied to the targets. Changes cannot be
pruned until they have been applied.
 Stop Capture so that you can manually delete rows from the CD tables
 For each CD table, issue the following:
 DELETE FROM <cd_table> CD
 WHERE CD.IBMSNAP_UOWID IN
 (SELECT UOW.IBMSNAP_UOWID FROM ASN.IBMSNAP_UOW UOW
 WHERE UOW.IBMSNAP_COMMITSEQ <=
(SELECT MIN(PC.SYNCHPOINT) FROM ASN.IBMSNAP_PRUNCNTL PC))
 Then, for the UOW table, issue:
 DELETE FROM ASN.IBMSNAP_UOW UOW
 WHERE UOW.IBMSNAP_COMMITSEQ <=
(SELECT MIN(PC.SYNCHPOINT) FROM ASN.IBMSNAP_PRUNCNTL PC)
 Restart Capture

E) Option D may still need a lot of log space, so you can break the manual deletes down into
smaller bits:
 Stop Capture and Apply. Keep Capture down until you are finished with this manual effort.
Capture will just suffer continual contention while you are pruning these large numbers of
rows, and keeping capture and apply down will keep all of the pruncntl values static while
you perform this task.

 Determine the min synchpoint for a CD table based on the pruncntl table.
 SELECT HEX(MIN(A.SYNCHPOINT)) FROM ASN.IBMSNAP_PRUNCNTL A,
 ASN.IBMSNAP_REGISTER B
 WHERE (A.SOURCE_TABLE = B.SOURCE _TABLE
 AND A.SOURCE_OWNER = B.SOURCE_OWNER
 AND A.SOURCE_VIEW_QUAL = B.SOURCE_VIEW_QUAL
 AND B.PHYS_CHANGE_OWNER = <YOUR CD OWNER>
AND B.PHYS_CHANGE_TABLE = <YOUR CD TABLE>)
 This is the highest value that can be safely pruned from this cd table. Eventually all the
rows in the CD table with a commitseq less than or equal to that value can be pruned, but
the commitseq value actually comes from a join with the UOW table. Now find the
minimum commitseq value in the uow table. Choose interim values of commitseq in
between the two values just selected and use as prunpoint in the following SQL:
 DELETE FROM <CD TABLE> A WHERE A.IBMSNAP_UOWID IN (SELECT DISTINCT
 B.IBMSNAP_UOWID FROM ASN.IBMSNAP_UOW B WHERE B.IBMSNAP_COMMITSEQ <=
prunpoint)
 Keep on issuing these deletes, with commits, until you have pruned all of the values
through to the highest value determined at the beginning.

 The uow table can be pruned safely only after all of the cd tables have been pruned. The
safe value for the uow table pruning is the min(synchpoint) from the pruncntl table (select
* where not null or zeroes).

DELETE FROM ASN.IBMSNAP_UOW WHERE IBMSNAP_COMMITSEQ <= min synchpoint

Vous aimerez peut-être aussi