Académique Documents
Professionnel Documents
Culture Documents
After completing this module, you will be able to: Describe the data distribution form and method. Describe Hashing. Describe Primary Index hash mapping. Describe the reconfiguration process. Describe a Block Layout. Describe File System Read Access.
Data Distribution
Records From Client (in random sequence) 2 32 67 12 90 6 54 75 18 25 80 41
From Host
Teradata
ASCII
AMP 1
AMP 2
AMP 3
AMP 4
Formatted
2 18
12 54 41 90
80 67 75 32 6
25
Stored
Hashing
The Hashing Algorithm creates a fixed length value from any length input
string.
Input to the algorithm is the Primary Index (PI) value of a row. The output from the algorithm is the Row Hash. A 32-bit binary value. The logical storage location of the row. Used to identify the AMP of the row. Table ID + Row Hash is used to locate the Cylinder and Data Block. Used for distribution, placement, and retrieval of the row. Row Hash uniqueness depends directly on PI uniqueness. Good data distribution depends directly on Row Hash uniqueness. The algorithm produces random, but consistent, Row Hashes. The same PI value and data type combination always hash identically. Rows with the same Row Hash will always go to the same AMP. Different PI values rarely produce the same Row Hash (Collisions).
Example 1:
SELECT HASHROW ('Teradata') ,HASHBUCKET (HASHROW ('Teradata')) ,HASHAMP (HASHBUCKET (HASHROW ('Teradata'))) ,HASHBAKAMP (HASHBUCKET (HASHROW ('Teradata'))) Hash Value F66DE2DC Bucket Num 63085 AMP Num 2 AS AS AS AS "Hash Value" "Bucket Num" "AMP Num" "AMP Fallback Num" ;
Example 2:
SELECT HASHROW ('Teradata') ,HASHROW ('Teradata ') ,HASHROW (' Teradata') Hash Value 1 F66DE2DC Hash Value 2 F66DE2DC AS "Hash Value 1" AS "Hash Value 2" AS "Hash Value 3 ; Hash Value 3 53F30AB4
Example: CREATE TABLE tableA (c1_bint BYTEINT ,c2_sint SMALLINT ,c3_int INTEGER ,c4_dec DECIMAL(8,0) ,c5_dec2 DECIMAL(8,2) ,c6_float FLOAT ,c7_char CHAR(10)) UNIQUE PRIMARY INDEX c1_bint, c2_sint); INSERT INTO tableA (5, 5, 5, 5, 5, 5, '5'); SELECT HASHROW (c1_bint) ,HASHROW (c2_sint) ,HASHROW (c3_int) ,HASHROW (c4_dec) ,HASHROW (c5_dec2) ,HASHROW (c6_float) ,HASHROW (c7_char) tableA; Output from SELECT Hash Byteint Hash Smallint Hash Integer Hash Dec80 Hash Dec82 Hash Float Hash Char 609D1715 609D1715 609D1715 609D1715 BD810459 E40FE360 334EC17C AS AS AS AS AS AS AS "Hash Byteint" "Hash Smallint" "Hash Integer" "Hash Dec80" "Hash Dec82" "Hash Float" "Hash Char"
FROM
Multi-Column Hashing
The Hashing Algorithm uses multiplication and addition to create the hash value for a multi-column index. Assume PI = (A, B)
[Hash(A) * Hash(B)] + [Hash(A) + Hash(B)] = [Hash(B) * Hash(A)] + [Hash(B) + Hash(A)]
Example: A PI of (3, 5) will hash the same as a PI of (5, 3) if both c1 & c2 are equivalent data types.
CREATE TABLE tableB (c1_int INTEGER ,c2_dec DECIMAL(8,0)) UNIQUE PRIMARY INDEX (c1_int, c2_dec); INSERT INTO tableB (5, 3); INSERT INTO tableB (3, 5); SELECT c1_int AS c1 ,c2_dec AS c2 ,HASHROW (c1_int) AS "Hash c1" ,HASHROW (c2_dec) AS "Hash c2" ,HASHROW (c1_int, c2_dec) as "Hash c1c2" FROM tableB;
A PI of (3, 5) will hash differently than a PI of (5, 3) if both column1 and column2 are not equivalent data types. Example:
SELECT c1_int AS c1 ,c2_dec AS c2 ,HASHROW (c1_int) AS "Hash c1" ,HASHROW (c2_dec) AS "Hash c2" ,HASHROW (c1_int, c2_dec) as "Hash c1c2" FROM tableB; *** Query completed. 2 rows found. 5 columns returned. c1 These two rows will not hash the same and probably will not produce a hash synonym. 5 3 c2 3.00 5.00 Hash c1 609D1715 6D27DAA6 Hash c2 A4E56902 BD810459 Hash c1c2 0E452DAE 336B8C96
CREATE TABLE tableB (c1_int INTEGER ,c2_dec DECIMAL(8,2)) UNIQUE PRIMARY INDEX (c1_int, c2_dec); INSERT INTO tableB (5, 3); INSERT INTO tableB (3, 5);
FROM
Result:
Hash c3 34D30C52
Hash c4 34D30C52
Count(*) 12 7
Remaining 16 bits
AMP AMP AMP AMP AMP AMP AMP AMP AMP AMP 0 1 2 3 4 5 6 7 8 9
Hash Maps
Hash Maps are the mechanism for determining which AMP gets a row. There are four (4) Hash Maps on every TPA node. The Hash Maps are loaded into PDE memory space of each TPA node when PDE software boots.
Communications Layer Interface (PE, AMP) Current Configuration Primary Reconfiguration Primary
Reconfiguration Fallback
For a PI or USI operation, only the AMP whose number appears in the
referenced Hash Map entry is interrupted.
The first 16 bits of a Row Hash is the Destination Selection Word (DSW). The DSW points to one map and one entry within that map. The referenced Hash Map entry identifies the AMP for the row hash.
Assume row hash of 0023 1AB2 8 AMP system AMP 05 16 AMP system AMP 14
000 001 002 003 004 005
Reconfiguration
Existing AMPs New AMPs
16,384
16,384
16,384
16,384
EMPTY
EMPTY
10,923
10,923
10,923
10,923
10,922
10,922
The system creates new Hash Maps to accommodate the new configuration. Old and new maps are compared.
Each AMP reads its rows, and moves only those that hash to a new AMP.
It is not necessary to offload and reload data due to a reconfiguration.
Percentage of Number of New AMPs Rows Moved = = SUM of Old + New AMPs to new AMPs 2 6 = 1 3 = 33.3%
48 Bit TABLE ID
DSW
Only the AMP whose number appears in the referenced Hash Map is interrupted.
Vdisk
Data Block
Each Database/User/Profile/Role is assigned a globally unique numeric ID. Each Table, View, Macro, Trigger, Stored Procedure, Join Index, and Hash
Index is assigned a globally unique numeric ID.
Table ID
A Unique Value for Tables, Views, Macros, Triggers, and Stored Procedures comes from DBC.Next dictionary table. Unique Value also defines the type of table:
SUB-TABLE ID 16 Bits
Table ID plus Row ID makes every row in the system unique. Examples shown in this manual use the Unique Value to represent the entire Table ID.
Row ID
On INSERT, the system stores both the data values and the Row ID.
ROW ID = ROW HASH and UNIQUENESS VALUE
Row Hash Row Hash is based on Primary Index value. Multiple rows in a table could have the same Row Hash. NUPI duplicates and hash synonyms have the same Row Hash. Uniqueness Value Type system creates a numeric 32-bit Uniqueness Value. The first row for a Row Hash has a Uniqueness Value of 1. Additional rows have ascending Uniqueness Values. Row IDs determine sort sequence within a Data Block. Row IDs support Secondary Index performance. The Row ID makes every row within a table uniquely identifiable. Duplicate Rows Row ID uniqueness does not imply data uniqueness.
An entry in the Cylinder Index identifies the Data Block. The Data Block is the physical I/O unit and may or may not be memory
resident.
The PE sends request to an AMP via the Message Passing Layer (PDE & BYNET).
Table ID Row Hash PI Value
AMP Memory
Master Index Cylinder Index (accessed in FSG Cache)
Vdisk CI
Row
VDisk
Cylinder Index
SRD - A DBD - A1 DBD - A2 SRD - B DBD - B1 DBD - B2
Data Block A1
Data Block A2
Cylinder Index
SRD - B DBD - B3 DBD - B4 DBD - B5
Vdisk
Cylinder 0
Seg. 0
CI
CID 1
CID 2
CID 3 . . CID n
CI Cylinder
The Master index and Cylinder Index entries are 4 bytes larger to include the
Partition #s to support Partitioned Primary Index (PPI) tables.
For non-partitioned tables, the partition number is 0 and the Master and Cylinder
Index entries (for NPPI tables) will use 0 as the partition number in the entry.
VDisk
Cylinder
Data Block A1 Data Block A2
Data Block B1
Data Block B2
One DBD per data block identifies location and first Part# / Row ID and the last Part # / Row Hash within a block.
FSE
A Row has maximum size of 64,256 bytes with releases V2R3 through V2R5.1.
Header (36 bytes)
Row Reference Array -3 -2 -1 0
Row 2 Row 4
Trailer (2 bytes)
Row 1
Row 3
The Primary Index value determines the Row Hash. The system generates the Uniqueness Value. NPPI Non-Partitioned Primary Index (typical Teradata primary index) For an NPPI table, the Row ID will be unique for every row in a table (for both SET and MULTISET). (or 62.75 KB).
Rows in a table may vary in length. The maximum row length is 64,256 bytes
In V2R5, if the Primary Index is not partitioned, then the row is implicitly
assumed to be in partition #0.
Master Index Lowest Highest Pdisk and Table ID Part # Row ID Table ID Part # Row Hash Cylinder Number
: 078 098 100 100 100 100 100 100 100 123 123 : : 0 0 0 0 0 0 0 0 0 1 2 : : 58234, 2 00107, 1 00773, 3 01361, 2 02937, 1 03662, 1 04123, 2 05974, 1 07353, 1 00343, 2 06923, 1 : : 095 100 100 100 100 100 100 100 120 123 123 : : 0 0 0 0 0 0 0 0 0 2 3 : : 72194 00676 01361 02884 03602 03999 05888 07328 00469 01864 00231 : : 204 037 169 777 802 117 888 753 477 529 943 :
To CYLINDER INDEX
Cylinder Index - Cylinder #169 SRDs Table ID First DBD DBD Offset Count
SRD #1 100 FFFF 12
DBDs
: DBD #4 DBD #5 DBD #6 DBD #7 DBD #8 DBD #9 :
Part #
: 0 0 0 0 0 0 :
Lowest Row ID
: 00867, 2 00938, 1 00998, 1 01010, 3 01185, 2 01290, 1 :
Part #
: 0 0 0 0 0 0 :
Highest RowHash
: 00902 00996 01010 01177 01258 01333 :
Start Sector
: 1010 0093 0789 0525 0056 1138 :
Start Sector
: 0270 0301 0349 0470 0481 0550 :
Sector Count
: 3 5 5 4 6 5 :
This example assumes that only 1 table ID has rows on this cylinder and the table is not partitioned.
790
791 792 793 794
Row Heap
A block is the physical I/O unit. The block header contains the Table ID (6 bytes). Only rows for the same table reside in the same data block. Rows are not split across block boundaries. Blocks within a table vary in size. The system adjusts block sizes dynamically. Blocks may be from 512 bytes to 127.5 KB (1 to 255 disk sectors). With V2R3 and V2R4.0, the maximum block size is 127 sectors (63.5 KB). Data blocks are not chained together. Row Reference Array pointers are stored (sorted) in reverse sequence based on Row ID within the block.
The Primary Index data value is used as a row qualifier to eliminate synonyms.
Value 3755 Hash 1000 Index Hash Uniq Value 998 1 4219 Data Columns
Row data
999
999 1000 1000 1002 1008 1010
1
2 1 2 1 1 1
2968
6324 1006 3755 6838 8825 0250
Row data
Row data Row data Row data Row data Row data Row data
794
AMP memory, cache size, and locality of reference determine if either of these steps
require physical I/O. Often, the Cylinder Index is memory resident and a Unique Primary Index retrieval requires only one (1) I/O. Message Passing Layer Table ID AMP Memory
Master Index Cylinder Index (accessed in FSG Cache) Data Block (accessed in FSG Cache)
Row Hash
PI Value
Vdisk CI
Row
Review Questions
1. The Row Hash for a PI value of 824 is the same for the data types of INTEGER and DECIMAL(18,0). True or False. _______ 2. The first 16 bits of the Row Hash is referred to as the _________ or the _______ _________ . 3. The Hash Map consists of 65,536 entries which identify an _____ number for the Row Hash. 4. The Current Configuration ___________ Hash Map is used to locate the AMP to locate/store a row based on PI value. 5. The ____________ utility is used to redistribute rows to a new system configuration with more AMPs. 6. The Unique Value of a Table ID comes from the dictionary table named DBC.________ . 7. The Row ID consists of the _______ ________ and the __________ _____ . 8. The _______ _______ contains a Cylinder Index Descriptor (CID) for each allocated Cylinder. 9. The _______ _______ contains an entry for each data block in the cylinder. 10. The ____ __________ ________ consists of a set of 2 byte pointers to the data rows in data block. 11. For Teradata V2R5.0, the maximum block size is approximately _______ and the maximum row size is approximately _______ . 12. The Primary Index data value is used as a row qualifier to eliminate hash _____________ .