Lecture Clustering

BBIT/SEM Advanced Databases
Clustering
Oracle Server Concepts Manual Database Systems Concepts Silberschatz/ Korth Sec. 10.7 Fundamentals of Database Systems Elmasri/Navathe Sec. 5.10
Stephen Mc Kearney, 2001.
Overview
Intra-file Clustering Definition
What types of clustering exist? When is it used? How is it implemented?
Clustering Index
How is it implemented in Oracle?
Clustering in Oracle
How do you decide to cluster data? How does clustering
Inter-file Clustering
How does clustering work?
Advantages & Disadvantages?
Applications compare to B+-Trees?
Criteria for Clustering
Clustering in Pages Advantages

Compare clustered and unclustered?
Comparison Disadvantages
Unclustered Relations
Clustered Relations 2
Definition
Clustering means that records related to each other are stored physically beside each other.
Frank
Clustering is a method of storing data on a disc. A cluster is used to store tuples from one or more relations physically close to other tuples in the database. The purpose of clustering is to speed up the performance of certain types of queries. When tuples that are physically close to each other are retrieved they are retrieved more quickly than tuples that are not physically close to each other.
Because clustering affects how the data is actually stored on the disc, the decision to use clustering in the database is part of the physical database design process. Clustering does not affect the applications that access the relations which have been clustered. Clustered and unclustered relations appear the same to users of the system.
Intra-file Clustering
Data items in a single file are stored together.
Supplier 1 Supplier 2 Supplier 3
Supplier n
Suppliers are stored in the order they are most often retrieved
4
In intra-file clustering records in a single file are stored close to related records in the same file. For example, if suppliers are normally ordered by their supplier number then each supplier would be stored to the supplier with the next highest supplier number.
Data items in two or more files are stored together.
Supplier 1 Shipment A Shipment B Supplier 2 Shipment C Shipment D Shipment E Supplier 3 Shipment F
Shipment G
Shipments from one file are stored beside suppliers in another file. 5
In inter-file clustering records from one file are stored close to records from another file. For example, a shipment from a shipments file would be stored close to the supplier of the shipment.
Overview
Clustering Index

Clustering Data in Pages

Disc
These pages will be quicker to retrieve. The disc must rotate less to read each page.
These pages will be slower to retrieve. The disc must rotate further to read each page.
Data that is stored close together will be quicker to retrieve.

7
Clustering affects the physical position of data on the disc. When two data items are stored on the same page on the disc, they can be read with one page read operation. Because the computer reads one page at a time, data items stored on the same page will be read at the same time. When two data items are stored on pages that are close to each other on the disc, they can be read with two page read operations. Because the pages occur one after another there is no disc head movement between reads (no seek time). When two data items are stored in separate locations on the disc, they can be read with two page read operations and a seek operation. Because the pages occur at separate locations on the disc the disc head must move to a new position on the disc to read the second page.
Adapted from Oracle7 Concepts Server Manual
Unclustered relations are stored in their own pages on the disc. That is, each page will contain tuples from one relation only. The pages may be positioned anywhere on the disc. Therefore, to join two relations at least two pages must be read from the disc - one page for each relation. For example, in the above example, the emp relation (table) is stored at one location on the disc and the dept relation (table) is stored at another location.
Clustered Relations
Adapted from Oracle7 Concepts Server Manual
Clustered relations are stored using a cluster key. Each relation belonging to the cluster has an attribute corresponding to the cluster key. Each block will store tuples with a particular cluster key value. For example, in the above example, the cluster key is deptno and all the departments and employees with deptno=10 are stored together. This type of cluster will improve the performance of queries that join the emp and the dept relations. Note that the cluster key value is only stored once for each distinct value. For example, the value deptno=10 is only stored once and all tuples with deptno=10 are stored together.
Overview
Clustering Index

10
Advantages
Advantages
Speeds up some queries Uses less space
Supplier 1 Shipment A
These shipments are for supplier 1.
Shipment B Supplier 2 Shipment C Shipment D Shipment E Supplier 3 Shipment F
A query for all shipments of supplier 1 will be quick because all the shipments for supplier 1 follow immediately after supplier 1.
Shipment G
11
Clustering will speed up some database queries. For example, a cluster consisting of suppliers and shipments will speed up queries that request all the shipments for a particular supplier. The cluster improves the supplier/shipment query because the data for each shipment is stored on the same page as the corresponding supplier. Hence, when the supplier record is read the set of shipments is also read. The cluster key value that is used to cluster relations is only stored once in each page. This may save disc space.
11
Disadvantages
Disadvantages
Slows down some queries Slows down writes
Supplier 1 Shipment A
To read all the shipment records the supplier records must also be read.
Shipment B Supplier 2 Shipment C Shipment D Shipment E Supplier 3 Shipment F
A query for all shipments will be slow because the shipments are not stored together on the disc.
Shipment G
12
Clustering will slow down certain types of queries. For example, the cluster on suppliers and shipments will slow down queries that ask for all shipments. The cluster slows down the all shipments query because the shipments are stored with each supplier. To read all the shipments the DBMS must also read the supplier data. Inserting new records into a cluster may also be slow. For example, adding a new shipment for supplier 1 will involve making space after shipment B.
12
Overview
Clustering Index

13
Applications 1 - Hierarchies
ER Diagram
Customer Order Order Line
Cluster
Customer 1 Order 1 Order Line 1
ER Instance
Customer 1
Order Line 2 Order 2 Order Line 1 Order Line 2 Order 3
Order 1
Order 2
Order 3
Order Line 1
Customer 2
Order Line 1
Order Line 2
Order Line 1
Order Line 2
Order Line 1
Order Line 2
A hierarchy of customer to orders to order lines.

14
Clustering is used when the data has a hierarchical structure. For instance, in the example above, the cluster would be used when the most common queries will retrieve all the orders and order lines for a customer. A cluster to store the above structure would cluster all the order lines with their corresponding orders and then the orders and order lines would be stored with their corresponding customer.
14
Applications 2 - Lists
List of Products
Product 1 Product 2 Product 3
Cluster
Product 1 Product 2 Product 3
15
A cluster may be used when queries will retrieve lists of data items. For example, in the above example, the cluster of products will improve queries requesting all the products.
15
Applications 3 - SQL Joins

Equi-joins
SELECT name, address, deptname FROM emp, dept WHERE emp.deptno = dept.deptno
The emp and dept relations may be clustered on the deptno attribute.
16
A cluster may be used to cluster relations that are frequently joined together. In the above example, the relations emp and dept may be clustered on the deptno attribute. The value of each deptno will be stored once together with all the corresponding emp and dept tuples.
16
Overview
Clustering Index

17
Clustering Index
Deptno 10 Records Dept Employee Employee Employee Index on Deptno 10 20 30 Employee Employee Employee 20 Dept
Page P1
All records with deptno=10 Page P2
All records with deptno=20 Page P3
30
Dept Employee Employee Employee
All records with deptno=30
18
The DBMS uses a clustering index when it implements a cluster. The clustering index is used to index the cluster key. This allows the DBMS to efficiently access the data in the cluster. The cluster index contains an entry for each cluster key value. The index may be a B+-Tree
Ref: Elmasri, sec 6.1.2
18
Create a cluster
CREATE CLUSTER emp_dept (deptno NUMBER(3));
Create a cluster index

CREATE INDEX emp_dept_index ON CLUSTER emp_dept;
Create Tables
CREATE TABLE dept (deptno NUMBER(3), ) CLUSTER emp_dept (deptno) PRIMARY KEY (deptno); CREATE TABLE emp (empno NUMBER(5), deptno NUMBER(3), ) CLUSTER emp_dept (deptno) FOREIGN KEY (deptno) REFERENCES dept;
19
There are three steps required to create a cluster in Oracle: 1. Create the cluster The space for the cluster is allocated on the disc. 2. Create the cluster index Oracle requires a cluster index to be able to access the cluster. Therefore, the cluster index must exist before data can be added to the cluster. 3. Create the tables When the tables are created a parameter is added to the CREATE TABLE command indicating the cluster to which the table will belong. Once the cluster has been created the normal data manipulation commands (INSERT, DELETE, UPDATE, SELECT) may be used. Therefore, using a cluster to improve the performance of a database does not affect the application programs that access the data.
19
Overview
Clustering Index

20

Query Requirements
Joins Lists Hierarchies
Space Requirements
Clustering may save space
Update Requirements
Clustering may slow updates
21
Deciding to cluster a set of relations depends on three factors: Query requirements Clustering improves joins between relations because it stores related tuples together in the same page. When the most common queries involve joining two relations, a cluster may improve performance. Space requirements Because each cluster key value is only stored once, storing relations in a cluster can use less storage space than storing the same relations separately. If storage space is restricted clustering the data may save space. Update requirements Cluster are difficult to update because space must be left to allow for additional clustered tuples. If space is not available, it may be necessary to move tuples between pages.
21
Comparison with Other Techniques

B+-Tree
Fast access to individual tuples Does not affect the order of data Can be ignored if not useful Easy to create and delete
Cluster
Fast access across relations Changes the order of the data Must be searched to access data Difficult to create and delete
22
A B+-Tree is designed to provide fast access to individual tuples in a relation. A cluster is designed to improve the performance of queries that join two or more relations together. A B+-Tree does not affect the order of the actual data. Although the index may be ordered, the actual data remains unordered. A cluster orders the actual data. A B+-Tree does not have to be used to answer a query. It is possible to access the data directly if using the B+-Tree is too inefficient. As a cluster affects the physical ordering of the data, the cluster must be accessed to retrieve the data. Hence, a cluster will slow down certain queries. A B+-Tree index is easy to create and delete because it is separate from the data. A cluster is difficult to create or change because it must be created before the data is added to the database. Deleting a cluster will destroy the data.
22
Partitioned Table
CREATE TABLE sales ( acct_no NUMBER(5), acct_name CHAR(30), amount_of_sale NUMBER(6), week_no INTEGER ) PARTITION BY RANGE ( week_no ). (PARTITION sales1 VALUES LESS THAN ( 4 ) TABLESPACE ts0, PARTITION sales2 VALUES LESS THAN ( 8 ) TABLESPACE ts1, ... PARTITION sales13 VALUES LESS THAN ( 52 ) TABLESPACE ts12 );
Oracle Concepts Manual
23
23
Partitioned Index 1
24
24
Partitioned Index 2
25
25
Partitioned Index 3
26
26
Equipartitioned Tables
Better availability and reliability
27
27
Disc Striping
28
28

Lecture Clustering

Transféré par

Informations du document

Description originale:

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Lecture Clustering

Transféré par

Droits d'auteur :

Formats disponibles

BBIT/SEM Advanced Databases

Stephen Mc Kearney, 2001.

BBIT/SEM Advanced Databases

How is it implemented in Oracle?

How do you decide to cluster data? How does clustering

How does clustering work?

Advantages & Disadvantages?

Applications compare to B+-Trees?

Criteria for Clustering

Clustering in Pages Advantages

Stephen Mc Kearney, 2001.

BBIT/SEM Advanced Databases

Stephen Mc Kearney, 2001.

BBIT/SEM Advanced Databases

Supplier 1 Supplier 2 Supplier 3

Stephen Mc Kearney, 2001.

BBIT/SEM Advanced Databases

Stephen Mc Kearney, 2001.

BBIT/SEM Advanced Databases

How is it implemented in Oracle?

How do you decide to cluster data? How does clustering

How does clustering work?

Advantages & Disadvantages?

Applications compare to B+-Trees?

Criteria for Clustering

Clustering in Pages Advantages

Stephen Mc Kearney, 2001.

BBIT/SEM Advanced Databases

Clustering Data in Pages

Data that is stored close together will be quicker to retrieve.

Stephen Mc Kearney, 2001.

BBIT/SEM Advanced Databases

Adapted from Oracle7 Concepts Server Manual

Stephen Mc Kearney, 2001.

BBIT/SEM Advanced Databases

Adapted from Oracle7 Concepts Server Manual

Stephen Mc Kearney, 2001.

BBIT/SEM Advanced Databases

How is it implemented in Oracle?

How do you decide to cluster data? How does clustering

How does clustering work?

Advantages & Disadvantages?

Applications compare to B+-Trees?

Criteria for Clustering

Clustering in Pages Advantages

Stephen Mc Kearney, 2001.

BBIT/SEM Advanced Databases

These shipments are for supplier 1.

Shipment B Supplier 2 Shipment C Shipment D Shipment E Supplier 3 Shipment F

Stephen Mc Kearney, 2001.

BBIT/SEM Advanced Databases

Shipment B Supplier 2 Shipment C Shipment D Shipment E Supplier 3 Shipment F

Stephen Mc Kearney, 2001.

BBIT/SEM Advanced Databases

How is it implemented in Oracle?

How do you decide to cluster data? How does clustering

How does clustering work?

Advantages & Disadvantages?

Applications compare to B+-Trees?

Criteria for Clustering

Clustering in Pages Advantages

Stephen Mc Kearney, 2001.