Académique Documents
Professionnel Documents
Culture Documents
11/16/2014
Content
Database Design
Entity-relationship model
Relational database design
Multimedia database
In the early days, database applications were built directly on top of file systems
Drawbacks of using file systems to store data:
Data redundancy and inconsistency
Multiple file formats, duplication of information in different files
Difficulty in accessing data
Need to write a new program to carry out each new task
Data isolation multiple files and formats
Integrity problems
Integrity constraints (e.g. account balance > 0) become buried in
Levels of Abstraction
View of Data
Data Models
Language for accessing and manipulating the data organized by the appropriate
data model
DML also known as query language
Retrieval of information
Insertion of new information
Deletion of information
Modification of information
Two classes of languages
Procedural user specifies what data is required and how to get those data
Declarative (nonprocedural) user specifies what data is required without
specifying how to get those data
SQL is the most widely used query language
Data
(DDL)
Definition
Language
Example:
Relational Model
Attributes
Database Design
The process of designing the general structure of the database:
Logical Design Deciding on the database schema. Database design requires that
we find a good collection of relation schemas.
Business decision What attributes should we record in the database?
Computer Science decision What relation schemas should we have and
how should the attributes be distributed among the various relation schemas?
Extend the relational data model by including object orientation and constructs to
XML:
Extensible
Language
Markup
Storage Management
Storage manager is a program module that provides the interface between the lowlevel data stored in the database and the application programs and queries
submitted to the system.
The storage manager is responsible to the following tasks:
Interaction with the file manager
Efficient storing, retrieving and updating of data
Components:
Authorization and integrity manager
Transaction manager
File manager
Buffer manager
Issues:
Storage access
File organization
Indexing and hashing
Query Processing
1.
2.
3.
Evaluation
Transaction Management
Database Architecture
The architecture of a database systems is greatly influenced by
the underlying computer system on which the database is running:
Centralized
Client-server
Parallel (multi-processor)
Distributed
Database Users
Users are differentiated by the way they expect to interact with
the system
Application programmers interact with system through DML calls
Sophisticated users form requests in a database query language
Specialized users write specialized database applications that do not fit into the
traditional data processing framework
Nave users invoke one of the permanent application programs that have been
written previously
Examples, people accessing database over the web, bank tellers, clerical staff
Database Administrator
Coordinates all the activities of the database system; the database administrator has
a good understanding of the enterprises information resources and needs.
Database administrator's duties include:
Schema definition
Storage structure and access method definition
Schema and physical organization modification
Granting user authority to access the database
Specifying integrity constraints
Acting as liaison with users
Monitoring performance and responding to changes in requirements
2000s:
XML and XQuery standards
Automated database administration
2014
11/16/2014
Relational Model
Example of a Relation
Basic Structure
Then r = {
Attribute Types
Relation Schema
Relation Instance
Database
Keys
Let K R
K is a superkey of R if values for K are sufficient to identify a unique tuple of each
possible relation r(R)
by possible r we mean a relation r that could exist in the enterprise we are
modeling.
Example: {customer_name, customer_street} and
{customer_name}
are both superkeys of Customer, if no two customers can possibly have the
same name
In real life, an attribute such as customer_id would be used instead of
customer_name to uniquely identify customers, but we omit it to keep
our examples small, and instead assume customer names are unique.
K is a candidate key if K is minimal
Example: {customer_name} is a candidate key for Customer, since it is a
superkey and no subset of it is a superkey.
Primary key: a candidate key chosen as the principal means of identifying tuples
within a relation
Should choose an attribute whose value never, or very rarely, changes.
E.g. email address is unique, but may change
Foreign Keys
A relation schema may have an attribute that corresponds to the primary key of
another relation. The attribute is called a foreign key.
E.g. customer_name and account_number attributes of depositor are foreign
keys to customer and account respectively.
Only values occurring in the primary key attribute of the referenced relation
may occur in the foreign key attribute of the referencing relation.
Schema diagram
Query Languages
Relational Algebra
Procedural language
Six basic operators
select:
project:
union:
set difference:
Cartesian product: x
rename:
The operators take one or two relations as inputs and produce a new relation as
a result.
A=B ^ D > 5
12
23 10
23 10
(r)
Select Operation
Notation: p(r)
p is called the selection predicate
Defined as:
p(r) = {t | t
r and p(t)}
Example of selection:
branch_name=Perryridge(account)
Project Operation
Notation:
where A1, A2 are attribute names and r is a relation name.
The result is defined as the relation of k columns obtained by erasing the columns
that are not listed
Duplicate rows removed from result, since relations are sets
Example: To eliminate the branch_name attribute of account
account_number, balance (account)
Union Operation
Notation: r s
Defined as:
r s = {t | t r or t s}
For r s to be valid.
1. r, s must have the same arity (same number of attributes)
2. The attribute domains must be compatible (example: 2nd column
of r deals with the same type of values as does the 2nd
column of s)
Example: to find all customers with either an account or a loan
customer_name (depositor)
customer_name (borrower)
Notation r s
Defined as:
r s = {t | t r and
s}
Cartesian-Product Operation
Example
Cartesian-Product Operation
Notation r x s
Defined as:
r x s = {t q | t r and q s}
Assume that attributes of r(R) and s(S) are disjoint. (That is, R S = ).
If attributes of r(R) and s(S) are not disjoint, then renaming must be used.
Composition of Operations
A=C(r x s)
Rename Operation
x (E)
returns the result of expression E under the name X, and with the
attributes renamed to A1 , A2 , ., An .
Banking Example
branch (branch_name, branch_city, assets)
customer (customer_name, customer_street, customer_city)
account (account_number, branch_name, balance)
loan (loan_number, branch_name, amount)
depositor (customer_name, account_number)
borrower (customer_name, loan_number)
Example Queries
Find the loan number for each loan of an amount greater than $1200
Find the names of all customers who have a loan, an account, or both, from the
bank
balance(account) - account.balance
(account.balance < d.balance (account x d (account)))
Formal Definition
A basic expression in the relational algebra consists of either one of the following:
A relation in the database
A constant relation
Let E1 and E2 be relational-algebra expressions; the following are all relationalalgebra expressions:
E1 E2
E1 E2
E1 x E2
p (E1), P is a predicate on attributes in E1
s(E1), S is a list consisting of some of the attributes in E1
Additional Operations
We define additional operations that do not add any power to the
relational algebra, but that simplify common queries.
Set intersection
Natural join
Division
Assignment
Set-Intersection Operation
Notation: r s
Defined as:
r s = { t | t r and t s }
Assume:
r, s have the same arity
attributes of r and s are compatible
Note: r s = r (r s)
Set-Intersection Operation
Example
Natural-Join Operation
Example:
R = (A, B, C, D)
S = (E, B, D)
Result schema = (A, B, C, D, E)
r
s is defined as:
r.A,
x s))
Division Operation
Notation:
Suited to queries that include the phrase for all.
Let r and s be relations on schemas R and S respectively where
R = (A1, , Am , B1, , Bn )
S = (B1, , Bn)
The result of r s is a relation on schema
R S = (A1, , Am)
r s = { t | t R-S (r
u s ( tu r ) }
Where tu means the concatenation of tuples t and u to produce a single tuple
Property
Let q = r s
Then q is the largest relation satisfying q x s r
Definition in terms of the basic algebra operation
Let r(R) and s(S) be relations, and let S R
r
R-S
(r ) R-S
R-S
(r ) x s ) R-S,S(r ))
To see why
R-S,S (r) simply reorders attributes of r
R-S
R-S
R-S
Assignment Operation
Find all customers who have an account at all branches located in Brooklyn city.
account)
Extended Relational-AlgebraOperations
Generalized Projection
Aggregate Functions
Outer Join
Generalized Projection
Each of F1, F2, , Fn are are arithmetic expressions involving constants and
attributes in the schema of E.
Given relation credit_info(customer_name, limit, credit_balance), find how much
more each person can spend:
customer_name, limit credit_balance (credit_info)
Aggregate Operation
Example
branch_name
Outer Join
Null Values
It is possible for tuples to have a null value, denoted by null, for some of their
attributes
null signifies an unknown value or that a value does not exist.
The result of any arithmetic expression involving null is null.
Aggregate functions simply ignore null values (as in SQL)
For duplicate elimination and grouping, null is treated like any other value, and two
nulls are assumed to be the same (as in SQL)
Comparisons with null values return the special truth value: unknown
If false was used instead of unknown, then not (A < 5)
would not be equivalent to
A >= 5
Three-valued logic using the truth value unknown:
OR: (unknown or true)
= true,
(unknown or false)
= unknown
(unknown or unknown) = unknown
AND: (true and unknown)
= unknown,
(false and unknown)
= false,
(unknown and unknown) = unknown
NOT: (not unknown) = unknown
In SQL P is unknown evaluates to true if predicate P evaluates to
unknown
Result of select predicate is treated as false if it evaluates to unknown
The content of the database may be modified using the following operations:
Deletion
Insertion
Updating
All these operations are expressed using the assignment operator.
Deletion
Deletion Examples
Insertion
Insertion Examples
Updating
A mechanism to change a value in a tuple without charging all values in the tuple
Use the generalized projection operator to do this task
r F1 , F2 ,, Fl , (r )
Each Fi is either
Update Examples
2014
11/16/2014
SQL
Data Definition
Basic Query Structure
Set Operations
Aggregate Functions
Null Values
Nested Subqueries
Complex Queries
Views
Modification of the Database
Joined Relations**
History
IBM Sequel language developed as part of System R project at the IBM San Jose
Research Laboratory
Renamed Structured Query Language (SQL)
ANSI and ISO standard SQL:
SQL-86
SQL-89
SQL-92
SQL:1999 (language name became Y2K compliant!)
SQL:2003
Commercial systems offer most, if not all, SQL-92 features, plus varying feature
sets from later standards and special proprietary features.
Not all examples here may work on your particular system.
Example:
create table branch
(branch_name
char(15) not null,
branch_city char(30),
assets
integer)
not null
primary key (A1, ..., An )
The drop table command deletes all information about the dropped relation from
the database.
The alter table command is used to add attributes to an existing relation:
alter table r add A D
where A is the name of the attribute to be added to relation r and D is the domain
of A.
All tuples in the relation are assigned null as the value for the new attribute.
The alter table command can also be used to drop attributes of a relation:
alter table r drop A
where A is the name of an attribute of relation r
Dropping of attributes not supported by many databases
SQL is based on set and relational operations with certain modifications and
enhancements
A typical SQL query has the form:
select A1, A2, ..., An
from r1, r2, ..., rm
where P
Ai represents an attribute
ri represents a relation
P is a predicate.
This query is equivalent to the relational algebra expression.
A1 , A2 ,, An ( P (r1 r2 rm ))
The select clause list the attributes desired in the result of a query
corresponds to the projection operation of the relational algebra
Example: find the names of all branches in the loan relation:
select branch_name
from loan
In the relational algebra, the query would be:
branch_name (loan)
NOTE: SQL names are case insensitive (i.e., you may use upper- or lower-case
letters.)
E.g. Branch_Name BRANCH_NAME branch_name
Some people use upper case wherever we use bold font.
SQL allows duplicates in relations as well as in query results.
To force the elimination of duplicates, insert the keyword distinct after select.
Find the names of all branches in the loan relations, and remove duplicates
select distinct branch_name
from loan
The keyword all specifies that duplicates not be removed.
select all branch_name
from loan
An asterisk in the select clause denotes all attributes
select *
from loan
The select clause can contain arithmetic expressions involving the operation, +, ,
, and /, and operating on constants or attributes of tuples.
The query:
select loan_number, branch_name, amount 100
from loan
would return a relation that is the same as the loan relation, except that the value
of the attribute amount is multiplied by 100.
The where clause specifies conditions that the result must satisfy
Corresponds to the selection predicate of the relational algebra.
To find all loan number for loans made at the Perryridge branch with loan amounts
greater than $1200.
select loan_number
from loan
where branch_name = 'Perryridge' and amount > 1200
Comparison results can be combined using the logical connectives and, or, and
not.
Comparisons can be applied to results of arithmetic expressions.
SQL includes a between comparison operator
Example: Find the loan number of those loans with loan amounts between
$90,000 and $100,000 (that is, $90,000 and $100,000)
The SQL allows renaming relations and attributes using the as clause:
old-name as new-name
Find the name, loan number and loan amount of all customers; rename the column
name loan_number as loan_id.
Tuple Variables
Tuple variables are defined in the from clause via the use of the as clause.
Find the customer names and their loan numbers for all customers having a loan at
some branch.
select customer_name, T.loan_number, S.amount
from borrower as T, loan as S
where T.loan_number = S.loan_number
Find the names of all branches that have greater assets than
some branch located in Brooklyn.
select distinct T.branch_name
from branch as T, branch as S
where T.assets > S.assets and S.branch_city = 'Brooklyn'
String Operations
Find the names of all customers whose street includes the substring Main.
select customer_name
from customer
where customer_street like '% Main%'
List in alphabetic order the names of all customers having a loan in Perryridge
branch
select distinct customer_name
from
borrower, loan
where borrower loan_number = loan.loan_number and
branch_name = 'Perryridge'
order by customer_name
We may specify desc for descending order or asc for ascending order, for each
attribute; ascending order is the default.
Example: order by customer_name desc
Duplicates
In relations with duplicates, SQL can define how many copies of tuples appear in
the result.
Multiset versions of some of the relational algebra operators given multiset
relations r1 and r2:
1.
2.
A (t1) in
3.
satisfies selections
A (r1
A1 , A2 ,, An ( P (r1 r2 rm ))
Set Operations
The set operations union, intersect, and except operate on relations and
correspond to the relational algebra operations
Each of the above operations automatically eliminates duplicates; to retain all
duplicates use the corresponding multiset versions union all, intersect all and
except all.
Suppose a tuple occurs m times in r and n times in s, then, it occurs:
m + n times in r union all s
min(m,n) times in r intersect all s
max(0, m n) times in r except all s
Aggregate Functions
Note:
Find the names of all branches where the average account balance is more than
$1,200.
Null Values
It is possible for tuples to have a null value, denoted by null, for some of their
attributes
null signifies an unknown value or that a value does not exist.
The predicate is null can be used to check for null values.
Example: Find all loan number which appear in the loan relation with null
values for amount.
select loan_number
from loan
where amount is null
The result of any arithmetic expression involving null is null
Example: 5 + null returns null
However, aggregate functions simply ignore nulls
More on next slide
Nested Subqueries
Example Query
Find all customers who have both an account and a loan at the bank.
select distinct customer_name
from borrower
where customer_name in (select customer_name
from depositor )
Find all customers who have a loan at the bank but do not have
an account at the bank
Find all customers who have both an account and a loan at the Perryridge branch
select distinct customer_name
from borrower, loan
where borrower.loan_number = loan.loan_number and
branch_name = 'Perryridge' and
(branch_name, customer_name ) in
Set Comparison
Find all branches that have greater assets than some branch located in Brooklyn.
select distinct T.branch_name
from branch as T, branch as S
where T.assets > S.assets and
S.branch_city = 'Brooklyn'
select branch_name
from branch
where assets > some
(select assets
from branch
where branch_city = 'Brooklyn')
Example Query
Find the names of all branches that have greater assets than all branches located
in Brooklyn.
select branch_name
from branch
where assets > all
(select assets
from branch
where branch_city = 'Brooklyn')
The exists construct returns the value true if the argument subquery is nonempty.
exists r r
not exists r r =
Example Query
Find all customers who have an account at all branches located in Brooklyn.
select distinct S.customer_name
from depositor as S
where not exists (
(select branch_name
from branch
Note that X Y =
Note: Cannot write this query using = all and its variants
XY
The unique construct tests whether a subquery has any duplicate tuples in its
result.
Find all customers who have at most one account at the Perryridge branch.
select T.customer_name
from depositor as T
where unique (
select R.customer_name
from account, depositor as R
where T.customer_name = R.customer_name and
R.account_number = account.account_number and
account.branch_name = 'Perryridge')
Example Query
Find all customers who have at least two accounts at the Perryridge branch.
select distinct T.customer_name
from depositor as T
where not unique (
select R.customer_name
from account, depositor as R
where T.customer_name = R.customer_name and
R.account_number = account.account_number and
account.branch_name = 'Perryridge')
Derived Relations
With Clause
The with clause provides a way of defining a temporary view whose definition is
available only to the query in which the with clause occurs.
Find all accounts with the maximum balance
with max_balance (value) as
select max (balance)
from account
select account_number
from account, max_balance
where account.balance = max_balance.value
Find all branches where the total account deposit is greater than the average of the
total account deposits at all branches.
with branch_total (branch_name, value) as
select branch_name, sum (balance)
from account
group by branch_name
with branch_total_avg (value) as
select avg (value)
from branch_total
select branch_name
Views
In some cases, it is not desirable for all users to see the entire logical model (that
is, all the actual relations stored in the database.)
Consider a person who needs to know a customers name, loan number and
branch name, but has no need to see the loan amount. This person should see a
relation described, in SQL, by
(select customer_name, borrower.loan_number, branch_name
from borrower, loan
where borrower.loan_number = loan.loan_number )
A view provides a mechanism to hide certain data from the view of certain users.
Any relation that is not of the conceptual model but is made visible to a user as a
virtual relation is called a view.
View Definition
A view is defined using the create view statement which has the form
create view v as < query expression >
where <query expression> is any legal SQL expression. The view name is
represented by v.
Once a view is defined, the view name can be used to refer to the virtual relation
that the view generates.
When a view is created, the query expression is stored in the database; the
expression is substituted into queries using the view.
Example Queries
View Expansion
Example Query
Delete the record of all accounts with balances below the average at the bank.
delete from account
where balance < (select avg (balance )
from account )
Problem:
2.
Next, delete all tuples found above (without recomputing avg or
retesting the tuples)
Provide as a gift for all loan customers of the Perryridge branch, a $200 savings
account. Let the loan number serve as the account number for the new savings
account
insert into account
select loan_number, branch_name, 200
from loan
where branch_name = 'Perryridge'
insert into depositor
select customer_name, loan_number
from loan, borrower
where branch_name = 'Perryridge'
and loan.account_number = borrower.account_number
The select from where statement is evaluated fully before any of its results are
inserted into the relation (otherwise queries like
insert into table1 select * from table1
would cause problems)
Increase all accounts with balances over $10,000 by 6%, all other accounts
receive 5%.
Write two update statements:
update account
set balance = balance 1.06
where balance > 10000
update account
set balance = balance 1.05
where balance 10000
The order is important
Can be done better using the case statement (next slide)
Same query as before: Increase all accounts with balances over $10,000 by 6%,
all other accounts receive 5%.
update account
set balance = case
Update of a View
Create a view of all loan data in the loan relation, hiding the amount attribute
create view loan_branch as
select loan_number, branch_name
from loan
Add a new tuple to branch_loan
insert into branch_loan
values ('L-37, 'Perryridge)
This insertion must be represented by the insertion of the tuple
('L-37', 'Perryridge', null )
into the loan relation
Some updates through views are impossible to translate into updates on the
database relations
create view v as
select loan_number, branch_name, amount
from loan
where branch_name = Perryridge
insert into v values ( 'L-99','Downtown', '23')
Others cannot be translated uniquely
insert into all_customer values ('Perryridge', 'John')
Have to choose loan or account, and
create a new loan/account number!
Most SQL implementations allow updates only on simple views (without
aggregates) defined on a single relation
Joined Relations**
Join operations take two relations and return as a result another relation.
These additional operations are typically used as subquery expressions in the
from clause
Join condition defines which tuples in the two relations match, and what
attributes are present in the result of the join.
Join type defines how tuples in each relation that do not match any tuple in the
other relation (based on the join condition) are treated.
Relation loan
Relation borrower
Note: borrower information missing for L-260 and loan information missing
for L-155
Find all customers who have either an account or a loan (but not both) at the
bank.
select customer_name
from (depositor natural full outer join borrower )
where account_number is null or loan_number is null
2014
11/16/2014
Entity-Relationship Model
Entity Sets
Relationship Sets
Design Issues
Mapping Constraints
Keys
E-R Diagram
Extended E-R Features
Design of an E-R Database Schema
Reduction of an E-R Schema to Tables
Entity Sets
A database can be modeled as:
a collection of entities,
relationship among entities.
Attributes
An entity is represented by a set of attributes, that
attribute
Attribute types:
Composite Attributes
Relationship Sets
A relationship is an association among several
entities
Example:
Hayes depositor A-102
customer entity relationship set account entity
A relationship set is a mathematical relation
among n 2 entities, each taken from entity sets
{(e1, e2, en) | e1
E1, e2
E2, , en
En}
where (e1, e2, , en) is a relationship
Example:
(Hayes, A-102)
depositor
set.
For instance, the depositor relationship set
between entity sets customer and account may
have the attribute access-date
a relationship set.
Relationship sets that involve two entity sets are
binary (or degree two). Generally, most
relationship sets in a database system are binary.
Relationship sets may involve more than two
entity sets.
E.g. Suppose employees of a bank may
have jobs (responsibilities) at multiple
branches, with different jobs at different
branches. Then there is a ternary
Mapping Cardinalities
Express the number of entities to which another
One to one
One to many
Many to one
Many to many
E-R Diagrams
Roles
Entity sets of a relationship need not be distinct
The labels manager and worker are called roles; they specify how employee
entities interact via the works-for relationship set.
Roles are indicated in E-R diagrams by labeling the lines that connect diamonds to
rectangles.
Role labels are optional, and are used to clarify semantics of the relationship
Cardinality Constraints
We express cardinality constraints by drawing
A customer is associated with at most one loan via the relationship borrower
A loan is associated with at most one customer via borrower
One-To-Many Relationship
In the one-to-many relationship a loan is
Many-To-One Relationships
In a many-to-one relationship a loan is associated
Many-To-Many Relationship
Keys
A super key of an entity set is a set of one or
Cardinality Constraints on
Ternary Relationship
We allow at most one arrow out of a ternary (or
Relationships
Some relationships that appear to be non-binary
Converting Non-Binary
Relationships to Binary Form
2. add (ei , ai ) to RA
4. add (ei , ci ) to RC
Design Issues
Use of entity sets vs. attributes
Specialization
Top-down design process; we designate
Specialization Example
Generalization
A bottom-up design process combine a number
used interchangeably.
Specialization and
Generalization (Contd.)
Can have multiple specializations of an entity set
Design Constraints on a
Specialization/Generalization
Constraint on which entities can be members of a
condition-defined
E.g. all customers over 65 years are members of senior-citizen entity
set; senior-citizen ISA person.
user-defined
Disjoint
Design Constraints on a
Specialization/Generalization
(Contd.)
Completeness constraint -- specifies whether or
an object.
Whether a real-world concept is best expressed
by an entity set or a relationship set.
The use of a ternary relationship versus a pair of
binary relationships.
The use of specialization/generalization
contributes to modularity in the design.
UML
UML: Unified Modeling Language
UML has many components to graphically model
Tables
A strong entity set reduces to a table with the
same attributes.
E.g. given entity set customer with composite attribute name with component
attributes first-name and last-name the table corresponding to the entity set
has two attributes
name.first-name and name.last-name
EM
E.g.,
Representing Relationship
Sets as Tables
Redundancy of Tables
Many-to-one and one-to-many relationship sets that are total on the manyside can be represented by adding an extra attribute to the many side,
containing the primary key of the one side
E.g.: Instead of creating a table for relationship account-branch, add an
attribute branch to the entity set account
That is, extra attribute can be added to either of the tables corresponding to
the two entity sets
E.g. The payment table already contains the information that would appear in
the loan-payment table (i.e., the columns loan-number and paymentnumber).
Representing Specialization
as Tables
Method 1:
Form a table for the higher level entity
Form a table for each lower level entity set,
include primary key of higher level entity set
and local attributes
table
table attributes
person
name, street, city
customer
name, credit-rating
employee
name, salary
Method 2:
Form a table for each entity set with all local
and inherited attributes
table
table attributes
person
name, street, city
customer
name, street, city , credit-rating
employee
name, street, city salary
Relations Corresponding to
Aggregation
2014
11/16/2014
Advanced SQL
User-Defined Types
Types and domains are similar. Domains can have constraints, such as not null,
specified on them.
Domain Constraints
Domain constraints are the most elementary form of integrity constraint. They
test values inserted in the database, and test queries to ensure that the
comparisons make sense.
New domains can be created from existing data types
Example:
create domain Dollars numeric(12, 2)
create domain Pounds numeric(12,2)
We cannot assign or compare a value of type Dollars to a value of type Pounds.
However, we can convert type as below
(cast r.A as Pounds)
(Should also multiply by the dollar-to-pound conversion-rate)
Large-Object Types
Large objects (photos, videos, CAD files, etc.) are stored as a large object:
blob: binary large object -- object is a large collection of uninterpreted binary
data (whose interpretation is left to an application outside of the database
system)
clob: character large object -- object is a large collection of character data
When a query returns a large object, a pointer is returned rather than the
large object itself.
Bfile
Nclob
Integrity Constraints
Constraints on a Single
Relation
not null
primary key
unique
check (P ), where P is a predicate
Referential Integrity
Ensures that a value that appears in one relation for a given set of attributes also
appears for a certain set of attributes in another relation.
Example: If Perryridge is a branch name appearing in one of the tuples in
the account relation, then there exists a tuple in the branch relation for
branch Perryridge.
Primary and candidate keys and foreign keys can be specified as part of the SQL
create table statement:
The primary key clause lists attributes that comprise the primary key.
The unique key clause lists attributes that comprise a candidate key.
The foreign key clause lists the attributes that comprise the foreign key and
the name of the relation referenced by the foreign key. By default, a foreign
key references the primary key attributes of the referenced table.
Assertions
Assertion Example
Every loan has at least one borrower who maintains an account with a minimum
balance or $1000.00
create assertion balance_constraint check
(not exists (
select *
from loan
where not exists (
select *
from borrower, depositor, account
where loan.loan_number = borrower.loan_number
and borrower.customer_name =
depositor.customer_name
and depositor.account_number =
account.account_number
and account.balance >= 1000)))
The sum of all loan amounts for each branch must be less than the sum of all
account balances at the branch.
create assertion sum_constraint check
(not exists (select *
from branch
where (select sum(amount )
from loan
where loan.branch_name =
branch.branch_name )
>= (select sum (amount )
from account
where loan.branch_name =
branch.branch_name )))
Authorization
Forms of authorization on parts of the database:
Read - allows reading, but not modification of data.
Insert - allows insertion of new data, but not modification of existing data.
Update - allows modification, but not deletion of data.
Delete - allows deletion of data.
Forms of authorization to modify the database schema
Index - allows creation and deletion of indices.
Resources - allows creation of new relations.
Alteration - allows addition or deletion of attributes in a relation.
Drop - allows deletion of relations.
Authorization Specification in
SQL
Privileges in SQL
select: allows read access to relation,or the ability to query using the view
Example: grant users U1, U2, and U3 select authorization on the branch
relation:
grant select on branch to U1, U2, U3
insert: the ability to insert tuples
update: the ability to update using the SQL update statement
delete: the ability to delete tuples.
all privileges: used as a short form for all the allowable privileges
more in Chapter 8
Revoking Authorization in
SQL
explicitly.
If the same privilege was granted twice to the same user by different grantees, the
user may retain the privilege after the revocation.
All privileges that depend on the privilege being revoked are also revoked.
Embedded SQL
Example Query
Note: above details vary with language. For example, the Java embedding defines
Java iterators to step through result tuples.
Dynamic SQL
JDBC
JDBC is a Java API for communicating with database systems supporting SQL
JDBC supports a variety of features for querying and updating data, and for
retrieving query results
JDBC also supports metadata retrieval, such as querying about relations present in
the database and the names and types of relation attributes
Model for communicating with the database:
Open a connection
Create a statement object
Execute queries using the Statement object to send queries and fetch results
JDBC Code
public static void JDBCexample(String dbid, String
userid, String passwd)
{
try {
Class.forName ("oracle.jdbc.driver.OracleDriver");
Connection conn = DriverManager.getConnection(
"jdbc:oracle:thin:@aura.belllabs.com:2000:bankdb", userid, passwd);
Statement stmt = conn.createStatement();
Do Actual Work .
stmt.close();
conn.close();
}
catch (SQLException sqle) {
System.out.println("SQLException : " + sqle);
}
}
Update to database
try {
stmt.executeUpdate( "insert into account values
('A-9732', 'Perryridge', 1200)");
} catch (SQLException sqle) {
System.out.println("Could not insert tuple. " + sqle);
}
rset.getFloat(2));
}
SQL:1999
SQL Functions
Define a function that, given the name of a customer, returns the count of the
number of accounts owned by the customer.
create function account_count (customer_name varchar(20))
returns integer
begin
declare a_count integer;
select count (* ) into a_count
from depositor
where depositor.customer_name = customer_name
return a_count;
end
Find the name and address of each customer that has more than one account.
select customer_name, customer_street, customer_city
from customer
where account_count (customer_name ) > 1
Table Functions
Usage
select *
from table (accounts_of (Smith))
SQL Procedures
Procedural Constructs
set n = n 1
For loop
Permits iteration over all results of a query
Example: find total of all balances at the Perryridge branch
declare n integer default 0;
for r as
select balance from account
where branch_name = Perryridge
do
set n = n + r.balance
end for
.. signal out-of-stock
end
The handler here is exit -- causes enclosing begin..end to be exited
Other actions possible on exception
External Language
Functions/Procedures
SQL:1999 permits the use of functions and procedures written in other languages
such as C or C++
Declaring external language procedures and functions
create procedure account_count_proc(in customer_name varchar(20),
out count integer)
language C
external name /usr/avi/bin/account_count_proc
create function account_count(customer_name varchar(20))
returns integer
language C
external name /usr/avi/bin/author_count
Recursion in SQL
Example of Fixed-Point
Computation
2014
11/16/2014
Combine Schemas?
A Lossy Decomposition
Functional Dependencies
Require that the value for a certain set of attributes determines uniquely the value
for another set of attributes.
A functional dependency is a generalization of the notion of a key.
Let R be a relation schema
R and
R
The functional dependency
holds on R if and only if for any legal relations r(R), whenever any two tuples t1 and
t2 of r agree on the attributes , they also agree on the attributes . That is,
t1[ ] = t2 [ ]
t1[ ] = t2 [ ]
Example: Consider r(A,B ) with the following instance of r.
Use of Functional
Dependencies
that r satisfies F.
specify constraints on the set of legal relations
We say that F holds on R if all legal relations on R satisfy the set of
functional dependencies F.
Note: A specific instance of a relation schema may satisfy a functional
dependency even if the functional dependency does not hold on all legal instances.
For example, a specific instance of loan may, by chance, satisfy
amount customer_name.
A functional dependency is trivial if it is satisfied by all instances of a relation
Example:
customer_name, loan_number customer_name
customer_name customer_name
In general, is trivial if
is trivial (i.e.,
is a superkey for R
We decompose R into:
(U
))
= loan_number
= amount
(U ) = ( loan_number, amount )
( R - ( - ) ) = ( customer_id, loan_number )
Goals of Normalization
such that (c, t, b) classes means that t is qualified to teach c, and b is a required
textbook for c
The database is supposed to list for each course the set of teachers any one of
which can be the courses instructor, and the set of books, all of which are required
for the course (no matter who teaches it).
BCNF
Insertion anomalies i.e., if Marilyn is a new teacher that can teach database, two
tuples need to be inserted
(database, Marilyn, DB Concepts)
(database, Marilyn, Ullman)
Therefore, it is better to decompose classes into:
Functional-Dependency
Theory
We now consider the formal theory that tells us which functional dependencies are
implied logically by a given set of functional dependencies.
We then develop algorithms to generate lossless decompositions into BCNF and
3NF
We then develop algorithms to test if a decomposition is dependency-preserving
Given a set F set of functional dependencies, there are certain other functional
dependencies that are logically implied by F.
For example: If A B and B C, then we can infer that A C
The set of all functional dependencies logically implied by F is the closure of F.
We denote the closure of F by F+.
We can find all of F+ by applying Armstrongs Axioms:
if
, then
(reflexivity)
if
, then
(augmentation)
if
, and , then
(transitivity)
These rules are
sound (generate only functional dependencies that actually hold) and
complete (generate all functional dependencies that hold).
Example
R = (A, B, C, G, H, I)
F={ AB
AC
CG H
CG I
B H}
some members of F+
AH
by transitivity from A B and B H
AG I
by augmenting A C with G, to get AG CG
and then transitivity with CG I
CG HI
by augmenting CG I to infer CG CGI,
and augmenting of CG H to infer CGI HI,
and then transitivity
F =F
repeat
+
+
R = (A, B, C, G, H, I)
F = {A B
AC
CG H
CG I
B H}
(AG)+
1. result = AG
2. result = ABCG (A C and A B)
3. result = ABCGH
(CG H and CG AGBC)
4. result = ABCGHI
(CG I and CG AGBCH)
Is AG a candidate key?
Is AG a super key?
Does AG R? == Is (AG)+ R
Is any subset of AG a superkey?
Does A R? == Is (A)+ R
Does G R? == Is (G)+ R
attributes of R.
Testing functional dependencies
To check if a functional dependency
F ), just check if
For each
, we output a
Canonical Cover
Extraneous Attributes
Testing if an Attribute is
Extraneous
To test if attribute A
; if it does, A is extraneous in
is extraneous in
compute
check that
Canonical Cover
R = (A, B, C)
F = {A BC
BC
AB
AB C}
Combine A BC and A B into A BC
Set is now {A BC, B C, AB C}
A is extraneous in AB C
Check if the result of deleting A from AB C is implied
by the other dependencies
Yes: in fact, B C is already present!
Set is now {A BC, B C}
C is extraneous in A BC
Check if A C is logically implied by A B and the
other dependencies
Yes: using transitivity on A B and B C.
Can use attribute closure of A in more complex
cases
The canonical cover is:
AB
BC
Lossless-join Decomposition
For the case of R = (R1, R2), we require that for all possible relations r on schema
R
r = R1 (r )
R2 (r )
A decomposition of R into R1 and R2 is lossless join if and only if at least one of the
following dependencies is in F :
R1 R2 R1
R1 R2 R2
Example
R = (A, B, C)
F = {A B, B C)
Can be decomposed in two different ways
R1 = (A, B), R2 = (B, C)
Lossless-join decomposition:
R1 R2 = {B} and B BC
Dependency preserving
R1 = (A, B), R2 = (A, C)
Lossless-join decomposition:
R1 R2 = {A} and A AB
R2)
Dependency Preservation
(F1 F2 Fn ) = F
If it is not, then checking updates for violation of functional
dependencies may require computing joins, which is expensive.
To check if a dependency
, Rn we apply the following test (with attribute closure done with respect to F)
result =
while (changes to result) do
for each Ri in the decomposition
t = (result Ri)+ Ri
result = result t
If result contains all attributes in , then the functional dependency
is preserved.
We apply the test on all dependencies in F to check if a decomposition is
dependency preserving
This procedure takes polynomial time, instead of the exponential time required to
compute F+ and (F1 F2
Fn)
Example
R = (A, B, C )
F = {A B
B C}
Key = {A}
R is not in BCNF
Design Goals
tables.
R could have been a single relation containing all attributes that are of
interest (called universal relation).
Normalization breaks R into smaller relations.
R could have been the result of some ad hoc design of relations, which we
then test/convert to normal form.
When an E-R diagram is carefully designed, identifying all entities correctly, the
tables generated from the E-R diagram should not need further normalization.
However, in a real (imperfect) design, there can be functional dependencies from
non-key attributes of an entity to other attributes of the entity
Example: an employee entity with attributes department_number and
department_address, and a functional dependency department_number
department_address
Good design would have made department an entity
Functional dependencies from non-key attributes of a relationship set possible, but
rare --- most relationships are binary
Denormalization for
Performance
Temporal data have an association time interval during which the data are valid.
A snapshot is the value of the data at a particular point in time
Several proposals to extend ER model by adding valid time to
attributes, e.g. address of a customer at different points in time
entities, e.g. time duration when an account exists
relationships, e.g. time during which a customer owned an account
But no accepted standard
Adding a temporal component results in functional dependencies like
customer_id customer_street, customer_city
not to hold, because the address varies over time
A temporal functional dependency X Y holds on schema R if the functional
dependency X Y holds on all snapshots for all legal instances r (R )
In practice, database designers may add start and end time attributes to relations
E.g. course(course_id, course_title)
course(course_id, course_title, start, end)
Constraint: no two tuples can have overlapping valid times
2014
11/16/2014
Triggers
Trigger Example
Suppose that instead of allowing negative account balances, the bank deals with
overdrafts by
setting the account balance to zero
creating a loan in the amount of the overdraft
giving this loan a loan number identical to the account number of the
overdrawn account
The condition for executing the trigger is an update to the account relation that
results in a negative balance value.
Instead of executing a separate action for each affected row, a single action can be
executed for all rows affected by a transaction
Use
for each statement
instead of
for each row
Use
referencing old table or referencing new table to refer to
temporary tables (called transition tables) containing the affected rows
Can be more efficient when dealing with SQL statements that update a large
number of rows
separate table
Have an external process that repeatedly scans the table,
carries out external-world actions and deletes action from
table
E.g. Suppose a warehouse has the following tables
inventory (item, level ): How much of each item is in the
warehouse
minlevel (item, level ) :
What is the minimum desired
level of each item
reorder (item, amount ): What quantity should we reorder at a time
orders (item, amount ) : Orders to be placed (read by
external process)
Triggers in MS-SQLServer
Syntax
create trigger overdraft-trigger on account
for update
as
if inserted.balance < 0
begin
insert into borrower
(select customer-name,account-number
from depositor, inserted
where inserted.account-number =
depositor.account-number)
insert into loan values
(inserted.account-number, inserted.branch-name,
inserted.balance)
update account set balance = 0
from account, inserted
where account.account-number = inserted.account-number
end
Authorization in SQL
Forms of authorization on parts of the database:
Read authorization - allows reading, but not modification of data.
Insert authorization - allows insertion of new data, but not modification of existing
data.
Update authorization - allows modification, but not deletion of data.
Delete authorization - allows deletion of data
Forms of authorization to modify the database schema:
Index authorization - allows creation and deletion of indices.
Resources authorization - allows creation of new relations.
Users can be given authorization on views, without being given any authorization
on the relations used in the view definition
Ability of views to hide data serves both to simplify usage of the system and to
enhance security by allowing users access only to data they need for their job
A combination or relational-level security and view-level security can be used to
limit a users access to precisely the data that user needs.
View Example
Suppose a bank clerk needs to know the names of the customers of each branch,
but is not authorized to see specific loan information.
Approach: Deny direct access to the loan relation, but grant access to the
view cust-loan, which consists only of the names of customers and the
branches at which they have a loan.
The cust-loan view is defined in SQL as follows:
create view cust-loan as
select branchname, customer-name
from borrower, loan
where borrower.loan-number = loan.loan-number
Authorization on Views
Creation of view does not require resources authorization since no real relation is
being created
The creator of a view gets only those privileges that provide no additional
Granting of Privileges
Privileges in SQL
select: allows read access to relation,or the ability to query using the view
Example: grant users U1, U2, and U3 select authorization on the branch
relation:
grant select on branch to U1, U2, U3
insert: the ability to insert tuples
update: the ability to update using the SQL update statement
delete: the ability to delete tuples.
references: ability to declare foreign keys when creating relations.
usage: In SQL-92; authorizes a user to use a specified domain
all privileges: used as a short form for all the allowable privileges
with grant option: allows a user who is granted a privilege to pass the privilege on
to other users.
Example:
grant select on branch to U1 with grant option
gives U1 the select privileges on branch and allows U1 to grant this
privilege to others
Roles
Roles permit common privileges for a class of users can be specified just once by
creating a corresponding role
Privileges can be granted to or revoked from roles, just like user
Roles can be assigned to users, and even to other roles
SQL:1999 supports roles
create role teller
create role manager
grant select on branch to teller
grant update (balance) on account to teller
grant all privileges on account to manager
grant teller to manager
grant teller to alice, bob
grant manager to avi
Revoking Authorization in
SQL
Limitations of SQL
Authorization
End users don't have database user ids, they are all
mapped to the same database user id
All end-users of an application (such as a web application)
may be mapped to a single database user
The task of authorization in above cases falls on the
application program, with no support from SQL
Benefit: fine grained authorizations, such as to individual
tuples, can be implemented by the application.
Drawback: Authorization must be done in application
code, and may be dispersed all over an application
Checking for absence of authorization loopholes
becomes very difficult since it requires reading large
amounts of application code
Audit Trails
Application Security
Encryption (Cont.)
Authentication
Digital Certificates
imposter?
Solution: use the public key of the web site
Problem: how to verify if the public key itself is genuine?
Solution:
Every client (e.g. browser) has public keys of a few rootlevel certification authorities
A site can get its name/URL and public key signed by a
certification authority: signed document is called a
certificate
Client can use public key of certification authority to verify
certificate
Multiple levels of certification authorities can exist. Each
certification authority
presents its own public-key certificate signed by a
higher level authority, and
Uses its private key to sign the certificate of other
web sites/authorities
2014
11/16/2014
Object-Based Databases
Extend the relational data model by including object orientation and constructs to
deal with added data types.
Allow attributes of tuples to have complex types, including non-atomic values such
as nested relations.
Preserve relational foundations, in particular the declarative access to data, while
extending modeling power.
Upward compatibility with existing relational languages.
Motivation:
Permit non-atomic domains (atomic indivisible)
Example of non-atomic domain: set of integers,or set of tuples
Allows more intuitive modeling for applications with complex data
Intuitive definition:
allow relations whenever we allow atomic (scalar) values relations within
relations
Retains mathematical foundation of relational model
Violates first normal form.
title,
a set of authors,
Publisher, and
a set of keywords
Non-1NF relation books
Complex Types
Structured types
Inheritance
Object orientation
Methods
Type Inheritance
Type Inheritance
SQL:99 does not support multiple inheritance
As in most other languages, a value of a
Table Inheritance
Array construction
array [Silberschatz,`Korth,`Sudarshan]
Multisets
multiset [`parsing,`analysis ])
Querying Collection-Valued
Attributes
Unnesting
The transformation of a nested relation into a
K.keyword
from books as B, unnest(B.author_array ) as
A (author ),
unnest (B.keyword_set ) as K (keyword )
Define a type Department with a field name and a field head which is a reference
to the type Person, with table people as scope:
create type Department (
name varchar (20),
head ref (Person) scope people)
We can then create a table departments as follows
create table departments of Department
We can omit the declaration scope people from the type declaration and instead
make an addition to the create table statement:
create table departments of Department
(head with options scope people)
Initializing Reference-Typed
Values
To create a tuple with a reference value, we can first create the tuple with a null
reference and then set the reference separately:
insert into departments
values (`CS, null)
update departments
set head = (select p.person_id
from people as p
where name = `John)
where name = `CS
The type of the object-identifier must be specified as part of the type definition of
the referenced table, and
The table definition must specify that the reference is user generated
create type Person
(name varchar(20)
address varchar(20))
ref using varchar(20)
create table people of Person
ref is person_id user generated
When creating a tuple, we must provide a unique value for the identifier:
We can then use the identifier value when inserting a tuple into departments
Avoids need for a separate query to retrieve the identifier:
insert into departments
values(`CS, `02184567)
Path Expressions
Persistent Programming
Languages
Relational systems
simple data types, powerful query languages, high protection.
Persistent-programming-language-based OODBs
complex data types, integration with programming language, high
performance.
Object-relational systems
complex data types, powerful query languages, high protection.
Note: Many real systems blur these boundaries
E.g. persistent programming language built as a wrapper on a relational
database offers first two benefits, but may have poor performance.
2014
11/16/2014
Classification of Physical
Storage Media
backed
Cache fastest and most costly form of storage; volatile; managed by the
computer system hardware.
Main memory:
Flash memory
Data survives power failure
Data can be written at a location only once, but location can be erased and
written to again
Can support only a limited number (10K 1M) of write/erase cycles.
Erasing of memory has to be done to an entire bank of memory
Reads are roughly as fast as main memory
But writes are slow (few microseconds), erase is slower
Cost per unit of storage roughly similar to main memory
Widely used in embedded devices such as digital cameras
Is a type of EEPROM (Electrically Erasable Programmable Read-Only
Memory)
Magnetic-disk
Optical storage
non-volatile, data is read optically from a spinning disk using a laser
CD-ROM (640 MB) and DVD (4.7 to 17 GB) most popular forms
Write-one, read-many (WORM) optical disks used for archival storage (CDR, DVD-R, DVD+R)
Multiple write versions also available (CD-RW, DVD-RW, DVD+RW, and
DVD-RAM)
Reads and writes are slower than with magnetic disk
Juke-box systems, with large numbers of removable disks, a few drives, and
12
bytes)
Storage Hierarchy
Disk Subsystem
Performance Measures of
Disks
RAID
much higher than the chance that a specific single disk will
fail.
E.g., a system with 100 disks, each with MTTF of
100,000 hours (approx. 11 years), will have a system
MTTF of 1000 hours (approx. 41 days)
Techniques for using redundancy to avoid data loss are
critical with large numbers of disks
Originally a cost-effective alternative to large, expensive disks
I in RAID originally stood for ``inexpensive
Today RAIDs are used for their higher reliability and
bandwidth.
The I is interpreted as independent
RAID Levels
RAID Level 0:
sequentially
Parity block becomes a bottleneck for independent block
writes since every block write also writes to parity disk
RAID Level 5: Block-Interleaved Distributed Parity;
partitions data and parity among all N + 1 disks, rather than
storing data in N disks and parity in 1 disk.
E.g., with 5 disks, parity block for nth set of blocks is
stored on disk (n mod 5) + 1, with the data blocks stored
on the other 4 disks.
Block writes occur in parallel if the blocks and their parity blocks are on
different disks.
Subsumes Level 4: provides same benefits, but avoids bottleneck of parity
disk.
RAID Level 6: P+Q Redundancy scheme; similar to Level 5, but stores extra
redundant information to guard against multiple disk failures.
Better reliability than Level 5 at a higher cost; not used as widely.
3 in 10 years)
I/O requirements have increased greatly, e.g. for Web
servers
When enough disks have been bought to satisfy required
rate of I/O, they often have spare storage capacity
so there is often no extra monetary cost for Level 1!
Level 5 is preferred for applications with low update rate,
and large amounts of data
Level 1 is preferred for all other applications
Hardware Issues
Storage Access
A database file is partitioned into fixed-length storage units called blocks. Blocks
are units of both storage allocation and data transfer.
Database system seeks to minimize the number of block transfers between the
disk and memory. We can reduce the number of disk accesses by keeping as
many blocks as possible in main memory.
Buffer portion of main memory available to store copies of disk blocks.
Buffer manager subsystem responsible for allocating buffer space in main
memory.
Buffer Manager
Programs call on the buffer manager when they need a block from disk.
If the block is already in the buffer, buffer manager returns the address of
the block in main memory
If the block is not in the buffer, the buffer manager
Allocates space in the buffer for the block
Buffer-Replacement Policies
Most operating systems replace the block least recently used (LRU strategy)
Idea behind LRU use past pattern of block references as a predictor of future
references
Queries have well-defined access patterns (such as sequential scans), and a
database system can use the information in a users query to predict future
references
LRU can be a bad strategy for certain access patterns involving repeated
scans of data
For example: when computing the join of 2 relations r and s by a nested
loops
for each tuple tr of r do
for each tuple ts of s do
if the tuples tr and ts match
Mixed strategy with hints on replacement strategy provided
by the query optimizer is preferable
Pinned block memory block that is not allowed to be written back to disk.
Toss-immediate strategy frees the space occupied by a block as soon as the
final tuple of that block has been processed
Most recently used (MRU) strategy system must pin the block currently being
processed. After the final tuple of that block has been processed, the block is
unpinned, and it becomes the most recently used block.
Buffer manager can use statistical information regarding the probability that a
request will reference a particular relation
E.g., the data dictionary is frequently accessed. Heuristic: keep datadictionary blocks in main memory buffer
Buffer managers also support forced output of blocks for the purpose of recovery
File Organization
Fixed-Length Records
Simple approach:
Store record i starting from byte n (i 1), where n is the size of each
record.
Record access is simple but records may cross blocks
Modification: do not allow records to cross block boundaries
Disadvantage: 1) Deletion of records
2) Boundary crossing
Deletion of record i:
alternatives:
move records i + 1, . . ., n
to i, . . . , n 1
move record n to i
do not move records, but link all free records on a free list
Free Lists
Store the address of the first deleted record in the file header.
Use this first record to store the address of the second deleted record, and so on
Can think of these stored addresses as pointers since they point to the location of
a record.
More space efficient representation: reuse space for normal attributes of free
records to store pointers. (No pointers stored in in-use records.)
Variable-Length Records
If a record is deleted, the space that it occupies is freed, and its entry is set to
deleted.
The level of indirection allows records to be moved to prevent fragmentation of
space inside a block
Storing large objects (blob, clob) requires special arrangements.
Organization of Records in
Files
Heap a record can be placed anywhere in the file where there is space
Sequential store records in sequential order, based on the value of the search
key of each record
Hashing a hash function computed on some attribute of each record; the result
specifies in which block of the file the record should be placed
Records of each relation may be stored in a separate file. In a multitable
clustering file organization records of several different relations can be stored
in the same file
Motivation: store related records on the same block to minimize I/O
Suitable for applications that require sequential processing of the entire file
The records in the file are ordered by a search-key
Read in sorted order
Difficulty in insertion and deletion
2014
11/16/2014
Basic Concepts
Ordered Indices
B+-Tree Index Files
B-Tree Index Files
Static Hashing
Dynamic Hashing
Comparison of Ordered Indexing and Hashing
Index Definition in SQL
Multiple-Key Access
Basic Concepts
search-key pointer
Index files are typically much smaller than the original file
Two basic kinds of indices:
Ordered indices: search keys are stored in sorted order
Hash indices: search keys are distributed uniformly across buckets using
a hash function.
Ordered Indices
In an ordered index, index entries are stored sorted on the search key value.
Dense index Index record appears for every search-key value in the file.
Sparse Index: contains index records for only some search-key values.
Applicable when records are sequentially ordered on search-key
To locate a record with search-key value K we:
Find index record with largest search-key value < K
Search file sequentially starting at the record to which the index record points
Multilevel Index
If primary index does not fit in memory, access
becomes expensive.
Solution: treat primary index kept on disk as a
If deleted record was the only record in the file with its particular search-key value,
the search-key is deleted from the index also.
Single-level index deletion:
Dense indices deletion of search-key: similar to file record deletion.
Sparse indices
if an entry for the search key exists in the index, it is deleted by
replacing the entry in the index with the next search-key value in the file
(in search-key order).
If the next search-key value already has an index entry, the entry is
deleted instead of being replaced.
Secondary Indices
Frequently, one wants to find all the records whose values in a certain field (which
is not the search-key of the primary index) satisfy some condition.
Example 1: In the account relation stored sequentially by account number,
we may want to find all accounts in a particular branch
Example 2: as above, but where we want to find all accounts with a specified
balance or range of balances
We can have a secondary index with an index record for each search-key value
Typical node
For i = 1, 2, . . ., n1, pointer Pi either points to a file record with search-key value
Ki, or to a bucket of pointers to file records, each record having search-key value
Ki. Only need bucket structure if search-key does not form a primary key.
If Li, Lj are leaf nodes and i < j, Lis search-key values are less than Ljs search-key
values
Pn points to next leaf node in search-key order
Non leaf nodes form a multi-level sparse index on the leaf nodes. For a non-leaf
node with m pointers:
All the search-keys in the subtree to which P1 points are less than K1
For 2 i n 1, all the search-keys in the subtree to which Pi points have
Example of a B+-tree
Example of B+-tree
and n
Since the inter-node connections are done by pointers, logically close blocks
need not be physically close.
The non-leaf levels of the B+-tree form a hierarchy of sparse indices.
The B+-tree contains a relatively small number of levels
Level below root has at least 2* n/2 values
Next level has at least 2* n/2 * n/2 values
.. etc.
If there are K search-key values in the file, the tree height is no more than
logn/2 (K)
thus searches can be conducted efficiently.
Insertions and deletions to the main file can be handled efficiently, as the index can
be restructured in logarithmic time (as we shall see).
Queries on B+-Trees
If there are K search-key values in the file, the height of the tree is no more than
logn/2 (K) .
A node is generally the same size as a disk block, typically 4 kilobytes
and n is typically around 100 (40 bytes per index entry).
With 1 million search key values and n = 100
at most log50(1,000,000) = 4 nodes are accessed in a lookup.
Contrast this with a balanced binary tree with 1 million search key values
around 20 nodes are accessed in a lookup
above difference is significant since every node access may need a disk I/O,
costing around 20 milliseconds
Updates on B+-Trees:
Insertion
would appear
If the search-key value is already present in the
leaf node
Add record to the file
If necessary add a pointer to the bucket.
If the search-key value is not present, then
add the record to the main file (and create a
bucket if necessary)
If there is room in the leaf node, insert (keyvalue, pointer) pair in the leaf node
Otherwise, split the node (along with the new
(key-value, pointer) entry) as discussed in the
next slide.
Splitting a non-leaf node: when inserting (k,p) into an already full internal node N
Copy N to an in-memory area M with space for n+1 pointers and n keys
Insert (k,p) into M
Copy P1,K1, , K n/2-1,P n/2 from M back into node N
Copy Pn/2+1,K n/2+1,,Kn,Pn+1 from M into newly allocated node N
Insert (K n/2,N) into parent N
Read pseudocode in book!
Find the record to be deleted, and remove it from the main file and from the bucket
(if present)
Remove (search-key value, pointer) from the leaf node if there is no bucket or if the
Updates on B+-Trees:
Deletion
Otherwise, if the node has too few entries due to the removal, but the entries in the
node and a sibling do not fit into a single node, then redistribute pointers:
Redistribute the pointers between the node and a sibling such that both have
more than the minimum number of entries.
Update the corresponding search-key value in the parent of the node.
The node deletions may cascade upwards till a node which has n/2 or more
pointers is found.
If the root node has only one pointer after deletion, it is deleted and the sole child
becomes the root.
Good space utilization important since records use more space than pointers.
To improve space utilization, involve more sibling nodes in redistribution during
splits and merges
Involving 2 siblings in redistribution (to avoid split / merge where possible)
results in each node having at least
entries
Multiple-Key Access
1.
2.
3.
Composite search keys are search keys containing more than one attribute
E.g. (branch_name, balance)
Lexicographic ordering: (a1, a2) < (b1, b2) if either
a1 < b1, or
a1=b1 and a2 < b2
Static Hashing
Static Hashing
To insert a record with search key Ki, compute h(Ki), which gives the address of
that bucket
For lookup, follow the same
Deletion is equally straightforward
Hashing can be used for two different purposes:
Hash file organization
Hash index organization
Hash Functions
Worst hash function maps all search-key values to the same bucket; this makes
access time proportional to the number of search-key values in the file.
An ideal hash function is uniform, i.e., each bucket is assigned the same number
of search-key values from the set of all possible values.
Ideal hash function is random, so each bucket will have the same number of
records assigned to it irrespective of the actual distribution of search-key values in
the file.
Typical hash functions perform computation on the internal binary representation of
the search-key.
For example, for a string search-key, the binary representations of all the
characters in the string could be added and the sum modulo the number of
buckets could be returned. .
Overflow chaining the overflow buckets of a given bucket are chained together in
a linked list.
Above scheme is called closed hashing.
An alternative, called open hashing, which does not use overflow buckets,
is not suitable for database applications.
open hashing, uses linear probing i.e., insert records to some other buckets
following the current bucket.
Other policies : compute hash function again
Hash Indices
Hashing can be used not only for file organization, but also for index-structure
creation.
A hash index organizes the search keys, with their associated record pointers, into
a hash file structure.
Strictly speaking, hash indices are always secondary indices
if the file itself is organized using hashing, a separate primary hash index on
it using the same search-key is unnecessary.
However, we use the term hash index to refer to both secondary index
structures and hash organized files.
Dynamic Hashing
i.
shrinks.
Multiple entries in the bucket address table may point to a bucket (why?)
Structure: Example
Hash structure after insertion of one Brighton and two Downtown records
Bitmap Indices
Bitmap indices are a special type of index designed for efficient querying on
multiple keys
Records in a relation are assumed to be numbered sequentially from, say, 0
Given a number n it must be easy to retrieve record n
Particularly easy if records are of fixed size
Applicable on attributes that take on a relatively small number of distinct values
E.g. gender, country, state,
E.g. income-level (income broken up into a small number of levels such as
0-9999, 10000-19999, 20000-50000, 50000- infinity)
A bitmap is simply an array of bits
In its simplest form a bitmap index on an attribute has a bitmap for each value of
the attribute
Bitmap has as many bits as records
In a bitmap for value v, the bit for a record is 1 if the record has the value v
for the attribute, and is 0 otherwise
Bitmaps are packed into words; a single word and (a basic CPU instruction)
computes and of 32 or 64 bits at once
E.g. 1-million-bit maps can be and-ed with just 31,250 instruction
Counting number of 1s can be done fast by a trick:
Use each byte to index into a pre-computed array of 256 elements each