Académique Documents
Professionnel Documents
Culture Documents
Introduction
Database: An integrated collection of data, usually stored on
secondary storage, pertaining to some organization
Database management system (DBMS): A collection of soft-
ware/programs that control the database, allowing individuals
to access (Read/Write) data contained therein. Example: a uni-
versity database might contain information about entities such
as students, faculty, courses, classrooms. Information relating
the entities with each other: students enrolled in courses, faculty
teaching courses, courses using rooms.
Database system: The collection of database and DBMS to-
gether.
Before DBMS:
After DBMS:
1
Why a DBMS? (or Problems with \le-processing" approach)
1. Reduce application development time
{ DBMS handles common functionality (e.g., all students
with GPA > 3.0)
{ Ideally, the DBMS interface is \what" oriented (instead
of \how")
{ Provides many tools to assist users
2. Resolve data redundancy and inconsistency (data sharing):
data is better utilized (discovered and reused), redundancy
of data is minimized
3. Data independence: Application programs depend on data
representation and storage details
{ Dierent le formats
{ Dierent programming languages
A database system, however, contains not only the database
itself but also a complete denition or description of the
database, termed metadata. The DBMS software refers
to metadata to know the structure of a specic database.
4. Data integrity: one may enforce consistency constraints on
data, e.g., number of seats sold number of seats on the
plane 1.1
5. Centralized control: DBA tunes the database to balance
user's needs
6. Security: mechanisms to prevent unauthorized access. These
mechanisms are based on content instead of le-oriented ap-
proach.
7. Concurrency control: avoids undesirable race conditions that
arise with simultaneous access/updates to data
2
8. Crash recovery: ensures the integrity of data in the presence
of failures
Data model: A set of concepts to describe the structure of a
database. That is, a specication tool (class of data structures)
for describing data, including entities, relationships, semantics,
constraints, etc. Purpose: to capture information about real
world data.
{ High-level or conceptual data model: Specify the overall log-
ical structure of the database (closer to human perception).
They describe what data is stored in the database. Exam-
ples are:
1. Object-based data models: these models are closer to
human perception and farther from system perception.
Examples are: 1) Entity-Relationship, and 2) Object-
Oriented models.
2. Record-based data models: concepts that may be under-
stood by end users but not too far from the way data is
organized within the computer. Examples are: 1) Rela-
tional, 2) Network, and 3) Hierarchical models.
{ Low-level or physical data models: describe data at the low-
est level (closer to system perception). They describe how
data is stored in the database (e.g., record formats and or-
derings). Examples are: 1) Unifying and 2) Frame memory
models.
3
Instance of the database: the collection of information stored
in the database at a particular moment in time (changes fre-
quently).
Database schema: the overall design of the database (changes
infrequently). (e.g., variable and its content).
Database management systems architecture:
(Note: The only data that actually exist is at the physical level.)
Data Independence:
1. Physical data independence: Modify the physical scheme
(data structures, e.g., B-tree or hash index) without causing
application programs to be rewritten. These modications
are necessary to enhance performance and new software re-
leases. Most relational vendors support this kind of data
independence.
2. Logical data independence: Modify the conceptual scheme
(e.g., add a new attribute to a table, rename an attribute)
without causing application programs to be rewritten. This
kind of data independence is harder to achieve.
4
There are several languages associated with a database:
1. Data Denition Language (DDL): The database scheme is
specied by a set of denitions that are expressed by a special
language named DDL. The result of compiling DDL state-
ments is a set of tables stored in a le called data catalog.
This le contains metadata (data about the data stored in
the database).
2. Data Manipulation Language (DML): a language that en-
ables users to access or manipulate data (retrieve, insert, re-
place, delete) as organized by a certain data model. We will
look at a commercial DML named SQL. In general, there are
two types of DML:
(a) Procedural: Describes what data is needed and how to
get it: e.g., relational algebra
(b) Non-procedural: Describes what data is needed without
specifying how to get it: e.g., tuple relational calculus
There are several kind of users associated with a system:
1. Database administrator: denes schemas, storage structures
and access method denitions, physical organization, autho-
rization, integrity constraints.
2. Application programmers: they write a program and make
it available to the end-users
3. Sophisticated users: they use a query language (SQL) to
access the database interactively
4. Naive (end) users: they invoke the application programs
5
Session 2: E-R Data Model (CH-2)
CSCI-585 Tuesday Jan. 16, 2001
Cyrus Shahabi
• Data Model: A collection of conceptual tools for describing data, data
relationships, data semantics & consistency constraints. Example: Entity
Relationships Data Model (E-R model)
• Based on a perception of a real world which consists of a set of basic
objects called entities and relationships among these objects.
• Entity: An entity is an object that exists and is distinguishable from other
objects (e.g., Tony with SS#, csci585 in Spring-2001, …)
• Entity set: A set of entities of the same type (e.g., students, courses).
Presented as:
Student
• Entity sets need not to be disjoint (e.g., Pamu (TA) is both a student and
an employee of USC)
• Attributes: An entity is represented by a set of attributes (e.g., student:
name, SS#, age, address, …). Attribute can be considered as a function
that maps an entity into a domain (e.g., SS#: entity->integer).
Presented as:
• Role: The function that an entity plays in a relationship is called its role.
Normally implicit.
• Recursive relationship: Same entity set participates more than once in a
relationship in different roles. Role names, hence, become essential:
Manage Employee
Enrolled
Students Courses
(From now on, when I say entity or relationship, I mean entity set and
relationship set)
• Keys: Entities and relationships are distinguishable using various keys.
• Superkey: A combination of one or more attributes that allow us to
identify uniquely an entity in an entity set (e.g., SS#, name & SS#).
• Candidate key: A minimal superkey (no proper subset is a superkey) that
uniquely identifies an entity (e.g., SS#, name & address, phone#).
• Primary key: A candidate key chosen by DBA to identify entities of an
entity set (e.g., SS#).
SS#
• Weak entity set: An entity set that does not have enough attributes to
form a primary key (e.g., transaction#, date, amount. Different accounts
might have similar transaction#).
Transaction
Address
• Derived vs. stored: The value of the attribute can be derived from
either other attributes (e.g., age from DOB) or related entities (e.g.,
NumberOfEmployees).
Age
Each row in this table corresponds to one entity of the entity set.
We may add/delete/modify rows in the table.
• Weak entity set with attributes a1, a2, …, an and an owner entity
set with primary key b1, b2, …, bm : represent it as a table with
n+m columns, one for each of { a1, a2, …, an} U { b1, b2, …, bm
}. b1, b2, …, bm is the foreign key of the resulting relation
referring to the corresponding relation of the owner entity set.
Example:
• (Idea: keep rows unique.)
• N-ary relationship set R with attributes a1, a2, …, an among
entity sets Ei ‘s (say m entity sets): represent it as a table with
n+m columns, one for each of { a1, a2, …, an} U {prim-key(E1),
prim-key(E2), …, prim-key(Em)}.
• Binary relationship set R with attributes a1, a2, …, an among
entity sets corresponding to relations S and T:
• If 1:1 then choose either relations (say S) and extend it with
prim-key(T) U { a1, a2, …, an}
(Attribute inheritance)
• There might exist many specialization of the same entity set
based on different distinguishing characteristics. Hence, an
entity can be a member of a number of subclasses. Example:
• An entity cannot merely exist by being a member of a subclass
but no superclass. However, it is not essential that every entity
in a superclass be a member of some subclass.
• Why specialization:
1. Define a set of subclasses of an entity set.
2. Associate additional specific attributes with each subclass.
3. Establish additional specific relationship sets between each
subclass and other entity sets.
EER-to-Relational Mapping
• Option 1: One table for superclass + two tables for subclasses
(one for each) consisting of their corresponding attributes plus
the primary key of the superclass.
• Option 2: Same as option 1, but without creating a table for the
superclass:
(Attribute inheritance)
• There might exist many specialization of the same entity set
based on different distinguishing characteristics. Hence, an
entity can be a member of a number of subclasses. Example:
• An entity cannot merely exist by being a member of a subclass
but no superclass. However, it is not essential that every entity
in a superclass be a member of some subclass.
• Why specialization:
1. Define a set of subclasses of an entity set.
2. Associate additional specific attributes with each subclass.
3. Establish additional specific relationship sets between each
subclass and other entity sets.
EER-to-Relational Mapping
• Option 1: One table for superclass + two tables for subclasses
(one for each) consisting of their corresponding attributes plus
the primary key of the superclass.
• Option 2: Same as option 1, but without creating a table for the
superclass:
(Disclaimer: Some example queries are covered, but you need to go read
the book and do more exercise on your own, not everything is covered!)
Emp (SS#, name, age, salary, dno)
Dept (dno, dname, floor, mgrSS#)
• Structured Query Language (SQL) consists of four basic
commands: Select, Insert, Update, and Delete.
• The select command has the following syntax:
select e.name
from Emp e, Dept d
where e.dno = d.dno and d.dname = ‘Shoe’ and e.salary >
50,000 and e.SS# in (
select e.SS#
from Emp e, Dept d
where e.dno = d.dno and d.dname = ‘Toy’)
Note: not in can be used for negation of in (… who does not work
for toy department.)
select name
from Emp
where salary > all
( select e.salary
from Emp e, Dept d
where e.dno = d.dno and d.floor=1)
select name
from Emp
where salary > some
( select e.salary
from Emp e, Dept d
where e.dno = d.dno and d.floor=2)
Find the employees who make more than the average salary:
select SS#
from Emp
where salary >
(select avg(salary)
from Emp)
(Some example queries, but you need to go read the book and do more
exercise on your own, not everything is covered!)
update Emp
set salary = 1.1 * salary
where SS# in (
select e.SS#
from Emp e, Dept d
where e.dno = d.dno and d.dname = ‘Toy’)
1. create schema s
Creates a schema!
3. drop table r
Get rid of the entire relation r
4. delete r
Only delete the tuples but keep the relation
Object-Oriented Concepts
Attributes
Object-Oriented Concepts
• Function types:
• Constructors: to initialize new instances of ADT
• Destructors: to release resources used by instances
• Actors: to perform all other operations (e.g., GET and SET
functions for virtual attributes).
Cyrus Shahabi
Computer Science Department
University of Southern California
shahabi@usc.edu
• Overview
• JDBC Package
• Connecting to databases with JDBC
• Executing select queries
• Executing update queries
Overview
• Role of an application: Update databases,
extract info, through:
– User interfaces
– Non-interactive programs
• Client software:
– Provide general and specific capabilities
– Oracle provides different capabilities as
Sybase (its own methods, … )
Client server architecture
• Client-Server architectures:
– 2 tier
– 3 tier
– Layer 1:
• user interface
– Layer 2:
• Middleware
– Layer 3:
• DB server
• Middleware:
– Server for client
Client for DB
Client server architecture
• Example: Web interaction with DB
– Layer 1: web browser
– Layer 2: web server + cgi program
– Layer 3: DB server
Client server architecture
• Application layer (1):
– User interfaces
– Other utilities (report generator, …)
– Connect to middleware
– Can connect to DB too
– Can have more than one connection
– Can issue SQL, or invoke methods in lower layers.
• Middleware layer (2):
– More reliable than user applications
Database interaction in Access
• Direct interaction with DB
• For implementing applications
• Not professional
• Developer edition:
– Generates stand alone application
• Access application:
– GUI + “Visual Basic for Applications” code
Database interaction in Access
• Connection to DB through:
– Microsoft Jet database engine
• Support SQL access
• Different file formats
– Other Database Connectivity (ODBC)
• Support SQL DBs
• Requires driver for each DB server
– Driver allows the program to become a client for DB
• Client behaves Independent of DB server
Database interaction in Access
• Making data source
available to ODBC
application:
– Install ODBC driver manager
– Install specific driver for a DB
server
– Database should be registered
for ODBC manager
• (Example 1)
Connecting to DB with JDBC
• Step 2: Make connection to the DB
• Connection conn = DriverManager( URL, Properties);
– Properties: specific to the driver
• URL = Protocol + user
– Protocol= jdbc:<subprotocol>:<subname>
» E.g.: jdbc:odbc:mydatabase
» E.g.:
jdbc:oracle:thin://oracle.cs.fsu.edu/bighit
• (Example 1)
Connecting to DB with JDBC
• Step 3: Make Statement object
– Used to send SQL to DB
• executeQuery(): SQL that returns table
• executeUpdate(): SQL that doesn’t return table
• Execute(): SQL that may return both, or different thing
• (Example 2)
Executing select queries
• Step 5: issue select queries
– Queries that return table as result
– Using statement object
– Uses executeQuery() method
– Return the results as ResultSet object
• Meta data in ResultSetMetaData object
– Every call to executeQuery() deletes previous
results
• (Example 2)
Executing select queries
• Step 6: retrieve the results of select queries
– Using ResultSet object
• Returns results as a set of rows
• Accesses values by column name or column number
• Uses a cursor to move between the results
• Supported methods:
– JDBC 1: scroll forward
– JDBC 2: scroll forward/backward, absolute/relative
positioning, updating results.
– JDBC 2: supports SQL99 data types(blob, clob,…)
• (Example 2)
Executing update queries
• Step 7: issue update queries
– Queries that return a row count (integer) as result
• Number of rows affected by the query
• -1 if error
– Using statement object
– Uses executeUpdate() method
• (Example 3)
Executing update queries
• Step 8: More Advanced
– Use PreparedStatement
• faster than regular Statement
• (Example 4)
– Cursors
• forward, backward, absolute/relative positions
• (Example 5)
Introduction to Spatial Database
Systems
by Cyrus Shahabi
from
Ralf Hart Hartmut Guting’s
VLDB Journal v3, n4, October 1994
Outline
• Introduction & definition
• Modeling
• Querying
• Data structures and algorithms
• System architecture
• Conclusion and summary
1
Introduction
• Various fields/applications require management of
geometric, geographic or spatial data:
– A geographic space: surface of the earth
– Man-made space: layout of VLSI design
– Model of rat brain
Introduction …
• Common challenge: dealing with large
collections of relatively simple geometric
objects
• Different from image and pictorial database
systems:
– Containing sets of objects in space rather than
images or pictures of a space
2
Definition
• A spatial database system:
– Is a database system
• A DBMS with additional capabilities for handling
spatial data
– Offers spatial data types (SDTs) in its data
model and query language
• Structure in space: e.g., POINT, LINE, REGION
• Relationships among them: (l intersects r)
– Supports SDT in its implementation
• Providing at least spatial indexing (retrieving objects
in particular area without scanning the whole space)
• Efficient algorithm for spatial joins (not simply
filtering the cartesian product)
Modeling
• WLOG assume 2-D and GIS application,
two basic things need to be represented:
– Objects in space: cities, forests, or rivers
– modeling single objects
– Space: say something about every point in
space (e.g., partition of a country into districts)
– modeling spatially related collections of
objects
3
Modeling …
• Fundamental abstractions for modeling
single objects:
– Point: object represented only by its location in
space, e.g., center of a state
– Line (actually a curve or ployline):
representation of moving through or
connections in space, e.g., road, river
– Region: representation of an extent in 2d-space,
e.g., lake, city
Modeling …
• Instances of spatially related
collections of objects:
– Partition: set of region objects that are
required to be disjoint (adjacency or
region objects with common
boundaries), e.g., thematic maps
– Networks: embedded graph in plane
consisting of set of points (vertices)
and lines (edges) objects, e.g.
highways, power supply lines, rivers
4
Modeling …
A sample (ROSE) spatial type system
EXT={lines, regions}, GEO={points, lines, regions}
Modeling …
• Spatial operators returning numbers
– dist: geo1 x geo2 real
– perimeter, area: regions real
• Spatial operations on set of objects
– sum: set(obj) x (objgeo) geo
– A spatial aggregate function, geometric union of all
attribute values, e.g., union of set of provinces determine
the area of the country
– closest: set(obj) x (objgeo1) x geo2 set(obj)
– Determines within a set of objects those whose spatial
attribute value has minimal distance from geometric query
object
5
Modeling …
• Spatial relationships:
– Topological relationships: e.g., adjacent, inside, disjoint.
Are invariant under topological transformations like
translation, scaling, rotation
– Direction relationships: e.g., above, below, or north_of,
sothwest_of, …
– Metric relationships: e.g., distance
• Enumeration of all possible topological relationships
between two simple regions (no holes, connected):
– Based on comparing two objects boundaries (δA) and
interiors (Ao), there are 4 sets each of which be empty or
not = 24=16. 8 of these are not valid and 2 symmetric so:
• 6 valid topological relationships:
disjoint, in, touch, equal, cover, overlap
Modeling …
• DBMS data model must be extended by SDTs at
the level of atomic data types (such as integer,
string), or better be open for user-defined types
(OR-DBMS approach):
relation states (sname: STRING; area: REGION; spop: INTEGER)
relation cities (cname: STRING; center: POINT; ext: REGION;
cpop: INTEGER);
relation rivers (rname: STRING; route: LINE)
6
Querying
• Two main issues:
1. Connecting the operations of a spatial algebra
(including predicates to express spatial
relationships) to the facilities of a DBMS
query language.
2. Providing graphical presentation of spatial
data (i.e., results of queries), and graphical
input of SDT values used in queries.
Querying …
Fundamental spatial algebra operations:
• Spatial selection: returning those objects satisfying a
spatial predicate with the query object
– “All cities in Bavaria”
SELECT sname FROM cities c WHERE c.center inside Bavaria.area
– “All rivers intersecting a query window”
SELECT * FROM rivers r WHERE r.route intersects Window
– “All big cities no more than 100 Kms from Hagen”
SELECT cname FROM cities c WHERE dist(c.center, Hagen.center) <
100 and c.pop > 500k
(conjunction with other predicates and query optimization)
7
Querying …
• Spatial join: A join which compares any two
joined objects based on a predicate on their spatial
attribute values.
– “For each river pass through Bavaria, find all cities
within less than 50 Kms.”
SELECT r.rname, c.cname, length(intersection(r.route, c.area))
FROM rivers r, cities c
WHERE r.route intersects Bavaria.area and
dist(r.route,c.area) < 50 Km
Querying …
• Graphical I/O issue: how to determine “Window” or
“Bavaria” in previous examples (input); or how to
show “intersection(route, Bavaria.area)” or “r.route”
(output) (results are usually a combination of several
queries).
• Requirements for spatial querying [Egenhofer]:
– Spatial data types
– Graphical display of query results
– Graphical combination (overlay) of several query results
(start a new picture, add/remove layers, change order of layers)
– Display of context (e.g., show background such as a raster
image (satellite image) or boundary of states)
– Facility to check the content of a display (which query
contributed to the content)
8
Querying …
• Extended dialog: use pointing device to select objects within a
subarea, zooming, …
• Varying graphical representations: different colors, patterns,
intensity, symbols to different objects classes or even objects
within a class
• Legend: clarify the assignment of graphical representations to
object classes
• Label placement: selecting object attributes (e.g., population) as
labels
• Scale selection: determines not only size of the graphical
representations but also what kind of symbol be used and
whether an object be shown at all
• Subarea for queries: focus attention for follow-up queries
9
Introduction to Spatial Database
Systems
by Cyrus Shahabi
from
Ralf Hart Hartmut Guting’s
VLDB Journal v3, n4, October 1994
1
Data Structures …
• Representation of a value of a SDT must be
compatible with two different views:
1. DBMS perspective:
• Same as attribute values of other types with respect to
generic operations
• Can have varying and possibly large size
• Reside permanently on disk page(s)
• Can efficiently be loaded into memory
• Offers a number of type-specific implementations fo
generic operations needed by the DBMS (e.g.,
transformation functions from/to ASCII or graphic)
Data Structures …
2. Spatial algebra implementation perspective, the
representation:
• Is a value of some programming language data type
• Is some arbitrary data structure which is possibly
quite complex
• Supports efficient computational geometry
algorithms for spatial algebra operations
• Is no geared only to one particular algorithm but is
balanced to support many operations well enough
2
Data Structures …
• From both perspectives, the representation should
be mapped by the compiler into a single or
perhaps a few contiguous areas (to support DBMS
paging). Also supports:
• Plane sweep sequence: object’s vertices stored in a
specific sweep order (e.g., x-order) to expedite
plane-sweep operation.
• Approximations: stores some approximations as
well, e.g., MBR
• Stored unary function values: such as perimeter or
area be stored once the object is constructed to
eliminate future expensive computations.
Spatial Indexing
• To expedite spatial selection (as well as other
operations such as spatial joins, …)
• It organizes space and the objects in it in some
way so that only parts of the space and a subset
of the objects need to be considered to answer a
query.
• Two main approaches:
1. Dedicated spatial data structures (e.g., R-tree)
2. Spatial objects mapped to a 1-D space to utilize
standard indexing techniques (e.g., B-tree)
3
Spatial Indexing
• A fundamental idea: use of approximations: 1)
continuous (e.g., bounding box), or 2) grid.
Spatial Indexing …
• Spatial data structures either store points or
rectangles (for line or region values)
• Operations on those structures: insert, delete,
member
• Query types for points:
– Range query: all points within a query rectangle
– Nearest neighbor: point closest to a query point
– Distance scan: enumerate points in increasing distance
from a query point.
• Query types for rectangles:
– Intersection query
query rectangle
– Containment query
4
Spatial Indexing …
• A spatial index structure organizes points into buckets.
• Each bucket has an associated bucket region, a part of
space containing all objects stored in that bucket.
• For point data structures, the regions are disjoint &
partition space so that each point belongs into
precisely one bucket.
• For rectangle data structures, bucket regions may
overlap.
A kd-tree partitioning of
2d-space
where each bucket can
hold up to 3 points
Spatial Indexing …
• One dimensional embedding: z-order or bit-interleaving
– Find a linear order for the cells of the grid while maintaining
“locality” (i.e., cells close to each other in space are also close to each
other in the linear order)
– Define this order recursively for a grid that is obtained by
hierarchical subdivision of space
11
0
01 11 10 1110
01
1
00 10 00
0 1 00 01 10 11
5
Spatial Indexing …
• Any shape (approximated as set of cells) over the grid
can now be decomposed into a minimal number of cells
at different levels (using always the highest possible
level) 000 001 010 011 100 101 110 111
10010 100110
1000
000 001 010 011 100 101 110 111
Spatial Indexing …
• Spatial index structures for points:
Y1
Y2
Y3
X1 X2 X3 X4 buckets
Scales KD-Tree
Grid-file
6
Spatial Indexing …
Spatial index structures for rectangles: unlike points,
rectangles don’t fall into a unique cell of a partition and
might intersect partition boundaries
– Transformation approach: instead of k-dimensional
rectangles, 2k-dimensional points are stored using a point data
structure
– Overlapping regions: partitioning space is abandoned &
bucket regions may overlap (e.g., R-tree & R*-tree)
– Clipping: keep partitioning, a rectangle that intersects
partition boundaries is clipped and represented within each
intersecting cell (e.g., R+-tree)
Spatial Indexing …
• A rectangle with 4 coordinates (Xleft, Xright, Ybottom, Ytop)
can be considered as a point in 4d-space
• For illustration, consider how an interval i = (i1, i2)
with 2 coordinates can be mapped to 2d-space (as a
point):
Y
Intersection query with interval i:
Find all points (x,y) where:
i2
x < i2 and y> i1
i1
i1 i2 X
7
Spatial Join
• Traditional join methods such as hash join or
sort/merge join are not applicable.
• Filtering cartesian product is expensive.
• Two general classes:
1. Grid approximation/bounding box
2. None/one/both operands are presented in a spatial index
structure
– Grid approximations and overlap predicate:
– A parallel scan of two sets of z-elements corresponding to
two sets of spatial objects is performed
– Too fine a grid, too many z-elements per object
(inefficient)
– Too coarse a grid, too many “false hits” in a spatial join
Spatial Join …
• Bounding boxes: for two sets of rectangles R, S all
pairs (r,s), r in R, s in S, such that r intersects s:
– No spatial index on R and S: bb_join which uses a
computational geometry algorithm to detect rectangle
intersection, similar to external merge sorting
– Spatial index on either R or S: index join scan the
non-indexed operand and for each object, the bounding
box of its SDT attribute is used as a search argument on
the indexed operand (only efficient if non-indexed
operand is not too big or else bb-join might be better)
– Both R and S are indexed: synchronized traversal of
both structures so that pairs of cells of their respective
partitions covering the same part of space are
encountered together.
8
System Architecture
• Extensions required to a standard DBMS architecture:
– Representations for the data types of a spatial algebra
– Procedures for the atomic operations (e.g., overlap)
– Spatial index structures
– Access operations for spatial indices (e.g., insert)
– Filter and refine techniques
– Spatial join algorithms
– Cost functions for all these operations (for query optimizer)
– Statistics for estimating selectivity of spatial selection and
join
– Extensions of optimizer to map queries into the specialized
query processing method
– Spatial data types & operations within data definition and
query language
– User interface extensions to handle graphical representation
and input of SDT values
System Architecture …
• The only clean way to accommodate these
extensions is an integrated architecture based on
the use of an extensible DBMS.
• There is no difference in principle between:
– a standard data type such as a STRING and a spatial
data type such as REGION
– same for operations: concatenating two strings or
forming intersection of two regions
– clustering and secondary index for standard attribute
(e.g., B-tree) & for spatial attribute (R-tree)
– sort/merge join and bounding-box join
– query optimization (only reflected in the cost functions)
9
System Architecture
Extensibility of the architecture is orthogonal to the data
model implemented by that architecture:
– Probe is OO
– DASDBS is nested relational
– POSTGRES, Starbust and Gral extended relational models
• OO is good due to extensibility at the data type level,
but lack extensibility at index structures, query
processing or query optimization.
• Hence, current commercial solutions are OR-DBMSs:
– NCR Teradata Object Relational (TOR)
– IBM DB2 (spatial extenders)
– Informix Universal Server (spatial datablade)
– Oracle 8i (spatial cartridges)
10
CSCI585-Spring2001 CSCI585-Spring2001
XML Overview
CSCI585-Spring2001
XML Overview (cont.) CSCI585-Spring2001
XML Terminology
1
CSCI585-Spring2001
A Simple XML Document CSCI585-Spring2001
C. Shahabi C. Shahabi
CSCI585-Spring2001
A Simple Document Type Definition CSCI585-Spring2001
The DTD Language (0)
C. Shahabi C. Shahabi
2
CSCI585-Spring2001
The DTD Language (1) CSCI585-Spring2001
The DTD Language (2)
C. Shahabi C. Shahabi
CSCI585-Spring2001
The DTD Language (3) CSCI585-Spring2001
The DTD Language (4)
3
CSCI585-Spring2001
The DTD Language (5) CSCI585-Spring2001
The DTD Language (6)
n Parameter entity references appear only within a n External entities allow us to include data from
DTD and cannot be used in an XML document. another XML document (think of an #include<...>
They are prefixed with a %. statement in C):
u Format and usage: u Format and usage:
<!ENTITY % name “replacement_characters”> <!ENTITY quotes SYSTEM
• Example: “http://www.stocks.com/quotes.xml”>
<!ENTITY % pcdata “(#PCDATA)”> • Example:
<!ENTITY authortitle %pcdata;> <document>
<heading>Current stock quotes</heading>
"es; <!-- data from quotes.xml -->
</document>
CSCI585-Spring2001
The DTD Language (7) CSCI585-Spring2001
The DTD Language (8)
<!ATTLIST box length CDATA “0”> u Source, of one of the three enumerated types, default list
<!ATTLIST box width CDATA “0”> u Taxed, with the fixed value yes
<!ATTLIST frame visible (true|false) u Fixed attribute type is a special case of default
“true”> u It determines that the default value cannot be changed by
<!ATTLIST person marital (single | married an XML document conforming to the DTD
C. Shahabi
| divorced | widowed) #IMPLIED> C. Shahabi uE.g., a book in our XML example must be taxed
4
CSCI585-Spring2001
The DTD Language (9) CSCI585-Spring2001
The DTD Language (10)
C. Shahabi C. Shahabi
CSCI585-Spring2001
The DTD Language (11) CSCI585-Spring2001
The DTD Language (12)
5
CSCI585-Spring2001
The DTD Language (13) CSCI585-Spring2001
The DTD Language (14)
n Example: Sales Order XML Document
n An XML document that satisfies the constraints of a DTD
<Orders>
<SalesOrder SONumber=“12345”> is said to be valid with respect to that DTD.
<Customer CustNumber=“543”>
<CustName>ABC Industries</CustName> n document type declaration (at the “prolog” of an XML
<Street>123 Main St.</Street> document):
<City>Chicago</City>
<State>IL</State> <ZIP>60609</ZIP> <!DOCTYPE BOOKCATALOG SYSTEM "http://t t.com/bookcatalog.dtd">
</Customer>
<OrderDate>10222000</OrderDate> n XML document claims validity with respect to the
<Item ItemNumber=“1”> BOOKCATALOG DTD
<Part PartNumber=“234”>
<Description>Turkey wrench</Description>
<Price>9.95</Price>
</Part>
<Quantity>10</Quantity>
</Item>
</SalesOrder>
C. Shahabi
</Orders> C. Shahabi
Example XSL
CSCI585-Spring2001 CSCI585-Spring2001
6
Example XSL …
CSCI585-Spring2001 Example XSL … CSCI585-Spring2001
<!-- Rule 2 --> <xsl:template match="book/title">
u Each template element describes one transformation rule <h1><xsl:apply-templates/></h1>
u The match attribute of a template element specifies the rule pattern while its </xsl:template>
content is the template used to produce the corresponding portion of the result
tree
<!-- Rule 3 --> <xsl:template match="book/author">
<!-- Rule 1 --> <xsl:template match="/">
<b><xsl:apply-templates/></b>
<html><head><title>Our New Catalog</title></head>
<body> </xsl:template>
<xsl :apply- templates/>
</body>
</html> u Pattern, “book/title” matches a title element if its parent is a book
</xsl:template> element
u The template calls for recursive processing of the contents, enclosed in
u The pattern “/” denotes the root of the source tree XHTML literals for bold display (<b>...</b>)
u The template contains some standard XHTML header and trailer constructs u XSL processing includes implicit rules that match elements, attributes,
u The apply-templates element is a rule-processing instruction that denotes and character data (text) not matched by any explicit rules; these rules
recursive processing of the contents of the matched element simply copy data from source to result tree
u XSLT includes several other instructions which permit templates with u In our example, all character data (such as the the text “The spy...” in
constructs such as for-loops, conditional sections, and sorting the title) is copied to the result tree
C. Shahabi C. Shahabi
XSL Example …
CSCI585-Spring2001 CSCI585-Spring2001
7
CSCI585-Spring2001 CSCI585-Spring2001
Why query XML data?
XML-QL: For Querying XML Data
n Data interchange on the Internet
u Integrating, transforming, cleaning and aggregating
n Motivation: XML data
u Why querying XML data? n Examples:
u Why a new query language for XML? u businesses publish data about their products &
services, for customers to compare and process
n Requirements for an XML query language u business partners could exchange internal
n XML-QL features operational data between their information systems
on secure channels
n Other XML query languages
u search robots could integrate automatically
information from related sources that publish their
data in XML format
• stock quotes from financial sites
• sports scores from news sites
C. Shahabi C. Shahabi
CSCI585-Spring2001 CSCI585-Spring2001
n Semi -structured: XML data is not rigidly 1. Precise Semantics 7. Preserve Order and
Association
structured 2. Rewritability,
Optimizability 8. Mutually Embedding
n Self-describing: schema exists with data with XML
3. Query Operations
n Can naturally model irregularities 9. Support for New
4. Compositional Datatypes
u Missing elements (e.g., bestseller?) Semantics 10. Suitable for Metadata
u Multiple occurrences of the same element (reviews*) 5. No Schema Required 11. Server-side Processing
u Elements w/ atomic values in some data items and 6. Exploit Available 12. Programmatic
structured values in others Schema Manipulation
u Collections of elements with heterogeneous structure 13. XML Representation
C. Shahabi C. Shahabi
8
CSCI585-Spring2001 CSCI585-Spring2001
<!ATTLIST article type PCDATA> n Where: what to select (kinda similar to where-clause in SQL)
<!ELEMENT publisher (name, address)> n Construct: what to return (kinda similar to select-clause in SQL)
<!ELEMENT author (firstname?, lastname)> n Extraction: by binding the variables $t, $a and $y, and returning
only $a
C. Shahabi C. Shahabi
CSCI585-Spring2001 CSCI585-Spring2001
C. Shahabi C. Shahabi
9
CSCI585-Spring2001 CSCI585-Spring2001
<result> <author><firstname> John </firstname> <lastname> Smith </lastname> </author> WHERE <bib> <book> <title> $t </>
<title/> Tractability </title> </result> <publisher> <name> Addison-Wesley </> </>
<result> <author><firstname> John </firstname> <lastname> Smith </lastname> </author> </> CONTENT_AS $p </> IN "www.a.b.c/bib.xml"
<title/> Decidability </title> </result>
CONSTRUCT <result> <title> $t </>
<result> <author><lastname> Arvind </lastname> </author> WHERE <author> $a </> IN $p
<title> Efficiency </title> </result>
CONSTRUCT <author> $a </>
...
</>
C. Shahabi C. Shahabi
XML-QL features …
CSCI585-Spring2001 CSCI585-Spring2001
10
CSCI585-Spring2001 CSCI585-Spring2001
C. Shahabi C. Shahabi
XML-QL features …
CSCI585-Spring2001 CSCI585-Spring2001
11
CSCI585-Spring2001 CSCI585-Spring2001
n Lore (Lightweight Object Repository) & Lorel n Your third homework (HW#3):
n XSL, easy to express recursive processing: Create a DTD, an XSL, and an example XML
u all author elements, regardless of how deep they occur in the
data: document from a given EER diagram and then query
<xsl:template> <xsl:apply-templates/> </xsl:template> it using an XML query language of choice (suggested:
<xsl:template match="author"> <result> < xsl:value-of/> </result> XML-QL)!
</xsl:template>
C. Shahabi C. Shahabi
12
Information Integration
Outline
• Information Integration
– Definition
– Motivating example
– Architectures
• Datalog
• Source Descriptions & Query Reformulation
– Global-as-View
– Local-as-View: Bucket Algorithm
– Source Capabilities: Recursive Rewritings
• Wrappers
• Matching Objects Across Sources
1
Information Integration
Single Interface to Multiple Sources
Information Agent
Motivation:
TheaterLoc Entertainment Agent
Tiger Map Hollywood.com
Server Trailers
Etak Geocoder
Agent
Zagat
CuisineNet Yahoo Movies
2
TheaterLoc
Information Integration
The problem of providing
uniform (sources transparent to user)
access to (query, and eventually updates too)
multiple (even 2 is a problem!)
autonomous (not affect the behavior of sources)
heterogeneous (different data models, schemas)
structured (at least semistructured)
data sources (not only databases)
3
Related Technologies
• Distributed databases:
– Sources are homogeneous
– Data is distributed a priori
– Sources are not autonomous
– Similarities at the optimization and execution level
• Information retrieval:
– Keyword search, no semantics
• Data mining:
– Discovering properties and patterns in data
Principal Dimensions of
Information Integration
• Virtual vs. materialized architecture
• Access: query only or query & update?
• Mediated schema
– Mediated schema requires schema integration and then query
reformulation. Two main approaches:
• Global as View
• Local as View
– Language for descriptions and queries: conjunctive queries (CQs), union
of CQs, Datalog (recursion), first-order logic (∧,∨,¬), description logics…
• Types of Sources
– Structured (DB’s) vs. semi-structured (Web)
– Source capabilities: positive and negative
4
Materialized Architecture:
Data Warehouse
Virtual Architecture:
Mediator
5
Mediator
Architecture
• User queries in
global (mediator)
schema
• Mediator translates
and decomposes
user query into
multiple source
queries
Datalog
• Datalog Program = set of datalog rules
• Datalog rule = conjunctive query
Big-LA-buyers(buyer,seller,product, price) :- head
Person(buyer, “Los Angeles”, phone), Datalog
Purchase(buyer, seller, product, price), body
price > 10000.
6
Conjunctive Queries and Views
CREATE VIEW Big-LA-buyers AS
SELECT buyer, seller, product, price
FROM Person, Purchase
WHERE Person.city = “Los Angeles” AND
Person.name = Purchase.buyer AND
Purchase.price > 10000
Big-LA-buyers(buyer,seller,product, price) :-
Person(buyer, “Los Angeles”, phone),
Purchase(buyer, seller, product, price),
price > 10000.
Datalog rule ~ view definition
Rule body ~ select-from-where construct of SQL
Transitive Closure
Suppose we are representing a graph by a relation Edge(X,Y):
Edge(a,b), Edge (a,c), Edge(b,d), Edge(c,d), Edge(d,e)
a d e
c
How do we answer the query:
Find all nodes reachable from a.
7
Recursion in Datalog
b
Path(X, Y) :- Edge(X, Y) a d e
Path(X, Y) :- Path(X, Z), Path(Z, Y). c
Semantics: evaluate the rules bottom-up until a fixpoint:
Iteration #0: Edge: {(a,b), (a,c), (b,d), (c,d), (d,e)}
Path: {}
Iteration #1: Path: {(a,b), (a,c), (b,d), (c,d), (d,e)}
Iteration #2: Path gets the new tuples: (a,d), (b,e), (c,e)
Iteration #3: Path gets the new tuple: (a,e)
Iteration #4: Nothing changes => stop.
8
Source Descriptions
Elements of source descriptions:
• Contents: source contains movies, directors, cast.
• Constraints: only movies produced after 1965.
• Completeness: contains all American movies.
• Capabilities:
– Negative: source requires movie title or director as input
– Positive: source can perform selections, joins, …
9
Approches to Specification of
Source Descriptions
• Global-as-view:
– Mediator relation defined as view over source
relations
Ex: TSIMMIS (Stanford), HERMES (Maryland).
• Local-as-View:
– Source relation defined as view over mediator
relations
Ex: Information Manifold (AT&T),
Tukwila(UW), InfoMaster (Stanford).
Query Reformulation
Problem: rewrite the user query expressed in the mediated
schema into a query expressed in the source schemas.
Given a query Q in terms of the mediated-schema relations,
and descriptions of the information sources,
Find a query Q’ that uses only the source relations, such that
• Q’ |= Q (i.e., answers are correct; i.e., Q’ Q) and
• Q’ provides all possible answers to Q given the sources.
10
Answering queries using views
• Query Containment: q’ q ↔ ∀D q’(D) q(D)
• Query Equivalence: q’ = q ↔ q’ q ^ q q’
Given query q and view definitions V={V1…Vn}
• q’ is an Equivalent Rewriting of q using V if:
– q’ refers only to views in V, and
– q’ = q
• q’ is a Maximally-Contained Rewriting of q using V if:
– q’ refers only to views in V, and
– q’ q, and
– there is no rewriting q1, such that q’ q1 q and q1 ≠ q
Global-as-View (GAV)
Each mediator relation is defined as a view
over source relations.
MovieActor(title,actor) ←
DB1(title,actor,year)
MovieActor(title, actor) ←
DB2(title,director,actor,year)
MovieReview(title, review) ←
DB1(title,actor,year) ^ DB3(title,review)
11
Query Reformulation in GAV
Query reformulation = rule unfolding+simplification
Query: Find reviews for ‘Brando’ movies
q(title,review) :- MovieActor(title,‘Brando’),
MovieReview(title,review)
1. q’(title,review) :- DB1(title, ‘Brando’,year), Redundant
DB1(title,actor,year’), DB3(title,review)
q’(title,review) :- DB1(title,‘Brando’, year),
DB3(title,review)
12
Local-as-View (LAV)
Each source relation is defined as a view over
mediator relations
S1: V1(title, year, director) →
Movie(title,year,director,genre) ^
American(director) ^ year ≥1960 ^
genre = ‘Comedy’
S2: V2 (title, review) →
Movie(title,year,director,genre) ^ year≥1990 ^
MovieReview(title, review)
13
LAV vs. GAV
See [Ullman,ICDT-1997] for a detailed comparison.
• Local as View:
– Easier to add sources: specify the query expression.
– Easier to specify constraints on contents of the sources:
they are part of the query expression describing them.
• Global as view:
– Easier query reformulation
14
The Bucket Algorithm: Example
V1(student,number,year) → Registered(student,course,year),
Course(course,number), number ≥ 500, year ≥ 1992
V2(student,dept,course) → Registered(student,course,year),
Enrolled(student,dept)
V3(student,course) → Registered(student,course,year), year≤1990
V4(student,course,number) → Registered(student,course,year),
Course(course,number), Enrolled(student,dept), number ≤ 100
15
2. Checking Containment
q(S,D) :- Enrolled(S,D), Registered(S,C,Y), Course(C,N), N ≥300, Y≥1995
V2(S,D,C’) V1(S,N’,Y) V1(S’,N,Y’)
V4(S,C’,N’) V2(S,D’,C)
V4(S,C,N’)
16
Negative Capabilities:
Binding Patterns
Sources:
AAAIdbf (X) → AAAIPapers(X)
CitationDBbf(X,Y) → Cites(X,Y)
AwardDBb(X) → AwardPaper(X)
Recursive Rewritings
q(X) :- AwardPaper(X)
• Problem: Unbounded union of conjunctive queries
q1(X) :- AAAIdb(X), AwardDB(X)
q1(X) :- AAAIdb(X1), CitationDB(X1,X), AwardDB(X)
…
q1(X) :- AAAIdb(X1), CitationDB(X1,X2), …,
CitationDB(Xn,X), AwardDB(X)
AAAIdbf (X) → AAAIPapers(X)
• Solution: Recursive Rewriting CitationDBbf(X,Y) → Cites(X,Y)
AwardDBb(X) → AwardPaper(X)
papers(X) :- AAAIdb(X)
papers(X) :- papers(Y), CitationDB(Y,X)
q’(X) :- papers(X), AwardDB(X)
17
[Open world Assumption (source descriptions with containment)]
18
Example of Extraction Rule
[Muslea et al 1999]
Page:
<b>Name:</b>Chinois on Main<b> Cuisine :<p> </b> Pacific New Wave <br>
SkipTo( <b> )
19
Matching Objects Across Sources
Problem: how to decide that objects in two sources
refer to the same object in the real world
Example: Zagat Fodors
CPK California Pizza Kitchen
Ralph’s Ralph’s Grill
• Information Retrieval techniques for similarity
joins [Cohen,SIGMOD-98] [Tejada&Knoblock]
20
CSCI585
by Cyrus Shahabi
from
Christian S. Jensen’s
Chapter 1
C. Shahabi 1
Outline
CSCI585
1
CSCI585
Introduction
C. Shahabi 3
CSCI585
Definitions
■ Temporal DBMS manages time-referenced data,
hence, times are associated with database
entities
■ Two types of time: valid time and transaction
time
■ Valid time, vt, of a fact (any logical statement that
is either true or false) is the collected times
(possibly spanning the past, present & future)
when the fact is true
■ Although all facts have a valid time, the valid
time of a fact may not necessarily be recorded in
the database (unknown or irrelevant to the app.)
◆ If a database models different worlds, database facts
might have several valid times, one for each world
C. Shahabi 4
2
CSCI585
Definitions …
CSCI585
Definitions …
C. Shahabi 6
3
Definitions …
CSCI585
CSCI585
Modeling
C. Shahabi 8
4
CSCI585
Modeling …
CSCI585
Modeling …
5
CSCI585
Modeling …
■ BCDM pros:
◆ Since no two tuples with mutually identical explicit
values are allowed in BCDM relation instance, the full
history of a fact is contained in exactly one tuple
◆ Relation instances that are syntactically different have
different information content and vice versa
■ BCDM cons:
◆ Bad internal representation and display to users of
temporal info
◆ Varying length and voluminous timestamps of tuples
are impractical to manage directly
◆ Timestamp values are hard to comprehend in BCDM
format
C. Shahabi 11
CSCI585
Modeling …
■ Fixed-length format for tuples, where each
tuple’s timestamp encodes a rectangular or stair-
based bitemporal region
■ Several tuples may be needed to represent a
single fact
cID TapeNum Ts Te Vs Ve ■ C101 rents T1234 on
C101 T1234 2 UC 2 4 May 2nd for 3 days, &
returns it on 5th
C102 T1245 5 7 5 now ■ C102 rents T1245 on 5th
C102 T1245 8 UC 5 7 open-ended, & returns it
on 8th
C102 T1234 9 9 9 11 ■ C102 rents T1234 on 9th
to be returned on 12th.
C102 T1234 10 13 9 13 On 10th the rent is
C102 T1234 extended to include 13th
14 15 9 now but tape is not returned
C102 T1234 16 UC 9 15 until 16th. 12
C. Shahabi
6
CSCI585
Modeling …
■ Non-first-normal-form representation
■ Relation is thought of as recording
information about some types of objects
(e.g., information about customers)
CustomerID TapeNum
■ C101 rents T1234 on
[2, Now] x [2,4] C101 [2, Now] x [2,4] T1234 May 2nd for 3 days, &
returns it on 5th
[5, 7] x [5, inf] C102 [5, 7] x [5, inf] T1245
[8, Now] x [5, 7] [8, Now] x [5, 7] ■ C102 rents T1245 on 5th
open-ended, & returns it
[9,9] x [9, 11] [9,9] x [9, 11] T1234
on 8th
[10,13] x [9, 13] [10,13] x [9, 13]
■ C102 rents T1234 on 9th
[14,15] x [9, inf] [14,15] x [9, inf]
to be returned on 12th.
[16, Now] x [9, 15] [16, Now] x [9, 15] On 10th the rent is
extended to include 13th
but tape is not returned
until 16th. 13
C. Shahabi
CSCI585
Modeling …
■ Note that 2nd tuple records two facts: rental
information for customer C102 for the two tapes
■ Pros of the two latter models:
◆ No need to update the relation at every tick, it is
achieved by introducing “now” variable that assume
the current value
■ Two choices to enter time values into relations
1. At the level of tuples (tuple timestamping)
2. At the level of attribute values (attribute timestamping)
C. Shahabi 14
7
CSCI585
Modeling …
A Vs Ve A Vs Ve A Vs Ve
a 2 8 a 2 4 a 2 8
b 2 8 a 5 8 b 2 4
b 2 8 b 5 8
CSCI585
Modeling …
■ BCDM only allows coalesced relation
instances, i.e., relations are only different
if they are not snapshot equivalent
◆ The last two relations are not legal in BCDM
■ However, the three relations are not
equivalent from an interval-based view:
◆ First relation: a tape was checked out for 7
days
◆ Second relation: the tape was checked out for
3 days initially and then for 4 more days
C. Shahabi 16
8
CSCI585
Querying
■ Temporal queries “can” be expressed via
conventional query languages such as SQL (e.g.,
current temporal applications); however, with
great difficulty
cID TapeNum Vs Ve
C101 T1234 2 now
cID TapeNum C101 T1245 5 10
C101 T1234 C102 T1245 22 25
C102 C102 T1425 9 19
T1425
C102 T1324 C102 T1434 4 14
C103 T1243 C102 T1324 9 now
S-CheckedOut C103 T1243 7 21
V-CheckedOut
■ At time 17, the first relation is a snapshot of the
second
C. Shahabi 17
CSCI585
Querying …
9
CSCI585
Querying …
C. Shahabi 19
CSCI585
Querying …
■ Many modeling issues impact the language
design, e.g., time stamping tuples or attributes
■ Language design must consider:
◆ time-varying nature of data,
◆ predicated on temporal values,
◆ temporal constructs,
◆ supporting states and/or events,
◆ supporting multiple calendars,
◆ modification of temporal relations,
◆ cursors, views, integrity constraints, handling now,
aggregates, schema versioning, periodic data
C. Shahabi 20
10
CSCI585
Querying …
CSCI585
DBMS Design
C. Shahabi 22
11
CSCI585
Logical Design
CSCI585
Logical Design …
12
CSCI585
Logical Design …
CSCI585
Logical Design …
C. Shahabi 26
13
CSCI585
Conceptual Design
■ ER diagrams become obscure and cluttered
when an attempt is made to capture temporal
aspects (see example)
■ CheckedOut relationship should become ternary
by introducing an artificial entity set to capture
time of rental
■ However, still issues remain: varying rental price
over time, transaction time inclusion, …
■ Some industrial solution: ignore temporal
aspects in the ER diagram and supplement it
with textual phrases, e.g., “full temporal support”
◆ ! no automatic mapping from ER to model
■ Dozens of temporally enhanced ER models
proposed 27
C. Shahabi
CSCI585
Conceptual Design …
C. Shahabi 28
14
CSCI585
Conceptual Design …
C. Shahabi 29
CSCI585
DBMS Implementation
C. Shahabi 30
15
CSCI585
Query Processing
C. Shahabi 31
CSCI585
Query Processing …
C. Shahabi 32
16
CSCI585
Implementation of Algebraic Operators
■ Efficient implementation of temporal selection,
joins, aggregates, and duplicate elimination !
temporal index structures
■ Variety of binary temporal joins have been
proposed: time-join, time-equijoin, … as
extensions of nested loop or merge join that
exploits orders or local workspace as well as
partitioning based joins
■ Also, incremental techniques for implementing
operators on relations capturing transaction time
have been discussed
◆ Caching the results of previous computations to be
reused later (easy to do since the records of updates,
I.e., changes to previously cached results, are already
contained in a temporal DBMS)
C. Shahabi 33
CSCI585
Imp. Of Algebraic Ops…
C. Shahabi 34
17
CSCI585
Indexing Structures
C. Shahabi 35
CSCI585
Summary
■ Popular approaches:
◆ Snapshot-based semantics for database design
◆ BCDM for modeling
◆ TSQL2 as a query language
■ Well understood issues (some with efficient
implementation):
◆ Semantics of the time domain: its structure,
dimensionality, and indeterminacy
◆ Representational issues and operations on timestamps
◆ Temporal joins, aggregates and coalescing
◆ Temporal index structures supporting vt, tt, or both
◆ Prototype implementations of temporal DBMS
C. Shahabi 36
18
CSCI585
Open Problems
■ Legacy awareness
■ Architecture awareness
■ Visualization of temporal data
■ Conceptual design
■ Performance (cost models for temporal
operators and maintaining statistics for
query optimizer)
C. Shahabi 37
CSCI585
Open Problems …
C. Shahabi 38
19
Multimedia Storage Servers
Cyrus Shahabi
shahabi@usc.edu
Integrated Media Systems Center &
Computer Science Department
University of Southern California
Los Angeles CA, 90089-0781 1
CSCI585-
CSCI585-Spring2001
OUTLINE
Introduction
Continuous Media
Magnetic Disk Drives
Display of CM (single disk, multi-disks )
Optimization Techniques
Additional Issues
Case Study (Yima)
C. Shahabi 2
CSCI585-
CSCI585-Spring2001
Network
Storage Manager
Memory
CSCI585-
CSCI585-Spring2001
Some Applications
Video-on-demand Medical databases
News-on-demand NASA databases
News-editing
Movie-editing
Interactive TV
Digital libraries
Distance Learning
C. Shahabi 4
CSCI585-
CSCI585-Spring2001
C. Shahabi 5
CSCI585-
CSCI585-Spring2001
Continuous Display
Data should be
transferred from the
storage device to the
memory (or display) at
a pre-specified rate. Memory
Otherwise: frequent
disruptions & delays,
Disk
termed hiccups
NTSC quality: 45Mb/s
C. Shahabi 6
CSCI585-
CSCI585-Spring2001
C. Shahabi 7
CSCI585-
CSCI585-Spring2001
Compression
MPEG-1 30:1 reduction in both size and
bandwidth requirement (NTSC 45 Mb/s is
reduced to 1.5 Mb/s)
MPEG-2 3-10:1 reduction
(NTSC ~ 4, DVD ~ 8, hdtv ~ 20 Mb/s)
Problem: lose information
(cannot be tolerated by some applications,
NASA)
C. Shahabi 8
CSCI585-
CSCI585-Spring2001
C. Shahabi 9
CSCI585-
CSCI585-Spring2001
C. Shahabi 10
CSCI585-
CSCI585-Spring2001
C. Shahabi 11
CSCI585-
CSCI585-Spring2001
C. Shahabi 12
CSCI585-
CSCI585-Spring2001 Disk Seek Time Model
TSeek = { cc1 ++( c( 2c××dd)) If d < z cylinders
3 4 If d >= z cylinders
1 60 sec
TAvgRotLatency = ×
2 rpm
C. Shahabi 13
CSCI585-
CSCI585-Spring2001
Disk Service Time Model
TService = TTransfer + TAvgRotLatency + TSeek
B B
BWEffective = TTransfer =
TService BWMax
TTransfer: data transfer time [s]
TAvgRotLatency: average rotational latency [s]
TService: service time [s]
B: block size [MB]
C. Shahabi BWEffective: effective bandwidth [MB/s] 14
CSCI585-
CSCI585-Spring2001
C. Shahabi 15
CSCI585-
CSCI585-Spring2001
Sample Calculations
Assumptions:
TSeek = 10 ms
BWMax = 20 MB/s
Spindle speed: 10,000 rpm
B
BWEffective =
B 30 sec
+ + TSeek
BWMax rpm
B 1 KB 10 KB 100 KB 1 MB 10 MB
BWEffective 0.076 0.74 5.55 15.87 19.49
MB/s MB/s MB/s MB/s MB/s
0.38% 3.7% 27.8% 79.4% 97.5%
C. Shahabi 16
CSCI585-
CSCI585-Spring2001
Summary
Average rotational latency depends on the spindle
speed of the disk platters (rpm)
Seek time is a non-linear function of the number
of cylinders traversed
Average rotational latency + seek time = overhead
(wasteful)
Average rotational latency and seek time reduce
the maximum bandwidth of a disk drive to the
effective bandwidth
C. Shahabi 17
CSCI585-
CSCI585-Spring2001
Round-robin Display
Seek Time
Retrieve X1 Y3 X2 Y4 X3 Y5
from Disk
Display Display X1 Display X2 Display X3
from
Display Y3 Display Y4 Display Y5
Memory
Time
CSCI585-
CSCI585-Spring2001
Cycle-based Display
Retrieve X1 Y3 Y4
Z5 Z6 X2 Y5 X3 Z7
from Disk
Display
from Display X1, Y3, Z5 Display X2, Y4, Z6
Memory
Time
C. Shahabi 20
CSCI585-
CSCI585-Spring2001
Subcycle 1 Subcycle 2
CSCI585-
CSCI585-Spring2001
C. Shahabi 22
CSCI585-
CSCI585-Spring2001
Hybrid
For the blocks retrieved within a region, use GSS
schema
This is the most general approach
Tp=?, N=?, Memory=?, max-latency=?
By varying R and g all the possible display techniques
can be achieved
Round-robin (R=1, g=N)
Cycle-base (R=1, g=1)
Constrained placement (R>0, g=1), ...
A configuration planner calculates the optimal values of
R & g for certain application.
C. Shahabi 23
CSCI585-
CSCI585-Spring2001
C. Shahabi 24
CSCI585-
CSCI585-Spring2001
Multiple-disks
Single disk: even in the
best case with 0 seek time,
68/1.5=45 MPEG-1
streams
Typical applications
(MOD): 1000 streams
Memory
Solution: aggregate
bandwidth and storage
space of multiple disk
drives
How to place a video?
C. Shahabi 25
CSCI585-
CSCI585-Spring2001
RAID Striping
All disks take part in X1
transmission of a block
Can be conceptualized as a
single disk
d1 d2 d3
Even distribution of display
load
X1.1 X1.2 X1.3
Efficient admission
X2.1 X2.2 X2.3
Is not scalable in throughput
C. Shahabi 26
CSCI585-
CSCI585-Spring2001
Round-robin retrieval
d1 d2 d3
Only a single disk takes part
in transmission of each block X1 X2 X3
Y2 Y3 W1
Retrieval schedule Y1
Z1 Z2
Round-robin retrieval of the Z3 W2 W3
blocks
Retrieval Schedule
Even distribution of display
d1 d2 d3 Display
load
Time
Efficient admission
Not scalable in latency X1,Y1,W1,Z1
X2,Y2,W2,Z2
X3,Y3,W3,Z3
C. Shahabi 27
CSCI585-
CSCI585-Spring2001
Hybrid Striping
Partition D disks into clusters of d disks
Each block is declustered across the d disks that constitute
a cluster (each cluster is a logical disk drive)
RAID striping within a cluster
Round-robin retrieval across the clusters
RAID striping (d=D), Round-robin retrieval (d=1)
C0 C1 C2
C. Shahabi 28
CSCI585-
CSCI585-Spring2001 Worst latency (sec)
Scalability 80
round-robin retrieval 0
0 5 10 15 20
Factor of increase in resources (memory+disk)
C. Shahabi 29
CSCI585-
CSCI585-Spring2001
C1 G0 G1 G2 G3 G4 G5
C2 G5 G0 G1 G2 G3 G4 ...
X4 X5 X0 X1 X2 X3
... C3 G4 G5 G0 G1 G2 G3
X6
Y0 Y1 Y2 Y3 Y4 Y5 C4 G3 G4 G5 G0 G1 G2
Y6 ...
C5 G2 G3 G4 G5 G0 G1
C0 C1 C2 C3 C4 C5
C. Shahabi 30
CSCI585-
CSCI585-Spring2001
Admission Control
When a request arrives, search an empty slot from the group
currently accessing the cluster which has the first block
failure: look up a group and find no empty slot
success: find a group with empty slot and assign the request
3 2 1 5 4
G0 G5 G4 G3 G2 G1
X4 X5 X0 X1 X2 X3
X6 ...
Y0 Y1 Y2 Y3 Y4 Y5
Y6 ... Z0 Z1 ...
C0 C1 C2 C3 C4 C5
C. Shahabi 31
CSCI585-
CSCI585-Spring2001
C. Shahabi 32
CSCI585-
CSCI585-Spring2001
C. Shahabi 33
CSCI585-
CSCI585-Spring2001
Optimization Techniques
C. Shahabi 34
CSCI585-Spring2001
Request Migration
Migrating a request from a busy group to a group with
more idle slots reduces the possible latency of future
requests
G0 G5 G4 G3 G2 G1
X4 X5 X0 X1 X2 X3
X6 ...
Y0 Y1 Y2 Y3 Y4 Y5
Y6 ... Z0 Z1 ...
C0 C1 C2 C3 C4 C5
C. Shahabi 35
CSCI585-Spring2001
Object Replication
With multiple copies of an object, simultaneous checking
for an empty slot reduces the worst case startup latency
and the average latency
G0 G5 G4 G3 G2 G1
X4 X5 X0 X1 X2 X3
X6 ...
X1 X2 X3 X4 X5 X0
... X6
C0 C1 C2 C3 C4 C5
C. Shahabi 36
CSCI585-
CSCI585-Spring2001
Maximize Throughput
Objective: support more requests than the
maximum throughput of the system!
Observation: for most of the applications (e.g.,
MOD) many requests for the same stream (e.g.,
new released movies) arrive close to each other
Solution: Share streams among multiple requests
How?
C. Shahabi 37
CSCI585-
CSCI585-Spring2001
Stream Sharing
Batching requests: introduce a startup delay and
multiplex single stream to support all the requests
for the same stream
Piggybacking: if two (or more) streams are close
enough, speed-up one and slow down the other so
they meet and then share streams
Buffer sharing: keep blocks of the first request in
memory so that the second request can be served
from memory (rather than disk)
C. Shahabi 38
CSCI585-
CSCI585-Spring2001
CSCI585-
CSCI585-Spring2001
C. Shahabi 40
CSCI585-
CSCI585-Spring2001
Additional Issues (3)
Multi-zone disk drives (variable transfer rate)
E.g., Track-pairing
E.g., FIXB (elevator allocation of blocks over zones)
Heterogeneity
Different media types
Different disk types
Fault tolerance: what kind of failures are acceptable?
Replication vs. parity-based
OnLine reorganization
C. Shahabi 41
...
Storage 100 Mb/s or 1 Gb/s NTSC (MPEG-2)
Systems
Multi-node, scalable server architecture 10.2 Channel Audio
Media format independent: supports DVD MPEG-2 (8 Mb/s), HD MPEG-2 (20
Mb/s), MPEG-4 (800 Kb/s), 10.2 channel audio, etc.
Standard transmission protocols: RTP, RTSP
Selective retransmission of lost media packets for improved playback quality
Synchronization across multiple media streams
C. Shahabi 42