Académique Documents
Professionnel Documents
Culture Documents
Week 1
Spring 2006
Melanie Nelson, Ph.D.
Introductions
About me
Experience with
Class introductions
Copyright Melanie Nelson 2006
Course Overview
Goals
Grading
Intro to DBs
Overview of common types of biological data
Introduction to biology-specific problems/issues
2% participation
10% homework
38% midterm
50% final project
Course Overview
Online resources
My website:
www.32geeks.com/classes/biodb_design_2006
Course Overview
Week 1
Week 2
Introduction to databases
Fundamentals of the relational model (includes a brief intro
to SQL)
Database design process
ER diagrams
Normalization
Week 3
Course Overview
Week 4
Week 5
Week 6
Final Project
Requirements document
ER or UML diagram
Short (1-2 page) report describing any difficult or unusual
design decisions
Make 10-15 minute presentation about DB to class
Copyright Melanie Nelson 2006
Introduction to Databases
Information in a database is
Structured (searchable)
Capable of being shared with multiple applications
(multiple uses)
User manually
transfers data
App #1
App #1
Files
App #2
App #3
App #2
Files
App #3
Files
App #1
App #2
Database
No need to
transfer
data among
apps
DBMS
App #3
Hierarchical
Network
Relational
Object-Oriented
Hierarchical Databases
Cell
Information organized
Type 2
into tree, or parent-child
CHO
Cells
relationships
integrin IV
p53
Data gets duplicated
calmodulin
Medium
when one child has
expression
integrin IV
more than one parent
Data gets lost when a High
DUSP-2
High
expression
expression
child doesnt have a
parent
Low
High
expression
expression
ICE
Hierarchical Databases
Network Databases
Extends hierarchical
model to allow children
to have multiple parents
Model has:
Records
Links between records
HEK
cells
CHO
cells
CHO
cell
High
Exp.
CHO
cell
Low
Exp.
calmodulin
HEK
cell
Medium
Exp.
p53
DUSP-2
integrin IV
HEK
cell
High
Exp.
ICE
Network Databases
Relational Databases
Information is modeled
as tables (relations)
with links between
tables
Rigorous mathematical
basis
Cell Line
Cell line ID
Cell line 1
CHO cells
Cell line 2
HEK cells
Protein Expression
Cell line ID
Protein ID
Expression level
Cell line 1
Protein 1
High
Cell line 1
Protein 2
Low
Cell line 1
Protein 3
High
Cell line 2
Protein 2
Medium
Cell line 2
Protein 4
High
Protein
Protein ID
Protein Name
Protein 1
calmodulin
Protein 2
integrin IV
Protein 3
DUSP-2
Protein 4
ICE
Protein 5
p53
Relational Databases
Oracle
IBM DB2
MS SQL Server
PostgreSQL
MySQL
Object-Oriented Databases
CellLine
Type: CHO cells
Protein
Name:Calmodulin
Expression: High
CellLine
Type: HEK cells
Protein
Name: Integrin IV
Expression: Medium
Protein
Name: Integrin IV
Expression: Low
Protein
Name: DUSP-2
Expression: High
Protein
Name: p53
Expression: High
Protein
Name: ICE
Expression: ???
Object-Oriented Databases
Data structure
Data manipulation
Data integrity
Mathematically rigorous
Relational Terms
Relation = Table
Consists of
heading (a fixed set
of attributes)
body (a set of tuples)
Attribute = Column
Also called a field
Proteins
Protein ID
Protein Name
Protein 1
calmodulin
Protein 2
integrin IV
Protein 3
DUSP-2
Protein 4
ICE
Protein 5
p53
Tuple = Row
Also called a
record.
A set of
attribute:value
pairs
Protein ID
Protein Name
Protein ID
Protein Name
Protein 1
Calmodulin, CaM
Protein 1
Calmodulin
Protein 3
Protein 1
CaM
Protein 3
DUSP-2
Protein 3
Protein 3
PAC1
Protein ID
Protein Name 1
Protein Name 2
Protein 1
Calmodulin
CaM
Protein 2
DUSP-2
Dual specificity
phosphatase 2
Protein Name 3
PAC1
Types of Relations
snapshot
Used to improve performance
Can lead to integrity issues
Copyright Melanie Nelson 2006
Data Integrity
Entity Integrity
Referential Integrity
Domain Integrity (other rules)
Copyright Melanie Nelson 2006
Entity Integrity
Candidate Keys
Candidate keys
Study_
in_Lab
Available_Protein
Homo
sapiens
human
Protein_
ID
Protein_
Name
Species_
ID
Protein 1
calmodulin
Mus
musculus
house mouse
Protein 2
integrin IV
56
56
Bos taurus
cow
Protein 3
DUSP-2
Species_
ID
1
Species_
Sci_Name
Foreign key of
child table
Primary key
of parent table
Copyright Melanie Nelson 2006
Referential Integrity
A foreign key value must either
Species
Species_
ID
Species_
Sci_Name
Species_
Common_
Name
Study_i
n_Lab
Homo
sapiens
human
Mus
musculus
house mouse
Bos taurus
cow
56
Available_Protein
Protein_
ID
Protein_
Name
Species_
ID
Protein 1
calmodulin
Protein 2
integrin IV
56
Protein 3
DUSP-2
Protein 9
PTP1B
72
Referential Integrity
Domain Integrity
Attribute integrity
Business rules
Introduction to SQL
Except that the spec says SQL doesnt stand for anything
Introduction to SQL
Example Tables
Z
Bio_molecule_sequence
Bio_mol_id INTEGER (FK)
Source_database VARCHAR2 (32)
Date_inserted DATETIME
Seq_text CLOB
Bio_molecule
Bio_mol_id INTEGER
Species VARCHAR2 (100)
Bio_mol_type_code CHAR (1) (FK)
Function_desc VARCHAR2 (2000) (O)
Bio_molecule_name
Bio_molecule_type
Bio_mol_type_code CHAR (1)
Bio_mol_type VARCHAR2 (32)
Bio_mol_desc VARCHAR2 (500)
CREATE TABLE
CREATE TABLE
Create a table to store biological molecules
CREATE TABLE Bio_molecule (
Bio_mol_id INTEGER PRIMARY KEY,
Species VARCHAR2 (50) NOT NULL,
Bio_mol_type_code CHAR (1) NOT NULL,
Function_desc VARCHAR2 (2000)
)
PRIMARY KEY is equivalent to UNIQUE, NOT NULL
Copyright Melanie Nelson 2006
ALTER TABLE
CREATE INDEX
INSERT
INSERT
Insert the name PTP1B for biological
molecule #1456. It is a primary name.
INSERT INTO Bio_molecule_name
(Bio_mol_id, Bio_mol_name, Primary_name)
VALUES (1456, PTP1B, Y)
Text is surrounded by single quotes.
Copyright Melanie Nelson 2006
UPDATE
UPDATE
Change calmodulin to be the primary name.
UPDATE Bio_molecule_name
SET
Primary_name = Y
WHERE Bio_mol_name = calmodulin
AND Bio_mol_id = 1
Bio_mol_id portion of where clause is probably
unnecessary.
Copyright Melanie Nelson 2006
DELETE
DELETE
Delete all Incyte sequence data
DELETE FROM Bio_molecule_sequence
WHERE Source_database = INCYTE
Bio_molecule_name
Bio_mol_id
Bio_mol_name
Primary_ name
Calmodulin
CaM
DUSP-2
Dual specificity
phosphatase 2
PAC1
Bio_mol_
id
Bio_mol_name
Primary_
name
Calmodulin
CaM
Bio_molecule_name
Bio_mol_
id
Bio_mol_name
Primary_
name
Calmodulin
CaM
Bio_mol_ name
Bio_mol
_id
DUSP-2
Calmodulin
Dual specificity
phosphatase 2
Dual specificity
phosphatase 2
PAC1
SELECT
SELECT
SELECT DISTINCT
SELECT DISTINCT
Find out what types of biological molecules are
represented in the Bio_molecule table:
SELECT DISTINCT Bio_mol_type_code
FROM Bio_molecule
SELECT *
FROM Bio_molecule, Bio_molecule_type
Species
Bio_
mol_
type_
code
Function
_desc
Homo
sapiens
Calcium
sensor
Mus
musculus
Phosphatase
Bio_molecule_type
Bio_mole_
type_code
Bio_mol_
type
Bio_mol_
desc
Protein
Expressed
protein
mRNA
Messenger
RNA
bm.
Bio_
mol
_id
bm.
Species
bm.
Bio_
mol_
type_
code
bm.
Function
_desc
bmt.
Bio_
mol_
type_
code
bmt.
Bio_
mol_
type
bmt.Bio_
mol_
desc
Homo
sapiens
Calcium
sensor
Protein
Expressed
protein
Homo
sapiens
Calcium
sensor
mRNA
Messenger RNA
Mus
musculus
Phosphatase
Protein
Expressed
protein
Mus
musculus
Phosphatase
mRNA
Messenger RNA
Relational Operators
Join
SELECT *
FROM Bio_molecule, Bio_molecule_type
WHERE Bio_molecule.Bio_mol_type_code =
Bio_molecule_type.Bio_mol_type_code
Copyright Melanie Nelson 2006
Species
Function
_desc
Homo
sapiens
Calcium
sensor
Mus
musculus
Phosphatase
Bio_molecule_type
Bio_mole_
type_code
Bio_mol_
type
Bio_mol_
desc
Protein
Expressed
protein
mRNA
Messenger
RNA
bm.
Bio_
mol
_id
bm.
Species
bm.
Bio_
mol_
type_
code
bm.
Function
_desc
bmt.
Bio_
mol_
type_
code
bmt.
Bio_
mol_
type
bmt.Bio_
mol_
desc
Homo
sapiens
Calcium
sensor
Protein
Expressed
protein
Mus
musculus
Phosphatase
mRNA
Messenger RNA
Types of join
Equi-join
Natural join
Outer join
JOIN
JOIN
Show the biomolecule type, rather than the code,
for all types represented in Bio_molecule:
SELECT DISTINCT Bio_mol_type
FROM Bio_molecule bm,
Bio_molecule_type bmt
WHERE bm.Bio_mol_type_code = bmt.Bio_mol_type_code
SELECT DISTINCT Bio_mol_type
FROM Bio_molecule bm
JOIN Bio_molecule_type bmt
ON bm.Bio_mole_type_code = bmt.Bio_mol_type_code
Copyright Melanie Nelson 2006
ORDER BY
SELECT
Bio_mol_name
FROM
Bio_molecule_name
WHERE
Bio_mol_id = 478
ASC or DESC
Copyright Melanie Nelson 2006
Aggregate Functions
COUNT
GROUP BY
Aggregate Functions
SELECT MAX(Date_inserted)
FROM Bio_molecule_sequence
Can be used on numeric and date fields
SUM
AVG
String Functions
Substrings
Length
Subqueries
Subqueries
EXISTS
Subqueries
Subqueries
CLOBs
SELECT *
FROM Protein_Sequence
UNION
SELECT *
FROM Nucleotide_Sequence
Copyright Melanie Nelson 2006
Sequence
Protein 1
ALVCYFMIEGD.
Protein 2
Biopol_ID
Sequence
Protein 1
ALVCYFMIEGD.
Protein 2
KLMIKAGGKLV.
DNA 1
ATTGCATTAGC.
DNA 2
GCGGTATGCC.
KLMIKAGGKLV.
Nucleotide_Sequence
Biopol_ID
Sequence
DNA 1
ATTGCATTAGC.
DNA 2
GCGGTATGCC.
Biopol
_ ID
Sequence
pI
Protein
1
ALVCYFMIEGD.
4.5
Protein
2
KLMIKAGGKLV.
Biopol_ID
Sequence
Protein 1
ALVCYFMIEGD.
Protein 2
KLMIKAGGKLV.
DNA 1
ATTGCATTAGC.
DNA 2
GCGGTATGCC.
7.3
Nucleotide_Sequence
Biopol
_ID
Sequence
Promoter
DNA 1
ATTGCATTAGC.
TATA
DNA 2
GCGGTATGCC.
TAAA
SELECT *
FROM Protein_Stock
INTERSECT
SELECT *
FROM Plasmid_Stock
Stock_location
Protein 1
Box 2
Protein 2
Stock_location
Protein 1
Box 2
Box 5
Plasmid_Stock
Protein_ID
Stock_location
Protein 1
Box 2
Protein 3
Box 3
SELECT *
FROM Protein_Stock
EXCEPT
SELECT *
FROM Plasmid_Stock
Copyright Melanie Nelson 2006
Stock_location
Protein 1
Box 2
Protein 2
Box 5
Stock_location
Protein 3
Box 3
Plasmid_Stock
Protein_ID
Stock_location
Protein 1
Box 2
Protein 3
Box 3
Protein_
Name
Species_
Sci_Name
Species_
Common_
Name
Protein 1
calmodulin
Homo
sapiens
human
Protein 2
integrin IV
Bos taurus
cow
Protein 1
calmodulin
Mus
musculus
house
mouse
Protein 3
ICE
Homo
sapiens
human
Lab_Species
Species_
Sci_Name
Species_
Common_ Name
Homo sapiens
human
Mus musculus
house mouse
Protein_
Name
Protein 1
calmodulin