Vous êtes sur la page 1sur 50

Databases and Data Mining

Lecture 1: Introduction to Data Mining for Bioinformatics


Fall 2005 Peter van der Putten (putten_at_liacs.nl)

Course Outline Objective


Understand the basics of data mining Gain understanding of the potential for applying it in the bioinformatics domain Limited hands on experience

Schedule
Date Time 4-Nov-05 13.45 - 15.30 18-Nov-05 13.45 - 15.30 15.45 - 17.30 25-Nov-05 13.45 - 15.30 2-Dec-05 13.45 - 15.30 15.45 - 17.30 Room 174 174 306/308 403 174 306/308 Lecture: Introduction Lecture: Predictive Data Mining Practical Assignments Lecture: Descriptive Data Mining & Search Lecture: Bioinformatics Data Mining Cases Practical Assignments

Evaluation
Practical assignment (2nd) plus take home exercise

Agenda Today What is data mining? A short summary of life Data mining revisited

What is data mining?

Genomic Microarrays Case Study

Problem:
Leukemia (different types of Leukemia cells look very similar) Given data for a number of samples (patients), can we
Accurately diagnose the disease? Predict outcome for given treatment? Recommend best treatment?

Solution
Data mining on micro-array data

Example: ALL/AML data


38 training patients, 34 test patients, ~ 7,000 patient attributes (micro array gene data) 2 Classes: Acute Lymphoblastic Leukemia (ALL) vs Acute Myeloid Leukemia (AML) Use train data to build diagnostic model

ALL

AML

Results on test data: 33/34 correct, 1 error may be mislabeled

Sources of (artificial) intelligence


Reasoning versus learning Learning from data
Patient data Customer records Stock prices Piano music Criminal mug shots Websites Robot perceptions Etc.

Some working definitions. Data Mining and Knowledge Discovery in Databases (KDD) are used interchangeably Data mining =
The process of discovery of interesting, meaningful and actionable patterns hidden in large amounts of data

Multidisciplinary field originating from artificial intelligence, pattern recognition, statistics, machine learning, bioinformatics, econometrics, .

A short summary of life Bio Building Blocks Biotech Data Mining Applications

The Promise.

The Promise.

The Promise.

DNA, Proteins, Cells

DNA, Proteins, Cells

From DNA to Proteins

Discovering the structure of DNA


James Watson & Francis Crick - Rosalind Franklin

The structure of DNA

DNA Trivia DNA stores instructions for the cell to peform its functions Double helix, two interwoven strands Each strand is a sequence of so called nucleotides Deoxyribonucleic acid (DNA) comprises 4 different types of nucleotides (bases): adenine (A), thiamine (T), cytosine (C) and guanine (G)
Nucleotide uracil (U) doesnt occur in DNA

Each strand is reverse complement of the other Complementary bases


A with T C with G

DNA Trivia Each nucleus contain 3 x 10^9 nucleotides Human body contains 3 x 10^12 cells Human DNA contains 26k expressed genes, each gene codes for a protein in principle DNA of different persons varies 0.2% or less Human DNA contains 3.2 x 10^9 base pairs
X-174 virus: 5,386 Salamander: 100 109 Amoeba dubia: 670 109

Primary Protein Structure


Proteins are built out of peptides, which are poylmer chains of amino acids Twenty amino acids are encoded by the standard genetic code shared by nearly all organisms and are called standard amino acids (100 amino acids exist in nature)

Protein Structure from Primary to Quaternary

Proteins: 3D Structure

A representation of the 3D structure of myoglobin, showing coloured alpha helices. This protein was the first to have its structure solved by X-ray crystallography by Max Perutz and Sir John Cowdery Kendrew in 1958, which led to them receiving a Nobel Prize in Chemistry. http://en.wikipedia.org/wiki/Protein

Proteins: 3D Structure

Molecular surface of several proteins showing their comparative sizes. From left to right are: Antibody (IgG), Hemoglobin, Insulin (a hormone), Adenylate Kinase (an enzyme), and Glutamine Synthetase (an enzyme).

Proteins: 3D Structure

G Protein-Coupled Receptors (GPCR) represent more than half the current drug targets

DNA Codes for Proteins but Proteins also Control Gene Expression

Protein regulation occurs at each step of synthesis

Repressor Protein Switching Genes On and Off

Regulatory Protein Coordinating Gene Expression

Importance of Combinatorial Gene Control

combinations of a few gene regulatory proteins can generate many different cell types during development

Some working definitions. Bioinformatics =


Bioinformatics is the research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize such data [http://www.bisti.nih.gov/]. Or more pragmatic: Bioinformatics or computational biology is the use of techniques from applied mathematics, informatics, statistics, and computer science to solve biological problems [Wikipedia Nov 2005]

NCBI Tools for data mining:


Nucleotide sequence analysis Proteine sequence analysis Structures Genome analysis Gene expression

Data mining or not?.

Bio informatics and data mining


From sequence to structure to function Genomics (DNA), Transcriptomics (RNA), Proteomics (proteins), Metabolomics (metabolites) Pattern matching and search Sequence matching and alignment Structure prediction
Predicting structure from sequence Protein secondary structure prediction

Function prediction
Predicting function from structure Protein localization

Expression analysis
Genes: micro array data analysis etc. Proteins

Regulation analysis

Bio informatics and data mining


Classical medical and clinical studies Medical decision support tools Text mining on medical research literature (MEDLINE) Spectrometry, Imaging Systems biology and modeling biological systems Population biology & simulation

Spin Off: Biological inspired computational learning


Evolutionary algorithms, neural networks, artificial immune systems

Examples of my related research Topology preserving property of self-organizing maps


Neural network for clustering & classification inspired by cortical maps

Benchmarking Artificial Immune Systems Predicting throat cancer survival rate


Value of fusing data from various sources for this purpose

Automated recognition of sick yeast cells in images (with prof. Verbeek) Recommender systems in bioinformatics
Amazon.com style recommendations

Data mining revisited

Some working definitions.


Data Mining and Knowledge Discovery in Databases (KDD) are used interchangeably Data mining =
The process of discovery of interesting, meaningful and actionable patterns hidden in large amounts of data

Multidisciplinary field originating from artificial intelligence, pattern recognition, statistics, machine learning, bioinformatics, econometrics, .

Some working definitions.

Concepts: kinds of things that can be learned


Aim: intelligible and operational concept description Example: the relation between patient characteristics and the probability to be diabetic

Instances: the individual, independent examples of a concept


Example: a patient, candidate drug etc. Example: age, weight, lab tests, microarray data etc

Attributes: measuring aspects of an instance Pattern or attribute space

Data mining tasks


Predictive data mining
Classification: classify an instance into a category Regression: estimate some continuous value

Descriptive data mining


Matching & search: finding instances similar to x Clustering: discovering groups of similar instances Association rule extraction: if a & b then c Summarization: summarizing group descriptions Link detection: finding relationships

Data Mining Tasks: Search

Finding best matching instances Every instance is a point in pattern space. Attributes are the dimension of an instance, f.e. Age, weight, gender etc. Pattern spaces may be high dimensional (10 to thousands of dimensions)

f.e. weight f.e. age

Data Mining Tasks: Clustering

Clustering is the discovery of groups in a set of instances Groups are different, instances in a group are similar In 2 to 3 dimensional pattern space you could just visualise the data and leave the recognition to a human end user
f.e. age

f.e. weight

Data Mining Tasks: Clustering

Clustering is the discovery of groups in a set of instances Groups are different, instances in a group are similar In 2 to 3 dimensional pattern space you could just visualise the data and leave the recognition to a human end user
f.e. age

f.e. weight

In >3 dimensions this is not possible

Data Mining Tasks: Classification

Goal classifier is to seperate classes on the basis of known attributes The classifier can be applied to an instance with unknow class For instance, classes are healthy (circle) and sick (square); attributes are age and weight
weight age

Examples of Classification Techniques


Majority class vote Machine learning & AI Decision trees Nearest neighbor Neural networks Genetic algorithms / evolutionary computing Artificial Immune Systems Good old statistics ..

Example Classification Algorithm 1 Decision Trees

20000 patients age > 67 yes 1200 patients Weight > 85kg yes 400 patients Diabetic (%50) no 800 customers Diabetic (%10) no 18800 patients gender = male? no

etc.

Decision Trees in Pattern Space

Goal classifier is to seperate classes (circle, square) on the basis of attribute age and income

Each line corresponds to a split in the tree


weight

Decision areas are tiles in pattern space


age

Example classification algorithm 3: Neural Networks


Inspired by neuronal computation in the brain (McCullough & Pitts 1943 (!))
invoer: bvb. klantkenmerken

uitvoer: bvb. respons

Input (attributes) is coded as activation on the input layer neurons, activation feeds forward through network of weighted links between neurons and causes activations on the output neurons (for instance diabetic yes/no) Algorithm learns to find optimal weight using the training instances and a general learning rule.

Neural Networks
Example simple network (2 layers)

age

body_mass_index

weightage

Weightbody mass index

Probability of being diabetic

Probability of being diabetic = f (age * weightage + body mass index * weightbody mass index)

Neural Networks in Pattern Space

Classification
Simpel network: only a line available (why?) to seperate classes

Multilayer network:
f.e. weight

Any classification boundary possible

f.e. age

Descriptive data mining: association rules


Discovery of interesting patterns Rule format: if A (and B and C etc) then Z Example:
If customer buys potatoes (A) and sauerkraut (B) then customer buys sausage (Z)

Important measures
Support condition: how often do potatoes and sauerkraut occur together (A,B) Confidence rule: how often do sausages then occur / support conditions (is A,B C always true?)

Could be used for instance for mining gene expression data

Quiz Question

What have we learned today An introduction into applying data mining for bioinformatics A short history of life Basic data mining concepts

Vous aimerez peut-être aussi