2 Fontana Salsomaggiore2011

Dalla sequenza alla funzione attraverso la genomica
Fontana Paolo
Fondazione Edmund Mach
HMM (Hidden Markov Model)

Un HMM un grafo di stati connessi dove ogni stato potenzialmente in grado di emettere un simbolo. Il modello parametrizzato secondo una probabilit che governa ogni stato e le transizioni tra gli stessi. Un HMM descrive la probabilit di una determinata sequenza contro un potenzialmente illimitato numero di sequenze. Supponiamo di avere un alfabeto composto da due lettere ( a,b) e di volere costruire una sequenza utilizzando gli HMM con unarchitettura costituita da due stati:
Whole Genome Shotgun (WGS)
- Se un frammento completamente contenuto allinterno di un repeat ci possono essere pi posizioni dove piazzarlo e se le copie non sono esttamente uguali pu causare errori nel consenso finale. - I repeat possono essere posizionati in modo tale da causare ambiguit, quindi due o pi layout sono compatibili con i frammenti in input.
Per ordinare i contigs e quindi creare uno scaffold si fa ricorso alle BAC ends (reads poste allestremit di un BAC).
Genome structural variation

A mate pair that spans a deletion event maps to the corresponding regions of the reference, but the distance of the two reads is greater than the insert size, while if the event is an insertion then the distance is smaller. An inversion is detected if the orientation of the reads is flipped.
We can apply a similar concept to linked insertions and everted duplications
Protein-Coding Genes in Eukaryotes

Why are the proteomes of various eukaryotes similar in size, given the enormous phenotypic differences between eukaryotes? (Proteome the complete set of all protein-encoding genes or all the proteins produced by them) Claverie calls this the N value paradox (N is for number), while Betran and Long call this G value paradox (G is for genes).
Protein-Coding Genes in Eukaryotes
We do know that organisms such worms and flies appear to have about 13 000 to 20 000 protein-coding genes, while plants, mice, and humans have only lightly more (about 20 thousand to 40 thousand genes). Why do organisms such as humans, having so much greater biological complexity than insects and nematodes, have not even twice as many genes? The genes of higher eukaryotes employ more complex forms of gene regulation, such as alternative splicing. Also architecture of individual genes tends to be more complex, for example with more domains present in an average human protein relative to insects.
Can you find a gene here?
Landmarks? Signals?
(hard to see)
the gene is (Human Casein Kinase II )
Introns make things harder
5 Intergenic Exon
Intron
Exon
Intron
Exon Intergenic
Start codon ATG
Splice sites mRNA Transcript
Stop codon TAG/TGA/TAA
5 UTR
3 UTR
Eukaryotic Gene Syntax

complete mRNA ATG coding segment TGA
ATG . . . GT
start codon
exon
intron
AG
...
exon
intron
GT
AG . . . TGA
exon
donor site acceptor site
donor site acceptor site stop codon
Regions of the gene outside of the CDS are called UTRs (untranslated regions), and are mostly ignored by gene finders, though they are important for regulatory functions.
Types of Exons
Three types of exons are defined, for convenience: initial exons extend from a start codon to the first donor site; internal exons extend from one acceptor site to the next donor site; final exons extend from the last acceptor site to the stop codon; single exons (which occur only in intronless genes) extend from the start codon to the stop codon:
Known Genes provide training signals

for computerized gene finding
atg caggtg ggtgag cagatg ggtgag cagttg ggtgag tga
Gene Prediction 12
start splice donor splice acceptor stop
ggtgag caggcc
What is Gene Prediction?

Gene prediction is the problem of parsing a sequence into nonoverlapping coding segments (CDSs) consisting of exons separated by introns.
Gene Prediction Approaches
Intrinsic (ab initio)

GENSCAN, FGENESH, GeneMark.hmm, GlimmerM, Genie;
Extrinsic (similarity-based)
Spliced alignment: GenomeScan, EuGene, FGENESH+, FGENESH_C, GeneId+, AUGUSTUS, etc; Genomic comparison: TwinScan, TWAIN, SLAM, SGP, FGENESH_2, etc;

Genscan
Generalized Hidden Markov Model (GHMM) loutput di uno stato pu essere una stringa di lunghezza finita. Inoltre la distribuzione di probabilit pu non essere la stessa per tutti gli stati: per esempio uno stato pu utilizzare una matrice di pesi per generare la sequenza di output, mentre un altro stato potrebbe usare un HMM. Gli stati corrispondono alle unit funzionali di un gene (promotore, esoni, introni, ) e le transizioni tra uno stato e laltro devono essere biologicamente consistenti.
General Things to Remember about (Protein-coding) Gene Prediction Software

It is, in general, organism-specific It works best on genes that are reasonably similar to something seen previously It finds protein coding regions far better than non-coding regions In the absence of external (direct) information, alternative forms will not be identified It is imperfect! (Its biology, after all)
Omologia: due geni o proteine si dicono omologhi se derivano da un progenitore comune Lomologia un carattere qualitativo a cui non pu essere attribuito un valore percentuale Similarit una funzione che associa un valore numerico a un paio di stringhe Ci sono due diversi tipi di omologia:
1. Due sequenze omologhe si definiscono ortologhe se appartengono a due specie diverse e il loro processo di divergenza ha avuto origine in seguito al processo di speciazione da cui le due specie in questione hanno avuto origine. Due sequenze omologhe si definiscono paraloghe se il loro processo di divergenza ha avuto origine in seguito a un processo di duplicazione genica
2.
Colinearit tra Lg13 e Lg16 di melo
AGSGYWKATGTDKVITTEGRKVGIKKALVFYIGKAPKGTKTNWIMHEYRLLENSRKNGSSKVD
ALLINEARE
AGSGYWKATGADKPIGLPKPVGIKKALVFYAGKAPKGEKTNWIMHEYRLADVDRSVRKKKNSLRLD
ALGORITMO
AGSGYWKATGTDKVITTEGRKVGIKKALVFYIGKAPKGTKTNWIMHEYRLLENSR----KNGSSKVD AGSGYWKATG DK I + VGIKKALVFY GKAPKG KTNWIMHEYRL + R K S ++D AGSGYWKATGADKPIGLP-KPVGIKKALVFYAGKAPKGEKTNWIMHEYRLADVDRSVRKKKNSLRLD
Algoritmi di allineamento esatto

Globale: Needleman e Wunsch
S1 S2
Locale: Smith e Waterman

S1 S2
1. Il primo passo per procedere allallinemento di due sequenze decidere lo score o punteggio da assegnare ai match, mismatch e gap
2. Costruzione di una matrice nxm (n la lunghezza di S 1 e m di S2) dove ogni lettera di S1 confrontata con ogni lettera di S2 e per ogni confronto effettuato assegnato un punteggio in base agli score decisi in precedenza. 3. Dalla matrice si ricava la sequenza con score globale maggiore
A 0 A G C A C A C A 1 2 3 4 5 6 7 8 1 0 1 2 3 4 5 6 7
C 2 1 1 1 2 3 4 5 6
A 3 2 2 2 1 2 3 4 5
C 4 3 3 2 2 1 2 3 4
A 5 4 4 3 2 2 1 2 3
C 6 5 5 4 3 2 2 1 2
T 7 6 6 5 4 3 3 2 2
T 8 7 7 6 5 4 4 3 3
S1: A_CACACTT S2: AGCACAC_A
S1: A_CACACTT S2: AGCACACA_
Algoritmi troppo lenti per poterli applicare nella ricerca di similarit contro gli attuali database biologici
BLAST
Il BLAST si basa su un algoritmo euristico, ci significa che l'allineamento prodotto non esatto. Lalgoritmo del BLAST pu essere diviso in tre parti. 1) 2) Leggere tutte le parole di lunghezza W contenute nella sequenza query; per ognuna di queste generata una lista di parole affini che producono uno score maggiore a una soglia T quando allineate con la parola della query. Analizza tutte le sequenze della banca dati ricercando la presenza di W-mers corrispondenti esattamente alla lista delle parole precedentemente prodotte. 3) Verifica se e quanto sia possibile estendere ogni hit. Questo processo svolto cercando di estendere lallineamento in entrambe le direzioni senza inserire gap. In questo modo si ottiene un HSP (High-scoring Segment Pair) non ulteriormente estendibile. Il parametro S definisce una soglia di score sopra la quale un HSP ritenuto degno di attenzione.
Oltre a W, T e S c un altro parametro importante X che determina quanto il programma deve insistere su un hit di W-mer prima di fermarsi La statistica che sta alla base del BLAST consente inoltre di mettere in relazione il valore di S con il numero atteso di HSP che raggiungono tale soglia in una banca di sequenze casuali della stessa S grandezza di quella considerata. E=kmne
FUNZIONE?
Seeding for sequence alignment: PatternHunter approach

BLAST looks for match of k consecutive letters as seeds (the default value for k is 11 for nucleic alignments). Insteed PatternHunter uses k non consecutive letters as seeds. The relative position of the k letters is called a spaced seed model and k is its weigth. For example, if we use the weigth 6 model 1110111, then the following alignmets match the seed: actgcct acttcct actacct 1110111
tactgcctg |||| |||| tactacctg 1: 1110101 2: 1110101 3: 1110101
With BLAST's seed model if a hit at position i is identified, the chance to have a second hit at position i+1 is very high because it requires only one extra base match. The dependency between the hits makes the detection of homologs less efficient: many regions will have more than one hit, which is unhelpful, while many other regions will be missed.
Sensitivity= number of TP+number of FN
number of TP
Sopra il 30% di identit il 90% delle sequenze risultano essere omologhe alla query, sotto il 25% meno del 10% lo sono.
AGSGYWKATGTDKVITTEGRKVGIKKALVFYIGKAPKGTKTNWIMHEYRLLENSRKNGSSKVD
ALLINEARE
AGSGYWKATGADKPIGLPKPVGIKKALVFYAGKAPKGEKTNWIMHEYRLADVDRSVRKKKNSLRLD
ALGORITMO
AGSGYWKATGTDKVITTEGRKVGIKKALVFYIGKAPKGTKTNWIMHEYRLLENSR----KNGSSKVD AGSGYWKATG DK I + VGIKKALVFY GKAPKG KTNWIMHEYRL + R K S ++D AGSGYWKATGADKPIGLP-KPVGIKKALVFYAGKAPKGEKTNWIMHEYRLADVDRSVRKKKNSLRLD
Valutazione del significato biologico dellallineamento prodotto
Esistono metodi pi fini per la ricerca di sequenze proteiche correlate funzionalmente o strutturalmente? Lidea consiste nellindividuare quei domini o posizioni conservate e quindi sottoposte a un vincolo strutturale o funzionale allinterno di proteine appartenenti alla stessa famiglia
Allineamento multiplo
Lallineamento multiplo di tre o pi sequenze pu essere definito come unipotesi di omologia posizionale tra basi o aminoacidi
1YEA 1YCC 2PCBB 5CYTR 1CCR 1CRY 1HROA 1CXC 1C2RA 155C 2C2C 2mtac AKESTGFKPGSAKKGATLFKTRCQQCHTIEE-------GGPNKVGPNLHGIFGRHSGQVK ----TEFKAGSAKKGATLFKTRCLQCHTVEK-------GGPHKVGPNLHGIFGRHSGQAE ---------GDVEKGKKIFVQKCAQCHTVEK-------GGKHKTGPNLHGLFGRKTGQAP ---------GDVAKGKKTFVQKCAQCHTVEN-------GGKHKVGPNLWGLFGRKTGQAE -ASFSEAPPGNPKAGEKIFKTKCAQCHTVDK-------GAGHKQGPNLNGLFGRQSGTTP ---------QDAASGEQVFK-QCLVCHSIGP-------GAKNKVGPVLNGLFGRHSGTIE -----SAPPGDPVEGKHLFHTICITCHTDIK-------G-ANKVGPSLYGVVGRHSGIEP -------QEGDPEAGAKAFN-QCQTCHVIVDDSGTTIAGRNAKTGPNLYGVVGRTAGTQA ---------GDAAKGEKEFN-KCKTCHSIIAPDGTEIVKG-AKTGPNLYGVVGRTAGTYP -------NEGDAAKGEKEFN-KCKACHMIQAPD-GTDIKG-GKTGPNLYGVVGRKIASEE --------EGDAAAGEKVSK-KCLACHTFDQ-------GGANKVGPNLFGVFENTAAHKD -----APQFFNIIDGSPLNFDD-----AMEEGRDTEAVKHFLETGENVYNEDPEILPEAE . * : * : . .
Osservando un allineamento multiplo di sequenze proteiche correlate tra di loro si possono notare delle regioni conservate tipicamente di 20-30 aminoacidi.
Lidea di base consiste nel classificare sequenze diverse come appartenenti alla stessa famiglia se in possesso degli stessi motivi. Per raggiungere tale scopo un metodo consiste nel definire dei profili: cio quali residui sono permessi in una certa posizione, quali sono altamente conservati o degenerati e quali posizioni o regioni possono tollerare inserzioni o delezioni.
N sequenze omologhe
Determinare tutti i possibili allineamenti a coppie
Determinare un albero guida basato sui punteggi di similarit tra tutte le coppie
Scegliere la coppia di sequenze con il pi alto grado di similarit e ragrupparle in un cluster fissandone lallineamento Il multi allineamento comprende tutte le sequenze
Allineamento multiplo
Limite: se lalgoritmo sbaglia un

allineamento influenzer negativamente tutti i successivi
Dato un allineamento multiplo di un set di sequenze, un profilo per quel allineamento indica la frequenza con cui ogni carattere appare in una determinata colonna. C1 C2 C3 C4 C5 A .75 T .75 .25 .75 .25 .50
ATC_A ATATA ACCT_ CT_TC
C .25 .25 .50 _
.25 .25 .25
Spesso i valori di un profilo sono convertiti in rapporto logaritmico. Se p(y,j) rappresenta la frequenza del carattere y nella posizione j e se p(y) indica la frequenza con la quale il carattere y appare ovunque nellallineamento multiplo, allora il valore log p(y,j)/p(y) usato come entry nella matrice del profilo. Per un carattere y e una colonna j, sia p(y,j) la frequenza con cui il carattere y appare [ s ( x,y ) p ( y,j ) ] nella colonna j del profilo e inoltre S(x,j) indichi lo score per allineare x con la colonna j

Questo concetto pu essere applicato in biologia per lidentificazione di proteine appartenenti ad una stessa famiglia: infatti posso definire un set di posizioni che in una sequenza sono pi o meno conservate. Per raggiungere questo scopo definisco una catena lineare di stati di match, di inserzioni e delezioni che si riferiscono ad un allineamento multiplo di proteine ( profilo).
Tutti gli stati possono generare un carattere eccetto quello di delezione.
Lo scopo di tutto questo lavoro trovare un modello che assegni unalta probabilit a quelle sequenze proteiche che appartengono alla stessa famiglia; cos facendo otteniamo un set di stati e transizioni con i quali possiamo valutare la probabilit di una sequenza ignota di appartenere ad una determinata famiglia proteica. Naturalmente ci sono pi cammini possibili che possono generare la stessa sequenza: bisogna trovare quello giusto ovvero che massimizza il punteggio.
Vantaggi
Solida base statistica Possono essere utilizzate in un numero notevole di task come il data mining con lo scopo di classificare dati biologici, analisi di struttura di proteine, pattern discovery, ecc.
Svantaggi
Overfitting: a causa dei dati di partenza in una famiglia proteica alcuni membri potrebbero essere sovrarappresentati pesando cos troppo nella costruzione del modello e rendendolo troppo stringente. Ottengo un modello lineare che non in grado di descrivere correlazioni superiori allinterno di una proteina: come per esempio legami a ponte di idrogeno, ponti disolfuro ecc. che possono avvenire tra aminoacidi distanti tra loro, ma vicini a causa del fold della proteina.
Mediante le tecniche viste ci si deve confrontare con lenorme quantit di dati disponibili nei database biologici pubblici La Figura illustra la crescita dei dati relativi alle sequenze di DNA, dallavvento delle tecniche di sequenziamento nel 1975 ai giorni nostri. Aumento cumulativo di articoli di biologia molecolare e di genetica (linea tratteggiata) e dei record di sequenze di DNA in GenBank (linea continua). Si noti come laumento esponenziale dei dati di sequenza abbia portato, intorno alla met degli anni 90, ad uninversione delle posizioni. Oggi, lenorme quantit di dati non consente di tenere il passo con le pubblicazioni scientifiche che dovrebbero descriverli. (Adattato da M.S. Boguski, Science 286, 453-455, 1999).
Uno dei principali task della bioinformatica ordinare i dati e ricavarne informazioni utili e fruibili per la comunit scientifica
Esiste un settore vero e proprio della bioinformatica che riguarda, appunto, il
data-mining
ed il processo attraverso il quale si raggiunge la conoscenza dallanalisi dei dati presenti, ad esempio, nelle banche dati primarie e che in grado di generare le banche dati secondarie o specializzate va sotto il nome di:
KDD Knowledge Discovery in Database
Knowledge Discovery in Databases (KDD)

Pulizia Integrazione
Selezione Tresformazione
Data Mining
Valutazione Visualizzazione
Data Data Warehouse Warehouse
Prepared Prepared data data
Patterns
Knowledge Knowledge Base Base
Knowledge
Knowledge Application
Data
Data mining (KDD) goals

Lo scopo principale del data mining creare una base di conoscenza utilizzabile per la predizione della funzione di dati biologici ignoti Descrizione Annotazione: il processo di interpretare i uninformazione biologicamente utilizzabile Predizione Costruzione di un modello con potere di predizione dati grezzi fornendo
Data mining (KDD) operations

Verifica Validare lipotesi analisi statistica Ricerca Esplorazione dei dati modelli predittivi Database segmentation
ONTOLOGY is a way to capture knowledge in a written and computable form. This means that the computer finds patterns so we dont have to. IN PHILOSOPHY Ontology (from Greek) is the philosophical study of the nature of being, existence or reality in general, as well as of the basic categories of being and their relations. IN COMPUTER SCIENCE Ontology is a formal representation of a set of concepts within a domain and the relationships between those concepts
Gene Ontology
mRNA synthesis DNA directed rna synthesis
Transcription
Gene expression
id: GO:0006352
The Gene Ontology is like a dictionary term: transcription initiation Each concept has: a name a definition an ID number id: GO:0006352 definition: Processes involved in the assembly of the RNA polymerase complex at the promoter region of a DNA template resulting in the subsequent synthesis of RNA from that promoter.
There are also relationships between them. Gene Ontology is a DAG Directed Acyclic Graph Nucleic acid binding is a type of binding. DNA binding is a type of nucleic acid binding.
Appropriate Relationships to Parents

GO currently has many relationships but the most frequent are types:
Is_a An is_a child of a parent means that the child is a complete type of its parent, but can be discriminated in some way from other children of the parent.
CAR
Ferrari is a CAR FIAT 500 is a CAR
Appropriate Relationships to Parents

and:
Part_of
A part_of child of a parent means that the child is always a constituent of the parent that in combination with other constituents of the parent make up the parent.
CAR
The wheel is a part of a CAR
True Path Violations Create Incorrect Definitions

..the pathway from a child term all the way up to its toplevel parent(s) must always be true".
nucleus
Part_of relationship
chromosome

chromosome
Is_a relationship
Mitochondri al chromosome

..the pathway from a child term all the way up to its toplevel parent(s) must always be true". nucleus
A mitochondrial chromosome is not part of a nucleus! chromosome

Is_a relationship
Mitochondrial chromosome

nucleus
chromosome
mitochondrion
Nuclear chromosome
Is_a relationships
Mitochondrial chromosome
Has_part relationship
To overcome this problem a new relationship has been recently added: has_part. Previously we have been used to propagating gene products up the graph. With the addition of has_part this is no longer so simple.
nucleus
ABF1 MGM101 ABF1
mitochondrion
MGM101
chromosome
ABF1
MGM101
The ontologies are used to categorize gene products.
Biological process ontology

Which process is a gene product involved in?
Molecular function ontology

Which molecular function does a gene product have?
Cellular component ontology

Where does a gene product act?
AMINOACID SEQUENCE Similarity searches HMM, profiles, HMM-HMM etc. Is there anything really similar out there ? NO YES
Fold recognition, etc try to find the 3D structural model or features
Try functional transfer annotate the sequence . Good luck !
ARGOT
It is a knowledge based and integrated approach which combines: 1.clustering of GO terms, based on their semantic similarities 1.weighting scheme which assesses retrieved hits sharing a certain number of biological features with the sequence to be annotated
What do you need?

A metric based on: 1)Topology: the GO graph 2)Information content: how informative is the term ? Can you quantify it ? 3)Semantic similarity: a measure to establish "How much does term A have to do with term B? 4)A weighting scheme: finding some biological features in common between our target and known proteins annotated in GO (BLAST,HMM etc.). How do we get and weight these features ?
Are C and D similar ? Are A and B similar ?

Edge distance: AB = 2 CD = 2 very close !!! but D Is antioxidant activity a sort of transcription regulator activity certainly not For sure, glutathione peroxidase activity shares something with phospholipid-hydroperoxide glutathione peroxidase activity !! and C is similar to D
Edge distance cutoff <= 2:
A is similar to B
NO
YES
Are C and D similar ? Are A and B similar ?

Information content (Resnick 1999)
Semantic similarity (Lin 1998) List of common subsumers

IC=0
A IC=2.9
B IC=1.8
IC=3.1
Semantic similarity: AB = 0 absolutely not similar ! CD = 0.62 quite similar !
C IC=4.2
D IC=5.8
Semantic similarity >= 0.6:
A is NOT similar to B
YES
and
C is similar to D
YES
Step I
Trimming the GO graph Keeping the nodes of BLAST hits only (black circles) and their parents (white circles)
Step II
1) Calculating IC 2) Calculating Weights
the absolute value of the sum of the log of the child nodes BLAST e-values.
Step III
1) Discarding nodes with Z-score < 0
Where S is the average calculated as the score of the root node divided by the total number of the nodes that compose the initial trimmed GO graph, Si is the score of node i and is the standard deviation assuming a Gaussian distribution of the weights
1) Clustering of nodes based on semantic similarity (stringent 0.7 threshold).
Specificity (TN/(TN+FP)) Sensitivity (TP/(TP+FN)) Y-axis = sensitivity X-axis = 1-specificity
ROC plots (10,000)
In (a) the results of InC, AC and TS scores are reported for hits under 100% sequence identity (ROC 100 plots). In (b) the performances of the three indexes are reported for low sequence similarity hits below 40% identity (ROC 40 plots). In (c), (d), and (e) the AC, TS, and InC scores are shown respectively, with comparisons of their trends at low (ROC 40 plots) and high (ROC 100 plots) sequence similarity. In (f) the annotations of up to the first top five BLAST hits are evaluated (TOPBLAST).
http://www.medcomp.medicina.unipd.it/Argot2/

2 Fontana Salsomaggiore2011

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

2 Fontana Salsomaggiore2011

Transféré par

Droits d'auteur :

Formats disponibles

Dalla sequenza alla funzione attraverso la genomica

HMM (Hidden Markov Model)

Whole Genome Shotgun (WGS)

Genome structural variation

We can apply a similar concept to linked insertions and everted duplications

Protein-Coding Genes in Eukaryotes

Protein-Coding Genes in Eukaryotes

Can you find a gene here?

the gene is (Human Casein Kinase II )

Introns make things harder

Start codon ATG

Splice sites mRNA Transcript

Stop codon TAG/TGA/TAA

Eukaryotic Gene Syntax

donor site acceptor site

donor site acceptor site stop codon

Known Genes provide training signals

start splice donor splice acceptor stop

What is Gene Prediction?

Gene Prediction Approaches

Intrinsic (ab initio)

HMM (Hidden Markov Model)

General Things to Remember about (Protein-coding) Gene Prediction Software

Colinearit tra Lg13 e Lg16 di melo

AGSGYWKATGTDKVITTEGRKVGIKKALVFYIGKAPKGTKTNWIMHEYRLLENSR----KNGSSKVD AGSGYWKATG DK I + VGIKKALVFY GKAPKG KTNWIMHEYRL + R K S ++D AGSGYWKATGADKPIGLP-KPVGIKKALVFYAGKAPKGEKTNWIMHEYRLADVDRSVRKKKNSLRLD

Algoritmi di allineamento esatto

Locale: Smith e Waterman

S1: A_CACACTT S2: AGCACAC_A

S1: A_CACACTT S2: AGCACACA_

Seeding for sequence alignment: PatternHunter approach

tactgcctg |||| |||| tactacctg 1: 1110101 2: 1110101 3: 1110101

Sensitivity= number of TP+number of FN

AGSGYWKATGTDKVITTEGRKVGIKKALVFYIGKAPKGTKTNWIMHEYRLLENSR----KNGSSKVD AGSGYWKATG DK I + VGIKKALVFY GKAPKG KTNWIMHEYRL + R K S ++D AGSGYWKATGADKPIGLP-KPVGIKKALVFYAGKAPKGEKTNWIMHEYRLADVDRSVRKKKNSLRLD

Valutazione del significato biologico dellallineamento prodotto

Determinare tutti i possibili allineamenti a coppie

Limite: se lalgoritmo sbaglia un

ATC_A ATATA ACCT_ CT_TC

C .25 .25 .50 _

.25 .25 .25

HMM (Hidden Markov Model)

Tutti gli stati possono generare un carattere eccetto quello di delezione.

Esiste un settore vero e proprio della bioinformatica che riguarda, appunto, il

KDD Knowledge Discovery in Database

Knowledge Discovery in Databases (KDD)

Data Data Warehouse Warehouse

Prepared Prepared data data

Data mining (KDD) goals

Data mining (KDD) operations

Appropriate Relationships to Parents

Appropriate Relationships to Parents

True Path Violations Create Incorrect Definitions

True Path Violations Create Incorrect Definitions

True Path Violations Create Incorrect Definitions

A mitochondrial chromosome is not part of a nucleus! chromosome

True Path Violations Create Incorrect Definitions

The ontologies are used to categorize gene products.

Biological process ontology

Molecular function ontology

Cellular component ontology

Fold recognition, etc try to find the 3D structural model or features

Try functional transfer annotate the sequence . Good luck !

What do you need?

Are C and D similar ? Are A and B similar ?

Edge distance cutoff <= 2: