Cours TD5

Biopython : librairie python pour la
bioinformartique
Simple manipulation de sequences
from Bio.Seq import Seq

my_seq = Seq("AGTACACTGGT")
print(my_seq)
print(my_seq.complement())
print(my_seq.reverse_complement())
Simple manipulation de sequences

my_seq = Seq("AGTACACTGGT")
Affiche l’ADN complementaire
print(my_seq)
print(my_seq.complement())
print(my_seq.reverse_complement())
Affiche l’ADN reverse complementaire

Lecture d’un fichier Fasta
from Bio import SeqIO

for seq_record in SeqIO.parse("SeqFasta.txt", "fasta"):
print(seq_record.id)
print(seq_record.seq)
print(len(seq_record))

Affiche le nom de la sequence Fasta

N’affiche pas le ‘>’
S’arrete au premier espace

Affiche la sequence nucleique/proteique

Les attributs de l’objet SeqIO en format Fasta
●
.seq → la sequence
●
.id → l’identifiant de la sequence (du > au premier espace non inclus)
●
.description → description (pour un humain) de la sequence
●
.dbxrefs → liste de liens vers les bases de donnees contenant la
sequence
●
.name
●
.letter_annotations
●
.annotations
●
.features
Liste des formats de sequences lues
●
biopython.org/wiki/SeqIO et biopython.org/wiki/AlignIO
– 29 types de sequences lues
– Fasta, Fastq, Genbank ...
– 12 types d’alignements lus
– clustal, phylip, ...
Les objets de Seq sont des chaines de caracteres
...
my_seq = Seq("GATCG", IUPAC.unambiguous_dna)
for index, letter in enumerate(my_seq):
print("%i %s" % (index, letter))
0 G
1 A
2 T
print(len(my_seq))
5
Exemple de fonctionnalite pour les sequences
●
Calcul du GC%
● from Bio.Seq import Seq
● from Bio.Alphabet import IUPAC
● from Bio.SeqUtils import GC
● my_seq = Seq("GATCGATGGGCCTATATAGGATCGAAAATCGC",
IUPAC.unambiguous_dna)
● GC(my_seq) → 46.875
Transcription d’une sequence d’ADN
coding_dna = Seq('ATGGCCATTGTAATG', IUPACUnambiguousDNA())

messenger_rna = coding_dna.transcribe()
print(messenger_rna)
→ Seq('AUGGCCAUUGUAAUG', IUPACUnambiguousRNA())
●
La fonction a change tous les T → U
●
Elle change aussi l’alphabet de la sequence
Traduction d’une sequence
●
On peut traduire l’ARN mais aussi l’ADN comme avec cet exemple :
from Bio.Alphabet import IUPAC
coding_dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTTAG", IUPAC.unambiguous_dna)
coding_dna.translate()
→ Seq('MAIVMGR*KG*', HasStopCodon(IUPACProtein(), '*'))
●
La traduction est base sur les tables du NCBI. Par defaut, biopython
utilise le code genetic standard, mais il est possible d’utiliser d’autre
code genetique
coding_dna.translate(table="Vertebrate Mitochondrial")
→ Seq('MAIVMGRWKR*', HasStopCodon(IUPACProtein(), '*'))
Affichage du code genetique standard
from Bio.Data import CodonTable

standard_table = CodonTable.unambiguous_dna_by_id[1]
print(standard_table)
Ecrire avec biopython
from Bio.SeqRecord import SeqRecord
from Bio.Alphabet import generic_protein
rec1 = SeqRecord(Seq("MMYQQGCFAGGTVLRLAKDLAENHLDSMVGQAL" \
+"SSAC", generic_protein),
id="gi|14150838|gb|AAK54648.1|AF376133_1",
description="chalcone synthase [Cucumis sativus]")
rec2 = SeqRecord(Seq("YPDYYFRITNREHKAELNPSMCEYMAPSL", generic_protein),
id="gi|13919613|gb|AAK33142.1|",
description="chalcone synthase [Fragaria vesca]")
my_records = [rec1, rec2]
SeqIO.write(my_records, "my_example.faa", "fasta")

from Bio import SeqIO Declaration et initialisation de l’objet rec1 et rec2
id="gi|14150838|gb|AAK54648.1|AF376133_1",
id="gi|13919613|gb|AAK33142.1|",

Ecriture des sequences id="gi|14150838|gb|AAK54648.1|AF376133_1",

au format Fasta
id="gi|13919613|gb|AAK33142.1|",

Conversion de format de fichier
●
Conversion d’une sequence au format genbank en format fasta
● from Bio import SeqIO
● count = SeqIO.convert("orchid.gbk", "genbank",
"example.fasta", "fasta")
Alignement des sequences
●
Un alignement global est realise par defaut.
● from Bio import pairwise2

● from Bio.SubsMat.MatrixInfo import blosum62
● alignments = pairwise2.align.localds("LSPADKTNVKAA", "PAEKSNV", blosum62, -10, -1)
● print(pairwise2.format_alignment(*alignments[0]))
● seq1 = SeqIO.read("alpha.fa", "fasta")
● seq2 = SeqIO.read("beta.fa", "fasta")
● alignments = pairwise2.align.globalds(seq1.seq, seq2.seq, blosum62, -10, -0.5)
● print(alignments)
●

Alignement local
Alignement global
●

Parametres de l’alignement
Quelques parametres de l’alignement
●
match score
●
mismatch score
●
target open gap score
●
target extend gap score
●
query open gap score
●
query extend gap score
●
mode: local / global
●
aligner.algorithm Ex : 'Smith-Waterman'
●
...
●
En fonction des parametres des gaps, le PairwiseAligner de biopython
choisit le meilleur algorithme d’alignement
Exemple de lecture des fichiers de BLAST (1)
●
La sortie du fichier Blast doit etre en XML
blastx -query t.fa -db nr -out BLASTp.xml -outfmt 5
from Bio import SearchIO

blast_qresult = SearchIO.read("BLASTp.xml", "blast-xml")
blast_slice = blast_qresult[:3]
print(blast_slice)
●
La sortie du fichier Blast doit etre en XML
blastx -query t.fa -db nr -out BLASTp.xml -outfmt 5

blast_slice = blast_qresult[:3]
print(blast_slice)
Recupere les 3 premiers hits
Program: blastp (2.8.1+)
Query: sp|Q5PP28|NAC3_ARATH (394)
NAC domain-containing protein 3 OS=Arabidopsis thaliana OX=3702 GN=...
Target: nr
Hits: ---- ----- ----------------------------------------------------------
# # HSP ID + description
---- ----- ----------------------------------------------------------
0 1 gi|15217677|ref|NP_171725.1| NAC domain containing pro...
1 1 gi|3258572|gb|AAC24382.1| Hypothetical protein [Arabid...
2 1 gi|297848422|ref|XP_002892092.1| NAC domain-containing...


blast_hsp1 = blast_qresult[10]
print(blast_hsp1)
print('\n')
blast_hsp2 = blast_qresult[10][0]
print(blast_hsp2)
Affiche les informations numeriques du HSP

blast_hsp1 = blast_qresult[10]
print(blast_hsp1)
print('\n')
blast_hsp2 = blast_qresult[10][0]
print(blast_hsp2)
Affiche les memes infos et

l’alignement du HSP
●
Resultat de blast_hsp1
Query: sp|Q5PP28|NAC3_ARATH
NAC domain-containing protein 3 OS=Arabidopsis thaliana OX=3702 GN=NA...
Hit: gi|1338503851|ref|XP_006305025.2| (389)
NAC domain-containing protein 3 [Capsella rubella]
HSPs: ---- -------- --------- ------ --------------- ---------------------
# E-value Bit score Span Query range Hit range
---- -------- --------- ------ --------------- ---------------------
0 3.5e-137 402.90 404 [0:394] [9:389]

●
Resultat de blast_hsp2
Query: sp|Q5PP28|NAC3_ARATH NAC domain-containing protein 3 OS=Arabidop...
Hit: gi|1338503851|ref|XP_006305025.2| NAC domain-containing protein ...
Query range: [0:394] (0)
Hit range: [9:389] (0)
Quick stats: evalue 3.5e-137; bitscore 402.90
Fragments: 1 (404 columns)
Query - METPVGLRFCPTDEEIVVDYLWPKNSDRDTSHVDRFINTVPVCRLDPWELPCQSRIKLK~~~SISRT
+E VG RF PTDEE V DYL PKN DTSHVDR I+T+ + DPWELP QSRIKLK~~~ ISRT
Hit - LENTVGFRFRPTDEEFVDDYLRPKNLKLDTSHVDRVISTLTISSFDPWELPSQSRIKLK~~~PISRT
from Bio.Blast import NCBIXML
result_handle = open('FichiersEtudiant/BLASTp.xml', 'r')
blast_record = NCBIXML.read(result_handle)
E_VALUE_THRESH = 0.01
for alignment in blast_record.alignments:

for hsp in alignment.hsps:
if hsp.expect < E_VALUE_THRESH:
print("\n****Alignment****")
print("sequence:", alignment.title)
print("length:", alignment.length)
print("e value:", hsp.expect)
print(hsp.query[0:75] + "...")
print(hsp.match[0:75] + "...")
print(hsp.sbjct[0:75] + "...")
from Bio.Blast import NCBIXML Limite l’affichage des resultats a un seuil
result_handle = open('FichiersEtudiant/BLASTp.xml', 'r')
blast_record = NCBIXML.read(result_handle)
E_VALUE_THRESH = 0.01
for alignment in blast_record.alignments:

for hsp in alignment.hsps:
if hsp.expect < E_VALUE_THRESH: Affiche quelques attributs
print("\n****Alignment****") des objets alignements et
print("sequence:", alignment.title) HSP
print("length:", alignment.length)
print("e value:", hsp.expect)
print(hsp.query[0:75] + "...")
print(hsp.match[0:75] + "...")
print(hsp.sbjct[0:75] + "...")
Limite l’affiche de l’alignement

au 75 premiers caracteres
Variables contenues dans blast biopython
●
La lecture du fichier blast donne une liste d’objet
●
La variable blast_hsp une cellule de cette liste
● blast_hsp.query_range → (0, 61)

● blast_hsp.evalue → 4.91307e-23
● blast_hsp.hit_start → 0 # start coordinate of the hit sequence
● blast_hsp.query_span → 61 # how many residues in the query sequence
● blast_hsp.aln_span → 61 # how long the alignment is
● blast_hsp.gap_num → 0 # number of gaps
● blast_hsp.ident_num → 61 # number of identical residues
Phylogenie et classification avec biopython
● from Bio import Phylo
● tree = Phylo.read("FichiersEtudiant/simpleFichier.dnd", "newick")
● Phylo.draw_ascii(tree)
● ________________________ A
● ________________________|
● | |________________________ B
● ________________________|
● | | ________________________ C
● | |________________________|
● _| |________________________ D
● |
● | ________________________ E
● | |
● |________________________|________________________ F
● |
● |________________________ G
Connection a des base de donnees distantes
●
Telecharger et/ou parser les donnees de NCBI
– Genome ou sequence au format Fasta ou Genbank
– Obtenir la taxonomie d’une sequence
●
Telecharger les citations de Pubmed
●
Telecharger et/ou parser les donnees de Swiss-Prot, PDB ..
Exercice
●
Refaites avec biopython l’exercice d’extraction du 10eme
alignement de BLAST

Cours TD5

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Cours TD5

Transféré par

Droits d'auteur :

Formats disponibles

Biopython : librairie python pour la

from Bio.Seq import Seq

from Bio.Seq import Seq

Affiche l’ADN reverse complementaire

from Bio import SeqIO

from Bio import SeqIO

Affiche le nom de la sequence Fasta

from Bio import SeqIO

Affiche la sequence nucleique/proteique

coding_dna = Seq('ATGGCCATTGTAATG', IUPACUnambiguousDNA())

from Bio.Alphabet import IUPAC

coding_dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTTAG", IUPAC.unambiguous_dna)

from Bio.Data import CodonTable

from Bio.SeqRecord import SeqRecord

from Bio.Alphabet import generic_protein

from Bio import SeqIO

description="chalcone synthase [Cucumis sativus]")

rec2 = SeqRecord(Seq("YPDYYFRITNREHKAELNPSMCEYMAPSL", generic_protein),

description="chalcone synthase [Fragaria vesca]")

my_records = [rec1, rec2]

SeqIO.write(my_records, "my_example.faa", "fasta")

from Bio.SeqRecord import SeqRecord

from Bio.Alphabet import generic_protein

from Bio import SeqIO Declaration et initialisation de l’objet rec1 et rec2

description="chalcone synthase [Cucumis sativus]")

rec2 = SeqRecord(Seq("YPDYYFRITNREHKAELNPSMCEYMAPSL", generic_protein),

description="chalcone synthase [Fragaria vesca]")

my_records = [rec1, rec2]

SeqIO.write(my_records, "my_example.faa", "fasta")

from Bio.SeqRecord import SeqRecord

from Bio.Alphabet import generic_protein

from Bio import SeqIO

Ecriture des sequences id="gi|14150838|gb|AAK54648.1|AF376133_1",

rec2 = SeqRecord(Seq("YPDYYFRITNREHKAELNPSMCEYMAPSL", generic_protein),

description="chalcone synthase [Fragaria vesca]")

my_records = [rec1, rec2]

SeqIO.write(my_records, "my_example.faa", "fasta")

● from Bio import pairwise2

● from Bio import pairwise2

● from Bio import pairwise2

from Bio import SearchIO

from Bio import SearchIO

Program: blastp (2.8.1+)

Query: sp|Q5PP28|NAC3_ARATH (394)

NAC domain-containing protein 3 OS=Arabidopsis thaliana OX=3702 GN=...

Hits: ---- ----- ----------------------------------------------------------

---- ----- ----------------------------------------------------------

0 1 gi|15217677|ref|NP_171725.1| NAC domain containing pro...

1 1 gi|3258572|gb|AAC24382.1| Hypothetical protein [Arabid...

2 1 gi|297848422|ref|XP_002892092.1| NAC domain-containing...

from Bio import SearchIO

Affiche les informations numeriques du HSP

Affiche les memes infos et

NAC domain-containing protein 3 OS=Arabidopsis thaliana OX=3702 GN=NA...

Hit: gi|1338503851|ref|XP_006305025.2| (389)

NAC domain-containing protein 3 [Capsella rubella]

HSPs: ---- -------- --------- ------ --------------- ---------------------

# E-value Bit score Span Query range Hit range

---- -------- --------- ------ --------------- ---------------------

0 3.5e-137 402.90 404 [0:394] [9:389]

Hit: gi|1338503851|ref|XP_006305025.2| NAC domain-containing protein ...

Query range: [0:394] (0)

Hit range: [9:389] (0)