Vous êtes sur la page 1sur 25

Big Data in Biology

Opportunities and Challenges

Professor Ewan Birney FRS FMedSci


Director
www.ebi.ac.uk

We have been living through a revolution.

One genome 2003 to 2013

The cost of sequencing a


genome in 2003

The cost of sequencing a


genome in 2015

All living things are made from the same stuff


(DNA, RNA, Protein)

There has been a huge impact on biological


research

We are starting to have an impact on


Medicine

And agriculture, and the environment

Solving blue-collar and white-collar


problems

Interesting, ground breaking


ideas

Big data management and


pipelines

EMBL-EBI: Big Data in Biology


Europes hub for biological data services,
research and training

Home of the ELIXIR hub


570 members of staff from 57 nations
60 Petabytes of data storage
40 Gbits bandwidth via JANET
11 million requests served / month
LFCF-enabled capacity

Impact beyond research

Solving childhood genetic diseases

30% of children with developmental delay


get a diagnosis due to exome sequencing

Genomics England

Largest single roll out of genomics and Big Data into


any healthcare system
Public/Private Partnership

Large industry involvement


A key reason for our
substantial strategic
investment in the Centre for
Therapeutic Target Validation
was the capacity and
expertise in Big Data in
biology at EMBL-EBI.
Patrick Vallance, President of
Pharmaceuticals R&D at GSK, on
the companys multi-million
investment in the CTTV

Smarter farming for food security


Crucial for improving crop yields
bread wheat and barley genomes

Genomes of major domesticated animals


Functional Annotation of Animal Genomes

Plant pathogens: hundreds of species


PhytoPath database

Global plant genome databases


Ensembl Plants, IPG

Big Data infrastructure


Crop genomics, soil metagenomics

Genomes

Sequencing ebola virus in real time

Metagenomics

Transformative small/medium
technology companies

>180 million pounds raised


(main investors IP Group, N.
Woodford)
Ewan Birney is a paid
consultant to Oxford
Nanopore

Opportunities

Thank you!

Follow me on twitter:
@ewanbirney
I blog regularly (Google Ewan
Birney)

Life science: many data types


Genes, genomes & variation
Gene, protein & metabolite expression
Protein sequences, families & motifs
Macromolecular structures
Interactions, reactions & pathways
Chemogenomics & metabolomics

Cross-domain tools & resources

Data resources at EMBL-EBI


Genes, genomes & variation
European Nucleotide Archive
1000 Genomes

Ensembl
Ensembl Genomes

European Genome-phenome
Archive
Metagenomics portal

Gene, protein & metabolite expression


ArrayExpress
Expression Atlas

Literature &
ontologies
Europe PubMed Central
Gene Ontology
Experimental Factor
Ontology

Metabolights
PRIDE

Protein sequences, families &


motifs
InterPro
Pfam
UniProt
Molecular structures
Protein Data Bank in Europe
Electron Microscopy Data Bank

Chemical biology
ChEMBL
ChEBI

Systems
BioModels
Enzyme Portal
BioSamples

Reactions, interactions
& pathways
Intact
Metabolights
Reactome

Who uses EMBL-EBIs services?

http:wwwdev.ebi.ac.uk/ebiwebtrafficmap/kmlvector.html

Research
Data-driven discovery

www.ebi.ac.uk/research

Data growth
1E+16

1E+15

1E+14

EGA
ENA
PRIDE
MetaboLights
ArrayExpress

12 month doubling

18 month doubling
4 month doubling

bytes

1E+13

3 month doubling
1E+12

1E+11

1E+10

1E+09

100000000
2002

2004

2006

2008

2010
date

2012

2014

2016

Sequence compression
Encoding of read starts and differences
3.5x100x compression over existing
formats

Scales favourably with increasing read


length and density

Fritz, M.H. Leinonen, R., et al. (2011) Efficient


storage of high throughput DNA sequencing
data using reference-based
compression. Genome Res. 21 (5), 734-40.