CAP 2016 - Big Data Platform To Manage Genomic Variants

A Novel Big Data Platform to Manage Genomic Variants in the Clinical Laboratory
Wade L. Schulz, MD, PhD; Michael DEletto, MS; Steve Shane, BS; William Byron, MBA; Charles Torre, MS; Thomas JS Durant, MD; Richard Torres, MD, MS; Peter Gershkovich, MD, MHA
Department of Laboratory Medicine, Yale School of Medicine; Department of Pathology, Yale School of Medicine; Department of Information Technology Systems, Yale-New Haven Health System
Context
The role of clinical next generation sequencing continues to increase. A key goal for many
institutions is to bring these genomic results to the bedside to improve patient outcomes.
While the standards for the exchange of clinical genomic data have begun to emerge,
molecular results are often found in disparate data silos, which makes it difficult to obtain a
complete history of genomic testing that can be used in clinical care.
Clinical Laboratory Workflow
Three laboratories perform clinical NGS at Yale/YNHH

Custom pipelines and interpretation software supply a PDF report to the LIS/EHR
Yale Center for Genome Analysis
Can be represented in either relational or non-relational models

Relational model presented here for visualization, but JSON is
recommended for data exchange
Includes specimen and patient metadata
Original VCF data included for data reprocessing
Variant data model based on VCF data model with additional fields
Regions covered includes metadata similar to BED file
Results - System Architecture and Data Flow
Whole Exome/Genome
Legend
Subject/Patient metadata
Specimen metadata
Specimen timestamps
Subject/Patient history
Technical metadata
Specimen interpretation
Original VCF data
Yale Department of Laboratory Medicine

Hematologic Malignancy
Data Model Features
VarBase
Solid Tumors
Yale Department of Pathology
Downstream
Project Scope
Problem definition: Discrete genomic data are present within individual laboratories inhouse interpretation software, but discrete data are not easily accessible to clinicans/
researchers since only full-text reports are available in the LIS/EHR.
Data Science Platform
Data stored within the Yale-New Haven Health Helix Data Science platform
Secure big data platform with access to traditional clinical data warehouse
Conclusions
With increased interest in precision medicine initiatives, the development of genomic data standards compatible with
modern technologies are needed for efficient data exchange. This novel platform and data format allow for the storage
of large volumes of standardized, genomic interpretations from separate laboratories while providing the sub-second
queries needed for clinical use.
Solution requirements:
1. Develop a JSON object model to store specimen and variant data
2. Create a central data repository to store all clinical genomic variants
3. Implement web service endpoints to accept JVF data objects from laboratories
4. Architect a solution for basic data analysis
5. Identify approaches to integrate data from traditional data warehouse (Epic Clarity)
6. Integrate with existing systems for cohort identification and analysis
Design
Our institution performs clinical next generation sequencing at three separate, local
laboratories. This project included the development of a JavaScript Object Notation (JSON)formatted data model, named JSON Variant Format (JVF), and a Hadoop-based platform
(Hortonworks, Santa Clara, CA, USA) for centralized data management.
1. Annotated JVF data are submitted to a secured web service interface (hosted in a Docker container) for validation
2. Validated JVF is submitted to a Kafka messaging queue where it is consumed by NiFi
3. NiFi merges original data into Avro files for compression and increased query efficiency
4. Avro files are compressed and placed in the Hadoop Distributed File System (HDFS) for long-term storage
5. Variant data are denormalized
6. Variant-level documents are stored within an Elasticsearch index
7. Kibana interface can be used for real-time visualizations
8. Elasticsearch index can be queried by in-house annotation software (VarBase) to provide variant-level statistics
9. Data stored in HDFS can be used for batch analysis or reprocessing
Future Directions
Implement a storage schema in HBase for population analysis

Development of a cohort identification interface for clinical and translational research groups
Integrate cohort identification with Epic SlicerDicer
Results
After interpretation, VCF data are merged with clinical annotations in the JVF
model. This format, based on the existing variant call file format, is readily
serializable for web service integration. The model can be easily extended to
support additional data elements.
Acknowledgements
We would like to thank James Knight from the Yale Center for Genome Analysis, Rich Hurley and Aaron
Forstrom from YNHH ITS, and staff and faculty in the Molecular Diagnostics and Tumor Profiling laboratories.
Data model, examples, and documentation can be found at https://github.com/wadeschulz/jvf

CAP 2016 - Big Data Platform To Manage Genomic Variants

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

CAP 2016 - Big Data Platform To Manage Genomic Variants

Transféré par

Droits d'auteur :

Formats disponibles

A Novel Big Data Platform to Manage Genomic Variants in the Clinical Laboratory

Clinical Laboratory Workflow

Three laboratories perform clinical NGS at Yale/YNHH

Can be represented in either relational or non-relational models

Results - System Architecture and Data Flow

Yale Department of Laboratory Medicine

Data Model Features

Yale Department of Pathology

Data Science Platform

Implement a storage schema in HBase for population analysis

Vous aimerez peut-être aussi