Vous êtes sur la page 1sur 1

A Novel Big Data Platform to Manage Genomic Variants in the Clinical Laboratory

Wade L. Schulz, MD, PhD; Michael DEletto, MS; Steve Shane, BS; William Byron, MBA; Charles Torre, MS; Thomas JS Durant, MD; Richard Torres, MD, MS; Peter Gershkovich, MD, MHA
Department of Laboratory Medicine, Yale School of Medicine; Department of Pathology, Yale School of Medicine; Department of Information Technology Systems, Yale-New Haven Health System

Context

The role of clinical next generation sequencing continues to increase. A key goal for many
institutions is to bring these genomic results to the bedside to improve patient outcomes.
While the standards for the exchange of clinical genomic data have begun to emerge,
molecular results are often found in disparate data silos, which makes it difficult to obtain a
complete history of genomic testing that can be used in clinical care.

Clinical Laboratory Workflow

Three laboratories perform clinical NGS at Yale/YNHH


Custom pipelines and interpretation software supply a PDF report to the LIS/EHR
Yale Center for Genome Analysis

Can be represented in either relational or non-relational models


Relational model presented here for visualization, but JSON is
recommended for data exchange
Includes specimen and patient metadata
Original VCF data included for data reprocessing
Variant data model based on VCF data model with additional fields
Regions covered includes metadata similar to BED file

Results - System Architecture and Data Flow

Whole Exome/Genome

Legend
Subject/Patient metadata
Specimen metadata
Specimen timestamps
Subject/Patient history
Technical metadata
Specimen interpretation
Original VCF data

Yale Department of Laboratory Medicine


Hematologic Malignancy

Data Model Features

VarBase

Solid Tumors

Yale Department of Pathology

Downstream

Project Scope
Problem definition: Discrete genomic data are present within individual laboratories inhouse interpretation software, but discrete data are not easily accessible to clinicans/
researchers since only full-text reports are available in the LIS/EHR.

Data Science Platform

Data stored within the Yale-New Haven Health Helix Data Science platform
Secure big data platform with access to traditional clinical data warehouse

Conclusions

With increased interest in precision medicine initiatives, the development of genomic data standards compatible with
modern technologies are needed for efficient data exchange. This novel platform and data format allow for the storage
of large volumes of standardized, genomic interpretations from separate laboratories while providing the sub-second
queries needed for clinical use.

Solution requirements:
1. Develop a JSON object model to store specimen and variant data
2. Create a central data repository to store all clinical genomic variants
3. Implement web service endpoints to accept JVF data objects from laboratories
4. Architect a solution for basic data analysis
5. Identify approaches to integrate data from traditional data warehouse (Epic Clarity)
6. Integrate with existing systems for cohort identification and analysis

Design

Our institution performs clinical next generation sequencing at three separate, local
laboratories. This project included the development of a JavaScript Object Notation (JSON)formatted data model, named JSON Variant Format (JVF), and a Hadoop-based platform
(Hortonworks, Santa Clara, CA, USA) for centralized data management.

1. Annotated JVF data are submitted to a secured web service interface (hosted in a Docker container) for validation
2. Validated JVF is submitted to a Kafka messaging queue where it is consumed by NiFi
3. NiFi merges original data into Avro files for compression and increased query efficiency
4. Avro files are compressed and placed in the Hadoop Distributed File System (HDFS) for long-term storage
5. Variant data are denormalized
6. Variant-level documents are stored within an Elasticsearch index
7. Kibana interface can be used for real-time visualizations
8. Elasticsearch index can be queried by in-house annotation software (VarBase) to provide variant-level statistics
9. Data stored in HDFS can be used for batch analysis or reprocessing

Future Directions

Implement a storage schema in HBase for population analysis


Development of a cohort identification interface for clinical and translational research groups
Integrate cohort identification with Epic SlicerDicer

Results

After interpretation, VCF data are merged with clinical annotations in the JVF
model. This format, based on the existing variant call file format, is readily
serializable for web service integration. The model can be easily extended to
support additional data elements.

Acknowledgements
We would like to thank James Knight from the Yale Center for Genome Analysis, Rich Hurley and Aaron
Forstrom from YNHH ITS, and staff and faculty in the Molecular Diagnostics and Tumor Profiling laboratories.
Data model, examples, and documentation can be found at https://github.com/wadeschulz/jvf

Vous aimerez peut-être aussi