Académique Documents
Professionnel Documents
Culture Documents
Wade L. Schulz, MD, PhD; Michael DEletto, MS; Steve Shane, BS; William Byron, MBA; Charles Torre, MS; Thomas JS Durant, MD; Richard Torres, MD, MS; Peter Gershkovich, MD, MHA
Department of Laboratory Medicine, Yale School of Medicine; Department of Pathology, Yale School of Medicine; Department of Information Technology Systems, Yale-New Haven Health System
Context
The role of clinical next generation sequencing continues to increase. A key goal for many
institutions is to bring these genomic results to the bedside to improve patient outcomes.
While the standards for the exchange of clinical genomic data have begun to emerge,
molecular results are often found in disparate data silos, which makes it difficult to obtain a
complete history of genomic testing that can be used in clinical care.
Whole Exome/Genome
Legend
Subject/Patient metadata
Specimen metadata
Specimen timestamps
Subject/Patient history
Technical metadata
Specimen interpretation
Original VCF data
VarBase
Solid Tumors
Downstream
Project Scope
Problem definition: Discrete genomic data are present within individual laboratories inhouse interpretation software, but discrete data are not easily accessible to clinicans/
researchers since only full-text reports are available in the LIS/EHR.
Data stored within the Yale-New Haven Health Helix Data Science platform
Secure big data platform with access to traditional clinical data warehouse
Conclusions
With increased interest in precision medicine initiatives, the development of genomic data standards compatible with
modern technologies are needed for efficient data exchange. This novel platform and data format allow for the storage
of large volumes of standardized, genomic interpretations from separate laboratories while providing the sub-second
queries needed for clinical use.
Solution requirements:
1. Develop a JSON object model to store specimen and variant data
2. Create a central data repository to store all clinical genomic variants
3. Implement web service endpoints to accept JVF data objects from laboratories
4. Architect a solution for basic data analysis
5. Identify approaches to integrate data from traditional data warehouse (Epic Clarity)
6. Integrate with existing systems for cohort identification and analysis
Design
Our institution performs clinical next generation sequencing at three separate, local
laboratories. This project included the development of a JavaScript Object Notation (JSON)formatted data model, named JSON Variant Format (JVF), and a Hadoop-based platform
(Hortonworks, Santa Clara, CA, USA) for centralized data management.
1. Annotated JVF data are submitted to a secured web service interface (hosted in a Docker container) for validation
2. Validated JVF is submitted to a Kafka messaging queue where it is consumed by NiFi
3. NiFi merges original data into Avro files for compression and increased query efficiency
4. Avro files are compressed and placed in the Hadoop Distributed File System (HDFS) for long-term storage
5. Variant data are denormalized
6. Variant-level documents are stored within an Elasticsearch index
7. Kibana interface can be used for real-time visualizations
8. Elasticsearch index can be queried by in-house annotation software (VarBase) to provide variant-level statistics
9. Data stored in HDFS can be used for batch analysis or reprocessing
Future Directions
Results
After interpretation, VCF data are merged with clinical annotations in the JVF
model. This format, based on the existing variant call file format, is readily
serializable for web service integration. The model can be easily extended to
support additional data elements.
Acknowledgements
We would like to thank James Knight from the Yale Center for Genome Analysis, Rich Hurley and Aaron
Forstrom from YNHH ITS, and staff and faculty in the Molecular Diagnostics and Tumor Profiling laboratories.
Data model, examples, and documentation can be found at https://github.com/wadeschulz/jvf