Phenotype Information For Existing GWAS Studies

Phenotype Information Retrieval for Existing GWAS Studies
Neda Alipanah, Ph.D. University of California San Diego March 2013
Motivation

The database of Genotypes and Phenotypes (dbGaP) is archiving the results of different Genome Wide Association Studies (GWAS). Phenotype variables are not harmonized across studies. Redundant phenotype identifiers for the same phenotype. dbGaP lacks semantic relations among its variables. Search on phenotypes is not accurate.
Goals

Standardize dbGaP information to allow accurate, reusable and Quick retrieval of information
Problem Statement (Example of Redundant Variables)

dbGaP Structure
dbGaP Study phs000007.v18.p7 dbGaP Study phs000007.v18.p7
id= phv00003636.v1 , Description= HEART: HYPERTENSIVE HEART DISEASE , name= FK414, version=1, Logical Max=--, Logical Minimum=--, unit=--, type=string
id= phv00008678.v3 , Description= CDI: HYPERTENSIVE HEART DISEASE , name= C334 , version=3, Logical Max=--, Logical Minimum=--, unit=--, type=text
N Alipanah,H Kim,L Ohno-Machado:Building an Ontology of Phenotypes for Existing GWAS Studies.Healthcare Informatics, Imaging and Systems Biology (HISB), 2012 IEEE Second International Conference on, vol., no., pp.111,27-28 Sept. 2012
Problem Statement (Example of Semantic Relation)

dbGaP Structure
dbGaP Study phs000284.v1.p1 dbGaP Study phs000284.v1.p1
id= phv00123020.v1, Description= CVD: self report of MD dx of cvd , name= cvd, version=1, value=Yes, No, Not assessed
Id= phv00123021.v1 , Description= CVD: self report of MD dx of cvd (missing recoded as no) , name=cvdx, version=1, value=Yes, No, Not assessed
Proposed Solution
Build
an information model
Indexing the phenotype variables semantically No Redundancy

Example:
Cardiovascular Disease Heart Disease Id=phv00124261.v1
.
phv00123021.v1 phv00123020.v1
id= phv00008678.v3
Methods
I. String-based Variables Distance Calculation II. Semantic Hierarchy Extraction on Revised Clusters III. Classification and Ontology Creation IV. sdGaP Information Retrieval
1. String-based Variables Distance Calculation
1- Property Extraction Name, Description, Type, Unit, ,and (Max-Min) values 2- UMLS Expansion Expand Variable Description with MetaMap
1. String-based Variables Distance Calculation

3Distance Computation Description: Vector Space Model Matching Name Similarity: Jaro-Winkler String Matcher Type: Exact String Match Unit: Exact String Match (Max-Min) values: Subset Matching
4-
Build Distance Matrix Compute the Distance between every pair of Variables. Cluster based on Distance Matrix Variables with the same distance to other variables are clustered together.
5-
II. Semantic Hierarchy Extraction on Revised Clusters

1.
Build String-based distance matrix for variables in a single assigned cluster. Sub-cluster variables and calculate semantically relevant (similar) variables. Assign labels to sub-clusters based on the relevant UMLS Concept Unique Identifier. Perform re-clustering to find smaller group of relevant variables.
2.
3.
4.
III. Classification and (sdGaP) Ontology Creation

1.
Start with UMLS semantic network.
2.
Extract corresponding sub class (PAR/CHD) hierarchy using the UMLS hierarchy table (MRREL table). Instantiate the phenotype variables to the UMLS CUIs. (Not for higher levels) Populate the related constraints in sdGaP
3.
4.
IV. sdGaP Information Retrieval

1.
Use sdGaP Ontology structure to expand the query

Density Measure (DM)
Density(A)=3 Density(B)=0 Density(D)=0
Density(A)=2 Density(D)=1 Density(B)=1 Density(E)=1 Density(C)=0
IV. Result

Dataset: Cleveland Family Study (CFS) with 5 data sets and 2,339 phenotype variables. (phs000284.v1.p1) Use Weka Tool for Xmean clustering. The X-mean clustering resulted in 35 clusters for relevant variables. Reorganized into 23 clusters by domain expert reviewers
IV. Result of Concept-based Retrieval (Improvement of Subclass Expansion)
Query =Cardiovascular Disease

Heart Disease
Phv00122274.v1
Phv00122277.v1 Phv00122280.v1
Cardiovascular Disease
phv00123021.v1 phv00123020.v1
Phv00122281.v1
Phv00122283.v1 Phv00122284.v1 Phv00122285.v1
Query Expansion={Heart } in Disease Cluster Recall Improvement 2/45=0.04 to 18/45=0.4

Phv00122286.v1
Conclusion
Extracting
Standard Reusable Information Model From UMLS Information Retrieval by Organizing Phenotype Variables and Instantiate them to Data Model
Improving
Limitation
Clustering
based on Distance Calculation is Semi-automated Computation. Instantiating variables to lower levels of hierarchy needs domain expert review. Only instances of lower level of hierarchies are considered in ontology building. For large data, distance calculation and clustering needs more advance algorithms.
Acknowledgement
Supported
by Grants
UH2HL108785 (NHLBI) R01HS019913 (AHRQ)

Supervision
of Dr. Lucila Ohno-Machado.

Phenotype Information For Existing GWAS Studies

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Phenotype Information For Existing GWAS Studies

Transféré par

Droits d'auteur :

Formats disponibles

Phenotype Information Retrieval for Existing GWAS Studies

Neda Alipanah, Ph.D. University of California San Diego March 2013

Problem Statement (Example of Redundant Variables)

Problem Statement (Example of Semantic Relation)

Indexing the phenotype variables semantically No Redundancy

1. String-based Variables Distance Calculation

1. String-based Variables Distance Calculation

II. Semantic Hierarchy Extraction on Revised Clusters

III. Classification and (sdGaP) Ontology Creation

Start with UMLS semantic network.

IV. sdGaP Information Retrieval

Use sdGaP Ontology structure to expand the query

Density(A)=3 Density(B)=0 Density(D)=0

Density(A)=2 Density(D)=1 Density(B)=1 Density(E)=1 Density(C)=0

IV. Result of Concept-based Retrieval (Improvement of Subclass Expansion)

Query =Cardiovascular Disease

Phv00122283.v1 Phv00122284.v1 Phv00122285.v1

Query Expansion={Heart } in Disease Cluster Recall Improvement 2/45=0.04 to 18/45=0.4

UH2HL108785 (NHLBI) R01HS019913 (AHRQ)

of Dr. Lucila Ohno-Machado.

Vous aimerez peut-être aussi