Académique Documents
Professionnel Documents
Culture Documents
Medline Baseline
Calibration and validation sets were needed in order
Entrez Gene
Repository
to define a threshold between genetic and non-
gene2pubmed
genetic domain citations and to evaluate the
performance of our algorithm. Two independent sets
of 100 citations each were randomly selected from
Genetic Domain
Corpus
Background
Corpus
MEDLINE and manually annotated by two
annotators with biological domain knowledge. Their
task was to identify citations as being in either the
MeSH Descriptor Extraction MeSH Descriptor Extraction genetic or non-genetic domain. They achieved kappa
& &
Frequency Count Frequency Count scores for inter-annotator agreement of 0.81 and 0.91
for the calibration and validation set, respectively.
Consensus voting was then used to achieve complete
Frequency Profile Chi-Square
agreement between judges. Both sets are available on
Table Statistic
request.
Figure 1. Flow diagram for the preparation of To draw a boundary between genetic and non-genetic
frequency profile table from knowledge sources. domain citations, we plotted predictive accuracy
against score values, and the threshold parameter was
set to maximize accuracy. Predictive accuracy is the
The algorithm proceeds by reading each MEDLINE overall correctness of the prediction and was
citation c in turn and assigning a decision score to it calculated as the sum of correct classifications (TP
as follows: and TN) divided by the total number of
Score (c) = 0 classifications (TP + FP + FN + TN). In this formula,
For each MeSH descriptor d TP is the number of true positives (citations are about
If d is a positive indicator genetics and are classified into the genetic domain);
Score (c) = Score (c) + 1 FP denotes the number of false positives (citations
Else if d is a negative indicator are classified into the genetic domain in the absence
Score (c) = Score (c) 1 of genetic contents); FN refers to the number of false
negatives (citations are wrongly classified into the
The output of this process is a list of scores for all the non-genetic domain), and TN is the number of true
citations, with the highest total given to those negatives (citations are correctly classified into the
citations containing MeSH descriptors typical for the non-genetic domain). The threshold can be adjusted
genetic domain.
1.0
B
Results A
0.2
0.8
0.7
0.0
Accuracy
0.6
Recall
0.4
citations classification.
0.2
Cutoff
The graph goes through two points. The first (0,1) is
Figure 2. Calibration plot. where the classifier detects no genetically relevant
citations. In this case it always gets the non-genetic
The following example illustrates the use of our citations right but it gets all genetic domain citations
classification algorithm. We retrieved two wrong. The second point (1,0) is where all citations
MEDLINE citations titled, “Strain-dependent are classified as genetically relevant. So the classifier
localization, microscopic deformations, and gets all genetically relevant citations right, but it gets
macroscopic normal tensions in model polymer all non-genetic citations wrong. As expected, a trade-
networks” (PMID: 15697942) and “Recessive motor off between precision and recall exists that can be
neuron diseases: mutations in the ALS2 gene and tuned, mainly by modifying the threshold parameter.
molecular pathogenesis for the upper motor The optimal accuracy with recall and precision
neurodegeneration” (PMID: 15651293) with 8 and 7 already mentioned above is defined at point A. For
corresponding MeSH descriptors, respectively. Three example, if we increase the threshold (B = 5), we
of the descriptors in the latter example were also classify citations with more precision (Pre = 1.00)
but with decreased recall (Rec = 18). On the other
We are grateful to Susanne M. Humphrey and 10. Humphrey SM, Rogers WJ, Kilicoglu H,
Thomas C. Rindflesch for helpful suggestions and Demner-Fushman D, Rindflesch TC. Word
comments. This work was supported by Slovenian sense disambiguation by selecting the best
Research Agency Grant J3-7411. semantic type based on Journal Descriptor
Indexing: preliminary experiment. J Am Soc
Inform Sci Tech. 2006;57(1):96–113.
References
11. Liu H, Lussier YA, Friedman C. Disambiguating
1. Swanson DR. Fish oil, Raynaud's syndrome, and ambiguous biomedical terms in biomedical
undiscovered public knowledge. Perspect Biol narrative text: an unsupervised method. J
Med. 1986;30(1):7-18. Biomed Inform. 2001;34(4):249-261.
2. Hristovski D, Stare J, Peterlin B, Dzeroski S. 12. Xu H, Fan JW, Hripcsak G, Mendonca EQ,
Supporting discovery in medicine by association Markatou M, Friedman C. Gene symbol
rule mining in MEDLINE and UMLS. Medinfo. disambiguation using knowledge-based profiles.
2001;10(2):1344-1348. Bioinformatics. 2007;23(8):1015-1022.
3. Hristovski D, Peterlin B, Mitchell JA, Humphrey 13. Farkas R. The strength of co-authorship in gene
SM. Using literature-based discovery to identify name disambiguation. BMC Bioinformatics.
disease candidate genes. Int J Med Inform. 2008;9:69.
2005;74(2-4):289-298.
14. Maglott D, Ostell J, Pruitt KD, Tatusova T.
4. Manning CD, Schuetze H. Foundations of Entrez Gene: gene-centered information at
statistical natural language processing. NCBI. Nucleic Acids Res. 2007;35(Database
Cambridge: MIT Press; 2003. issue):D26–31.