Académique Documents
Professionnel Documents
Culture Documents
Brian L. Cairns, MS1, Rodney D. Nielsen, PhD1, James J. Masanz, MS2, James H. Martin,
PhD1, Martha S. Palmer, PhD1, Wayne H. Ward, PhD1, Guergana K. Savova, PhD3
1
University of Colorado at Boulder, Boulder, CO; 2Mayo Clinic College of Medicine,
Rochester, MN; 3Children's Hospital Boston and Harvard Medical School, Boston, MA
Abstract
The Multi-source Integrated Platform for Answering Clinical Questions (MiPACQ) is a QA pipeline that integrates
a variety of information retrieval and natural language processing systems into an extensible question answering
system. We present the system’s architecture and an evaluation of MiPACQ on a human-annotated evaluation
dataset based on the Medpedia health and medical encyclopedia. Compared with our baseline information retrieval
system, the MiPACQ rule-based system demonstrates 84% improvement in Precision at One and the MiPACQ
machine-learning-based system demonstrates 134% improvement. Other performance metrics including mean
reciprocal rank and area under the precision/recall curves also showed significant improvement, validating the
effectiveness of the MiPACQ design and implementation.
Introduction
The increasing requirement to deliver higher quality care with increased speed and reduced cost has forced
clinicians to adopt new approaches to how diagnoses are made and care is administered. One approach, evidence-
based medicine, focuses on the integration of the current best clinical expertise and research into the ongoing
practice of medicine 1.
One of the positive side effects of this requirement for clinicians to make better use of medical evidence with less
time and effort has been practice-enhancing technological solutions. Medical citation databases such as PubMed
have enabled immediate access to a wide array of research over the Internet. In many cases, medical information
retrieval systems allow clinicians to quickly locate articles that are specific to their information need. However,
given that medical information retrieval systems generally return results only at a coarse level, and given that
clinicians have a minimal amount of time to search for answers – often as little as two minutes – there is substantial
room for improvement 2. The adoption of natural language processing (NLP) techniques allows for the creation of
clinical question answering systems that better understand the clinician’s query and can more precisely serve the
user’s information need. Such systems typically accept clinical questions in free-text and produce a fine-grained list
of potential answers to the question. The result is an improvement in retrieval performance and a reduction in the
effort required by the clinician.
The Multi-source Integrated Platform for Answering Clinical Questions (MiPACQ) is an integrated framework for
semantic-based question processing and information extraction. Using NLP and information retrieval (IR)
techniques, MiPACQ accepts free-text clinical questions and presents the user with succinct answers from a variety
of sources 3 such as general medical encyclopedic resources and the patient’s data residing in the Electronic Medical
Record (EMR). We present a design and implementation overview of the MiPACQ system along with an evaluation
dataset that includes clinical questions and human-annotated answers taken from the Clinques 4 question set and
Medpedia 5 online medical encyclopedia. We demonstrate that the MiPACQ system shows significant improvement
over the existing Medpedia search engine and a Lucene 6 baseline information retrieval system on standard question
answering evaluation metrics such as Precision at One and mean reciprocal rank (MRR).
Prior Work
The use of readily-available Internet search engines such as Google has become commonplace in clinical contexts:
Schwartz et. al 7 surveyed emergency medicine residents and found that they considered Internet searches to be
highly reliable and routinely used them to answer clinical questions in the emergency department. Unfortunately, the
same study found that the residents had difficulty distinguishing between reliable and unreliable answers. Worse,
while residents in the study expressed a high degree of confidence in the reliability of answers found through
Internet searches, when tested the residents provided incorrect answers 33% of the time. When Athenikos and Han 8
surveyed biomedical QA systems, they asserted that “current medical QA approaches have limitations in terms of
the types and formats of questions they can process.”
171
Some projects have attacked the question answering problem using syntactic information and text-based analysis.
Yu et. al. created such a system 9 based on Google definitions of terms from the UMLS 10. The system can provide
summarized answers to definitional questions and provides context by citing relevant MEDLINE articles. In
subjective evaluations with physicians, Yu’s system did well with a variety of satisfaction metrics; however only
definitional questions were tested.
There are a number of clinical question answering systems that make use of semantic information in clinical
documents. Demner-Fushman et. al. 11 developed a system of feature extractors that identify subject populations,
interventions, and outcomes in medical studies. Demner-Fushman et. al. later extended this research to create a
general-purpose QA system 12. This system is similar in principle to the MiPACQ architecture, but smaller in scope
as it retrieves answers only at the document level and targets a more restricted answer corpus (MiPACQ returns
answers at the paragraph level and targets arbitrary information sources). Wang et. al. 13 experimented with UMLS
semantic relationships and created a system that extracts exact answers from the output of a Lucene-based
information retrieval system; however, their approach requires an exact semantic match, which limits performance
on more complex and non-factoid questions. Athenikos et. al 14 propose a rule-based system based on human-
developed question/answer UMLS semantic patterns; such an approach is likely to perform well on specific question
types but might have difficulty generalizing to new questions. AskHermes15 finds answers based on keyword
searching.
The MiPACQ system is distinguished from existing clinical QA systems in several ways. MiPACQ is broader in
scope because it integrates numerous NLP components to enable deeper semantic understanding of medical
questions and resources and it is designed to allow integration with a wide range of information sources and NLP
systems. Additionally, unlike most medical QA research, MiPACQ was designed from the start to provide an
extensible framework for the integration of new system components. Finally, MiPACQ uses machine learning (ML)
based re-ranking rather than the fixed rule-based re-ranking used in most other research.
System Architecture
PoC Clinician
or Lab
investigator
ClearTK
LexEVS IR System
APIs (Lucene)
Clinques
Features,
LexGrid Corpus Relevance
Annotations
Data Index Judgments
172
The MiPACQ system integrates multiple NLP tools into a single system that provides query formulation, automatic
question and candidate answer annotation, answer re-ranking, and training for the various components. The system
is based on ClearTK 16, a UIMA-based 17 toolkit developed by the computational semantics research group 18 at the
University of Colorado at Boulder. ClearTK provides common data formats and interfaces for many popular NLP
and ML systems.
Figure 1 provides an overview of the MiPACQ architecture (for further details see Nielsen et al., 2010 3). Questions
are submitted by a clinician or investigator using the web-based user interface. The question is then processed by the
annotation pipeline which adds semantic annotations. MiPACQ then interfaces with the information retrieval (IR)
system which uses term overlap to retrieve candidate answer paragraphs. Candidate answer paragraphs are annotated
using the annotation pipeline. Finally, the paragraphs are re-ordered by the answer re-ranking system based on the
semantic annotations, and the results are presented to the user.
Corpus
Medpedia
Because the MiPACQ system is designed to be used with a wide variety of medical resources, we wanted to test its
performance against a medical article database with a broad focus. Medpedia is a Wiki-style collaborative medical
database that targets both clinicians and the general public. Unlike some other Wiki projects (such as Wikipedia),
Medpedia restricts editing to approved physicians and doctoral-degreed biomedical scientists 19. Despite this
restricted authorship, Medpedia’s volunteer-based collaborative editing model results in a broader range of writing
styles and article quality than traditional medical databases or journals. This variability provides a good test of the
MiPACQ system’s ability to generalize to noisy and diverse data. Additionally, because Medpedia’s text is freely
available under a Creative Commons license, we will be able to distribute our full-text index, annotation guidelines,
and annotations to other researchers late in 2011.
Because Medpedia is a continually evolving corpus, we took a snapshot of Medpedia’s full text on April 26th, 2010.
This snapshot has served as our Medpedia corpus throughout the MiPACQ project. Articles were broken into
paragraphs based on HTML “p” tag boundaries and all other HTML tags (including formatting, images, and
hyperlinks were stripped).
Clinical Questions
The clinical question data is a set of full text questions and Medpedia article and paragraph relevance judgments. We
started with 4654 clinical questions from the Clinques corpus collected by Ely et. al. from interviews with
physicians 4. During the interviews, which were conducted between patient visits, the physicians were asked what
non-patient-specific questions came up during the preceding visit, including “vague fleeting uncertainties”.
Given the variability in quality in the Clinques corpus and our annotation resources, we narrowed the question set to
353 questions that were likely to be answered by the Medpedia medical encyclopedia. The exclusions consisted of
questions that were incomprehensible or incomplete, required temporal logic to answer, required qualitative
judgments, or required patient-specific knowledge that was not present in Medpedia 5. The remaining questions were
then randomly divided into training and evaluation sets. A medical information retrieval and annotation specialist
manually formulated and ran queries against the Medpedia corpus and annotated all of the paragraphs in articles that
appeared relevant and in a few of the other top ranked articles.
The human annotator found individual paragraphs in Medpedia that answered 177 of the questions. We discarded
the remaining questions because Medpedia lacked the answer content. The training set contains 80 questions with at
least one paragraph answering the question, and the evaluation set contains 81 questions with at least one answer
paragraph. The annotated questions will also be released in late 2011.
Annotation Pipeline
The main NLP component of the MiPACQ question answering system is the annotation pipeline. The pipeline
incorporates multiple rule-based and ML-based systems that build on each other to produce question and answer
annotations that facilitate re-ranking candidate answers based on semantics (e.g., raising the rank of candidates
173
containing the expected answer type or that include similar UMLS entities). Although the pipeline stages are the
same for questions and candidate answers (excluding the expected answer type system, which operates only on
questions), all of the ML-based pipeline components use separate models for questions and answers.
cTAKES
The first stage in the annotation pipeline processes the question (or candidate answer) with the cTAKES clinical text
analysis system. cTAKES is a multi-function system 20 that incorporates sentence detection, tokenization, stemming,
part of speech detection, named entity recognition, and dependency parsing. Close collaboration between the
MiPACQ project and the cTAKES project resulted in the addition of several features to cTAKES that are directly
useful in the MiPACQ system. Because MiPACQ builds upon the unified type system used throughout cTAKES,
annotations added by the cTAKES components (such as part-of-speech tags) are readily utilized by later stages in
the MiPACQ annotation pipeline.
cTAKES is designed as a pipeline of connected systems that work together to create useful textual annotations 21.
The pipeline first locates sentences in the document or question using the OpenNLP 22 sentence detector. Individual
sentences are then annotated with token boundaries with a custom rule based tokenizer that handles details such as
punctuation and hyphenated words. The next pipeline stage wraps the National Library of Medicine (NLM)
SPECIALIST tools 23, providing term canonicalization and lemmatization.
The cTAKES pipeline then performs part-of-speech tagging using a custom UIMA wrapper around the OpenNLP
part of speech tagger. The wrapper adapts OpenNLP’s data format to match the cTAKES unified type system and
provides functionality for building the tag dictionary and tagging model. cTAKES provides a pre-generated standard
tagging model trained on a combination of Penn Treebank 24 and clinical data.
Syntactic parsing is provided in cTAKES using the CLEAR dependency parser 25, a state-of-the-art transition based
parser developed by the CLEAR computational semantics group at the University of Colorado at Boulder.
Named entity detection in cTAKES is done using a series of dictionary lookup annotators, which identify drug
mentions, anatomical sites, procedures, signs/symptoms, and diseases/disorders using a series of dictionaries based
on the UMLS Metathesaurus. Taken together, these annotators provide good coverage for a variety of medical
named entity types.
The final stage in the cTAKES pipeline annotates the discovered mentions with negations (e.g. “no chest pain”) and
status information (such as “history of myocardial infarction”). These annotations are generated using the finite state
machine based NegEx algorithm 26.
174
fewer, when Medpedia did not return at least 50 results). These articles were filtered to include just those that were
present in the original April 26th download. The results were recorded in a database for later analysis.
Document-level baselines were developed just as a means of ensuring that our initial article filter provided
reasonable results. The emphasis of this work is on performing paragraph-level question answering.
175
part, on the semantic annotations produced by the annotation pipeline. This method is used as a baseline to
demonstrate performance based on a few simple informative features.
There are three components to the rule-based scoring function as described in Equation 1. The first component, S, is
the original score from the paragraph-level baseline system (which is itself the product of document- and paragraph-
level scores from Lucene). This score is then multiplied by the sum of two other components: a bag-of-words
component and a UMLS entity component.
1 + ∩ 1 + ∩
,
=
+
1 + 1 +
176
Results
Evaluation metrics
||
1 1
=
|| _ !"#
#$%
Document Level
System P@1 MRR AUC
Medpedia 0.123 0.134 0.062
MiPACQ 0.494 0.593 0.530
Table 1: Performance of Document Level QA Systems
The document-level systems (Medpedia search and the MiPACQ document-level baseline system) were evaluated
against the 81 questions in the evaluation set. For the purpose of this evaluation, documents containing at least one
paragraph annotated “answer” were considered valid answer documents.
Given that both the MiPACQ and Medpedia systems are based on shallow textual features, the performance
difference is dramatic (Precision at One increased by 302%). Despite the fact that Lucene is not intended to be a
question answering system, the document-level results are acceptable as a first step in the question answering
process. This is in contrast to the Medpedia search system: while it might be acceptable as a search system, as a
question answering system it fails to locate relevant documents at an acceptable rate.
Paragraph Level
System P@1 MRR AUC
Baseline 0.074 0.140 0.105
Rule-based Re-ranking 0.136 0.212 0.149
ML Based Re-ranking 0.173 0.266 0.141
Table 2: Performance of Paragraph Level QA Systems
The paragraph-level QA system results are presented in Table 2. These results were generated by running the
evaluation system against the same question test set as in the document-level evaluation.
Several interesting facts emerge from the data. Although the document-level and paragraph-level results are not
directly comparable, the paragraph-level results indicate (as would be expected) that finding the answer within the
document is considerably more difficult than simply finding the relevant document. Whereas the Lucene-based
MiPACQ document-level baseline system might be considered acceptable as a general-purpose QA system (an
“answer” result was ranked first for 49% of the questions), the paragraph-level comparisons show that traditional
information retrieval systems are inadequate for the more fine-grained paragraph level question answering task (only
7% of the questions resulted in an answer paragraph being returned as the first result).
177
Additionally, we can see that even the simple rule-based re-ranking algorithm using UMLS and BOW features
provides a substantial performance boost. The rule-based system returned a correct answer in the first position
13.6% of the time (an 84% improvement relative to the baseline system). Because the rule-based system considers
only UMLS entity matches and word matches between the question and answer, we can infer that these matches are
strong predictors of correct answers. Additionally, despite the fact that the cTAKES system was not trained on the
Medpedia corpus (at the time of this writing, we had not yet created gold-standard syntactic or semantic annotations
for the Medpedia paragraphs) the generated UMLS entities were sufficiently valid to result in a significant
performance improvement, providing additional evidence that the system should work well with new medical
resources.
The ML-based re-ranking results showed further improvement over the rule-based system. The ML based system
did significantly better at ranking answers in the first position (P@1): a correct answer was ranked highest 17.3% of
the time (134% better than the IR baseline system and 27% better than the rule-based system). ML-based re-ranking
also showed improvement on MRR compared to both the IR baseline (90% better) and the rule-based system (26%
better). AUC showed significant improvement relative to the IR baseline, but was slightly lower than in the rule-
based approach.
For 21 questions (26% of the evaluation set) the paragraph-level baseline system returned no answer paragraphs in
the top 100. The re-ranking systems (both rule and ML based) cannot improve performance on these questions
because paragraphs outside of the top 100 are not currently considered for re-ranking. We expect that adding
keyword expansion will help to mitigate this problem by improving the baseline IR system’s recall. The
accompanying precision loss should be mitigated by the filtering effect of the ML based re-ranking.
Future Work
To improve document/paragraph recall, one strategy is term expansion. Each question is expanded into a series of
queries with different versions of key terms. A standard information retrieval system is used to query the corpus
with each query, and then the ranked results are combined into a single result set. We plan to implement term
expansion using the LexEVS 33 terminology server and the UMLS entity features provided by the cTAKES
annotator.
Moschitti and Quarteroni demonstrated that semantic role labeling (SRL) of questions and answers with PropBank-
style predicate argument structures 34 can provide significant improvements in question answering performance 28.
Predicate-argument annotations describe the semantic meanings of predicates and identify their arguments and
argument roles. To investigate the effectiveness of adding predicate argument structures, we created a custom SRL
system trained on a mixture of the Clinques corpus and PropBank data 34. Evaluation of our SRL system with the
CoNLL-2009 shared task development dataset 35 indicates that our system is competitive with other state-of-the-art
SRL systems.
Features extracted from UMLS semantic relations may also improve the answer re-ranking and we are completing a
system to perform this classification. Full evaluation of the system will be completed once we have integrated the
predicate argument structures and UMLS relations into the re-ranking systems.
MiPACQ will be released under an Apache v2 open-source license in late 2011. Associated lexical resources
including gold standard annotations will also be made available to the research community.
Conclusion
The increased desire to incorporate medical databases into the clinical practice of medicine has resulted in increased
demand for clinical question answering systems. Traditional information retrieval systems have proven inadequate
for the task. While baseline systems performed marginally well at retrieving documents pertinent to the question,
physicians do not have time to scour the entire document to find the answer and performance was substantially
worse when we attempted to apply IR technologies to obtain answers at the paragraph level.
This is the first evaluation of MiPACQ. It demonstrates that the application of NLP and machine learning to clinical
question answering can provide substantial performance improvements. By annotating both questions and candidate
answers with a variety of syntactic and semantic features and incorporating both existing and new NLP systems into
a comprehensive annotation and re-ranking pipeline, we were able to improve question answering performance
significantly (an improvement of 133% in Precision at One relative to the baseline IR system). This is particularly
178
significant considering that we have not yet incorporated query re-formulation (key word expansion). When
MiPACQ is run against other corpora, the extent to which the re-ranking is boosted by the current machine-learning
classifiers remains to be seen. And while the expected answer type classifier is dependent upon the training data, the
classifier is only one component of the re-ranking. If needed, they can be retrained easily due to our use of the
ClearTK framework. Our approach demonstrates the effectiveness of NLP for clinical question answering tasks as
well as the utility of integrating multiple layers of annotation with machine-learning based re-ranking systems.
Acknowledgements
The project described was supported by award number NLM RC1LM010608. The content is solely the
responsibility of the authors and does not necessarily represent the official views of the NLM/NIH.
References
1. Sackett DL, Rosenberg WMC, Gray JAM, Haynes RB, Richardson WS. Evidence based medicine: what it is
and what it isn't. BMJ. January 1996;312(71):71-72.
3. Nielsen RD, Masanz J, Ogren P, et al. An Architecture for Complex Clinical Question Answering. The 1st
ACM International Health Informatics Symposium, 2010:395-399.
4. Ely J, Osheroff J, Chambliss M, Ebell M, Rosenbaum M. Answering Physicians' Clinical Questions: Obstacles
and Potential Solutions. J Am Med Inform Assoc. Mar-Apr 2005;12(2):217–224.
7. Schwartz DG, Abbas J, Krause R, Moscati R, Halpern S. Are Internet Searches a Reliable Source of
Information for Answering Residents’ Clinical Questions in the Emergency Room. Proceedings of the 1st
ACM International Health Informatics Symposium, 2010: 391-394; New York.
8. Athenikos SJ, Han H. Biomedical question answering: a survey. Comput Methods Programs Biomed. July
2010; 99(1):1-24.
10. Lindberg D, Humphreys B, McCray A. The Unified Medical Language System. Meth Inform Med. August
1993;32(4):281-291.
11. Demner-Fushman D, Lin J. Knowledge Extraction for Clinical Question Answering: Preliminary Results.
AAAI-05 Workshop on Question Answering, 2005:1-9; Pittsburgh.
12. Demner-Fushman D, Lin J. Answering Clinical Questions with Knowledge-Based and Statistical Techniques.
Computational Linguistics. 2007;33(1).
13. Wang W, Hu D, Feng M, Liu W. Automatic Clinical Question Answering Based on UMLS Relations. SKG '07
Proceedings of the Third International Conference on Semantics, Knowledge and Grid, 2007:495-498; Xi'an.
14. Athenikos SJ, Han H, Brooks AD. A framework of a logic-based question-answering system for the medical
domain (LOQAS-Med). Proceedings of the 2009 ACM symposium on Applied Computing, 2009:847-851;
New York.
179
15. Yu H, Cao YG. Automatically extracting information needs from ad-hoc clinical questions. American Medical
Informatics Association (AMIA) Fall Symposium, 2008:96-100; San Francisco.
16. Ogren PV, Wetzler PG, Bethard SJ. ClearTK: A UIMA Toolkit for Statistical Natural Language Processing.
LREC, 2008:865-869; Marrakech.
17. Ferrucci D, Lally A. UIMA: an architectural approach to unstructured information processing in the corporate
research environment. Natural Language Engineering. September 2004;10(3-4):327-348.
20. Savova GK, Masanz JJ, Ogren PV, et al. Mayo clinical Text Analysis and Knowledge Extraction System
(cTAKES): architecture, component evaluation and applications. J Am Med Inform Assoc. 2010(17):568-574.
21. Clinical Text Analysis and Knowledge Extraction System User Guide. Available at:
http://ohnlp.sourceforge.net/cTAKES/#_document_preprocessor. Last accessed: May 10, 2011.
24. Marcus MP, Marcinkiewicz MA, Santorini B. Building a Large Annotated Corpus of English: The Penn
Treebank. Computational Linguistics. June 1993;19(2):313-330.
25. Choi JD, Nicolov N. K-best, Locally Pruned, Transition-based Dependency Parsing Using Robust Risk
Minimization. Collections of Recent Advances in Natural Language Processing;5:205-216.
26. Negex. Available at: http://code.google.com/p/negex/. Last accessed: May 16, 2011.
27. Manning CD, Raghavan P, Schütze H. Introduction to Information Retrieval: Cambridge University Press;
2008:118-125.
29. Liu TY. Learning to Rank for Information Retrieval. Foundations and Trends in Information Retrieval
2009;3(3):225-331.
30. Moschitti A, Quarteroni S. Linguistic kernels for answer re-ranking in question answering systems. Information
Processing & Management. 2010. In press.
31. Platt JC. Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood
Methods. Advances in Large Margin Classifiers: MIT Press; 1999:61-74.
32. Lin HT, Lin CJ, Weng RC. A note on Platt's probabilistic outputs for support vector machines. Machine
Learning. 2007;68(3):267-276.
34. Kingsbury P, Palmer M. From TreeBank to PropBank. Language Resources and Evaluation, 2002:29-31; Las
Palmas.
35. Hajič J, Ciaramita M, Johansson R, et al. The CoNLL-2009 Shared Task: Syntactic and Semantic
Dependencies in Multiple Languages. Proceedings of the 13th Conference on Computational Natural Language
Learning (CoNLL-2009), 2009:1-18; Boulder.
180