Vous êtes sur la page 1sur 1

A Document Retrieval Method Based on UMLS

Similarity in Biomedical Question Answering


Sidi Mohammed Ben Abdellah Faculty of Science Dhar El
University Mourad SARROUTI and Said OUATIK EL ALAOUI Mahraz
Laboratory of Computer Science and Modeling, FSDM, Sidi Mohamed Ben Abdellah University, Fez, Morocco.
mourad.sarrouti@usmba.ac.ma, s_ouatik@yahoo.com

Abstract Experimental Results and Discussion


Biomedical document retrieval systems play a vital role in biomedical question Dataset
answering systems. The performance of the latter depends directly on the As a test collection, we used the publicly available benchmark datasets provided
performance of its biomedical document retrieval section. Indeed, the main goal of by the BioASQ challenge [5]. The latter, within the 2014 edition, realised five
biomedical document retrieval is to find a set of citations that have high probability batches of testing data which are used as test sets to evaluate the participating
to contain the answers. In this paper, we propose a biomedical document retrieval systems in the task b. Each batch of testing data sets contains 100 biomedical
method to retrieve relevant documents for the biomedical questions (queries) from questions. The biomedical questions are created by a group of biomedical experts
the users. In our framework, we first use GoPubMed search engine to find the top- and provided by BioASQ organizers.
K results. Then, we re-rank the top-K results by computing the semantic similarity
between questions and the title of each document using UMLS similarity. Our Evaluation Metrics
proposed method is evaluated on the BioASQ 2014 task datasets. The experimental We evaluated the performance of the proposed biomedical document retrieval
results show that our proposed method has the best performance (MAP@100) method using four metrics, namely mean precision, mean recall, mean F1-measure
compared to the existing state-of-the-art related document retrieval systems. and mean average precision (MAP) [5]. In the BioASQ challenge, MAP measure
is used to sort and to compare the participating systems. Moreover, for the test in
BioASQ’2014, only the 100 first documents from the resulting list are allowed to
Introduction be submitted.
By the rapidly increasing of knowledge in the biomedical domain, it becomes very
Results and Discussion
difficult even for experts to absorb all the relevant information in their field of
Table 1 presents the comparison between our proposed method and the current
interest. Information Retrieval (IR) systems present a list of documents that might
state-of-the-art methods on batch 1 of testing datasets in BioASQ 2014.
have the associated information, but the majority of them leave it to the user to find
and extract the required information. Unlike IR systems, Question Answering (QA) Table 1 Comparison in terms of MAP of the proposed biomedical document retrieval method
systems aim to provide inquirers with direct and precise answers to their questions, with the current state-of-the-art methods.
by employing Information Extraction (IE) and Natural Language Processing (NLP)
methods [1]. Typically an automated QA system consists of three main elements, Systems Mean precision Mean recall Mean F-measure MAP
which independently can be studied and developed, [1, 2]: Question Processing,
Document Processing and Answer Processing. Figure 1 illustrates the generic SNUMedinfo1 0.04 0.59 0.08 0.26
architecture of a biomedical QA system.
Top 100 baseline 0.22 0.43 0.22 0.19
Generic Architecture of Biomedical Question Answering Systems

Proposed System 0.23 0.36 0.22 0.27


Phase 1
Question Processing Query Overall, from Table 1, it can be seen clearly that the results of the proposed
method have an absolute competitiveness with the current state-of-the-art methods
Natural Language Query Formulation PubMed Documents in terms of MAP. Indeed, the performance of our system was 0.27 of MAP.
Questions (e.g. “Is Moreover, Our proposed method significantly outperforms the baseline system,
CHEK2 involved in Relevant
cell cycle control?”) documents i.e., Top 100 Baseline, by a wide margin in term of mean average precision
User Question Analysis
Phase 2 (0.0847 MAP).
&
Classification Document Processing
Conclusion and Future Work
In this paper, we have tackled an original biomedical document retrieval method.
First, we have used Metamap to extract biomedical named entities and connect
Question Types: Documents and Passages
Retrieval
them in order to generate queries. Then, the top 200 relevant documents are
yes/no, factoid,
list or summary. retrieved by GoPubMed search engine. Next, we have kept only the top 100
Answers Phase 3 documents after re-ranking the top 200 documents by computing the semantic
(e.g. “Yes”) similarity between question and documents title. Finally, the experiments on the
Answer Processing Candidate Answers BioASQ 2014/2015 document retrieval task have demonstrated that our proposed
framework is proved to be effective and competitive for biomedical documents
retrieval compared to several state-of-the-art systems. In our future work, we will
focus on integrating our biomedical document retrieval framework in a biomedical
In this work, we address the problem of document retrieval which is an important QA system.
component of biomedical QA systems. The aim of biomedical document retrieval
task is to find a list of relevant documents that are likely to contain the answer. References
Method [1] Athenikos SJ, Han H (2010) Biomedical question answering: A survey.
Computer methods and programs in biomedicine 99(1):1–24, DOI 10.1016/j.
The proposed method consists of three main steps: (1) query reformulation, (2) cmpb.2009.10.003
PubMed document retrieval using GoPubMed, and (3) biomedical document re- [2] Abacha AB, Zweigenbaum P. MEANS: A medical question-answering system
ranking. combining NLP techniques and semantic Web technologies. Information
1. Query Reformulation: in this step, we process the biomedical question, written Processing & Management. 2015;51(5):570–594.
in natural language, to make it efficient and optimized for searching. We have used [3] Doms, A., Schroeder, M.: Gopubmed: exploring pubmed with the gene
MetaMap [2] for mapping terms in questions to Unified Medical Language System ontology. Nucl. Acids Res. 33(suppl 2), W783–W786 (2005)
(UMLS) in order to extract the Biomedical Entity Names (BENs) and connect [4] McInnes, B.T., Pedersen, T., Pakhomov, S.V.: Umls-interface and umls-
them with the “AND” operator. similarity: open source software for measuring paths and semantic similarity. In:
2. Pubmed Document Retrieval Using GoPubMed: the query generated in the AMIA Annual Symposium Proceedings, vol. 2009, p. 431. American Medical
query reformulation phase will be fired to GoPubMed semantic search engine [3] Informatics Association (2009)
in order to find the top 200 documents. [5] Tsatsaronis, G., Balikas, G., Malakasiotis, P., Partalas, I., Zschunke, M.,
3. Biomedical Document Re-Ranking: the document re-ranking is the main and Alvers, M. R., Weissenborn, D., Krithara, A., Petridis, S., Polychronopoulos, D.,
important step in the proposed method. Indeed, we do not completely depend Almirantis, Y., Pavlopoulos, J., Baskiotis, N., Gallinari, P., Artiéres, T.,
on GoPubMed ranking of documents. So we re-rank the obtained 200 documents Ngomo, A.-C. N., Heino, N., Gaussier, E., Barrio-Alvers, L., Schroeder, M.,
again by computing the similarity between a given question and the title of each Androutsopoulos, I., & Paliouras, G. (2015). An overview of the BIOASQ large-
document. We have used UMLS similarity [4] to obtain similarity between scale biomedical semantic indexing and question answering competition. BMC
biomedical concepts of a question and the concepts of document title. In Bioinformatics, 16, 1–28. URL: http://dx.doi.org/10.1186/s12859-015-0564-6.
fact, we have used path length as similarity measure where the similarity score is doi:10.1186/s12859-015-0564-6.
inversely proportional to the number of nodes along the shortest path between the
concepts.

Vous aimerez peut-être aussi