Vous êtes sur la page 1sur 2

Vol. 20 no.

1 2004, pages 120–121


BIOINFORMATICS APPLICATIONS NOTE DOI: 10.1093/bioinformatics/btg369

GIS: a biomedical text-mining system for


gene information discovery
Jung-Hsien Chiang∗, Hsu-Chun Yu and Huai-Jen Hsu
Department of Computer Science and Information Engineering, National Cheng Kung
University, Tainan 701, Taiwan, ROC

Received on March 7, 2003; revised on June 5, 2003; accepted on June 17, 2003

ABSTRACT information about a gene includes biological functions, asso-


Summary: We present a biomedical text-mining system ciated diseases, related genes, and gene–gene relations. The
focused on four types of gene-related information: biological result has been a considerable reduction in the time and effort
functions, associated diseases, related genes and gene–gene required to survey the literature on genes.
relations. The aim of this system is to provide researchers an We have developed the GIS (Gene Information System)
easy-to-use bio-information service that will rapidly survey the to perform high-speed gene information discovery. Our sys-
rapidly burgeoning biomedical literature. tem has two modules. Module I, gene information screening,
Availability: http://iir.csie.ncku.edu.tw/~yuhc/gis/ provides information about biological functions, associated

Downloaded from bioinformatics.oxfordjournals.org by guest on May 1, 2011


Contact: jchiang@mail.ncku.edu.tw diseases and related genes for a queried gene. Module II,
gene–gene relation extraction, extracts the gene–gene rela-
INTRODUCTION tions described in abstracts and estimates whether the relation
In the post-genomic era, a great deal of research is directed between a pair of genes is positive, cooperative, or negative.
toward the analysis and interpretation of genetic sequences In the following we describe the system architecture of the
data. The results of these biological experiments are reported two modules.
in text form. This type of information, however, is under-
utilized by biologists and medical researchers because of Module I: gene information screening
the highly unstructured and free-format characteristics of In this module, the user can screen the information of
the published information and because of its overwhelm- biological functions, associated diseases and related-genes
ing volume. Currently, there are some systems that analyze for each gene in the input gene list [Fig. 1(a) at http://iir.
PubMed (http://www.ncbi.nlm.nih.gov/pubmed/) abstracts csie.ncku.edu.tw/~yuhc/gis/figure.htm].
and then provide a value-added bio-information service. The function of this module is achieved through three
For example, Suiseki (Blaschke et al., 1999, http://www. agents. The document retrieval agent is responsible for col-
pdg.cnb.uam.es/suiseki/) focuses on the extraction and visu- lecting the medical documents from PubMed. The sentence
alization of protein–protein interactions. MedMiner (Tanabe selection agent selects the generally important parts of the
et al., 1999, http://discover.nci.nih.gov/textmining/filters. abstracts, the title and conclusion section, for processing. If an
html) takes advantage of GeneCards (Safran et al., 2002, abstract does not contain a section labeled ‘CONCLUSION:’,
http://bioinfo.weizmann.ac.il/cards/) as its knowledge source the system selects the last three sentences. The lexicon ana-
and offers gene information related to specific keywords. lysis agent is the core of this module. It finds, counts and
XplorMed (Perez-Iratxeta et al., 2001, http://www.bork.embl- indexes, with the domain-specific lexicon, the keywords for
heidelberg.de/xplormed/) presents specified information biological functions, associated diseases and related genes
through interaction with the user. EDGAR (Rindflesch et al., that occur in the abstracts. The domain-specific lexicon con-
2000, http://www-smi.stanford.edu/projects/helix/psb00/) tains six parts: biological function lexicon, disease lexicon,
extracts information about drugs and genes relevant to can- gene lexicon, relational keyword lexicon, negative keyword
cer from the biomedical literature. GENIES (Friedman et al., lexicon and stopword lexicon. We compile the lexicon from
2001) extracts and structures information about cellular some on-line dictionaries and suggestion from professional
pathways from the biomedical literature. The goal of our bio-medical researchers. The lexicon contains synonyms to
research was to design and develop an information system handle lexical variation.
that can streamline the process of retrieving and analyzing
gene-related information from PubMed abstracts, where the Module II: gene–gene relation extraction
In this module, users can acquire information about the
∗ To whom correspondence should be addressed. relation between a pair of genes [Fig. 1(b) at http://iir.csie.

120 Bioinformatics 20(1) © Oxford University Press 2004; all rights reserved.
GIS: a biomedical text mining system

ncku.edu.tw/~yuhc/gis/figure.htm]. The relations here are related genes, instead of showing only a list of titles. (3) Using
classified into three categories, positive, cooperative and neg- colors to mark important keywords, to help the user discover
ative, according to gene function. For example, a gene A can important information easily and to facilitate understanding.
activate another gene B’s function or expression (a posit- (4) Editing the domain-specific lexicon of biological func-
ive relation); gene A and gene B can bind to a complex or tions, diseases and genes to meet the user’s needs, and thus to
cooperate with each other (a cooperative relation and example expand the lexicon. (5) Filtering the content of the abstracts
keywords are ‘associate with’ and ‘is conjugated to’); or gene and reserving only the conclusions or last three sentences of
A can suppress gene B’s function or expression (a negative the abstract.
relation). We have presented here a biomedical text mining system
The function of this module is achieved through four agents: that screens gene information and extracts gene–gene relations
document retrieval, data preprocess, learning process and described in the text. We extract the information about biolo-
relation prediction. Here we introduce only the kernel learn- gical functions, associated diseases and related genes through
ing process and relation prediction agents. The learning a domain-specific lexicon. We use three kinds of relations—
process agent is responsible for generating sentence expres- positive, cooperative and negative—to represent the relation
sion patterns from training samples consisting of sentences between a pair of genes. The extraction performance is 0.840
describing gene–gene relations. Sentence expression pat- for precision and 0.767 for recall. In brief, we have designed
terns stand for the patterns of wording and term distribution and implemented new system architecture for the discovery
in describing relations, and they are represented as a vari- of gene information in biomedical texts.
ant of decision tree. This agent operates offline beforehand.
The relation prediction agent judges the relations described REFERENCES
in sentences according to sentence expression patterns and

Downloaded from bioinformatics.oxfordjournals.org by guest on May 1, 2011


Blaschke,C., Andrade,M.A., Ouzounis,C. and Valencia,A. (1999)
determines whether the relations are positive, cooperative
Automatic extraction of biological information from scientific
or negative. Please visit the website of our system for the
text: protein–protein interactions. In Proceedings of the 7th Inter-
algorithms used in this module. national Conference on Intelligent Systems for Molecular Biology
These two modules can be integrated in the gene informa- (ISMB’99), pp. 60–67.
tion discovery process when a medical researcher queries the Friedman,C., Kra,P., Yu,H., Krauthammer,M. and Rzhetsky,A.
information of a specified gene in Module I. It then gets a (2001) GENIS: a natural-language processing system for
related gene for the input gene from the result, and then queries the extraction of molecular pathways from journal articles.
these two genes’ relation in Module II. After querying gene Bioinformatics, 17, S74–S82.
information in Module I, the user may find that the query gene Perez-Iratxeta,C., Bork,P. and Andrade,M.A. (2001) XplorMed: a
and another gene co-occur in the same sentences frequently. tool for exploring MEDLINE abstracts. Trends Biochem. Sci., 26,
And they have similar biological functions. At this moment, 573–575.
Rindflesch,T.C., Tanabe,L., Weinstein,J.N. and Hunter,L. (2000)
the user can proceed to use Module II to query the relation
EDGAR: extraction of drugs, genes and relations from the
between these two possibly related genes, and this information biomedical literature. Pac. Symp. Biocomput., 5, 517–528.
can be used in further study. Safran,M., Solomon,I., Shmueli,O., Lapidot,M., Shen-Orr,S.,
System features Adato,A., Ben-Dor,U., Esterman,N., Rosen,N., Peter,I. et al.
(2002) GeneCards 2002: towards a complete, object-oriented,
The GIS system provides the following features. (1) Finding, human gene compendium. Bioinformatics, 18, 1542–1543.
counting and indexing the keywords for biological func- Tanabe,L., Scherf,U., Smith,L.H., Lee,J.K., Hunter,L. and
tions, associated diseases and related genes in the abstracts. Weinstein,J.N. (1999) MedMiner: an internet text-mining tool
(2) Browsing the sentences and abstracts by clicking the for biomedical information, with application to gene expression
keywords of biological functions, associated diseases and profiling. BioTechniques, 27, 1210–1217.

121

Vous aimerez peut-être aussi