Improving Medical Decision Trees by Combining Relevant Health-Care Criteria

Expert Systems with Applications 39 (2012) 11782–11791
Improving medical decision trees by combining relevant health-care criteria

Joan Albert López-Vallverdú
*
, David Riaño, John A. Bohada

Research Group on Artificial Intelligence (BANZAI), Departament d’Enginyeria Informàtica i Matemàtiques, Universitat Rovira
i Virgili, Av. Països Catalans 26, 43007 Tarragona, Spain
a r t i c l e i n f o
Keywords: Medical decision making Decision trees Background knowledge
a b s t r a c t
Through the years, decision trees have been widely used both to represent and to conduct decision processes. They can be
automatically induced from databases using supervised learning algorithms which usually aim at minimizing the size of the tree.
When inducing decision trees in a medical setting, the induction process should consider the background knowledge used by
health-care professionals to make decisions in order to produce decision trees that are medically and clinically comprehensible
and correct. Comprehensibility measures the medical coherence of the sequence of questions represented in the tree, and
correctness rates how much irrelevant are the errors of the decision tree from a medical or clinical point of view. Some
algorithms partially solve these problems pursuing alternative objectives as reducing the economic cost or improving the
adherence of the decision process to medical standards. However, from a clinical point of view, none of these criteria is valid
when it is considered alone, because real medical decisions are taken attending to a combination of them, and also other
health-care criteria, simultaneously. Moreover, this combination of criteria is not static and may vary if the decision tree is made
for different purposes as screening, diagnosing, prognosing or drug and therapy prescription. In this paper, a decision tree
induction algorithm that uses combinations of health-care criteria is presented and used to generate decision trees for screening
and diagnosing in four medical domains. The mechanisms to formalize and to combine these criteria are also presented. The
results have been analyzed from both a statistical and a medical point of view, and they suggest that our algorithm obtains
decision trees that physicians evaluated as more comprehensible and correct than the decision trees obtained by previous
approaches as they keep an equivalent accuracy.
© 2012 Elsevier Ltd. All rights reserved.
1. Introduction
In medicine, decision processes may be of several kinds and for different purposes (Fauci et al., 2009): screening, diagnosing,
prognosing, drug and therapy prescription, etc. Through the years, mul- tiple computer-based structures have been proposed to
formalize these decision processes. They range from statistical approaches as Bayesian Networks (Arsene, Dumitrache, & Mihu,
2011; Lucas, van der Gaag, & Abu-Hanna, 2004; Velikova, de Carvalho Ferreira, & Lucas, 2007) or probabilistic models
(Husmeier et al., 2004) to symbolic approaches as decision trees (Chapman & Sonnenberg, 2003; Podgorelec, Kokol, Stiglic, &
Rozman, 2002), decision tables (Shiffman, 1997) or decision rules (Clark & Niblett, 1989; Yeh, Cheng, & Chen, 2011). Among
them, decision trees have been partic- ularly successful and widely used both to represent and to conduct decision processes.
Medical decision trees can be provided by experts (Candell Riera, 2003; Fauci et al., 2009) or automatically induced from
medical databases (Ling, Yang, Wang, & Zhang, 2004; López-Vallverdú, Riaño, & Collado, 2007; Quinlan, 1986). In
computer science, three of the most referred algorithms to induce decision trees are ID3 (Quinlan, 1986), C4.5 (Quinlan, 1993)
and C5.0 (Quinlan, 2003). They aim at minimizing the size of the tree and therefore shortening the decision process by
maintaining the quality of the final decision. The main drawback of the trees pro- duced with these algorithms is that the final
trees only consider the information that can be extracted from the medical databases and so they do not necessarily satisfy
medical and clinical comprehensibility and correctness. Comprehensibility is a measure of the medical coherence of the
sequence of questions of the decision processes represented in the tree according to the health-care experts (e.g., asking for the
age of the patient before obtaining the thyroid-stimulating hormone value can be accepted in a patient screening process but not
in diagnosing thyroid malfunctions). Correctness rates how much irrelevant are the errors of the decision tree from a medical or
clinical point of view (e.g., the medical error of sending a patient to the Intensive Care Unit rather than to a general hospital
floor is lower than sending him home by mistake). Providing efficient, but also comprehensible and correct decision
*
Corresponding author. Tel.: +34 977558516; fax: +34 977559710. E-mail addresses: joanalbert.lopez@urv.net (J.A.
López-Vallverdú), david.riano@ urv.net (D. Riaño), john.bohada@urv.net (J.A. Bohada).
mechanisms is prior in medical decision making.
In the past, some approaches to the induction of medical decision trees have pursued alternative objectives as reducing the
0957-4174/$ - see front matter © 2012 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.eswa.2012.04.073
Contents lists available at SciVerse ScienceDirect
Expert Systems with Applications

journal homepage: www.elsevier.com/locate/eswa

economic cost (Chai, Deng, Yang, & Ling, 2004; Ling et al., 2004; Ling, Sheng, & Yang, 2006) or improving the adherence
(Horning, Hoehns, & Doucette, 2007) of the decision process to medical standards (López-Vallverdú et al., 2007). However,
these approaches do not guarantee medical comprehensibility and correctness. On the one hand, none of the previous criteria on
the length, the economic cost and the adherence to clinical standards is useful when it is considered alone, because real medical
decisions are taken attending not only to these criteria but also to many others that are combined, simultaneously. On the other
hand, the induction of decision trees with those criteria cannot differentiate among the different possible application purposes.
This differentiation is important because, for example, a comprehensible and correct decision tree for diagnosis can be
completely wrong for screening purposes.
In this context, in Section 2 we formalize the concept of medical decision process. In Section 3 we propose the mechanisms to
formalize medical criteria in order to include them in a decision tree induction algorithm, and in Section 4 we propose a
methodology to combine them. In Section 5 we present a general algorithm to induce decision trees, identifying the points
where medical decision criteria can be introduced as background knowledge. These are called choice points. In Section 6, the
measures of accuracy, comprehensibility and correctness for the evaluation of the induced decision trees are formalized. The
inductive algorithm is used in Section 7 to generate decision trees for the purposes of screening and diagnosing in four medical
domains. The results have been analyzed from a statistical and a medical point of view, and the conclusions reported in Section 8.
2. Formalizing a decision process
In medicine there are many different descriptions of what a decision process is (Fauci et al., 2009), therefore it is mandatory to
define the concept of medical decision process in this paper. Here, a decision process is a sequence of medical questions or
observations that lead to a concrete medical decision. In a particular domain, if Q = {q
1
,q
2
,...,q
m
} is the set of valid questions, D = {d
1
, d
2
,...,d
n
} the set of possible decisions and q
i
(p) the answer to the question q
i
2 Q for a certain patient p, then the finite sequence ðq for i 1
ðpÞ; patient q
i
2
ðpÞ; p ... in ; q
which i
k
ðpÞ; d
a p
Þ health-care represents a medical decision process professional takes decision d
p 2 D after exact order. having asked Observe that the questions questions represent q
i
1
; q
i
2
; ... patient ; q
i
k
2 Q signs in this and symptoms but also consultation to the patient record or to an expert. Individual decision processes can be
generalized and structured as decision mechanisms that do not only capture the medical knowledge supporting each individual
decision process, but also provide the way of conducting new decisions under other circumstances or for other patients. Among
the existing decision mechanisms (see, for example Arsene et al., 2011; Clark & Niblett, 1989; Chapman & Sonnenberg, 2003;
Husmeier et al., 2004; Podgorelec et al., 2002; Shiffman, 1997) here we choose decision trees because they are structured,
explicit, and easy to understand and to interpret, which are compelling requirements of a medical process. A decision tree (DT) is
a decision mechanism that describes decision processes that always start with the same question and concatenate questions in
such a way that each possible answer to a question is followed by a new question or by a final decision.
Decision mechanisms, as for example DTs, can be automatically obtained applying induction algorithms. These algorithms
start from a set of data represented in the form (q
1
(p),q
2
(p),...,q
m
(p);d
p
) where p are the different patients, q’s are questions whose
answers can be known or not for each patient p, and d
p
is the decision taken for patient p. Observe that the order of the questions is the
same for all the patients since it defines the description of the case rather
J.A. López-Vallverdú et al. / Expert Systems with Applications 39 (2012) 11782–11791 11783
than a medical decision process in which the questions to the different patients can be asked in a different order.
Fig. 1 shows a DT induced using an information gain based algorithm to identify patients with heart disease (questions are
represented as ellipses and decisions as boxes). It does not consider any medical background knowledge so medical
comprehensibility and correctness is not guaranteed. For example, the question about number of major vessels in the root requires
an invasive test for all the patients. The systematic application of this test as the first step of the decision process lacks of medical
sense, and it is completely wrong for certain medical decision processes as screening. The algorithms to induce comprehensible
and correct decision mechanisms from a set of data must then be based on one or more medical decision criteria that extend the
statistical sense of asking one or another question with a clinical sense. These medical decision criteria and their formalization are
introduced in the next section.
3. Decision criteria in health-care and their formalization
In medicine, the list of criteria which may be combined to make decisions is very large and diverse. A systematic approach to
the organization of such criteria and their representation using cost functions and layered partial orders (LPOs) is proposed in
López- Vallverdú and Riaño (2012). In this section, we explain how these criteria can be formalized in order to decide about the
appropriate questions and decisions in a decision process. This appropriateness is used to determine the best order of questions
and decisions in medical decision processes.
3.1. Criteria on the questions
The order in which questions are asked in a decision process is decided according to the criteria on the questions. They are
used to determine whether a question is more or less adequate than another one in a given context. For example, in diabetes
screening, we may use the decision time criterion to decide to perform an oral glucose tolerance test rather than obtaining the
longer 2-h serum insulin value. When formalizing the criteria on the questions, cost functions or LPOs are defined over the set of
questions Q. For example, the expert may choose to represent the economic cost criterion as a cost function f
e
: Q ? [0,1] and the script criterion, which measures the adherence of the procedure to the sequence specified
by medical standards (López-Vallverdú & Riaño, 2012), as a LPO 6
s over Q.
Criteria on the questions can be contextual or context-free. A criterion is said to be contextual when it depends on the context
(related disease, medical purpose, etc.) of the medical decision process. In a certain context, the answer to a question may be
important in order to make a decision, but in another context, this question may be totally unnecessary. For example, the script
value of answering the question stability_of_blood_pressure is greater if we are deciding where a post-operative patient must be
sent, than if we are determining whether the patient is hypothyroid or not. Script and granularity (López-Vallverdú & Riaño,
2012) are examples of contextual criteria.
Context-free criteria do not change when they are used in different contexts because they depend on the health-care test
needed to obtain the answer of the question. For example, economic cost is a context-free criterion. The question
sodium_on_blood has no economic cost itself but its economic cost is related to the blood test that provides the answer for this
question. The economic cost of a regular blood test is always the same regardless of the context. Moreover, a health-care test can
provide simultaneous answers to several questions of the decision process. For example, a regular

blood test informs about the levels of sodium, urea, creatinine, etc. providing an answer to the question sodium_on_blood, but
also to urea_on_blood, creatinine_on_blood, etc. Decision time, economic cost, health risk and physical comfortability
(López-Vallverdú & Riañ- o, 2012) are examples of context-free criteria. Notice that once a health-care test is performed to
answer one of the questions, it does not have to be performed again to answer the rest of the questions related to that health-care
test. Being t a health-care test that provides the answer to a set of questions Q0 & Q; when a question q0 2 Q0 is asked in the
decision process, the values for context-free criteria of the questions in Q0 change, so that ∀q0 2 Q0, f(q0) = 0 (for cost
functions), and each question q0 2 Q0 is moved to the first layer of 6(for LPOs).
3.2. Criteria on the decisions
A decision process concludes with a final decision that can be right or wrong. The relevance of the error in wrong decisions is
evaluated with the criteria on the decisions. For example, according to the health risk criterion, it is safer to wrongly send a
post-operative patient to the Intensive Care Unit (ICU) than sending him home by mistake. Some previous works have
evaluated the possible wrong decisions performed in a decision process (Ling et al., 2004, 2006; Turney, 2000). In these
approaches, an expert has to provide a cost function error(d
1
,d
2
) which returns the error of per- forming d
2
when the correct decision is d
1
, for each pair of decisions d
1
, d
2
in the set of possible decisions D. This approach has the inconvenience that the expert is required to provide a value
for each one of the #D Á (#D À 1) possible errors in the decision process (where #D is the cardinality of D). For medium and
large sets of decisions this is much information that experts must provide. In order to reduce this effort, here we use a different
approach that divides the error into type I and type II medical errors which are concepts that medical doctors are familiar with.
Type I error represents the relevance of taking a wrong decision (e.g., the economic cost if we send a patient to ICU when this is
a wrong decision).
11784 J.A. López-Vallverdú et al. / Expert Systems with Applications 39 (2012) 11782–11791
Fig. 1. Decision tree to identify patients with heart disease.
Type II error represents the relevance of not taking a correct decision (e.g., the risk on the health of a patient who is not sent to
ICU when this is the correct decision).
When formalizing the criteria on the decisions, the cost functions or the LPOs are defined over the set of decisions D. This
means that two cost functions f: D ? [0,1] (or two LPOs 6 over D) are needed for each criteria considered; one for type I error and
another one for type II error. For example, the expert may choose to represent the type I error of the health risk criterion (h) as a
LPO 6
h
and the type II error as a cost function f
h
. This approach requires the expert to only provide 2 Á #D values.
For each decision d 2 D, this value is f
c
(d) or ‘
c
(d) when the criteria c is represented with a cost function or a LPO, respectively.
Com- pared with previous approaches (Ling et al., 2004, 2006; Turney, 2000) our proposal requires much less information and it
is easier to provide by experts. This appreciation was confirmed by the health-care professionals that evaluated the results of this
work. According to them, our approach is much more closer to the way they objectively measure medical errors.
4. Combination of criteria
In a decision process, questions and final decisions are not chosen based on a unique criterion but on the simultaneous
application of a set of medical criteria. This combination can be very complex and it may involve criteria with different levels
of priority and relevance. In this section we present a means to include a combination of the formalized criteria in the induction
of DTs. We first explain how the inductive algorithm selects criteria according to their priority, and then we present a method to
combine them, considering their relevance.
4.1. Selection of criteria considering their priority
In a decision process, medical and clinical criteria are arranged in different levels of priority. The priority of a criterion is
defined as the relative position of this criterion in the set of criteria when it is

used in medical decision making. This priority is represented by a positive number, 1 being the highest priority. Health-care
professionals may use priorities to rank the relevance of the criteria in the decision problem that they are trying to solve. For
example, in the selection of questions for screening patients with diabetes, the expert may consider script, economic cost and
physical comfortability criteria of higher priority than health risk or decision time. The expert can also avoid the use of priorities
just by stating that all the criteria have the same level of priority.
The criteria in the first level of priority are those which are used to guide the sequence of questions or to make the final
decisions in the decision process. Only in the case that these criteria are not able to identify the best question or decision in the
process, the criteria of the second level of priority are considered. If these also fail, then the criteria in the third level are used,
and so on. If none of the levels is useful to choose the best question or decision, then any remaining question q
i
2 Q or decision d
i
2 D is appropriate and the one with the lower index i is selected.
4.2. Combination of criteria considering their relevance
After having considered priorities, the criteria of the same priority are combined according to their relevances. The relevance
of a criterion is defined as its weight within the combination of criteria used in medical a 2 [0,1] such with priority i. that
Given decision P
a
c2C
c
making > i a
c a
c0 1⁄4 we 1, and it is represented by a value
where C
i
contains those criteria say that criterion c is more relevant than criterion c0. Health-care professionals must provide the
relevance of the decision criteria as a means of weighting the relative importance of each criterion in the decision problem that
they are trying to solve. When combining n criteria represented as cost functions or LPOs we deal with three cases:
Case 1: Combination of n cost functions (f
linear combination: g 1⁄4 a
1
f
c
1
þÁÁÁþa
n
f
c
n
c
with 1
; ... ; a f
c
i
n
the ): We apply a relevance of criterion c
i
. Case 2: Combination of n LPOs (6
c
1
;... ; 6
c
n
): We apply the procedure of combination of LPOs described in López-Vallverdú
and Riaño (2011a). Case 3: Combination of m cost functions and n À m LPOs
(f
c
1 ; ... tion of ; LPOs f
c
m
; 6
c in mþ1
(López-Vallverdú ; ... ; 6
c
n
): We apply & the procedure Riaño, 2011a) of combina- to the n À m LPOs obtaining a single LPO 6 0. Then we transform 60
into a cost function f0 (López-Vallverdú & Riaño, 2011a) and finally, combine the m + 1 cost functions the relevance of f0
calculated as a0 f
c 1⁄4 1
; P
... n
i1⁄4mþ1
; f
c
m
;f a 0 i .
as in case 1 with
5. Induction of decision trees based on medical criteria
The three most successful and widely applied algorithms to induce DTs are ID3 (Quinlan, 1986), C4.5 (Quinlan, 1993) and
C5.0 (Quinlan, 2003) with more than 1800 publications in medical informatics since 2000.1 These are greedy algorithms that
produce DTs as a result of a top-down partitioning process that starts with a dataset which contains descriptions of past decision
processes. In medical informatics (Podgorelec et al., 2002), these cases represent decisions on patients that are expressed as (q
1
(p),q
2
(p),...,q
m
(p);d
p
) where p are the different patients considered, q’s are
questions on particular conditions of the patients whose answer can be known or not, and d
p
is the decision taken for patient p. In spite of significant differences, the baseline of ID3, C4.5 and C5.0 is
equivalent: partition the dataset into subsets using the best possible question, until the decision of the remaining cases can be
considered equivalent, then
take the most appropriate decision. This behavior is described in Algorithm 1 where three choice points have been identified.
These are points in which background knowledge can be considered in order to improve the medical and clinical
comprehensibility and correctness of the DT induced.
Choice point one, in line 2, sets a condition for placing a decision node (or not). For the current dataset this condition
determines whether the situation (q
i
(p) )ÁÁÁ) is better represented with a decision (q
i
) or if more questions have to be asked (q
i
(p) ) d
p (p) ) q
j
(p) )ÁÁÁ). Choice point two, in line 3, is the condition to select the best decision d
p
2 D for the current decision process. Choice point three, in line 7, is the condition to select the best question q
j
2 Q for the current decision process.
5.1. Introducing background knowledge in the induction of DTs
In order to improve the medical comprehensibility and correctness of the trees induced by ID3, C4.5 or C5.0 and also to be
able to produce trees with a concrete medical orientation (e.g., screening, diagnosis, treatment, etc.), the medical background
knowledge is included in Algorithm 1 (see Fig. 2). This knowledge comes represented by cost functions and LPOs related to
each one of the criteria taking part in the decision process. For each criteria, three cost functions (or LPOs) are defined: one for
questions and other two for type I and type II errors on the decisions. These cost functions and LPOs, together with the priorities
and relevances of the criteria, define the background knowledge required to produce decision trees with a medical sense.
A representation of all the background knowledge required is shown in Table 1 where c
1
are the criteria selected. For each criterion c
i
,...,c
k (i.e., table row), the background knowledge provides the priority and relevance (p
qi
and a
qi
) when the criterion is used to select the questions, for type-I error, and p IIi
and and the a IIi
priority and relevance (p
Ii
and a
Ii for type-II error) when it is used to select the proper decision. Each criterion c
i
may be represented as a cost function or a LPO, for questions, and for type I and
type II errors. Table 1 is a central component of the process described in Fig. 2.
The criteria in Table 1 are combined using the methodology described in Section 4 obtaining three global cost functions or
LPOs for each level of priority j: one for criteria on the questions (g
qj or 6
qj
1 Bibliographic search in ScienceDirect with keywords medicine AND (id3 OR c4.5 OR c5.0).
), another one for criteria on the decisions related to type I

) errors (g
Ij
and a third one for criteria on the decisions related to type II errors (g
IIj
or 6
IIj
). With the aim of inducing DTs that are medically and clinically comprehensible and correct and, at the
same time, adapted to the health-care purpose the DT must serve to, we propose an implementation for each one of the choice
points of Algorithm 1 that uses the different global cost functions and LPOs.
5.2. Condition for placing a decision node
In medicine, deciding whether a decision process has reached a final decision or if new questions are recommended is a trade
off between type I and type II errors. Here, these errors are respectively represented with the cost functions g
Ij
obtained for each level of priority j (see Fig. 2). If we have global LPOs,
they are transformed into the cost functions g
Ij
and g
IIj
and g
IIj
(López-Vallverdú & Riaño, 2011a). Therefore, for each priority level j, g
Ij
provide the global cost of accepting a wrong decision and the
global cost of rejecting a correct decision over a decision process ðq
i
1
and g
IIj
P0, if ðpÞ; P
P
q
0
ðdÞ i 2
ðpÞ; is ...; the q i proportion k
ðpÞ; d
p
Þ on a patient p. Given of patients in P0 a set of patients on which the final decision was d, then, considering criteria with priority
i, the cost of placing a decision node Dec
i
(d, P0) is calculated using Eq. (1). The condition for placing a decision node is reached if one of
the total costs for making a decision d over the current dataset, considering criteria with priority 1, is lower than a threshold

2D
(Dec
1
(d,P0)) <

).
(i.e., min
d
Dec
i
À d; P 0
Á
1⁄4 ð 1 À P
P
0
ðdÞ Þ Á g
Ij
dð Þþ
X
À P
P
0
d
À d 0
Á
Á g
IIj
À d 0
Á Á
ð1Þ
0
2D;d
0
–d
We compare the costs for making a decision with a threshold rather than with the costs of making a question because
questions and decisions depend on different criteria and thus they are not
Fig. 2. Introducing background knowledge in Algorithm 1.
Table 1 Representation of the input background knowledge.
Criteria Questions Decisions
Type I error Type II error
p a Formalization p a Formalization p a Formalization
c
1
p
q1
a
q1
f
q1
or 6
q1
p
I1
a
I1
f
I1
or 6
I1
p
II1
a
II1
f
II1
or 6
I
I1 c
2
p
q2
a
q2
f
q2
or 6
q2
p
I2
a
I2
f
I2
or 6
I2
p
II2
a
II2
f
II2
or 6
I
I2 ÁÁÁ ÁÁÁ ÁÁÁ ÁÁÁ ÁÁÁ ÁÁÁ ÁÁÁ ÁÁÁ ÁÁÁ ÁÁÁ c
k
p
qk
a
qk
f
qk
or 6
qk
p
Ik
a
Ik
f
Ik
or 6
Ik
p
IIk
a
IIk
f
IIk
or 6
IIk
or 6
Ij
comparable. If a decision is correct enough for the current dataset (its cost is lower than

) it can be placed in the DT with no need to calculate the cost of making a question.
This procedure considers both the information in the database (proportion of patients for each decision) and the medical back-
ground knowledge (type I and II error cost for each wrong decision).
5.3. Select the best decision: correctness
From a medical point of view, the most correct decision to be made over a certain set of patients, must be determined
considering type I and type II errors (see g
Ij
and g
IIj
in Fig. 2). Therefore the selection of the best decision is done using Eq. (1). The best
decision to be selected is the one which minimizes Dec
1
. If several decisions minimize Dec
1
then we select the one of them which minimizes Dec
2
. The procedure is repeated for each level of priority until there is only one optimal decision. If the lowest priority
level is reached and there is not a single optimal decision selected, then the remaining decision d
i
with the lowest index i is taken.
5.4. Select the best question: comprehensibility
A decision process is medically comprehensible if the questions are made in an order similar to the criteria of the health-care
experts. Therefore, criteria on the questions are involved in the selection of the best question for a certain patient (see g
qj
and 6
qj
in Fig. 2). Nevertheless, from a medical point of view the
most comprehensible question is not necessarily the question that leads to the best situation to make a final decision. In order to
select comprehensible questions which are also useful to make a final decision, we use the concept of expected cost (EC). For
each question q
i
, the EC represents the cost of making a decision in the next step of the decision process after asking the question q
i
. This is the average of the costs of placing decision nodes for each of the
subsets

obtained when a certain set of patients P0 & P is partitioned using q
i
. EC is calculated with Eq. (2), where P0
a
1⁄4 fp 2 P0 : q
i
ðpÞ 1⁄4 ag and A
i
(p) = a, p 2 P0}.
ECðq
i
= {a:q
i
; P0Þ 1⁄4
#A 1
X
i
a2A
i
min d2D
À Dec
1
À d;P0 a
Á Á
ð2Þ
We compute EC for each question and we select those questions whose EC is lower than a threshold d. The best question is
the one which minimizes the global cost function g
q1
(or which is in the lowest layer of the LPO 6
q1
) for criteria on the questions of level of priority 1. If several questions minimize g
q1
(or are in the lowest layer of 6
q1
) then we select the one of them which minimizes g
q2 (or which is in the lowest layer of 6
q2
). The procedure is repeated for each level of priority until there is only one optimal
question. If none of the levels is useful to select one of these questions, then the remaining question q
i
with the lowest index i is selected. The use of the expected cost together with the criteria on the
questions guarantees a trade off between the information in the database and the medical background knowledge when selecting
the best question.
6. Evaluation of medical decision trees
The accuracy of a DT is defined as the percentage of correct decisions over the total number of decisions made. Accuracy is a
statistical measure like sensitivity, specifity and positive and neg- ative predictive values (Lang & Secic, 2006), which
numerically compares the decisions represented in the DT with the cases in the training dataset.
These measures are not based on any kind of medical background knowledge, so they are not a valid way to assess the
medical comprehensibility and correctness of the DTs. Let pathðp; to DTÞ1⁄4fq
patient p if we p 1
; follow q
p
2
; ... ; the q p
k
g decision be the sequence tree DT. of questions asked Comprehensibility is calculated with Eq. (3) and evaluates the sequence of
questions in path(p,DT) for all the patients p 2 P following the indications of the decision tree DT. Comprehensibility takes into
account the global cost function g
q1
of the criteria on questions with priority 1. If the medical background knowledge is represented with a
global LPO 6
q1
, this has to be transformed into a cost function g
q1 (López-Vallverdú & Riaño, 2011a), before Eq. (3) is
applied.
comprehensibilityðP;DTÞ 1⁄4
1 #P
Á #P À
X
p2P
!
ð3Þ
Let DN be the set of decision nodes in a decision tree DT (i.e., the terminal nodes of the DT), and let d
n
P
q2pathðp;DTÞ
g
q1
ðqÞ #pathðp;DTÞ
and P
n
be the decision made and the set of patients in a decision node n 2 DN,
respectively. Cor- rectness is calculated with Eq. (4) and it evaluates all the final decisions made in a DT with the function Dec
1
which returns the cost of placing a decision node considering criteria with priority 1.
correctnessðP;DTÞ 1⁄4
#DN 1
Á #DN À
X
Dec
1
ðd
n
;P
n
Þ
!
n2DN
ð4Þ
7. Tests and results
In this section, we detail the tests carried out on the induction of medically comprehensible and correct DTs and the results
obtained with our algorithm on four medical domains from the UCI Repository of Machine Learning (Frank & Asuncion,
2010). The domains are diabetes with 768 patients, 8 questions and 2 decisions; heart disease with 303 patients, 13 questions
and 2 decisions;
post-operative with 90 patients, 8 questions and 3 decisions, and thyroid with 3772 patients, 20 questions and 3 decisions.
The background knowledge about the different decision criteria in all four domains has been provided by physicians of the
Clinical Hospital in Barcelona (CHB) (Spain) and the SAGESSA Health Care Group (Spain). For each domain, these
professionals selected some medical criteria and provided the background knowledge according to Table 1 and for the purposes
of patient screening and patient diagnosis.
7.1. The tests
With the aim of finding evidence that our approach (MEDBK) provides comprehensible and correct DTs which are useful to
represent medical decision processes and, at the same time, showing the limitations of the information gain based algorithms
(IG) as ID3, C4.5 or C5.0 in the induction of medical DTs,2 we have performed the following two types of test on the previous
four medical domains.
Test type 1 to show evidence that MEDBK generates comprehensible and correct medical DTs, with no loss of accuracy with
respect to IG. Test type 2 to show evidence about the suitability of MEDBK to produce decision mechanisms for different
purposes (screening and diagnosis) for the same datasets.
The first type of test has been performed by generating DTs to screen patients in the four domains. MEDBK required the
professionals of the two health-care institutions to agree on the criteria to be used and also on the priorities and relevances of
such criteria for a screening decision process. Table 2 summarizes the selected criteria extracted from the list in
(López-Vallverdú & Riaño, 2012) (column 1), their respective priorities (columns p), relevances (columns a) and their
formalization as cost functions or LPOs, for questions, and type I and type II errors on the decisions. The cost functions and
LPOs are not provided here because each medical domain tested has its own ones. These are 25 cost functions and 15 LPOs in
total which are provided in López-Vallverdú and Riaño (2011b).
According to physicians, some of the criteria in Table 2 are not appropriate for selecting questions or considering type I or
type II errors. These appear as ‘–’ in the table meaning that they are not part of the background knowledge.
All these tests have been performed with and without cross-validation, and with and without pruning. Cross-validation is
used to analyze the robustness of the DTs and, in our case, it consisted in repeating the following procedure 10 times. We
randomly sepa- rated 90% of the patients of the initial dataset and we used them to generate the DT which was then tested using
the remaining 10% of the patients. Pruning is used to reduce the overfitting of DTs and to remove sections of a DT that may be
based on noisy or erroneous data. Pruning is based on a prefixed percentage of DT node representativity. So, during the induction
process, if a node of the DT represents less than this percentage of patients, it becomes a decision node. For representativity ratio
we used 2%. We compared the results of these tests with the DTs obtained with IG.
The second type of test was centered in the thyroid domain and consisted in the generation of DTs with both the IG and the
MEDBK algorithms for the decision processes of patient screening and patient diagnosis.
The results of the two types of test were analyzed by physicians
2 In the following tests we used as IG the Weka J48 implementation of the C4.5 algorithm (Witten & Frank, 2005).

of the two previously mentioned health-care institutions and their main conclusions summarized in Section 7.2. We also
compared the accuracy, comprehensibility and correctness of the DTs induced by MEDBK in comparison with those other DTs
generated with IG. This comparison is detailed in Section 7.3.
7.2. Decision trees obtained and medical analysis
With MEDBK, we have induced DTs to screen patients in the medical domains of diabetes, heart disease, post-operative, and
thyroid. Several physicians proposed the criteria, priorities and relevances in order to avoid as much as possible the presence of
questions based on risky, uncomfortable or expensive medical tests (see Table 2). In Fig. 3 we provide one of the DTs induced
with MEDBK. Contrarily to the DT obtained with IG (see Fig. 1), this one is based on low-invasive questions as age, sex, chest
pain type, resting blood pressure, resting electrocardiogram and maximum heart rate rather than in other questions based on
invasive tests as for example the number of major vessels. Observe that the DT induced with
Table 2 Priorities, relevances and formalization of the medical criteria to perform screening decision processes.
Criteria Criteria on the questions Criteria on the decisions
Type I error Type II error
p a Formalization p a Formalization p a Formalization
Script 1 1 6
qs
– – – – – – Health risk 2 1 6
qh
a 1 1 f
IIh
a
Physical comf. 3 0.4 6
qc
1 0.9 f
Ih
a – – – Economic cost 3 0.4 f
qe
1 0.1 f
Ic 2 0.5 f
Ie
– – – Decision time 3
0.2 f
qt
2 0.5 f
It
– – –
a For the post-operative domain, it was formalized with a LPO.
Fig. 3. DT induced for the screening of heart disease using MEDBK.
MEDBK uses the questions age and sex (highest priority according to the criterion script López-Vallverdú & Riaño, 2011b)
before asking other questions. However, the trade off of our method between the information in the database and the medical
background knowledge causes that not always the latter is the one that determines the sequence of questions. For example, in
one branch the question maximum heart rate is used to make a final decision, without having asked other questions with a
higher priority like resting blood pressure, fasting blood sugar and serum cholestorol. The physicians qualified the behavior of
this DT as according to normal practice, whereas the one depicted in Fig. 1 was rejected as inap- propriate for decision making in
the screening of patients with heart disease.
This interpretation is the same for all the DTs obtained in the four medical domains tested and it is corroborated by the numer-
ical results discussed in Section 7.3. All the DTs obtained with IG represent medical decision processes that are either more risky,
uncomfortable or expensive than the ones obtained with MEDBK.

Fig. 4. LPO over the questions to diagnose thyroid malfunctioning.
MEDBK was also used to induce different DTs for the same input data. This was possible by adjusting the set of selected
criteria and their priorities and relevances to the sort of medical decision de- sired (i.e., screening or diagnosis). Centered in the
thyroid problem, MEDBK was used to generate DTs to screen and to diagnose patients. The criteria were again the ones in
Table 2 for the screening process, and script for the diagnosis process. The script criterion was represented with the LPO in Fig.
4.3 MEDBK proposed a DT to screen patients with thyroid problems, and another DT to diagnose thyroid malfunctioning (see
Fig. 5). Both DTs were accepted as correct by the team of physicians supporting this work. The DT that was obtained with IG
was not accepted for screening purposes, but acceptable for diagnosis. However, in spite that the DT proposed by IG was pretty
similar to the one in Fig. 3 (and therefore appropriate for diagnostic4), the physicians concluded that even in a diagnosis, there
is always a set of medical criteria guiding the selection of questions. And, since these criteria cannot be incorporated to IG, this
algorithm is also unable to guarantee DTs representing good diagnosis processes. This fact has been observed in several of the
domains studied, as diabetes whose DTs incorporated questions related to blood pressure or pregnancy which are irrelevant in
order to make final diagnostic decisions.
7.3. The quality of the results
The quality of medical DTs is measured in terms of their accuracy and their medical comprehensibility and correctness.
Table 3 shows these values for the MEDBK DTs when they are used to screen patients in the domains of diabetes, heart disease,
post- operative, and thyroid. The average of the IG DTs is also provided for the sake of comparison.
The quality of a medical DT is also related to the capability of this tree to remain unchanged and still represent good medical
decisions (i.e., DT robustness) and the ability not to represent chance decisions (i.e., DT overfitting). In Table 3 we provide the
results before and after applying cross-validation in order to analyze the robustness of the DT obtained, and also the results
before and after applying pruning in order to analyze overfitting.
7.3.1. Accuracy of DTs
We observe that the mean difference between the average accuracies of the DTs without cross-validation obtained with IG
and MEDBK is 3.9% (4.3% with pruning and 3.5% without pruning). This difference can be explained by the fact that MEDBK
is not designed to maximize accuracy but to maximize comprehensibility and cor-
3 The 16 other questions that do not appear in the LPO are in layer 4 but they were omitted for space reasons (see
López-Vallverdú & Riaño, 2011b).
Fig. 5. DT induced by MEDBK for the diagnosis of thyroid.
4 The physicians argued that some cases of thyroid problems could not be diagnosed with the IG and MEDBK DTs because
there were not instances of such cases in the input database.
Table 3 Results obtained for DTs to screen patients in four medical domains with MEDBK.
With pruning Without pruning
Acc. (%)
Cor. (%)
With cross-validation Diabetes 71.4 78.0 77.9 74.0 78.5 79.6 Heart disease 77.7 92.7 87.7 74.2 90.8 85.5 Post-
operative
Com.
Cor.
Acc.
Com. (%)
(%)
(%)
(%)
64.4 90.9 90.0 57.8 82.5 84.2
Thyroid 95.4 85.4 95.5 95.9 88.4 95.9
Average 77.2 86.8 87.8 75.5 85.1 86.3 Average IG 75.5 76.2 85.4 75.3 44.9 85.4
Without cross-validation Diabetes 78.5 81.8 84.0 83.1 81.0 86.9 Heart disease 82.5 92.0 90.5 91.7 88.6 95.4 Post-
operative
75.6 83.6 94.7 92.2 83.2 98.3
Thyroid 95.5 83.9 95.5 97.5 81.0 97.5
Average 83.0 85.3 91.2 91.1 83.5 94.5 Average IG 87.3 39.0 91.6 94.6 42.4 95.8
rectness. On the contrary, IG is an algorithm oriented to accuracy maximization, but it obtains DTs whose accuracies are not
significantly better than the ones obtained with MEDBK. At the same time cross-validation shows that the accuracy of IG DTs
diminishes more quickly than the accuracies obtained with MEDBK DTs (15.5% and 10.7%, respectively). Therefore IG obtains
slightly more accurate but less robust DTs.
7.3.2. Comprehensibility of DTs
The results of comprehensibility are clearly favorable to MEDBK, whose average comprehensibility is 43.7% better. Thyroid
is a clear example in which comprehensibility is more than 60% better with respect to IG trees, for all the tests performed. In all
four domains, the results show that the order of the questions in the DTs pro- duced with MEDBK is more coherent from a
medical point of view.
7.3.3. Correctness of DTs
The strong relation between accuracy (i.e., percentage of correct decisions) and correctness (i.e., quality of the decisions)
causes that, often, the results obtained by IG in terms of mean correctness are good. Nevertheless, when comparing IG and
MEDBK DTs we find cases where IG DTs are better in accuracy but worse in correctness. This means that MEDBK makes more
mistakes than IG (1.4% in average) but these mistakes are less important. This happens in

several cases as, for example, in the DTs for screening of post-operative patients with pruning. According to accuracy, IG
obtains a better DT than MEDBK (with respective accuracies 82.2% and 75.6%), but medical correctness indicates that the errors
of the DTs induced with MEDBK are less critical from a medical or clinical point of view (this is represented with the respective
correctness values 89.7% and 94.7%).
7.3.4. Robustness of DTs
The results in Table 3 suggest that MEDBK DTs are better at making decisions over new patients. With cross-validation, the
average loss of accuracy is 4.9% lower with MEDBK than with IG, with respect to the DTs generated without cross-validation.
The differences on the loss of comprehensibility and correctness are less relevant but also favorable to MEDBK (1.6% and 2.5%,
respectively). This means that the DTs generated with MEDBK are more robust than the trees generated with IG.
7.3.5. Overfitting of DTs
Pruning is a satisfactory procedure because it obtains smaller DTs which reduce overfitting while there is not a significant loss
of accuracy, correctness and comprehensibility. Both MEDBK and IG obtain DTs with a similar average loss of accuracy and
correctness when applying pruning (always below 3.5%). As far as comprehensibility is concerned, DTs of MEDBK are
medically better after pruning (1.8% in average), while those of IG are significantly worse (6.1% in average).
8. Conclusions
The information gain based algorithms to induce decision trees in complex domains cannot always guarantee acceptable
results from an expert point of view. Concretely, in the medical domain, these algorithms do not consider health-care criteria and
therefore, important aspects as the risks of the clinical procedures or the patient uncomfortability can be left out of their decision
processes. Moreover, medical errors in the final decisions can be critical and therefore their recommendation cannot be taken as
medically correct. For the same dataset, these algorithms always produce the same DT regardless of its final medical purpose or
intentionality. This is not correct because, for example, a good DT for diagnosing is not necessarily a good DT for other medical
decision processes like screening or disease treatment.
Here, we have proposed an algorithm to induce medical DTs that uses a combination of some relevant health-care criteria.
The chosen criteria and their respective priorities and relevances allow the algorithm to produce DTs oriented to different medical
purposes.
The tests performed in the medical domains of diabetes, heart disease, post-operative and thyroid malfunctioning for the pur-
poses of screening and diagnosing conclude that the medical DTs generated with the new algorithm are medically
comprehensible and correct, while their accuracy is not significantly worse than the one obtained with information gain based
algorithms, but more robust to new data. The sequences of questions of the trees in these domains are medically comprehensible
and do not imply unnecessary risky, uncomfortable or expensive medical tests. With respect to correctness, the presence of
critically wrong decisions is avoided. Cross-validation and pruning tests indicate that the DTs obtained by our algorithm are
robust and resistant to overfitting.
In the future, this work will be continued following three lines. The first line is the exploitation of health-care databases about
different medical decision processes like prevention, screening, diagnosing and patient treatment, in order to automatically
adjust the relevances that produce the most accurate, comprehensible and
correct DTs with respect to the medical decisions contained in the data. Our aim is to consider all the criteria and let the optimi-
zation algorithm to determine the relevances which will approach to zero for those criteria that are not used in each concrete
decision process. At the end, we expect to have a family of criterion-relevance pairs describing each medical process and we
will use them to compare the way of working of different medical centres.
The second line will adapt the current induction of DTs to the induction of clinical algorithms (Bohada, Riaño, &
López-Vallverdú, 2012; Riaño, López-Vallverdú, & Tu, 2008). A clinical algorithm (CA) is a flow diagram consisting of
branching-logic pathways which represent sequences of clinical decisions, for teaching clinical decision making, and for
guiding patient care. These branching- logic pathways can be represented with DTs, therefore they can be induced with the
algorithm in Section 5. Considering this, we will aim to induce medically comprehensible and correct CAs from hospital
databases by including medical background knowledge.
The third line will face the induction of medical DTs following a different approach. We can accept that medical criteria are
found implicit in the data available about medical decisions. Starting with databases containing decision q accurate, i
2
ðpÞ; ... ; q
comprehensible i
k
ðpÞ; d
p Þ, we will study the possibilities and correct DTs processes as ðq i
1
ðpÞ; of generating without considering an explicit representation of medical criteria (Torres, López- Vallverdú, & Riaño, 2011a).
Acknowledgements
We would like to thank Dr. Collado and Dr. Alonso for their con- tinuous support leading the groups of health-care
professionals from the SAGESSA Health Care Group (Spain) and the Clinical Hos- pital in Barcelona (Spain), respectively.
References
Arsene, O., Dumitrache, I., & Mihu, I. (2011). Medicine expert system dynamic Bayesian network and ontology based. Expert
Systems with Applications, 38, 15253–15261. Bohada, J. A., Riaño, D., & López-Vallverdú, J. A. (2012). Automatic generation
of clinical algorithms within the state-decision-action model. Expert Systems with
Applications<http://dx.doi.org/10.1016/j.eswa.2012.02.196>. Candell Riera, J. (2003). Estratificación pronóstica tras infarto
agudo de miocardio.
Revista Espanola de Cardiologia, 56(3), 303–313. Chai, X., Deng, L., Yang, Q., & Ling, C. X. (2004). Test-cost sensitive
Nayïve Bayesian
classification. In Proceedings 4th IEEE international conference on data mining. Chapman, G. B., & Sonnenberg, F. A.
(Eds.). (2003). Decision making in health care: Theory, psychology and applications. Cambridge series on judgement and
decision making. Cambridge University Press. Clark, P., & Niblett, T. (1989). The CN2 induction algorithm. Machine Learning,
3(4),
261–283. Fauci, A. S., Braunwald, E., Kasper, D. L., Hauser, S. L., Longo, D. L., & Jameson, J. L., et al. (Eds.). (2009).
Featuring the complete contents of Harrison’s principles of internal medicine (17th ed. McGraw Hill. Harrison’s Online.
Horning, K. K., Hoehns, J. D., & Doucette, W. R. (2007). Adherence to clinical practice guidelines for 7 chronic conditions in
long-term-care patients who received pharmacist disease management services versus traditional drug regimen review. Journal of
Managed Care Pharmacy, 13(1), 28–36. Husmeier, D., Dybowski, R., & Roberts, S. (Eds.). (2004). Probabilistic modelling in
bioinformatics and medical informatics. Springer. Lang, T. A., & Secic, M. (2006). How to report statistics in medicine (2nd
ed.). American
College of Physicians. Ling, C. X., Yang, Q., Wang, J., & Zhang, S. (2004). Decision trees with minimal costs.
In Proceedings 21st international conference on machine learning. Ling, C. X., Sheng, V. S., & Yang, Q. (2006). Test
strategies for cost-sensitive decision
trees. IEEE Transaction on Knowledge and Data Engineering, 18(8), 1055–1067. López-Vallverdú, J. A., & Riaño, D.
(2011a). Cost functions and partial orders as medical background knowledge: formalization and operations. Research report
DEIM-RR- 11-003. Spain: Universitat Rovira i Virgili. <http://deim.urv.cat/recerca/ reports/DEIM-RR-11-003.pdf> Accessed
March 2012. López-Vallverdú, J. A., & Riaño, D. (2011b). Repository of background knowledge.
<http://banzai-deim.urv.net/repositories/repository.pdf> Accessed March 2012. López-Vallverdú, J. A., & Riaño, D. (2012a).
Decision criteria in health-care and their representation. Research report DEIM-RR-12-001. Spain: Universitat Rovira i

Virgili. <http://deim.urv.cat/recerca/reports/DEIM-RR-12-001.pdf> Accessed March 2012. López-Vallverdú, J. A., Riaño, D., &
Collado, A. (2007). Increasing acceptability of decision trees with domain attributes partial orders. In Proceedings of the 20th
IEEE international symposium on computer-based medical systems, Maribor, Slovenia. Lucas, P., van der Gaag, L., &
Abu-Hanna, A. (2004). Bayesian networks in
biomedicine and health-care. Artificial Intelligence in Medicine, 30(3), 201–214. Frank, A., & Asuncion, A. (2010). UCI
Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science. <http://
archive.ics.uci.edu/ml>. Podgorelec, V., Kokol, P., Stiglic, B., & Rozman, I. (2002). Decision trees: An overview
and their use in medicine. Journal of Medical Systems, 26(5), 445–463. Quinlan, J. R. (1986). Induction of decision trees.
Machine Learning, 1(1), 81–106. Quinlan, J. R. (1993). C4.5: Programs for machine learning. San Mateo, CA., USA:
Morgan Kaufman. Quinlan, J. R. (2003). C5.0 Online tutorial. <http://www.rulequest.com> Accessed
March 2012.
Riaño, D., López-Vallverdú, J. A., & Tu, S. (2008). Mining hospital data to learn SDA*
clinical algorithms. LNAI (Vol. 4924, pp. 46–61). Shiffman, R. N. (1997). Representation of clinical practice guidelines in
conventional and augmented decision tables. Journal of the American Medical Informatics Association, 4, 382–393. Torres, P.,
López-Vallverdú, J. A., & Riaño, D. (2011). Inducing decision trees from
medical decision processes. LNAI (Vol. 6512, pp. 40–55). Turney, P. D. (2000). Types of cost in inductive concept learning.
In Workshop on cost-sensitive learning at the 7th international conference on machine learning. California: Stanford University.
Velikova, M., de Carvalho Ferreira, N., & Lucas, P. (2007). Bayesian network decomposition for modeling breast cancer
detection. In Artificial intelligence in medicine, AIME 2007, Amsterdam, The Netherlands. LNAI (Vol. 4594, pp. 346–350).
Springer. Witten, I. H., & Frank, E. (2005). Data mining: Practical machine learning tools and
techniques (2nd ed.). Morgan Kaufman. Yeh, D., Cheng, C., & Chen, Y. (2011). A predictive model for cerebrovascular
disease
using data mining. Expert Systems with Applications, 38(7), 8970–8977.

Improving Medical Decision Trees by Combining Relevant Health-Care Criteria

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Improving Medical Decision Trees by Combining Relevant Health-Care Criteria

Transféré par

Droits d'auteur :

Formats disponibles

Expert Systems with Applications 39 (2012) 11782–11791

Improving medical decision trees by combining relevant health-care criteria

, David Riaño, John A. Bohada

Expert Systems with Applications

Vous aimerez peut-être aussi