economic cost (Chai, Deng, Yang, & Ling, 2004; Ling et al., 2004; Ling, Sheng, & Yang, 2006) or improving the adherence (Horning, Hoehns, & Doucette, 2007) of the decision process to medical stan- dards (López-Vallverdú et al., 2007). However, these approaches do not guarantee medical comprehensibility and correctness. On the one hand, none of the previous criteria on the length, the economic cost and the adherence to clinical standards is useful when it is considered alone, because real medical decisions are taken attend- ing not only to these criteria but also to many others that are com- bined, simultaneously. On the other hand, the induction of decision trees with those criteria cannot differentiate among the different possible application purposes. This differentiation is important because, for example, a comprehensible and correct decision tree for diagnosis can be completely wrong for screening purposes. In this context, in Section 2 we formalize the concept of medical decision process. In Section 3 we propose the mechanisms to for- malize medical criteria in order to include them in a decision tree induction algorithm, and in Section 4 we propose a methodology to combine them. In Section 5 we present a general algorithm to in- duce decision trees, identifying the points where medical decision criteria can be introduced as background knowledge. These are called choice points. In Section 6, the measures of accuracy, com- prehensibility and correctness for the evaluation of the induced decision trees are formalized. The inductive algorithm is used in Section 7 to generate decision trees for the purposes of screening and diagnosing in four medical domains. The results have been analyzed from a statistical and a medical point of view, and the conclusions reported in Section 8. 2. Formalizing a decision process In medicine there are many different descriptions of what a decision process is (Fauci et al., 2009), therefore it is mandatory to define the concept of medical decision process in this paper. Here, a decision process is a sequence of medical questions or observations that lead to a concrete medical decision. In a particular domain, if Q = {q 1 ,q 2 ,...,q m } is the set of valid questions, D = {d 1 , d 2 ,...,d n } the set of possible decisions and q i (p) the answer to the question q i 2 Q for a certain patient p, then the finite sequence ðq for i 1 ðpÞ; patient q i 2 ðpÞ; p ... in ; q which i k ðpÞ; d a p Þ health-care represents a medical decision process professional takes decision d p 2 D after exact order. having asked Observe that the questions questions represent q i 1 ; q i 2 ; ... patient ; q i k 2 Q signs in this and symptoms but also consultation to the patient record or to an expert. Individual decision processes can be generalized and structured as decision mechanisms that do not only capture the medical knowl- edge supporting each individual decision process, but also provide the way of conducting new decisions under other circumstances or for other patients. Among the existing decision mechanisms (see, for example Arsene et al., 2011; Clark & Niblett, 1989; Chapman & Sonnenberg, 2003; Husmeier et al., 2004; Podgorelec et al., 2002; Shiffman, 1997) here we choose decision trees because they are structured, explicit, and easy to understand and to interpret, which are compelling requirements of a medical process. A decision tree (DT) is a decision mechanism that describes decision processes that always start with the same question and concatenate ques- tions in such a way that each possible answer to a question is followed by a new question or by a final decision. Decision mechanisms, as for example DTs, can be automatically obtained applying induction algorithms. These algorithms start from a set of data represented in the form (q 1 (p),q 2 (p),...,q m (p);d p ) where p are the different patients, q’s are questions whose answers can be known or not for each patient p, and d p is the decision taken for patient p. Observe that the order of the questions is the same for all the patients since it defines the description of the case rather J.A. López-Vallverdú et al. / Expert Systems with Applications 39 (2012) 11782–11791 11783 than a medical decision process in which the questions to the differ- ent patients can be asked in a different order. Fig. 1 shows a DT induced using an information gain based algo- rithm to identify patients with heart disease (questions are repre- sented as ellipses and decisions as boxes). It does not consider any medical background knowledge so medical comprehensibility and correctness is not guaranteed. For example, the question about number of major vessels in the root requires an invasive test for all the patients. The systematic application of this test as the first step of the decision process lacks of medical sense, and it is com- pletely wrong for certain medical decision processes as screening. The algorithms to induce comprehensible and correct decision mechanisms from a set of data must then be based on one or more medical decision criteria that extend the statistical sense of asking one or another question with a clinical sense. These medical decision criteria and their formalization are introduced in the next section. 3. Decision criteria in health-care and their formalization In medicine, the list of criteria which may be combined to make decisions is very large and diverse. A systematic approach to the organization of such criteria and their representation using cost functions and layered partial orders (LPOs) is proposed in López- Vallverdú and Riaño (2012). In this section, we explain how these criteria can be formalized in order to decide about the appropriate questions and decisions in a decision process. This appropriateness is used to determine the best order of questions and decisions in medical decision processes. 3.1. Criteria on the questions The order in which questions are asked in a decision process is decided according to the criteria on the questions. They are used to determine whether a question is more or less adequate than an- other one in a given context. For example, in diabetes screening, we may use the decision time criterion to decide to perform an oral glucose tolerance test rather than obtaining the longer 2-h serum insulin value. When formalizing the criteria on the questions, cost functions or LPOs are defined over the set of questions Q. For exam- ple, the expert may choose to represent the economic cost criterion as a cost function f e : Q ? [0,1] and the script criterion, which mea- sures the adherence of the procedure to the sequence specified by medical standards (López-Vallverdú & Riaño, 2012), as a LPO 6 s over Q. Criteria on the questions can be contextual or context-free. A criterion is said to be contextual when it depends on the context (related disease, medical purpose, etc.) of the medical decision pro- cess. In a certain context, the answer to a question may be impor- tant in order to make a decision, but in another context, this question may be totally unnecessary. For example, the script value of answering the question stability_of_blood_pressure is greater if we are deciding where a post-operative patient must be sent, than if we are determining whether the patient is hypothyroid or not. Script and granularity (López-Vallverdú & Riaño, 2012) are exam- ples of contextual criteria. Context-free criteria do not change when they are used in differ- ent contexts because they depend on the health-care test needed to obtain the answer of the question. For example, economic cost is a context-free criterion. The question sodium_on_blood has no economic cost itself but its economic cost is related to the blood test that provides the answer for this question. The economic cost of a regular blood test is always the same regardless of the context. Moreover, a health-care test can provide simultaneous answers to several questions of the decision process. For example, a regular
blood test informs about the levels of sodium, urea, creatinine, etc. providing an answer to the question sodium_on_blood, but also to urea_on_blood, creatinine_on_blood, etc. Decision time, economic cost, health risk and physical comfortability (López-Vallverdú & Riañ- o, 2012) are examples of context-free criteria. Notice that once a health-care test is performed to answer one of the questions, it does not have to be performed again to answer the rest of the ques- tions related to that health-care test. Being t a health-care test that provides the answer to a set of questions Q0 & Q; when a question q0 2 Q0 is asked in the decision process, the values for context-free criteria of the questions in Q0 change, so that ∀q0 2 Q0, f(q0) = 0 (for cost functions), and each question q0 2 Q0 is moved to the first layer of 6(for LPOs). 3.2. Criteria on the decisions A decision process concludes with a final decision that can be right or wrong. The relevance of the error in wrong decisions is evaluated with the criteria on the decisions. For example, according to the health risk criterion, it is safer to wrongly send a post-oper- ative patient to the Intensive Care Unit (ICU) than sending him home by mistake. Some previous works have evaluated the possi- ble wrong decisions performed in a decision process (Ling et al., 2004, 2006; Turney, 2000). In these approaches, an expert has to provide a cost function error(d 1 ,d 2 ) which returns the error of per- forming d 2 when the correct decision is d 1 , for each pair of deci- sions d 1 , d 2 in the set of possible decisions D. This approach has the inconvenience that the expert is required to provide a value for each one of the #D Á (#D À 1) possible errors in the decision process (where #D is the cardinality of D). For medium and large sets of decisions this is much information that experts must pro- vide. In order to reduce this effort, here we use a different approach that divides the error into type I and type II medical errors which are concepts that medical doctors are familiar with. Type I error represents the relevance of taking a wrong decision (e.g., the economic cost if we send a patient to ICU when this is a wrong decision). 11784 J.A. López-Vallverdú et al. / Expert Systems with Applications 39 (2012) 11782–11791 Fig. 1. Decision tree to identify patients with heart disease. Type II error represents the relevance of not taking a correct decision (e.g., the risk on the health of a patient who is not sent to ICU when this is the correct decision). When formalizing the criteria on the decisions, the cost func- tions or the LPOs are defined over the set of decisions D. This means that two cost functions f: D ? [0,1] (or two LPOs 6 over D) are needed for each criteria considered; one for type I error and another one for type II error. For example, the expert may choose to represent the type I error of the health risk criterion (h) as a LPO 6 h and the type II error as a cost function f h . This approach requires the expert to only provide 2 Á #D values. For each decision d 2 D, this value is f c (d) or ‘ c (d) when the criteria c is represented with a cost function or a LPO, respectively. Com- pared with previous approaches (Ling et al., 2004, 2006; Turney, 2000) our proposal requires much less information and it is easier to provide by experts. This appreciation was confirmed by the health-care professionals that evaluated the results of this work. According to them, our approach is much more closer to the way they objectively measure medical errors. 4. Combination of criteria In a decision process, questions and final decisions are not cho- sen based on a unique criterion but on the simultaneous applica- tion of a set of medical criteria. This combination can be very complex and it may involve criteria with different levels of priority and relevance. In this section we present a means to include a com- bination of the formalized criteria in the induction of DTs. We first explain how the inductive algorithm selects criteria according to their priority, and then we present a method to combine them, considering their relevance. 4.1. Selection of criteria considering their priority In a decision process, medical and clinical criteria are arranged in different levels of priority. The priority of a criterion is defined as the relative position of this criterion in the set of criteria when it is
used in medical decision making. This priority is represented by a positive number, 1 being the highest priority. Health-care profes- sionals may use priorities to rank the relevance of the criteria in the decision problem that they are trying to solve. For example, in the selection of questions for screening patients with diabetes, the expert may consider script, economic cost and physical com- fortability criteria of higher priority than health risk or decision time. The expert can also avoid the use of priorities just by stating that all the criteria have the same level of priority. The criteria in the first level of priority are those which are used to guide the sequence of questions or to make the final decisions in the decision process. Only in the case that these criteria are not able to identify the best question or decision in the process, the cri- teria of the second level of priority are considered. If these also fail, then the criteria in the third level are used, and so on. If none of the levels is useful to choose the best question or decision, then any remaining question q i 2 Q or decision d i 2 D is appropriate and the one with the lower index i is selected. 4.2. Combination of criteria considering their relevance After having considered priorities, the criteria of the same prior- ity are combined according to their relevances. The relevance of a criterion is defined as its weight within the combination of criteria used in medical a 2 [0,1] such with priority i. that Given decision P a c2C c making > i a c a c0 1⁄4 we 1, and it is represented by a value where C i contains those criteria say that criterion c is more rele- vant than criterion c0. Health-care professionals must provide the relevance of the decision criteria as a means of weighting the rel- ative importance of each criterion in the decision problem that they are trying to solve. When combining n criteria represented as cost functions or LPOs we deal with three cases: Case 1: Combination of n cost functions (f linear combination: g 1⁄4 a 1 f c 1 þÁÁÁþa n f c n c with 1 ; ... ; a f c i n the ): We apply a relevance of criterion c i . Case 2: Combination of n LPOs (6 c 1 ;... ; 6 c n ): We apply the pro- cedure of combination of LPOs described in López-Vallverdú and Riaño (2011a). Case 3: Combination of m cost functions and n À m LPOs (f c 1 ; ... tion of ; LPOs f c m ; 6 c in mþ1 (López-Vallverdú ; ... ; 6 c n ): We apply & the procedure Riaño, 2011a) of combina- to the n À m LPOs obtaining a single LPO 6 0. Then we transform 60 into a cost function f0 (López-Vallverdú & Riaño, 2011a) and finally, combine the m + 1 cost functions the relevance of f0 calculated as a0 f c 1⁄4 1 ; P ... n i1⁄4mþ1 ; f c m ;f a 0 i . as in case 1 with 5. Induction of decision trees based on medical criteria The three most successful and widely applied algorithms to in- duce DTs are ID3 (Quinlan, 1986), C4.5 (Quinlan, 1993) and C5.0 (Quinlan, 2003) with more than 1800 publications in medical informatics since 2000.1 These are greedy algorithms that produce DTs as a result of a top-down partitioning process that starts with a dataset which contains descriptions of past decision processes. In medical informatics (Podgorelec et al., 2002), these cases represent decisions on patients that are expressed as (q 1 (p),q 2 (p),...,q m (p);d p ) where p are the different patients considered, q’s are questions on particular conditions of the patients whose answer can be known or not, and d p is the decision taken for patient p. In spite of significant differences, the baseline of ID3, C4.5 and C5.0 is equivalent: partition the dataset into subsets using the best possible question, until the decision of the remaining cases can be considered equivalent, then J.A. López-Vallverdú et al. / Expert Systems with Applications 39 (2012) 11782–11791 11785 take the most appropriate decision. This behavior is described in Algorithm 1 where three choice points have been identified. These are points in which background knowledge can be considered in order to improve the medical and clinical comprehensibility and correctness of the DT induced. Choice point one, in line 2, sets a condition for placing a decision node (or not). For the current dataset this condition determines whether the situation (q i (p) )ÁÁÁ) is better represented with a decision (q i ) or if more questions have to be asked (q i (p) ) d p (p) ) q j (p) )ÁÁÁ). Choice point two, in line 3, is the condition to select the best decision d p 2 D for the current decision process. Choice point three, in line 7, is the condition to select the best question q j 2 Q for the current decision process. 5.1. Introducing background knowledge in the induction of DTs In order to improve the medical comprehensibility and correct- ness of the trees induced by ID3, C4.5 or C5.0 and also to be able to produce trees with a concrete medical orientation (e.g., screening, diagnosis, treatment, etc.), the medical background knowledge is included in Algorithm 1 (see Fig. 2). This knowledge comes repre- sented by cost functions and LPOs related to each one of the crite- ria taking part in the decision process. For each criteria, three cost functions (or LPOs) are defined: one for questions and other two for type I and type II errors on the decisions. These cost functions and LPOs, together with the priorities and relevances of the criteria, de- fine the background knowledge required to produce decision trees with a medical sense. A representation of all the background knowledge required is shown in Table 1 where c 1 are the criteria selected. For each criterion c i ,...,c k (i.e., table row), the background knowledge provides the priority and relevance (p qi and a qi ) when the criterion is used to select the questions, for type-I error, and p IIi and and the a IIi priority and relevance (p Ii and a Ii for type-II error) when it is used to select the proper decision. Each criterion c i may be represented as a cost function or a LPO, for questions, and for type I and type II errors. Table 1 is a central component of the process described in Fig. 2. The criteria in Table 1 are combined using the methodology de- scribed in Section 4 obtaining three global cost functions or LPOs for each level of priority j: one for criteria on the questions (g qj or 6 qj 1 Bibliographic search in ScienceDirect with keywords medicine AND (id3 OR c4.5 OR c5.0). ), another one for criteria on the decisions related to type I
) errors (g Ij and a third one for criteria on the decisions related to type II errors (g IIj or 6 IIj ). With the aim of inducing DTs that are medically and clinically comprehensible and correct and, at the same time, adapted to the health-care purpose the DT must serve to, we propose an implementation for each one of the choice points of Algorithm 1 that uses the different global cost functions and LPOs. 5.2. Condition for placing a decision node In medicine, deciding whether a decision process has reached a final decision or if new questions are recommended is a trade off between type I and type II errors. Here, these errors are respec- tively represented with the cost functions g Ij obtained for each level of priority j (see Fig. 2). If we have global LPOs, they are transformed into the cost functions g Ij and g IIj and g IIj (López-Vallverdú & Riaño, 2011a). Therefore, for each priority level j, g Ij pro- vide the global cost of accepting a wrong decision and the global cost of rejecting a correct decision over a decision process ðq i 1 and g IIj P0, if ðpÞ; P P q 0 ðdÞ i 2 ðpÞ; is ...; the q i proportion k ðpÞ; d p Þ on a patient p. Given of patients in P0 a set of patients on which the final decision was d, then, considering criteria with priority i, the cost of placing a decision node Dec i (d, P0) is calculated using Eq. (1). The condition for placing a decision node is reached if one of the total costs for making a decision d over the current dataset, considering criteria with priority 1, is lower than a threshold
2D (Dec 1 (d,P0)) <
). (i.e., min d Dec i À d; P 0 Á 1⁄4 ð 1 À P P 0 ðdÞ Þ Á g Ij dð Þþ X À P P 0 d À d 0 Á Á g IIj À d 0 Á Á ð1Þ 0 2D;d 0 –d We compare the costs for making a decision with a threshold rather than with the costs of making a question because questions and decisions depend on different criteria and thus they are not 11786 J.A. López-Vallverdú et al. / Expert Systems with Applications 39 (2012) 11782–11791 Fig. 2. Introducing background knowledge in Algorithm 1. Table 1 Representation of the input background knowledge. Criteria Questions Decisions Type I error Type II error p a Formalization p a Formalization p a Formalization c 1 p q1 a q1 f q1 or 6 q1 p I1 a I1 f I1 or 6 I1 p II1 a II1 f II1 or 6 I I1 c 2 p q2 a q2 f q2 or 6 q2 p I2 a I2 f I2 or 6 I2 p II2 a II2 f II2 or 6 I I2 ÁÁÁ ÁÁÁ ÁÁÁ ÁÁÁ ÁÁÁ ÁÁÁ ÁÁÁ ÁÁÁ ÁÁÁ ÁÁÁ c k p qk a qk f qk or 6 qk p Ik a Ik f Ik or 6 Ik p IIk a IIk f IIk or 6 IIk or 6 Ij comparable. If a decision is correct enough for the current dataset (its cost is lower than
) it can be placed in the DT with no need to calculate the cost of making a question. This procedure considers both the information in the database (proportion of patients for each decision) and the medical back- ground knowledge (type I and II error cost for each wrong decision). 5.3. Select the best decision: correctness From a medical point of view, the most correct decision to be made over a certain set of patients, must be determined consider- ing type I and type II errors (see g Ij and g IIj in Fig. 2). Therefore the selection of the best decision is done using Eq. (1). The best deci- sion to be selected is the one which minimizes Dec 1 . If several deci- sions minimize Dec 1 then we select the one of them which minimizes Dec 2 . The procedure is repeated for each level of priority until there is only one optimal decision. If the lowest priority level is reached and there is not a single optimal decision selected, then the remaining decision d i with the lowest index i is taken. 5.4. Select the best question: comprehensibility A decision process is medically comprehensible if the questions are made in an order similar to the criteria of the health-care ex- perts. Therefore, criteria on the questions are involved in the selec- tion of the best question for a certain patient (see g qj and 6 qj in Fig. 2). Nevertheless, from a medical point of view the most com- prehensible question is not necessarily the question that leads to the best situation to make a final decision. In order to select com- prehensible questions which are also useful to make a final deci- sion, we use the concept of expected cost (EC). For each question q i , the EC represents the cost of making a decision in the next step of the decision process after asking the question q i . This is the aver- age of the costs of placing decision nodes for each of the subsets
obtained when a certain set of patients P0 & P is partitioned using q i . EC is calculated with Eq. (2), where P0 a 1⁄4 fp 2 P0 : q i ðpÞ 1⁄4 ag and A i (p) = a, p 2 P0}. ECðq i = {a:q i ; P0Þ 1⁄4 #A 1 X i a2A i min d2D À Dec 1 À d;P0 a Á Á ð2Þ We compute EC for each question and we select those questions whose EC is lower than a threshold d. The best question is the one which minimizes the global cost function g q1 (or which is in the lowest layer of the LPO 6 q1 ) for criteria on the questions of level of priority 1. If several questions minimize g q1 (or are in the lowest layer of 6 q1 ) then we select the one of them which minimizes g q2 (or which is in the lowest layer of 6 q2 ). The procedure is repeated for each level of priority until there is only one optimal question. If none of the levels is useful to select one of these questions, then the remaining question q i with the lowest index i is selected. The use of the expected cost together with the criteria on the questions guarantees a trade off between the information in the database and the medical background knowledge when selecting the best question. 6. Evaluation of medical decision trees The accuracy of a DT is defined as the percentage of correct decisions over the total number of decisions made. Accuracy is a statistical measure like sensitivity, specifity and positive and neg- ative predictive values (Lang & Secic, 2006), which numerically compares the decisions represented in the DT with the cases in the training dataset. These measures are not based on any kind of medical back- ground knowledge, so they are not a valid way to assess the medical comprehensibility and correctness of the DTs. Let pathðp; to DTÞ1⁄4fq patient p if we p 1 ; follow q p 2 ; ... ; the q p k g decision be the sequence tree DT. of questions asked Comprehensibility is calculated with Eq. (3) and evaluates the sequence of questions in path(p,DT) for all the patients p 2 P following the indications of the decision tree DT. Comprehensibility takes into account the global cost function g q1 of the criteria on questions with priority 1. If the medical background knowledge is represented with a glo- bal LPO 6 q1 , this has to be transformed into a cost function g q1 (López-Vallverdú & Riaño, 2011a), before Eq. (3) is applied. comprehensibilityðP;DTÞ 1⁄4 1 #P Á #P À X p2P ! ð3Þ Let DN be the set of decision nodes in a decision tree DT (i.e., the terminal nodes of the DT), and let d n P q2pathðp;DTÞ g q1 ðqÞ #pathðp;DTÞ and P n be the decision made and the set of patients in a decision node n 2 DN, respectively. Cor- rectness is calculated with Eq. (4) and it evaluates all the final deci- sions made in a DT with the function Dec 1 which returns the cost of placing a decision node considering criteria with priority 1. correctnessðP;DTÞ 1⁄4 #DN 1 Á #DN À X Dec 1 ðd n ;P n Þ ! n2DN ð4Þ 7. Tests and results In this section, we detail the tests carried out on the induction of medically comprehensible and correct DTs and the results ob- tained with our algorithm on four medical domains from the UCI Repository of Machine Learning (Frank & Asuncion, 2010). The do- mains are diabetes with 768 patients, 8 questions and 2 decisions; heart disease with 303 patients, 13 questions and 2 decisions; J.A. López-Vallverdú et al. / Expert Systems with Applications 39 (2012) 11782–11791 11787 post-operative with 90 patients, 8 questions and 3 decisions, and thyroid with 3772 patients, 20 questions and 3 decisions. The background knowledge about the different decision criteria in all four domains has been provided by physicians of the Clinical Hospital in Barcelona (CHB) (Spain) and the SAGESSA Health Care Group (Spain). For each domain, these professionals selected some medical criteria and provided the background knowledge accord- ing to Table 1 and for the purposes of patient screening and patient diagnosis. 7.1. The tests With the aim of finding evidence that our approach (MEDBK) provides comprehensible and correct DTs which are useful to rep- resent medical decision processes and, at the same time, showing the limitations of the information gain based algorithms (IG) as ID3, C4.5 or C5.0 in the induction of medical DTs,2 we have per- formed the following two types of test on the previous four medical domains. Test type 1 to show evidence that MEDBK generates comprehen- sible and correct medical DTs, with no loss of accuracy with respect to IG. Test type 2 to show evidence about the suitability of MEDBK to produce decision mechanisms for different purposes (screening and diagnosis) for the same datasets. The first type of test has been performed by generating DTs to screen patients in the four domains. MEDBK required the profes- sionals of the two health-care institutions to agree on the criteria to be used and also on the priorities and relevances of such criteria for a screening decision process. Table 2 summarizes the selected criteria extracted from the list in (López-Vallverdú & Riaño, 2012) (column 1), their respective priorities (columns p), rele- vances (columns a) and their formalization as cost functions or LPOs, for questions, and type I and type II errors on the decisions. The cost functions and LPOs are not provided here because each medical domain tested has its own ones. These are 25 cost func- tions and 15 LPOs in total which are provided in López-Vallverdú and Riaño (2011b). According to physicians, some of the criteria in Table 2 are not appropriate for selecting questions or considering type I or type II errors. These appear as ‘–’ in the table meaning that they are not part of the background knowledge. All these tests have been performed with and without cross-val- idation, and with and without pruning. Cross-validation is used to analyze the robustness of the DTs and, in our case, it consisted in repeating the following procedure 10 times. We randomly sepa- rated 90% of the patients of the initial dataset and we used them to generate the DT which was then tested using the remaining 10% of the patients. Pruning is used to reduce the overfitting of DTs and to remove sections of a DT that may be based on noisy or erroneous data. Pruning is based on a prefixed percentage of DT node representativity. So, during the induction process, if a node of the DT represents less than this percentage of patients, it becomes a decision node. For representativity ratio we used 2%. We compared the results of these tests with the DTs obtained with IG. The second type of test was centered in the thyroid domain and consisted in the generation of DTs with both the IG and the MEDBK algorithms for the decision processes of patient screening and pa- tient diagnosis. The results of the two types of test were analyzed by physicians 2 In the following tests we used as IG the Weka J48 implementation of the C4.5 algorithm (Witten & Frank, 2005).
of the two previously mentioned health-care institutions and their main conclusions summarized in Section 7.2. We also compared the accuracy, comprehensibility and correctness of the DTs in- duced by MEDBK in comparison with those other DTs generated with IG. This comparison is detailed in Section 7.3. 7.2. Decision trees obtained and medical analysis With MEDBK, we have induced DTs to screen patients in the medical domains of diabetes, heart disease, post-operative, and thyroid. Several physicians proposed the criteria, priorities and rel- evances in order to avoid as much as possible the presence of ques- tions based on risky, uncomfortable or expensive medical tests (see Table 2). In Fig. 3 we provide one of the DTs induced with MEDBK. Contrarily to the DT obtained with IG (see Fig. 1), this one is based on low-invasive questions as age, sex, chest pain type, resting blood pressure, resting electrocardiogram and maximum heart rate rather than in other questions based on invasive tests as for exam- ple the number of major vessels. Observe that the DT induced with Table 2 Priorities, relevances and formalization of the medical criteria to perform screening decision processes. Criteria Criteria on the questions Criteria on the decisions Type I error Type II error p a Formalization p a Formalization p a Formalization Script 1 1 6 qs – – – – – – Health risk 2 1 6 qh a 1 1 f IIh a Physical comf. 3 0.4 6 qc 1 0.9 f Ih a – – – Economic cost 3 0.4 f qe 1 0.1 f Ic 2 0.5 f Ie – – – Decision time 3 0.2 f qt 2 0.5 f It – – – a For the post-operative domain, it was formalized with a LPO. 11788 J.A. López-Vallverdú et al. / Expert Systems with Applications 39 (2012) 11782–11791 Fig. 3. DT induced for the screening of heart disease using MEDBK. MEDBK uses the questions age and sex (highest priority according to the criterion script López-Vallverdú & Riaño, 2011b) before ask- ing other questions. However, the trade off of our method between the information in the database and the medical background knowledge causes that not always the latter is the one that deter- mines the sequence of questions. For example, in one branch the question maximum heart rate is used to make a final decision, with- out having asked other questions with a higher priority like resting blood pressure, fasting blood sugar and serum cholestorol. The physi- cians qualified the behavior of this DT as according to normal practice, whereas the one depicted in Fig. 1 was rejected as inap- propriate for decision making in the screening of patients with heart disease. This interpretation is the same for all the DTs obtained in the four medical domains tested and it is corroborated by the numer- ical results discussed in Section 7.3. All the DTs obtained with IG represent medical decision processes that are either more risky, uncomfortable or expensive than the ones obtained with MEDBK.
Fig. 4. LPO over the questions to diagnose thyroid malfunctioning. MEDBK was also used to induce different DTs for the same input data. This was possible by adjusting the set of selected criteria and their priorities and relevances to the sort of medical decision de- sired (i.e., screening or diagnosis). Centered in the thyroid problem, MEDBK was used to generate DTs to screen and to diagnose pa- tients. The criteria were again the ones in Table 2 for the screening process, and script for the diagnosis process. The script criterion was represented with the LPO in Fig. 4.3 MEDBK proposed a DT to screen patients with thyroid problems, and another DT to diagnose thyroid malfunctioning (see Fig. 5). Both DTs were accepted as cor- rect by the team of physicians supporting this work. The DT that was obtained with IG was not accepted for screening purposes, but acceptable for diagnosis. However, in spite that the DT proposed by IG was pretty similar to the one in Fig. 3 (and therefore appropri- ate for diagnostic4), the physicians concluded that even in a diagno- sis, there is always a set of medical criteria guiding the selection of questions. And, since these criteria cannot be incorporated to IG, this algorithm is also unable to guarantee DTs representing good diagno- sis processes. This fact has been observed in several of the domains studied, as diabetes whose DTs incorporated questions related to blood pressure or pregnancy which are irrelevant in order to make final diagnostic decisions. 7.3. The quality of the results The quality of medical DTs is measured in terms of their accu- racy and their medical comprehensibility and correctness. Table 3 shows these values for the MEDBK DTs when they are used to screen patients in the domains of diabetes, heart disease, post- operative, and thyroid. The average of the IG DTs is also provided for the sake of comparison. The quality of a medical DT is also related to the capability of this tree to remain unchanged and still represent good medical decisions (i.e., DT robustness) and the ability not to represent chance decisions (i.e., DT overfitting). In Table 3 we provide the re- sults before and after applying cross-validation in order to analyze the robustness of the DT obtained, and also the results before and after applying pruning in order to analyze overfitting. 7.3.1. Accuracy of DTs We observe that the mean difference between the average accu- racies of the DTs without cross-validation obtained with IG and MEDBK is 3.9% (4.3% with pruning and 3.5% without pruning). This difference can be explained by the fact that MEDBK is not designed to maximize accuracy but to maximize comprehensibility and cor- 3 The 16 other questions that do not appear in the LPO are in layer 4 but they were omitted for space reasons (see López-Vallverdú & Riaño, 2011b). Fig. 5. DT induced by MEDBK for the diagnosis of thyroid. 4 The physicians argued that some cases of thyroid problems could not be diagnosed with the IG and MEDBK DTs because there were not instances of such cases in the input database. Table 3 Results obtained for DTs to screen patients in four medical domains with MEDBK. With pruning Without pruning Acc. (%) Cor. (%) With cross-validation Diabetes 71.4 78.0 77.9 74.0 78.5 79.6 Heart disease 77.7 92.7 87.7 74.2 90.8 85.5 Post- operative Com. Cor. Acc. Com. (%) (%) (%) (%) 64.4 90.9 90.0 57.8 82.5 84.2 Thyroid 95.4 85.4 95.5 95.9 88.4 95.9 Average 77.2 86.8 87.8 75.5 85.1 86.3 Average IG 75.5 76.2 85.4 75.3 44.9 85.4 Without cross-validation Diabetes 78.5 81.8 84.0 83.1 81.0 86.9 Heart disease 82.5 92.0 90.5 91.7 88.6 95.4 Post- operative 75.6 83.6 94.7 92.2 83.2 98.3 Thyroid 95.5 83.9 95.5 97.5 81.0 97.5 Average 83.0 85.3 91.2 91.1 83.5 94.5 Average IG 87.3 39.0 91.6 94.6 42.4 95.8 J.A. López-Vallverdú et al. / Expert Systems with Applications 39 (2012) 11782–11791 11789 rectness. On the contrary, IG is an algorithm oriented to accuracy maximization, but it obtains DTs whose accuracies are not signifi- cantly better than the ones obtained with MEDBK. At the same time cross-validation shows that the accuracy of IG DTs diminishes more quickly than the accuracies obtained with MEDBK DTs (15.5% and 10.7%, respectively). Therefore IG obtains slightly more accurate but less robust DTs. 7.3.2. Comprehensibility of DTs The results of comprehensibility are clearly favorable to MEDBK, whose average comprehensibility is 43.7% better. Thyroid is a clear example in which comprehensibility is more than 60% better with respect to IG trees, for all the tests performed. In all four domains, the results show that the order of the questions in the DTs pro- duced with MEDBK is more coherent from a medical point of view. 7.3.3. Correctness of DTs The strong relation between accuracy (i.e., percentage of correct decisions) and correctness (i.e., quality of the decisions) causes that, often, the results obtained by IG in terms of mean correctness are good. Nevertheless, when comparing IG and MEDBK DTs we find cases where IG DTs are better in accuracy but worse in correctness. This means that MEDBK makes more mistakes than IG (1.4% in average) but these mistakes are less important. This happens in
several cases as, for example, in the DTs for screening of post-oper- ative patients with pruning. According to accuracy, IG obtains a better DT than MEDBK (with respective accuracies 82.2% and 75.6%), but medical correctness indicates that the errors of the DTs induced with MEDBK are less critical from a medical or clinical point of view (this is represented with the respective correctness values 89.7% and 94.7%). 7.3.4. Robustness of DTs The results in Table 3 suggest that MEDBK DTs are better at making decisions over new patients. With cross-validation, the average loss of accuracy is 4.9% lower with MEDBK than with IG, with respect to the DTs generated without cross-validation. The differences on the loss of comprehensibility and correctness are less relevant but also favorable to MEDBK (1.6% and 2.5%, respectively). This means that the DTs generated with MEDBK are more robust than the trees generated with IG. 7.3.5. Overfitting of DTs Pruning is a satisfactory procedure because it obtains smaller DTs which reduce overfitting while there is not a significant loss of accuracy, correctness and comprehensibility. Both MEDBK and IG obtain DTs with a similar average loss of accuracy and correctness when applying pruning (always below 3.5%). As far as comprehen- sibility is concerned, DTs of MEDBK are medically better after prun- ing (1.8% in average), while those of IG are significantly worse (6.1% in average). 8. Conclusions The information gain based algorithms to induce decision trees in complex domains cannot always guarantee acceptable results from an expert point of view. Concretely, in the medical domain, these algorithms do not consider health-care criteria and therefore, important aspects as the risks of the clinical procedures or the pa- tient uncomfortability can be left out of their decision processes. Moreover, medical errors in the final decisions can be critical and therefore their recommendation cannot be taken as medically cor- rect. For the same dataset, these algorithms always produce the same DT regardless of its final medical purpose or intentionality. This is not correct because, for example, a good DT for diagnosing is not necessarily a good DT for other medical decision processes like screening or disease treatment. Here, we have proposed an algorithm to induce medical DTs that uses a combination of some relevant health-care criteria. The chosen criteria and their respective priorities and relevances allow the algorithm to produce DTs oriented to different medical purposes. The tests performed in the medical domains of diabetes, heart disease, post-operative and thyroid malfunctioning for the pur- poses of screening and diagnosing conclude that the medical DTs generated with the new algorithm are medically comprehensible and correct, while their accuracy is not significantly worse than the one obtained with information gain based algorithms, but more robust to new data. The sequences of questions of the trees in these domains are medically comprehensible and do not imply unnecessary risky, uncomfortable or expensive medical tests. With respect to correctness, the presence of critically wrong decisions is avoided. Cross-validation and pruning tests indicate that the DTs obtained by our algorithm are robust and resistant to overfitting. In the future, this work will be continued following three lines. The first line is the exploitation of health-care databases about dif- ferent medical decision processes like prevention, screening, diag- nosing and patient treatment, in order to automatically adjust the relevances that produce the most accurate, comprehensible and 11790 J.A. López-Vallverdú et al. / Expert Systems with Applications 39 (2012) 11782–11791 correct DTs with respect to the medical decisions contained in the data. Our aim is to consider all the criteria and let the optimi- zation algorithm to determine the relevances which will approach to zero for those criteria that are not used in each concrete decision process. At the end, we expect to have a family of criterion-rele- vance pairs describing each medical process and we will use them to compare the way of working of different medical centres. The second line will adapt the current induction of DTs to the induction of clinical algorithms (Bohada, Riaño, & López-Vallverdú, 2012; Riaño, López-Vallverdú, & Tu, 2008). A clinical algorithm (CA) is a flow diagram consisting of branching-logic pathways which represent sequences of clinical decisions, for teaching clini- cal decision making, and for guiding patient care. These branching- logic pathways can be represented with DTs, therefore they can be induced with the algorithm in Section 5. Considering this, we will aim to induce medically comprehensible and correct CAs from hospital databases by including medical background knowledge. The third line will face the induction of medical DTs following a different approach. We can accept that medical criteria are found implicit in the data available about medical decisions. Starting with databases containing decision q accurate, i 2 ðpÞ; ... ; q comprehensible i k ðpÞ; d p Þ, we will study the possibilities and correct DTs processes as ðq i 1 ðpÞ; of generating without considering an explicit representation of medical criteria (Torres, López- Vallverdú, & Riaño, 2011a). Acknowledgements We would like to thank Dr. Collado and Dr. Alonso for their con- tinuous support leading the groups of health-care professionals from the SAGESSA Health Care Group (Spain) and the Clinical Hos- pital in Barcelona (Spain), respectively. References Arsene, O., Dumitrache, I., & Mihu, I. (2011). Medicine expert system dynamic Bayesian network and ontology based. Expert Systems with Applications, 38, 15253–15261. Bohada, J. A., Riaño, D., & López-Vallverdú, J. A. (2012). Automatic generation of clinical algorithms within the state-decision-action model. Expert Systems with Applications<http://dx.doi.org/10.1016/j.eswa.2012.02.196>. Candell Riera, J. (2003). Estratificación pronóstica tras infarto agudo de miocardio. Revista Espanola de Cardiologia, 56(3), 303–313. Chai, X., Deng, L., Yang, Q., & Ling, C. X. (2004). Test-cost sensitive Nayïve Bayesian classification. In Proceedings 4th IEEE international conference on data mining. Chapman, G. B., & Sonnenberg, F. A. (Eds.). (2003). Decision making in health care: Theory, psychology and applications. Cambridge series on judgement and decision making. Cambridge University Press. Clark, P., & Niblett, T. (1989). The CN2 induction algorithm. Machine Learning, 3(4), 261–283. Fauci, A. S., Braunwald, E., Kasper, D. L., Hauser, S. L., Longo, D. L., & Jameson, J. L., et al. (Eds.). (2009). Featuring the complete contents of Harrison’s principles of internal medicine (17th ed. McGraw Hill. Harrison’s Online. Horning, K. K., Hoehns, J. D., & Doucette, W. R. (2007). Adherence to clinical practice guidelines for 7 chronic conditions in long-term-care patients who received pharmacist disease management services versus traditional drug regimen review. Journal of Managed Care Pharmacy, 13(1), 28–36. Husmeier, D., Dybowski, R., & Roberts, S. (Eds.). (2004). Probabilistic modelling in bioinformatics and medical informatics. Springer. Lang, T. A., & Secic, M. (2006). How to report statistics in medicine (2nd ed.). American College of Physicians. Ling, C. X., Yang, Q., Wang, J., & Zhang, S. (2004). Decision trees with minimal costs. In Proceedings 21st international conference on machine learning. Ling, C. X., Sheng, V. S., & Yang, Q. (2006). Test strategies for cost-sensitive decision trees. IEEE Transaction on Knowledge and Data Engineering, 18(8), 1055–1067. López-Vallverdú, J. A., & Riaño, D. (2011a). Cost functions and partial orders as medical background knowledge: formalization and operations. Research report DEIM-RR- 11-003. Spain: Universitat Rovira i Virgili. <http://deim.urv.cat/recerca/ reports/DEIM-RR-11-003.pdf> Accessed March 2012. López-Vallverdú, J. A., & Riaño, D. (2011b). Repository of background knowledge. <http://banzai-deim.urv.net/repositories/repository.pdf> Accessed March 2012. López-Vallverdú, J. A., & Riaño, D. (2012a). Decision criteria in health-care and their representation. Research report DEIM-RR-12-001. Spain: Universitat Rovira i
Virgili. <http://deim.urv.cat/recerca/reports/DEIM-RR-12-001.pdf> Accessed March 2012. López-Vallverdú, J. A., Riaño, D., & Collado, A. (2007). Increasing acceptability of decision trees with domain attributes partial orders. In Proceedings of the 20th IEEE international symposium on computer-based medical systems, Maribor, Slovenia. Lucas, P., van der Gaag, L., & Abu-Hanna, A. (2004). Bayesian networks in biomedicine and health-care. Artificial Intelligence in Medicine, 30(3), 201–214. Frank, A., & Asuncion, A. (2010). UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science. <http:// archive.ics.uci.edu/ml>. Podgorelec, V., Kokol, P., Stiglic, B., & Rozman, I. (2002). Decision trees: An overview and their use in medicine. Journal of Medical Systems, 26(5), 445–463. Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1(1), 81–106. Quinlan, J. R. (1993). C4.5: Programs for machine learning. San Mateo, CA., USA: Morgan Kaufman. Quinlan, J. R. (2003). C5.0 Online tutorial. <http://www.rulequest.com> Accessed March 2012. J.A. López-Vallverdú et al. / Expert Systems with Applications 39 (2012) 11782–11791 11791 Riaño, D., López-Vallverdú, J. A., & Tu, S. (2008). Mining hospital data to learn SDA* clinical algorithms. LNAI (Vol. 4924, pp. 46–61). Shiffman, R. N. (1997). Representation of clinical practice guidelines in conventional and augmented decision tables. Journal of the American Medical Informatics Association, 4, 382–393. Torres, P., López-Vallverdú, J. A., & Riaño, D. (2011). Inducing decision trees from medical decision processes. LNAI (Vol. 6512, pp. 40–55). Turney, P. D. (2000). Types of cost in inductive concept learning. In Workshop on cost-sensitive learning at the 7th international conference on machine learning. California: Stanford University. Velikova, M., de Carvalho Ferreira, N., & Lucas, P. (2007). Bayesian network decomposition for modeling breast cancer detection. In Artificial intelligence in medicine, AIME 2007, Amsterdam, The Netherlands. LNAI (Vol. 4594, pp. 346–350). Springer. Witten, I. H., & Frank, E. (2005). Data mining: Practical machine learning tools and techniques (2nd ed.). Morgan Kaufman. Yeh, D., Cheng, C., & Chen, Y. (2011). A predictive model for cerebrovascular disease using data mining. Expert Systems with Applications, 38(7), 8970–8977.