Vous êtes sur la page 1sur 91

Workshop on Data Mining for Business Applications

Held in conjunction with the KDD conference, August 20, 2006

Workshop Chairs Rayid Ghani Carlos Soares

Preface
Data Mining in various forms is becoming a major component of business operations. Almost every business process today involves some form of data mining. Customer Relationship Management, Supply Chain Optimization, Demand Forecasting, Assortment Optimization, Business Intelligence, and Knowledge Management are just some examples of business functions that haven been impacted by data mining techniques. Even though data mining has become critical to businesses, most of the academic research in data mining is conducted on mostly publicly available data sources. This is mainly due to two reasons: 1) the difficulty academic researchers face in getting access to large, new, and interesting sources of data 2) limited access to domain experts who can provide a practical perspective on existing problems and provide a new set of research problems. Corporations are typically wary of releasing their internal data to academics and in most cases, there is limited interaction between industry practitioners and academic researchers working on related problems in similar domains. The goals of this workshop are: 1. Bring together researchers (from both academia and industry) as well as practitioners from different fields to talk about their different perspectives and to share their latest problems and ideas. 2. Attract business professionals who have access to interesting sources of data and business problems but not the expertise in data mining to solve them effectively. We would like to focus on the following topics in the workshop:

Novel business applications of data mining New classes of research problems motivated by real-world business problems. Data mining as a component of existing business processes Selling data mining technology/projects inside an organization or to customers Integration of data mining technologies with other kind of technologies that already exist inside corporations Lessons learned from practical experiences with applying data mining to business applications

Rayid Ghani, Accenture Technology Labs Carlos Soares, University of Porto http://labs.accenture.com/kdd2006_workshop/

Acknowledgements
We would like to acknowledge the support of Accenture Technology Labs and University of Porto in organizing this workshop. We would also like to thank all the people who submitted papers to this workshop, the Program Committee members who helped us in reviewing the submissions, the members of the panels, and the attendees of the workshop for making this a successful event. Program Committee: Chid Apte, IBM Research Paul Bradley, Apollo Data Technologies Pavel Brazdil, University of Porto Doug Bryan, KXEN Raul Domingos, SPSS Robert Engels, CognIT Andrew Fano, Accenture Technology Labs Usama Fayyad, Yahoo Ronen Feldman, Clearforest Marko Grobelnik, Jozef Stefan Institute Robert Grossman, Open Data Partners and University of Illinois at Chicago Alpio Jorge, University of Porto Tom Khabaza, SPSS Jrg-Uwe Kietz, Kdlabs AG Arno Knobbe, Kiminkii/University of Utrecht Dragos Margineantu, Boeing Company Gabor Melli, PredictionWorks Natasa Milic-Frayling, Microsoft Research Dunja Mladenic, Jozef Stefan Institute Gregory Piatetsky-Shapiro, KDNuggets Katharina Probst, Accenture Technology Labs Foster Provost, New York University Peter van der Putten, Chordiant Software Galit Shmueli, University of Maryland Gary Weiss, Fordham University Lus Torgo, University of Porto Alexander Tuzhilin, New York University

Table of Contents
Preface .. Acknowledgements .. Workshop Committee . Discovering Telecom Fraud Situations through Mining Anomalous Behavior Patterns Ronnie Alves, Pedro Ferreira, Orlando Belo, Joao Lopes, Joel Ribeiro, Lus Corteso Interactivity Closes the Gap: Lessons Learned in an Automotive Industry Application Axel Blumenstock, Jochen Hipp, Carsten Lanquillon, Rudiger Wirth .. The Business Practitioners Viewpoint-Discovering and Resolving Real-Life Business Concerns through the Data Mining Exercise Richard Boire . Customer Validation of Commercial Predictive Models Tilmann Bruckhaus, William Guthrie A Boosting Approach for Automated Trading Germn Creamer, Yoav Freund . Zen and the Art of Data Mining T. Dasu, E. Koutsofios, J. Wright .. Data mining in the real world: What do we need and what do we have? Franoise Souli Fogelman Forecasting Online Auctions using Dynamic Models Wolfgang Jank, Galit Shmueli, Shanshan Wang Business Event Advisor: Mining the Net for Business Insight with Semantic Models, Lightweight NLP, and Conceptual Inference Alex Kass, Christopher Cowell-Shah . Mining and Querying Business Process Logs Akhil Kumar Driving High Performance for a Large Wireless Communications Company through Advanced Customer Insight Ramin Mikaili, Lynette Lilly i ii iii

12

23

29

37

44

49

57

62

66

Quantile Trees for Marketing Claudia Perlich, Saharon Rosset A Decision Management Approach to Basel II Compliant Credit Risk Management Peter van der Putten, Arnold Koudijs, Rob Walker ...

68

71

Resolving the Inherent Conflicts of Value Definition in Academic-Industrial Collaboration David Selinger, Tyler Kohn 76 Using Data Mining in Procurement Business Transformation Outsourcing Moninder Singh, Jayant R. Kalagnanam .... 80

Discovering Telecom Fraud Situations through Mining Anomalous Behavior Patterns


Ronnie Alves, Pedro Ferreira, Lus Corteso Portugal Telecom Inovao, SA, Orlando Belo, Joao Lopes, Joel Rua Eng. Jos Ferreira Pinto Basto Ribeiro
University of Minho Campus de Gualtar 4710-057 Braga PORTUGAL 3810-106 Aveiro PORTUGAL

Filipe Martins
Telbit, Lda, Rua Banda da Amizade, 38 3810-059 Aveiro PORTUGAL

lcorte@ptinovacao.pt

fmartins@telbit.pt

{ronnie, pedrogabriel, obelo}@di.uminho.pt ABSTRACT


In this paper we tackle the problem of superimposed fraud detection in telecommunication systems. We propose two anomaly detection methods based on the concept of signatures. The first method relies on a signature deviation-based approach while the second on a dynamic clustering analysis. Experiments carried out with real data, voice call records from an entire week, corresponding to approximately 2.5 millions of CDRs and 700 thousand of signatures processed per day, allowed us to detect several anomalous situations. The frauds analysts provide us a small list of 12 customers for whom a fraudulent behavior was detected during this week. Thus, 9 and 11 fraud situations were discovered from each method respectively. Preliminary results and discussion with fraud analysts has already proved that our methods are a valuable tool to assist them in fraud detection. particular in telecommunications fraud [3]. Our goal was to detect deviate behaviors in useful time, giving better basis to analysts to be more accurate in their decisions in the establishment of potential fraud situations.

2. THE ROLE OF SIGNATURES ON DETECTING FRAUD


Our technique has as a core concept on the notion of signature. We emphasize the work of Cortes and Pregibon [3], since it was the main inspiration for the use of signatures. We have redefined their notion of signature. A signature of a user corresponds to a vector of feature variables whose values are determined during a certain period of time. The variables can be simple, if they consist into a unique atomic value (ex: integer or real) or complex, if they consist in two co-dependent statistical values, typically the average and the standard deviation of a given feature. Table 1. Description of the fv used in signature and summary. Description Duration of Calls N. of Calls Working Days N. of Calls Weekends and Holidays N. of Calls Working Time (8h-20h) N. of Calls Night Time (20h-8h) N. of Calls to Diff. National Networks N. of Calls as Caller (Origin) N. of Calls as Called (Destination) N. of International Calls N. of Calls as Caller in Roaming Type Complex Complex Complex Complex Complex Simple Simple Simple Simple Simple Simple

1. INTRODUCTION
In superimposed fraud situations, the fraudsters make an illegitimate use of a legitimate account by different means. In this case, some abnormal usage is blurred into the characteristic usage of the account. This type of fraud is usually more difficult to detect and poses a bigger challenge to the telecommunications companies. Telecommunications companies use since the 90's decade several kinds of approaches based on statistical analysis and heuristics methods to assist them in the detection and categorization of fraud situations. Recently, they have been adopting the use and exploitation of data mining and knowledge discovery techniques for this task. In this paper we tackle the problem of superimposed fraud detection in telecommunication systems. Two methods for discovering fraud situations through mining anomalous customers behavior patterns are presented. These methods are based on the concept of signature [3], which has already been used successfully for anomalous detection in many areas like credit card usage [1], network intrusion [2] and in

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. DMBA'06, August 20, 2006, Philadelphia, Pennsylvania, USA. Copyright 2006 ACM 1-59593-439-1...$5.00

N. of Calls as Called in Roaming

The choice of the type of the variables depends on several factors, like the complexity of the feature described or the data available to perform such calculation. A feature like the duration of the calls shows a significant variability which is much better

expressed through an average()/standard-deviation() parameter. A feature like the number of international calls is typically much less frequent and thus an average value is sufficient to describe it. In table 1 we list the complete set of feature variables (fv) used in the context of this work. A signature S is then obtained from a function for a given temporal window , where S = (). We consider a time unit, the amount of time in which the CDRs are accumulated and that in the end of this period are processed. A summary C, has the same information structure as a signature, but it is used to resume the user behavior in a smaller time period. Typically, a signature reflects the usage patterns for a period of a week, a month or even half year,

3.2 Calculating the Distance among Signatures


Since the feature variables in the signature have different types, each variable has to be evaluated according to a distinct subfunction. Thus, the dist function is composed by the several subfunctions: dist = (f1, f2,, fn). Consider as an example the simplification of a signature S = {(a,a); b; c; (d,d)}, where the first and the last feature variables are complex (calculated by Eq. 2) and the second and the third are simple (calculated by Eq. 1) variables. Let C = {(a,a); b; c; (d,d)}, be a summary. Since we are interested in considering deviation detection from a probabilistic point of view, i.e. the distance measure among two signatures S and C, would therefore correspond to the probability of C being different from S. The proposed distance function can be presented as:

whereas a summary reflects the periods of an hour, a half day or complete day. In this work, we considered the period of one day for a summary and a week for the signature.

3. DEVIATING PATTERNS 3.1 Evaluating Similarities among Signatures


3.1.1 Similarity of Simple Feature Variables
A simple feature is defined by a unique variable, which corresponds to the average value of the considered feature. For simple feature variable comparison we will make use of a ratioscaled function. This type of function makes a positive measurement on a non-linear scale, which will be, in this case, the exponential scale. The used function is defined in the range [0, 1] and is defined according to the equation

D(S , C) = 1 f1 (S1 , C1 ) 2 + ... + n f n (Sn , Cn ) 2

(3)

Different distance functions can be provided, by the fraud analyst, by setting the weighing factors i to different values. The use of different distance functions will allow detecting deviations in different scenarios. The overall distance function can be redefined as in 2. Dist(S, C) = MAX{dist1(S, C), dist2(S, C),,distm(S, C)} (4) If according to the distance function, a threshold value defined by the analyst is exceeded, Dist(S, C) > , then an alarm should be raised to future examination of the respective user. Otherwise, the user is considered to be within its normal behavior.

d (S x , S y ) = e

| S x S y | B Amp

(1)

3.3 Anomaly Detection Procedure


The anomaly detection procedure based on signature deviation consists in several steps. It starts by a loading step, which imports the information to the local database of the system. This information refers to the signature and summary information of each user. The signatures are imported only once, when the system is started. All the signatures of a user are kept through time. Such information will also be useful for posterior analysis. A signature may have two different status "Active" or "Expired". For each user only one signature can have the Active state, and it is the most up to date one. The processing step is described by algorithm presented in [5], and follows the previous equations for calculating the distance and similarities among signatures. According to equation (4), if an alarm is raised, the user is put on a blacklist. This is performed on the triggering alarm step, which is based on the calculation of the whole distance functions over the signatures. At the end, all the raised alarms have to pass through the analyst verification in order to determine if this alarm corresponds or not to a fraud situation. The evaluation of the alarms is supported by the interface of the system that employs features of dashboard systems, providing a complete set of valuable information [5].

In equation 3, Sx and Sy are the two variables under comparison, B is a constant value, and Amp is the amplitude (difference between the maximum and minimum value) of the respective feature variable in all signatures space.

3.1.2 Similarity of Complex Feature Variable


Complex feature variables are defined by two co-dependent variables. These variables correspond respectively to the average and the standard deviation of the considered feature. For two complex variables, Cx = (Mx,x,) and Cy = (My,y), the similarity function is defined in equation 4, and is within the range [0, 1].

d (C x , C y ) = d ( M x , M y )

| Cx C y | | Cx C y |

(2)

Equation 4 is the result of the combination of two formulas, the similarity function for simple variables (eq. 3) and the ratio

| Cx C y | | Cx C y |

. This ratio is also within the range [0, 1] and

provides the overlap degree of the two complex feature variables by measuring the intersection of the intervals [Mx-x, Mx+x] and [My-y, My+y].

3.4 Signature Updating


The updating process of the signatures follows the ideas presented in [3]. The update of a signature St in the instant t+1, St+1, through a set of processed CDRs (summary) C, is given by the formula: St+1 = .St + (1-).C (5)

The constant indicates the weight of the new actions C in the values of the new signature. Depending on the size of the time window this constant can be adjusted [3]. In contrast to the system in [3], the value of signature is always updated. If the Dist(St, C) then the user is considered to have a normal behavior. If Dist(St, C) > then an alarm is triggered, nevertheless the signature continues to be constantly updated. The reason for this is that the alarm still needs to pass through the analysis of the company fraud analyst. It could be the case in which the analyst considers it as a false alarm. The continuous update of that user signature avoids the loss of information that was gathered between the moment when the alarm was triggered and the moment the analyst gives his verdict.

cluster centroid, and it is assigned to the cluster in which has the smallest distance.

4.2.1 Absolute and Relative Similarity


In order to make the comparison of signatures against cluster centroids, two types of similarity measures can be defined: absolute and relative similarity. Absolute similarity defines the similarity value between the signature and the centroid in a given time moment t. This value is calculated according to formula 6. Relative similarity relates the absolute similarity between instant t and t+1, providing the percentage of the signature variation between two consecutive time instants. This value is obtained through the formula:

4. CHANGING PATTERNS 4.1 Clustering Signatures


The analysis of changes in the clusters topology over a period of time will provide valuable information for the better understanding of the usage patterns of the telecommunications services. In particular, the detection of abrupt changes in cluster membership may provide strong evidences of a fraud situation. We propose the application of dynamic clustering analysis techniques over signature data. Our aim is that these changes will also provide evidences to fraud analysts for establishing potential fraud situations.

= {1

D ( S i , SignCl [S i ]t +1 } 100% D ( S i , SignCl [S i ]t

(7)

In formula 7, Si corresponds to a signature, and SignCl[Si] to the cluster that Si belongs in the moment t. Figure 1 shows a positive variation, where the signature Si is closer to the centroid 0 in the instant t than in t+1.

4.1.1 Similarity of Signatures


Signatures are composed of simple and complex variables. Traditional similarity measures, like Euclidean distance, Pearson correlation, Jaccard measure will not be applicable for signature comparison. Therefore, we need to devise a new similarity measure which will allow us to determine similarities among signatures. We define the similarity between two signatures as the combination of the variable similarity measures defined in section 3.1. For two signatures X and Y, where Xi and Yi are respectively the feature variable i of X and Y, and for n possible variables, the similarity measure can be defined as in 6.

Figure 1. Positive variation of the relative similarity of the signature.

D( X , Y ) = W1 d1 ( X1, Y1 )2 + ... + Wn dn ( X n , Yn )2
and

(6)

D(X,Y) [0, 1] and Wi defines the weight of the feature

n k =1

Wi = 1 .With

this signature similarity measure, we

can compare all signatures. This will provide a N x N matrix, that summarizes the similarities among the N signatures. The clustering solution can then be obtained by taking into account the previous calculated matrix as the input.

4.2 Clustering Migration Analysis


According to the moment of the week, different usage patterns can be found [6]. These usage profiles are provided by means of signature clustering analysis, according to the method describe previously in section 4.1. Therefore, for each day of the week a cluster topology is provided. This topology describes customers' usage patterns during that period. Each cluster is described by the characteristics of its centroid. The centroid is defined as a signature. This allows making direct comparisons of the signatures and clusters centroid. The comparison is made according to the similarity formula 6.The signature assignment to the cluster is done by comparing each signature against each

Figure 2. Negative variation of relative similarity of the signature and change cluster membership. A negative value of the relative similarity in the instant t+1, indicates that the signature Si is now close to the centroid of the cluster that it fits in the instant t. Nevertheless, we can detect to a cluster membership change, since now Si is now closer to another cluster (cluster 1) (figure 2).

We define a cluster membership change as follows: a signature S changes its cluster membership to cluster Cj in the instant t+1, if it belongs to cluster Ci in the instant t, in the instant t+1 the distance D(S,Cj) is minimal concerning all clusters and D(S,Cj)t+1 < D(S,Ci)t. All the data relative to the cluster membership of the signatures are kept for posterior analysis. These data, which we call Historical data, will make possible to assess the evolution of the customer behavior through time. In order to offer the analyst a tool for a better examination of the changing behavior of the customers, during a defined interval, analysis reports can be generated [6]. This tool will provide the identification of all the conditions used, as well as, the average and standard deviation of the signatures variations and the maximum, minimum and average values for all the signature feature variables. The deviating signatures detected are included into a blacklist, for further analysis.

based approach is given in table 2. Pay attention to the most right (gray) column, further investigation on those alarms shown that some of them were real fraud situations. Table 2. Different thresholds () and the alarms generated for three particular days of the week. ()/day Tue Wed Sat 0.8 2141 3029 1006 1.0 649 1145 560 1.2 139 251 150 1.6 50 103 39 2.0 25 56 23

Figure 3. Example of fraud situations in the blacklist. Figure 3 shows an example of a real fraud situation detected by our methods on the evaluation study. The first line contains a header with the temporal reference, the analysis report description, and the limits of the range [-2, +2], which indicates that any variation outside this limit is considered an abnormal situation. For the next lines, it is listed the moment when the anomaly was detected, the signature identification (phone number), the cluster where the signature belongs, a flag indicating a cluster membership change (1in the positive case), the absolute similarity of the signature and the cluster, the relative similarity (variation) and as the last column the description of the respective analysis report. More detailed information about the methods, as well as, scalability issues regarding its application can be obtained in [5, 6].

For getting more understanding under the circumstances in which those alarms where generated one must investigate the impact of each variable over the MAX distance function (Eq.4). In figure 4 we exemplify such evaluation by allowing top-k queries over the complete set of alarms. We also verified that the most 10 imperative anomalous situations were raging from 2.76 up to 3.33 concerning its distance function. The feature variable which has more impact over the distance calculation is the international call (originated ones). On the other hand, in Figure 5 we can see that workhours variable has great importance to the distance calculation over the whole period.

5. EVALUATING REAL FRAUD SITUATIONS


In order to assess the quality of our strategy methodology in detecting anomalous behaviors, we have examined the data correspondent to a week of voice calls from a Portuguese mobile telecommunications network. The complete set of CDRs corresponds to approximately 2.5 millions of records, and 700 thousand of signatures processed per day. Up to now, there isnt exists any accurate database with previous cases of fraud. Thus, the settings of our methods were guided by a small list of 12 customers (fraudsters in the referenced week), provided by the fraud analyst in order to detect other similar behaviors. In this first stage of detecting anomalous situations, we are interested on the effectiveness of our methods. Therefore, we worked on a subset of the previous data concerning to a sample distribution with approximately 5 thousands summaries per day and its respective signatures to the whole week. The detection process was carried out by applying the method described in section 3. Several thresholds () were used and basically four main distance functions were designed combining different feature variables and weights. An illustration of the alarms generated by the deviation-

Figure 4. The impact of each feature variable over the top-10 higher alarms.

Figure 5. An overall picture of feature variable distribution over the max distance ( 2). It is important to mention that both methods provide just insights that could be recognized as anomalous situations. In fact, the

characteristics of the data provided by the analysts dont allow us to apply any classification technique. Therefore, it is quite hard to evaluate, precisely, the rates for false positive and false negatives. Although, given the small list provided by analyst we can report a recall of 75%. In order to complement the previous results we further make use of a dynamic clustering approach to detect suspect changes on cluster membership over the whole week. The identification of those changes will trigger alarms for future inspection. After several executions, the qualities of the clusters were maximized with 8 clusters. The distribution of the alarms raised by this method can be figured out in table 3. Table 3. Alarms raised per cluster for three particular days of the week. Cluster 1 2 3 4 5 6 7 8 Tue 3 9 3 5 23 20 8 52 Wed 9 7 12 17 21 31 11 72 Sat 1 123 71 16 22 40 26 0

By using dynamic clustering we can now report a recall of 91%. As one can see this method is a little bit susceptible for detecting anomalous situations than the previous one. This is explained by the relative similarity measure (Eq. 7) which provides a fine tuning of the clustering migration method by exploring signatures relative variation over the time (whole week). Finally, the overlap rate of both methods corresponds to approximately 62% for the whole sample used, and 66% for the blacklist provided by the fraud analyst. Meanwhile, the remaining cases, other anomalous situations with the same behavior of the previous cases detected, are under inspection by the company analysts. Thus, the next efforts will be heading to the development of a database of fraud cases, as well as, an induction rule engine to help analyst on the evaluation of the alarms. Concerning the scalability issues preliminary results showed to us that the most costly step is the calculation of the summaries and signatures. It requires several aggregations functions over CDRs records with the purpose of grouping information by each customer. At this time, this is done by several SQL scripts over a Microsoft SQL Server 2005. By the time that this information is available we can make use of each method discussed in this work without pre-defined order to detect anomalies. When dealing with such huge data we have realized that working with chunks of information (summaries and signatures) plus clustered indexes structures, it improves the processing time without losing quality of the results by at least one order of magnitude. On the other hand adds a new trouble, in sense that, when sliding the window from to +1 requires rebuilding of the all respective indexes. Finally, in case of using dynamic clustering we have divide the original chunk of data D, into a set of partitions Di, mutually exclusive, in order to make the processing of each partition feasible. After all partitions have been processed, the last step is to merge all the clustering information resulted from each chunk processed. The parameters that described the cluster topology obtained for each block are gathered in a unique set of Df. These parameters are considered the data objects for further processing of the final K clusters obtained. In a future work, we intend to report several scenarios of utilization and optimization of the both elements discussed in this work for detecting anomalous situations.

The bottom (gray) line in table 3 shows the cluster with the highest number of calls. Figure 6 shows an example of changing on cluster membership, which represents a real fraud situation identified by this method. The first and second customers pass from a cluster (1 and 2) with a lower average of number of calls to the cluster with the highest number of calls (8), in days 4 and 3 respective. The third customer in this example, although always in the same cluster, has registered a significant variation between days between days 5 and 6.

6. FINAL DISCUSSION
In this work we have presented two methods for detecting telecom fraud situations. Both methods rely on the concept of signature to summarize the customer behavior through a certain period of time. In the first approach, the user signature is used as a comparison basis. A possible differentiation between the actual behavior of the user and its signature may reveal an abnormal situation. The second approach uses dynamic clustering analysis in order to evaluate changes on cluster membership over the time. The clear basis of these detection-based methods is that they complement each other on reporting anomalous situations. For instance in section 5 we show an overlapping of 66% fraud situations which was raised by the proposed methods. The experimental evaluation performed with data from a week of voice calls, and respective comparison, with a list of previously detected fraud cases, allowed us to conclude about the high rate of true positives (91%) detected by the proposed methods. Additionally, they discovered other fraud situations which were not reported previously by the analysts. Preliminary discussion

Figure 6. Example of anomaly situations regarded to the increase in the number of calls.

with fraud analysts gave us feedback about the promising capabilities of the proposed methodologies.

[4] Myers and Myers. Probability and Statistics for Engineers and Scientists. Prentice Hall, 6th edition. [5] Pedro Ferreira, Ronnie Alves, Orlando Belo and Lus Corteso. Establishing Fraud Detection Patterns Based on Signatures. In Proceedings of Industrial Conference on Data Mining2006, July, 2006. [6] Pedro Ferreira, Orlando Belo, Ronnie Alves, and Joel Ribeiro. Fratelo - Fraud in Telecommunications: Technical report. Tech Report 1, University of Minho, Department of Informatics, May 2006.

7. REFERENCES
[1] Y. Kou, T. Lu S. Sirwongwattana, and Y. Huang. Survey of fraud detection techniques. In Proceedings of IEEE Intl Conference on Networking, Sensing and Control, March 2004. [2] T.F. Lunt. A survey of intrusion detection techniques. Computer and Security, (53):405-418, 1999. [3] Corrina Cortes and Daryl Pregibon. Signature-based methods for data streams. Data Mining and Knowledge Discovery, (5):167-182, 2001.

Interactivity Closes the Gap


Lessons Learned in an Automotive Industry Application
Axel Blumenstock
Department of Applied Information Processing University of Ulm, Germany

Jochen Hipp, Steffen Kempe, Carsten Lanquillon, Rudiger Wirth


DaimlerChrysler Group Research Ulm, Germany

axel.blumenstock@uni-ulm.de

{jochen.hipp, steffen.kempe, carsten.lanquillon, ruediger.wirth}@dcx.com ABSTRACT


After nearly two decades of data mining research there are many commercial mining tools available, and a wide range of algorithms can be found in the literature. One might think there is a solution to most of the problems practitioners face. In our application of descriptive induction on warranty data, however, we found a considerable gap between many standard solutions and our practical needs. Confronted with challenging data and requirements such as understandability and support of existing work ows, we tried many things that did not work, ending up in simple solutions that do. We feel that the problems we faced are not so uncommon, and would like to advocate that its better to focus on simplicityallowing domain experts to bring in their knowledgerather than on complex algorithms. Interactivity and simplicity turned out to be key features to success. explaining why some kind of quality issue occurs and feeding this information back into engineering isolating groups of vehicles that might suer a certain defect in the future, so as to make service actions more targeted and eective. Our research group picks up common data mining methods and adapts them to the practical needs of our engineers and domain experts. This contribution reports on the lessons learned. In particular, we elaborate on our experience that the right answer to domain complexity need not be algorithmic complexitybut rather simplicity. Simplicity opens ways to create an interactive setup which involves experts without overwhelming them. And if truly involved, an expert will understand the results and turn them into action. We will outline the problem setting in Section 2. The subsequent sections respectively discuss the theoretical aspects, tool selection and model building methods, each answering the questions of what we tried and what nally worked.

1.

INTRODUCTION

An air bellow bursts: This happens on one truck, on another it does not. Is this random coincidence, or the result of some systematic weakness? Questions like these have ever been keeping experts busy at DaimlerChryslers After Sales Services. Recently, they have attracted even more attention, when Chryslers CEO LaSorda introduced the so-called tag process: a rigorous quality enhancement initiative that once more mirrors the enormous business relevance of fast problem resolution [3]. This primary goal of quality enhancement entails several tasks to be solved: detecting upcoming quality issues as early as possible

2. DOMAIN AND REQUIREMENTS 2.1 The Data


Most of the data at hand is warranty data, providing information about diagnostics and repairs at the garage. Further data is about vehicle production, conguration and usage. All these sources are heterogeneous, and the data was not collected for the purpose of causal analyses. This raises questions about reliability, appropriateness of scale, and level of detail. Apart from these concerns, our data has some properties that make it hard to analyze, including Imbalanced classes: The class of interest, made up of all instances for which a certain problem was reported, is very small compared to its contrast set. Often, the proportion is way below 1 %. Multiple causes: Sometimes, a single kind of problem report can be traced back to dierent causes that produced the same phenomenon. Therefore, on the entire data set, even truly explanatory rules show only modest qualities in terms of statistical measures.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and the full citation on the rst page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specic permission and/or a fee.

DMBA06, August 20, 2006, Philadelphia, Pennsylvania, USA. Copyright 2006 ACM 1-59593-439-1. . . $5.00.

Semi-labeledness: The counterpart of the positives is not truly negative. If there is a warranty entry for some vehicle, it is (almost) sure that it indeed suered the problem reported on. For any non-positive example, however, it is unclear whether it carries problematic properties and may fall defective in near future. High-dimensional space of inuence variables (1000s) Inuence variables interact strongly: Some quality issues do not occur until several inuences coincide. And, if an inuence exists in the data, many other noncausal variables follow, showing positive statistical dependence with the class as well. True causes not in data: By chance, they are concludable from other, inuenced variables.

As stated above, however, data is semi-labeled, and the problem behind the positive class may have multiple causes. These properties act as if there were a strong inherent noise that changes the class variable in either direction. Classier induction tries to separate the classes in the best possible way but can return unpredictable, arbitrary results when noise increases. For our application, it suces to grab the most explainable part of the positives and leave the rest for later investigation or, nally, ascribe it to randomness. In other words, we experienced that anything beyond partial description is not adequate here (confer Hands categorization into mining the complete relation versus local patterns [5]). So we came up with subgroup discovery (e.g., [8]). It means to identify subsets of the entire object set which show some unusual distribution with respect to a property of interest in our case, the binary class variable. Results from subgroup discovery approaches need not be restricted to knowledge acquisition, but can be re-used for picking out objects of interest. This is the partial classication we want, where a statement about the contrast set is not adequate or required. Still, data properties make subgroup discovery results unusable most of the time. There are many candidate inuences, and they interact strongly. Therefore, even if the cause could be described by a sole variable, it would be hard to nd it among the set of variables inuenced by it. All these variables, including the causal one, would refer to roughly the same subset of vehicles with an increased proportion of positives.

2.2 The Domain Experts and Their Tasks


Our users are experts in the eld of vehicle engineering, specialized on various subdomains such as engine or electrical equipment. They keep track of what goes on in the eld, mainly by analyzing warranty data, and try to discover upcoming quality issues as early as possible. If they recognize a problem, they strive for nding out the root causes in order to address it most accurately. They have been doing these investigations successfully over years. Now, data mining can help them to better meet the demands of fast reaction, well-founded insight and targeted service. But any analysis support must t into the users mindset, their language, and their work ow. The structure of the problems to be analyzed varies substantially. This task requires inspection, exploration and understanding for every case anew. Ideally, the engineers should be enabled to apply various exploration and analysis methods from a rich repository. And it is important that they do it themselves, because no one else could decide quickly enough whether a certain clue is relevant and should be pursued, and ask the proper questions. Finding out reasons of strange phenomena requires both comprehensive and detailed background knowledge. Yet, the engineers are not data mining experts. They could make use of data mining tools out of the box, but common data mining suites already require deeper understanding of the methods. Further, the users are reluctant to accept any system-generated hypothesis if the system cannot give exact details that justify this hypothesis. The bottom line is that penetrability and, again, interactivity are almost indispensable features of any mining system in our eld.

3.2 What works


Opposing it to mere discovery, wed rather like to talk about subgroup description. It is to identify the very same subgroups, but in a way as comprehensive and informative as possible. The rationale is, even if subgroup discovery results are presented in a human-readable form, the users are left alone to map these results to synonyms that can be more meaningful in the context of the application. In a domain with thousands of inuence variables, however, the users cannot be expected to bear all the (possibly even multivariate) interactions in their minds. Vehicle conguration, for instance, contains hundreds of strongly interrelated variables, dependent as well on type, production date and destination region. Subgroup description is thus required to provide any reasonable explanation as long as there is no evidence that the nding is void or unjustied.

3.

UNDERSTANDING THE TASK

Let us rst have a theoretical look at the problem. It is noteworthy that we will meet the following arguments again when we investigate individual methods.

4. A TOOL THAT SUITS THE EXPERTS 4.1 What we tried


We had a look at several commercially available data mining suites and tools. Unfortunately, any of these fell short of the requirements outlined in Section 2.2. As an overall observation, they were rather inaccessible and often did not allow for interaction at the model building level. Even if they did, they could not present information (like measures) in the non-statistician users language. Tools of this kind oer their methods in a very generic fashion so that the typical domain expert does not know where

3.1 What we tried


A great portion of the task can be seen as a classication problem. We would like to separate the good from the bad. It may be possible to tell for any vehicle whether it might encounter problems in the future. And if we choose a symbolic method, we can use the model to explain the problem.

Build Tree Model Prepare Data Explore


Build Rule Model

been doing even before getting in touch with data mining. Hence, decision trees were the rst method we chose.

5.1 What we tried


To quickly provide the users with explanation models, it was proximate to build decision trees automatically as is typically done when inducing tree-based classiers ([2, 6, 11]). However, the experts deemed the results unusable most of the time, because the split attributes that had been selected by any of the common top-down tree induction algorithm were often uninformative or meaningless to them: The topranked variable was rarely the most relevant one. For some time, we experimented with dierent measures. Literature suggests measures such as information gain, information gain ratio, 2 p-value, or gini index, to mention the most important ones. However, in an exemplary analysis case, there was a variable that gave the actual hint for the expert to discover the quality issues cause. This variable was ranked 27th by information gain, 41st by gain ratio, 36th by p-value and 33rd by gini index. We conclude that an automatic induction process hardly could have found a helpful tree.

Complexity upon request

Figure 1: Coarse usage model of our tool. There is a xed process skeleton corresponding to the original workow. The user can just go through, or gain more exibility (and complexity) upon request. to start. In short, we believe that the goal conict between exibility and guidance can hardly be solved by any generalpurpose application, where the greatest simplication potential, namely domain adaption, remains unexploited.

4.2 What works


We ended up in programming a tool of our own. Figure 1 shows a simplied view of our tools process model. It emerged as the union of our experts workows und thus oers guidance even for users not overly literate in data mining. At the same time, it does not constrain the user to a single process but allows going deeper and gain exibility wherever the user is able and willing to. For example, the users start with extracting data for further analysis. We tried to keep this step simple and hide the complexities as much as possible. The user just selects the vehicle subset and the inuence variables he likes to work with. A meta data based system cares about joins, aggregations, discretizations or other data transformation steps. This kind of preprocessing is domain specic, but still exible enough to adapt to changes and extensions. In the course of their analyses, the experts often want to derive variables of their own. That way, they can materialize concepts otherwise spread over several other conditions. This is an important point where they introduce casespecic background knowledge. The system allows them to do so, up to the full expressiveness of mathematical formulas. A similar fashion of multi-level complexity is oered for the Explore box in Figure 1: The system oers both standard reports, suiting the experts needs in most of the cases, up to individually congurable diagrams. For the sake of model induction, our tool oers currently two branches that interact and complement each other: decision trees and rule sets.

5.2 What works


This is where interactivity comes into action. This is close what Ankerst proposed [1], except for the mining goal. Building trees interactively relieves the measure of choice from the burden of selecting the single best split attribute. The idea is almost trivial: Present the attributes in an ordered list and let the expert make tentative choices until he nds one he considers plausible. What remains is the problem of how to rank the attributes in a reasonable way. But even for ranking, the aforementioned statistical measures proved little helpful. We explain this by the fact that they are measures designed for classicator induction, trying to separate the classes in the best possible way. But as illustrated in Section 3, this is not the primary goal in our application. Most of the time, we deal with two-class problems anyway: the positive class versus the contrasting rest. Hence, we can use the measure lift (the factor by which the positive class rate in a given node is higher than the positive class rate in the root node). To complement the lift value of a tree node, we use the recall (the fraction covered) of the positive class. Both lift and recall are readily understandable for the users as they have immediate analogies in their domain. Now, focusing on high-lift paths, the users can successively split tree nodes to reach a lift as high as possible while maintaining nodes with substantial recall. In order to condense this into a suitable attribute ranking, we group attribute values (or, the current nodes children). We require the resulting split to create at most k children, where typically k = 2 so as to force binary splits. This ensures both that the split is handy and easily understood by the user, and that the subsequent attribute ranking can be based consistently on the child node with the highest lift. To group the children in a reasonable way, we simply sort

5.

INTERACTIVE DECISION TREES

Subgroup discovery (and description) can be mapped to partitioning the instance set into multiple decision tree leaves. Paths to leaves with a high share of the positive class provide a description of an interesting subgroup. In fact, decision tree induction roughly corresponds to what our experts had

15

wtLift isometrics

But even the hope that a good rule will be among the subsequently mined ones need not hold: Imagine there are two rules describing exactly the same example set. CN2-SD will never nd both, because by modifying the examples weights, the two rules ranks will change simultaneously. This, however, runs counter the idea of subgroup description, in other words, comprehensiveness at the textual level rather than mere subset identication.

topLift

6.2 What works


0% topRecall 100%

Figure 2: Quality space for the assessment of split attributes. Each dot represents an attribute, plotted over recall (x axis) and lift (y axis) of the best (possibly clustered) child that would result. Dots are plotted bold if there is no other dot that is better in both dimensions. The curves are isometrics according to the recall-weighted lift.

We thus came up with an exhaustive search (within constraints). It is realized by an association rule miner with xed consequence. This is not new, and like us, many research groups think about how to handle redundancy within the results (e.g., [4, 10, 14]) What we like to point out here is that once again, the idea of interactivity produced a simple but eective solution. The expert is enabled to control a CN2-SD like sequential covering. He picks a rule he recognizes as interesting or already known. This is comparable to selecting a decision tree split attribute. Several measures, tting into his mindset, support him with his choice. The instance set is then modied so as to remove the marked inuence, and the expert can re-iterate to nd the next interesting rule.

them by lift. Then, keeping their linear order, we cluster them using several heuristics: merge smallest nodes rst, merge adjacent nodes with lowest lift dierence. Lift and recall of the resulting highest-lift node are nally combined to a one-dimensional measure (weighted lift, or explanational power) in order to create the ranking. Grouping is automatically performed during attribute assessment. Still, the users can interactively undo and redo the grouping or even arrange the attribute values into any form that they desire. This is important to further incorporate background knowledge, e.g. with respect to ordered domains, geographical regions, or, in particular, components that are used in certain subsets of vehicles and should, thus, be considered together. As an alternative to a ranked list, the user can still get the more natural two-dimensional presentation of the split attributes (Figure 2). Similar to within a ROC space, every such attribute is plotted as a point. We use recall and lift as the two dimensions.

7. MODULE INTERACTION
The key property that makes a tool more than the sum of its components, however, is the facility of interaction between its exploration and modelling components. This is still only partly implemented, but our users strongly request for it. Indeed module interaction is the feature that allows them to exibly apply the methods oered and to take out the respective best of them. Such sometimes trivial but practically important features include: Extracting instance subsets as covered by a rule or tree path and exchanging them within the modules for deeper analyses or visualization. Building a tree with a path as described by a rule in order to take a closer look at the respective contrast sets. Deriving new variables from tree paths or rules.

6.

INTERACTIVE RULE SETS

As an important data property we mentioned that inuences interact in a way that some quality issues do not occur until several inuences coincide. While decision tree building is intuitive, its search is greedy and thus may miss interesting combinations. So the experts asked for an automatic, more comprehensive search. This led us to rule sets.

8. CONCLUSION
We reported on our experiences of applying data mining methods in a domain where data is dicult, analysis tasks change structurally case by case, and thus a great amount of background knowledge is indispensable. Many approaches suggested in the literature turned out either too constrained or too complex to be oered without major adaption. In such a setting, we consider it best to stick to simple methods, provide these in a both exible and understandable way, and settle on interactivity. Still, there is a wide eld to explore. At many points of the process, there is much room for methods that support the experts and reduce their routine work load as much as possible.

6.1 What we tried


Among others (e.g., [7, 13, 12]), a well-known subgroup discovery algorithm is CN2-SD [9]. It induces rules by sequential covering: By heuristic search, nd a rule that is best according to some statistical measure. Reduce the weights of the covered examples, and re-iterate until no reasonable rule can be found any more. The rst handicap of this procedure is the same as with decision trees: There is no measure that could guarantee to select the best inuence, here: rule.

9.

REFERENCES

[1] M. Ankerst, C. Elsen, M. Ester, and H.-P. Kriegel. Visual classication: an interactive approach to decision tree construction. In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1999. [2] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classication and Regression Trees. Chapman & Hall, 1984. [3] M. Connelly. Chryslers LaSorda on quality: Fix it now. Automotive News, May 9th 2005. [4] F. Gebhardt. Choosing among competing generalizations. Knowledge Acquisition, (3), 1991. [5] D. J. Hand. Data miningreaching beyond statistics. Research in Ocial Statistics, (2):517, 1998. [6] G. V. Kass. An exploratory technique for investigating large quantities of categorical data. Applied Statistics, 29:119127, 1980. [7] W. Kl osgen. EXPLORA: A multipattern and multistrategy discovery assistant. In Advances in Knowledge Discovery and Data Mining, pages 249271. 1996. [8] W. Kl osgen. Applications and research problems of subgroup mining. In Proceedings of the 11th International Symposium on Foundations of Intelligent Systems, 1999. [9] N. Lavra c, P. A. Flach, B. Kav sek, and L. Todorovski. Rule induction for subgroup discovery with CN2-SD. In ECML/PKDD02 Workshop on Integration and Collaboration Aspects of Data Mining, Decision Support and Meta-Learning, 2002. [10] B. Liu, M. Hu, and W. Hsu. Multi-level organization and summarization of the discovered rules. In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2000. [11] J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufman, 1993. [12] G. I. Webb. OPUS: An ecient admissible algorithm for unordered search. Journal of Articial Intelligence Research, 3, 1995. [13] S. Wrobel. An algorithm for multi-relational discovery of subgroups. In Proceedings of the First European Symposium on Principles of Data Mining and Knowledge Discovery, 1997. [14] X. Yan, H. Cheng, J. Han, and D. Xin. Summarizing itemset patterns: a prole-based approach. In Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2005.

The Business Practitioners ViewpointDiscovering and Resolving Real-Life Business Concerns through the Data Mining Exercise
Richard Boire
Boire Filler Group 1020 Brock Rd South, Suite 2008 Pickering, ON L1W 3H2, Canada +1-905-837-0005

Purpose of Discussion Recognizing that data mining has become a more common process for the effective use and dissemination of information in all sectors of society, it is now quite common for data mining to be a critical process within a business organization. Through the use of various techniques and approaches, businesses have achieved huge gains that positively impact the profit and loss statements of the organization. Yet, in many ways, the typically large benefits and gains seen within the scientific and academic environments fail to materialize within the business environment. The imperfect world of customer behaviour and the resulting information requires a more practical approach that in effect may minimize the benefits of data mining. However, even with these limitations, significant business gains would never be realized without the use of data mining. With close to 25 years of business experience in both marketing and credit risk, the presenter will attempt to relay his real-life experience in applying data mining to a wide variety of business scenarios. At the end of this exercise, attendees should be able to understand and appreciate the limitations of applying data mining techniques within many business environments. Yet, readers will also appreciate how data mining practitioners leverage their business application experience along with their data mining knowledge to extract as much business advantage as possible.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. DMBA'06, August 20, 2006, Philadelphia, Pennsylvania, USA. Copyright 2006 ACM 1-59593-439-1...$5.00.

Background The presenters background in data mining began in 1983 at Readers Digest Canada where he built predictive models to target both prospects and subscribers for Readers Digest products. His experience at American Express Canada provided insights regarding the application of data mining techniques in trying to optimize both marketing behaviour and credit behaviour. As a top Canadian consultant within the market place for the last 10 years, the author has been fortunate to both broaden and deepen his knowledge in such sectors as: -Retail -Pharmaceutical -Insurance -Non Profit -Investment Industry -Telecommunications -Technology -Financial Services Attached to this proposal are the general credentials of the presenters organization(presenter is currently 1 of 2 partners in business) which contains more input regarding his skill and expertise.

Approach to Discussion The discussion will outline a concern or issue and the business case where this concern arose. The discussion of a particular data mining issue and the particular business case would be focused around the four main stages of a data mining exercise. -Problem/Business Challenge Identification -Creating the Analytical File/Data Environment -Applying the Right Analytical Tools -Implementation As with many data mining projects, practitioners often identify a problem or challenge at the beginning of a project through some fact-finding exercise that can be resolved through certain data mining techniques. In all of these projects, even so-called tactical type projects such as the development of a specific predictive model, there is a significant amount of data discovery that occurs within each project. This data discovery often leads to issues and concerns that were completely unbeknownst prior to the project commencement. In dealing with these new issues and concerns, data miners will display the flexibility that is required to achieve some solution which are often sub-optimal yet still better than doing nothing. Timelines in many of these cases necessitate the delivery of a sub-optimal solution recognizing that a subsequent phase of the project will more

fully address this concern. But it is the discovery of these concerns that further reinforces the cyclical nature of the data mining process with learning being the ultimate objective.

Track and Obtain Track and Obtain Results Results

Analyze information Analyze information

Apply new learning Apply new learning within decisiondecision-making within decision-making process process

Derive New Learning Derive New Learning

Some Key Business Issues to Discuss 1.Overstatement This financial institution example represents a case of where the initial data mining solution purported to provide excellent results. Upon application within a direct marketing program, the results were horrible. We will demonstrate why these results were horrible and even more importantly why they should have been expected given the current data environment. More detail on 1-Overstatement A good example of the above was the case where a mortgage insurance response model was built to identify bank mortgage holders who were most likely to purchase mortgage insurance. In reviewing the initial model, we found that the gains or lift was very strong and in particular within the top 5% of customers that were promoted. The results below are produced once we have developed a model. Once the model is developed, it is then applied against a validation or hold-out sample in order to measure how the model would have performed if it had been applied against this group of hold-out names. You are seeing the results of the model against these hold-out names.

% of names who were promoted 0-5% 5%-10% 10%-15% 15%-20% 95%-100%

% of all mortgage insurance buyers captured in top 5% 80% 5% 4%

1%

These type of results albeit on the surface appear to be very good. However, our practical experience indicates that these results may be too good to be true. At this point, we then delve into some statistics to better understand what is going on. We conduct a simple stepwise regression in order to understand the % contribution of each variable within the overall model. Within this five variable model, we realize that the variable (have bought insurance) accounts for 85% of the power of the model.
Model Variable Ever Bought Insurance 1 or more lending products Have an investment product have a credit card Live in Ontario % contribution to Model 85% 8%

5% 1% 1%

Now these above results could simply tell us that we should perhaps conduct upfront segmentation. Perhaps, we should simply separate the universe into those who bought insurance and those who did not buy insurance and develop models for each separate group. However, we decide that further forensics should be conducted on the data by looking at correlation analysis of all the potential independent variables versus the objective function of Buying Insurance. Listed below is a summary of these correlation results. The table and results below are produced by ranking the variables by their absolute correlation value to response. This report gives us a univariate perspective on how each potential model variable impacts response.

Variable Ever Bought Insurance 1 or more lending products have a line of credit product have a car loan have an RRSP have an RESP Have an investment product live in Toronto Live in Ontario ..etc have a chequing account

Correlation Coefficient 0.75 0.2 0.18 0.17 0.16 0.15 0.16 0.15 0.14

0.0002

The typical number of variables in this type of above report usually comprises around 100 to 150 variables for many of our clients but can comprise several hundred for our data-rich clients. In reviewing these above numbers, the red flag is the magnitude of difference between the ever bought insurance correlation coefficient and the other variables. You can see from the above that this difference is almost 4 to 1. Clearly, there appears to be something going on with this variable that we need to better understand and which warrants further exploration but within the data area. Specifically, we need to better understand how this variable was created. We then need to go back to the analytical file which was created during this exercise. We observe 100 records and all the variables that were created during this exercise. Upon review of the variables and the information contained within these 100 records we can actually begin to diagnose the real problem here. Listed below is a sample of 11 mortgage insurance responders and 11 mortgage insurance non responders

Response yes yes yes yes yes yes yes yes yes yes yes No No No No No No No No No No No

Ever bought Insurance yes yes yes yes yes yes yes yes yes yes yes no no no no yes yes no no yes no no

1 or more lending products yes yes yes yes yes no yes yes yes yes no no yes no yes no no no no no yes no

have an investment yes no yes yes no yes yes no yes yes yes no no yes no no no no no yes yes no

have a credit card yes yes no no yes yes no no yes yes no no no no no no no yes yes yes yes no

live in Ontario yes no yes no yes no yes no yes no yes no no no yes yes no no no no yes no

In the above report, all responders also have bought insurance. This should not happen in reality as there should always be some non insurance buyers who will purchase mortgage insurance as their first insurance product. We then ask how this variable was created and discover that the variable ever bought insurance was calculated at the current time rather than being calculated prior to the event of purchase. Of course, anyone purchasing mortgage insurance will always be recorded as having bought insurance on the database . This example serves to highlight the importance of crafting the analytical file or your data file in the right manner. In this case, we should craft our variables within an analytical file that contains a pre-period and a post period.
pre period all independent variables Post period Dependant variable

In the above schematic diagram, the variable mortgage insurance response should be created in the post period while the variable ever bought insurance should be created in the pre period along with all the other independant variables.

From the outset, looking merely at the statistics and output in its purest form might lead us to believe that we have a very good solution. But the experienced practitioners perspective would question the validity of these results and might suggest that results are

being overstated. The experienced data miner would conduct further forensics as seen in the above case with the objective of truly understanding what these results mean. The data miner has to roll up his or her sleeves and get into the data to discover the real reason behind these results. But this type of unglamorous detective work often goes unnoticed as these situations are never presented to the senior-level business stakeholders, yet, like any scientific discipline , there is an intrinsic satisfaction to the data miner of having solved a problem. It is this disciplined approach to always question results that will ultimately lead to highly effective data mining solutions. 2.No Data Environment This retail example will demonstrate the creativity of the data miner practitioners mindset when building a solution with no customer-level data from the company. Through market-research learning and a large portion of creativity on behalf of the data miner, a solution was developed in lieu of doing nothing. More Detail on 2-No Data Environment In this example, no behavioural or demographic data was available at the individual level. However, a market research study to a sample of the population revealed that its products appealed to people with the following characteristics: High income Recent immigrants Females

Despite the lack of data at the individual level, the data miner will seek out other alternatives rather than doing nothing. Using his or her knowledge of what is available from an external standpoint,, the data miner realizes that the use of Statistics Canada data in this situation might provide a viable alternative. The use of Stats Can Census data provides an array of variables(over 2000) which look at ethnicity, language, immigration patterns, religion, and a range of other demographics. This data is collected at an enumeration area(EA). Currently there are approximately 50000 EAs in Canada. These EAs are then mapped to postal codes through a conversion table which contains approximately 800000 postal codes. At the end of this process, an 800000 postal code record table can be created with all its associated characteristics. A simplified example containing a few postal codes and some variables is listed below.
% of population speaking French 11% 81% 9% % of population with college education 34% 22% 21%

Postal Code M5A1A3 H4B2E5 V636J4

Average Household Size 1.5 4 3

As you can see from the above, we do have information which we can action at the postal code level. However, the limitations of the data is that it is aggregate level. For instance, if Richard Boire and John Smith are in the same census area, their information on all census characteristics will be identical. This loss of granularity at using aggregate level

data still provides some capability to produce good solutions but not at the same level of performance presuming we had individual level data. Going back to our market research example, the data miner would focus on three Stats Can Census Elements: % of females in the Census Area Median Income in the Census Area % of population who immigrated in last 5 years in the Census Area

The data miner would then create an index of each of the above variables and create an overall composite index which weights each variable equally. Listed below is an example of the index that would be calculated for one postal code.

Income Average Postal Code M5A 1J2 Index $40,000 $50,000

% Female 52% 60%

% Landed Immig. Immig. 5% 10%

1.25

1.15

With each postal code containing a composite index, we now have a postal code scoring system. The utility of such a system is that prospects for this retail company can now be ranked based on where they live. See simplified example below:

% of File 0-5% 5%-10% 10%-15% 95%-100% 40000 Total 800000

# of Postal Codes 40000 40000 40000

Minimum Index in Interval 5.5 5 4.8 0.05

# of prospects 80000 60000 90000 30000 3000000

Presuming that there are limited budgets in terms of acquiring new customers, the use of this information can help to determine which prospects to promote given a fixed amount

of dollars. Names can be selected based on where they live(postal code)

3.Poor Data Quality This retail organization was attempting to use data mining for the first time in order to develop effective direct marketing programs. The organization, though, had no idea of the current state of their information environment. Our data discovery process actually led us to provide recommendations on improving their existing data quality before actually conducting any data mining exercise. Once these data quality recommendations were made, we will discuss how data mining was used within their data-challenged environment. More detail on 3-Poor Data Quality. As with any organization beginning its foray into CRM or database marketing type programs, it is critical to understand the existing data environment. A rigorous process is conducted which is often referred to as a data audit. The process involves frequency distributions on each field on each file within the organizations information environment. Seeing as how the objective for the organization is to conduct database marketing programs, we wanted to focus on that information which was relevant for marketing. This meant that many files which were related to finance/accounting, human respources,etc. were not used. The purpose of these extensive and exhaustive frequency distribution reports is to obtain insights on the integrity and quality of the data. For example, are there too many missing values within a given field. Are there values within a certain field that dont make sense. The results of this type of exercise provide a state of the nation concerning the organizations data environment. This indicates to the organization what needs to be done to improve the data environment and what can be done regarding database marketing programs given the current data limitations. Our initial investigation focused on their AS400 system which had a customer file and a billing file. Frequency distributions were conducted on each field in each file. Our key findings were as follows. See example below.

Region Prairie Provinces Quebec Ontario West Missing Values Total

# of Customers 25 M 100 M 350 M 25 M 500 M 1 MM

% of Total Customers 2.5% 10% 35% 2.5% 50% 100%

From these results, we found that 50% of the file had no postal code implying that direct mail type programs would have limited impact. Yet we also found that customers had

more than one customer O.D. which after investigation meant that the same person going into two different stores would have two different I.Ds.
Customer ID 1000009 1000115 1000119 1000125 Last Name Boire Boire Boire Boire First Name Richard Richard Richard Richard Phone Number 905-509-8053 905-509-8053 905-509-8053 905-509-8053

But our investigation also revealed that each customer did have a unique phone number. This meant that if we were going to conduct customer level analysis, we would use phone number rather than customer I.D. as our unique customer level identifier. Accordingly, the company set up processes to create new customer I.Ds based on phone number and going forward built systems that connected the phone number to the right customer I.D. Yet our investigation went beyond just the data and into researching suppliers that could help in the companys challenge of a high level of missing postal codes. Our investigation revealed that there was a company which had software that could link phone numbers to name and address. Recognizing that the phone number information in the database was 100% accurate, we recommended this organization as a means to repopulate missing name and address on the file. With the above data audit completed and recommendations in place, the company could begin to conduct database marketing programs that would produce meaningful results.

4.Using Statistics Blindly Examples will be used here to illustrate the danger of applying raw statistical results without totally understanding its practical impact. Given the nature of the data environment within most businesses, we will discuss how to practically apply these results. For example, in many cases multicollinearity between variables can cause a predictive model variable to switch sign. Age may have a negative relationship with response within a predictive model but when analyzing the correlation of age against response, we observe that the relationship is a positive one. Some statisticians will argue that this inconsistency is unimportant and the solution should be left intact. Meanwhile, according to the practitioner, unadjustment of the solution represents a scenario where one is likely to receive sub-optimal results.

5.Creating Solutions that combine statistics and business sense For this travel-related company, their targeting efforts for acquiring new customers were always confined to simply purchasing the best list sources. However, the organization wanted to further increase its targeting capabilities beyond just ordering the right lists. Through this increased granularity, some names were included from traditionally poor list sources while some names were excluded from traditionally strong performing lists. Through our data discovery process, both statistical and non-statistical techniques were utilized to build solutions which significantly improved their existing results(90% ROI improvement). (NOTE: Our company received data mining award for this : NAMMU Awards in Canada-Nov. 2/2005)

6. Automating the data discovery process In many data mining exercises, the analyst is constantly faced with new data sources which need to be considered as potential sources of new business intelligence. Before the analyst can even analyze this information and its impact to the business, an assessment needs to be done in order to evaluate the quality and integrity of the data. In business terms, this overall assessment of the quality and integrity of the data is often referred to as a data audit. The tasks that are required in this data process area as follows: Loading of the Data Evaluation of Initial Data Diagnostics Frequency Distribution of Variables Loading of Data Reports are generated which demonstrate the number of records in each file as well as the format of each variable. Another side report also prints out a random sample of 10 records as well as all the variables and their associated values within each file. This is done in order to give the analyst an initial glimpse of the data and its quality. Evaluation of Initial Data Diagnostics These reports convey the # of missing values as well as the number of unique values which help to provide insights regarding the usefulness of variables in future data mining exercises. Frequency Distribution of Variables These reports look at how values distribute within a given variable. The type of distributions can provide insights on creating derived variables.
Typically in the past, the creation of all these reports was very manual. Separate programs were created for each new file in producing all these reports. We not only automated the ETL process (Extract,Transform, and Load) but also produced routines to automate some of the initial analytical reports such as frequency distribution and data diagnostic reports. With these automated routines using SAS Macros, tasks that took at least a day or two can now be run in less than a hour.

Customer Validation of Commercial Predictive Models


Tilmann Bruckhaus
Numetrics Management Systems, Inc 20863 Stevens Creek Blvd Cupertino, CA 95014 +1-408-351-5818

William E. Guthrie
Numetrics Management Systems, Inc 20863 Stevens Creek Blvd Cupertino, CA 95014 +1-408-351-5811

tilmannb@numetrics.com

billg@numetrics.com

ABSTRACT
A central need in the emerging business of model-based prediction is to enable customers to validate the accuracy of a predictive product. This paper discusses how data mining models and their inferences can be evaluated from the customer viewpoint, where the customer is not particularly knowledgeable in data mining. To date, academia has focused primarily on the validation of algorithms through mathematical metrics and benchmarking studies. This type of validation is not sufficient in the business context where a specific model must be validated in terms that customers can quickly and effortlessly understand. We describe our predictive business and our customer validation needs. To that end, we discuss examples of customer needs, review issues associated with model validation, and point out how academic research may help to address these business needs.

projects, such as electrical and transistor properties, and use data mining technology to predict key performance indicators for new projects. These key performance indicators include cost, time-tomarket and productivity, and they also cover related issues, such as design complexity, effort, staffing and milestones. Numetrics is a pioneer in the emerging market for predictive software and services, and we began assembling our industry database in 1996. Since then, we have accumulated a rich history of experiences with creating predictive products, as well as with selling and supporting them. Customers must be confident that our applications give accurate results to rely on them for businesscritical decisions. Validating the predictions is therefore an essential step to acceptance. Customer validation is similar to the traditional mathematical validation of data mining algorithms and predictive models. However, in many ways, customer validation comprises a superset of the difficulties and challenges of mathematical validation. In our experience with applying data mining technology to real industry data and actual business problems, data mining currently focuses predominantly on a small fraction of the entire problem. Kohavi & Provost (2001) capture our own assessment of the situation well when they state: It should also be kept in mind that there is more to data mining than just building an automated [...] system. []. With the exception of the data mining algorithm, in the current state of the practice the rest of the knowledge discovery process is manual. Indeed, the algorithmic phase is such a small part of the process because decades of research have focused on automating it on creating effective, efficient data mining algorithms. However, when it comes to improving the efficiency of the knowledge discovery process as a whole, additional research on efficient data mining algorithms will have diminishing returns if the rest of the process remains difficult and manual. [] In sum, [] there still is much research needed mostly in areas of the knowledge discovery process other than the algorithmic phase. In this paper, we will explore the specific research needs which are related to customer validation of predictive models.

1. INTRODUCTION
Fielded applications of data mining usually require careful validation before deployment. In the case of commercial applications validation is particularly important, and when an entire company is built on predictive modeling technology the need for validation becomes a question of survival. In this emerging business of predictive products and services, a new class of research problems, motivated by real-world business needs, is materializing in the guise of validation by the customer. Once the business relevance of a predictive model has been established, customer validation of model accuracy is arguably the most critical challenge in selling products and services that are based on data mining technology to customers. With this premise, we share observations and lessons learned from practical experiences with a business that is entirely focused on predictive modeling. Numetrics Management Systems serves the semiconductor industry with products and services based on predictive data mining technology to help our customers evaluate, plan, control and succeed with semiconductor and system design projects. Our products capture critical design parameters of finished
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. DMBA'06, August 20, 2006, Philadelphia, Pennsylvania, USA. Copyright 2006 ACM 1-59593-439-1...$5.00.

2. BACKGROUND
Data mining experts and customers of data mining technology do not necessarily share the same training and background. Data miners typically have thorough knowledge of data mining as well as statistical training. For example, some of the more widely read overview texts on data mining are Berry & Linoff (1997), Han &

Kamber (2005), Mitchell (1997), Quinlan (1993), Soukup & Davidson (2002), Witten & Frank (2005). A good introduction into the more specialized field of model validation can be provided by two insightful papers which compare various methods of evaluating machine learning algorithms on a broad set of benchmarking data sets: Caruana & Niculescu-Mizil (2004) and Caruana & Niculescu-Mizil (2006). There is also a rich field of research into cost-sensitive learning, for example, see Bruckhaus, Ling, Madhavji & Sheng (2004), Chawla, Japkowicz & Kolcz (2004), Domingos (1999), Drummond & Holte (2003), Elkan (2001), Fan, Stolfo, Zhang & Chan (1999), Fayyad, Piatetsky-Shapiro, Smyth & Uthurusamy (1996), Japkowicz (2001), Joshi, Agarwal & Kumar (2001), Ling & Li (1998), Ling, Yang, Wang & Zhang (2004), Niculescu-Mizil & Caruana (2001), Ting (2002), Weiss & Provost (2003), and Zadrozny, Langford & Abe (2003). For customers who are familiar with such literature, model validation techniques that are used in academia may be wholly appropriate. However, customers with expertise in data mining are an exception rather than the rule. Most of our customers are engineers and managers who do not necessarily have the background and skills which are taken for granted within the data mining community. Customers will therefore ask questions that reflect the terminology and approaches of their own field of expertise, and data miners may be hard-pressed to provide answers.

These two sets of questions illustrate that customers think in terms of their field of application rather than in terms of data mining, and more importantly, it is often not clear how to translate one language into the other. In addition to a desire for familiar terminology, there are other peculiarities about customer validation. Some customers are more interested in white box validation whereas others may be more interested in black box validation, where white box validation considers how the model operates, and black box validation only considers the behavior displayed by the application. Both types of customers would like to receive answers to their questions. A further complication however, is that customers who are interested in white box validation may want to understand the model in terms of engineering equations they are familiar with, and they might want to see a formula which describes how a specific input of interest affects the predictive output of the model. A related question arises in the context of planning: how accurate is the applications estimate of effort for a specific new project that will be critical to the customers future success? This question cannot be answered on the basis of a single project (one observation), because during the projects planning stage it is obviously impossible to compare its predicted effort to its actual effort. Moreover, model accuracy is measured for a population of cases and although the model may be very accurate across an entire population, it may provide less accurate results for a single case. Generally, model accuracy has to be measured for a population, and it is not clear whether and how it might be possible to obtain accuracy estimates for specific cases. Such individual cases, or use cases, may be pivotal to the customers business and to their decision of adopting a predictive application. Another key caveat is that semiconductor design projects evolve throughout their life cycle. What is estimated at the planning stage may change during the project, and it would be useful to account for this evolution in modeling and validation.

3. CUSTOMER APPROACH TO MODEL VALIDATION


The customer view of model validation is different from the academic view because few assumptions can be made as to the statistical and data mining savvy of the customer. Customers must focus on their business needs, and in our case, these needs are those of the semiconductor business. Our product is not so much viewed as a predictive model but rather as a tool that can answer questions about cost, time-to-market, and productivity of a semiconductor design project. Our customers do not concern themselves primarily with the predictive model that is the engine inside the application. Instead they focus on the controls of the application, and in order to receive value from the product, they need the application to address their business needs directly and immediately. When our customers ask how accurate is your model? they have a broad and diverse mental picture of what accuracy means. Many of our customers are engineers, so they expect a meticulous response. Some of the questions our customers are likely to ask when they validate a predictive model are listed in Table 1. It is apparent that our customers questions are mostly specific to the domain of semiconductor design. Even when their questions are not explicitly asked in the terminology of the semiconductor business, our customers would still like to obtain answers in semiconductor design project terminology. For example, when our customers ask: How accurate is the application, and how is accuracy measured? they would prefer an answer that uses their terminology, like X% of projects complete within Y% of the predicted completion date to an answer which does not use their terminology, like The F-Score is X. For comparison, the Table 2 lists questions which are focused on data mining technology

4. ANALYSIS
The issue of customer-oriented evaluation has two components. The first is how to place evaluations in the customer's language, which we will refer to as the domain language requirement. The second component is perhaps more relevant to data mining research: how to make sure that evaluations actually answer the wide range of questions that will be asked by customers. We will refer to this requirement as the evaluation completeness requirement. One avenue to address the domain language requirement might be to use a generic language that can be mapped more-or-less easily into the language for a particular domain. How then do the different questions in Table 1 correspond to more general questions in such a generic language? For example, question 1 is a question of applying the model to particular cases that the customer has in mind. Question 5 talks about the accuracy of the model on a particular subspace of the domain. Question 8 talks about how to deal with missing values when the model is being used, as opposed to when it is being built. Question 9 talks about the inputs actually "used" by the model.

Table 1: Questions focused on the Application Domain


1. 2. Our organization has collected effort metrics on 50 completed projects. How accurately does the application rank the expected effort for those projects? I have experienced a challenging 5. situation where the target frequency of a chip was doubled in order to better address market needs. How does the doubling of frequency affect the predicted effort for the design? I know from experience that the 8. number of capacitors on an analog design is related to effort. How can the application predict effort accurately when the number of capacitors is not entered into the application? I have experience with version N 3. of the model. With version N+1 available, should I migrate to version N+1? Is it better? How much better is it? My organization operates in the 6. automotive semiconductor field which is a very specialized market. Can the application predict accurately in this specific environment? How accurate is the application, and how is accuracy measured?

4.

Can the application predict exactly how long my radio frequency design project will take?

7.

9. The application asks for clock I do not track Ring Oscillator speed but my design is pure Delay, but the application analog and has no clocks. What requires this input. Will the should I enter? Is this model application still be useful without useful to me? this input, and how sensitive is the application to inaccurately entered data?

Table 2: Questions focused on Data Mining Technology


10. What is the area under the Receiver-OperatingCharacteristic Curve? 13. What is the F-Score? 11. What is the optimal number of boosting operations? 14. What is the Cross Entropy of the model? 17. Where is the precision/recall break-even point? 12. What is the Lift for this model?

15. How well would the application perform on the Iris Data Set, Anderson (1935)? 18. Does the application use a Support Vector Machine?

16. How imbalanced was the training data set?

Table 3: Analysis of Customer Validation Needs Customers DomainSpecific Concepts Analysis Key MachineLearning Concepts
Use Case, Background Expert Knowledge, Training Cases, Validation Cases Knowledge of relative actual outcomes: item 1. Concern over risk associated with improvement vs. stability: item 2. Business risk due to potentially inaccurate estimations: items 3, 5, 6. Intuitions about expected model behavior in response to changes in input values: items 4, 7, 8. Awareness that unique cases within the domain require special treatment: items 5, 6, 7. Statistical analysis of ranking, such as Spearmans rank correlation coefficient may be a good tool for evaluating model performance. Specific alternative models to be compared in terms of their performance and quality. Rank Correlation

Estimations based on first hand experience and intuition: Table 1, items 1, 4, 7, 9.

Reference to potential use cases or background expert knowledge. To achieve customer acceptance and win their business, it may be particularly important that the model perform well on these use cases. It may be possible to improve model performance by incorporating appropriate background expert knowledge or by capturing additional training cases.

Model comparison

References to model quality. Some are more generic, while others are more specific and identify project duration as a target variable.

Model Quality, Target Variables

Consider use cases where a specific input, such as frequency, is modified and review the impact in terms of the sensitivity of the model.

Model Sensitivity, Specific Inputs

There are reportedly clusters of cases in the input space where the model should perform differently from how it performs in other ranges. It may be helpful to use unsupervised learning to discover such clusters and to offer cluster membership as an input. Or, different models or sub-models may be built to address different sub-domains. A variable considered important by the customer is not an input to the model. Are there one or more proxy variables in the model which account for some of the missing information? Is there an opportunity to build a better model with additional inputs? Estimation based on partial inputs. Dealing with missing values. Inapplicable inputs.

Case Clustering, Sub-Models, Stratification

Insights about which parameters should be used for estimation: items 4, 7.

Missing Variables, Proxy Variables, Adding Inputs

Required data cannot be collected or does not apply: items 8, 9.

Missing Data in Scoring Records

Table 3 provides such a mapping from customer language to machine learning language for all sample questions listed in Table 1. In the left-hand column of Table 3, we list customers domainspecific concepts related to the questions from Table 1. We then match the fragments to approximately comparable ideas and approaches in machine-learning language in the middle column. We also suggest some key machine-learning concepts in the righthand column which appear to be related to the customer concern in question. Certainly, our list of customer needs and questions and our mapping into the machine learning domain is not exhaustive. For example, some topics which we have not addressed but which are equally important to customer validation of data mining technology are the explanatory power of data mining models (Pazzani 2000), the financial impact of predictive models (Bruckhaus 2006), and information retrieval-related customer needs (Joachims 2002). Understanding customer concerns is a prerequisite for validating practical, commercial data mining applications. In some cases it may be best to address customer validation needs by analyzing the output of a model, while in other cases it may be possible to address customer validation needs directly inside of the data mining algorithm. For example, it may be possible to build predictive models while specifically taking into account the sensitivity of the model to variations of the inputs, as suggested by Engelbrecht (2001), and Castillo et al. (2006). As customer validation needs are better understood, researchers and practitioners will be able to more fully address the evaluation completeness requirement. One method may be to select algorithms and model evaluation procedures which are designed to address specific customer validation needs.

It is generally not practical to train customers in data mining validation, and what is needed instead is technology for supporting customer validation in practical terms. It appears that customers are interested in model-level accuracy, the effect of specific inputs on model output, as well as in a great variety of domain-specific use cases. The customer view of model validation is at once very similar and very different from the data miners view, and it is our hope that technologies will evolve that will make it easy to cross the chasm between the two.

6. REFERENCES
[1] Anderson, E. 1935, "The irises of the Gasp peninsula", Bulletin of the American Iris Society 59, 2-5. [2] Berry, M.J.A., and Linoff, G. 1997. Data Mining Techniques: For Marketing, Sales, and Customer Support. John Wiley & Sons. [3] Bruckhaus, T. 2006 (forthcoming). The Business Impact of Predictive Analytics. Book chapter in Knowledge Discovery and Data Mining: Challenges and Realities with Real World Data. Zhu, Q, and Davidson, I., editors. Idea Group Publishing, Hershey, PA [4] Bruckhaus, T., Ling, C.X., Madhavji, N.H., and Sheng, S. 2004. Software Escalation Prediction with Data Mining. Workshop on Predictive Software Models (PSM 2004), A STEP Software Technology & Engineering Practice. [5] Caruana, R., and Niculescu-Mizil, A. 2004, Data Mining in Metric Space: An Empirical Analysis of Supervised Learning Performance Criteria. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Seattle, Washington, USA, August 22-25, 2004. [6] Caruana, R., and Niculescu-Mizil, A., 2006, An Empirical Comparison of Supervised Learning Algorithms. In Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, PA. [7] Castillo, E., Guijarro-Berdias, B., Fontenla-Romero, O., Alonso-Betanzos, A., 2006, A Very Fast Learning Method for Neural Networks Based on Sensitivity Analysis, Journal of Machine Learning Research, 7(Jul), pp 1159-1182. [8] Chawla, N.V., Japkowicz, N., and Kolcz, A. eds. 2004. Special Issue on Learning from Imbalanced Datasets. SIGKDD, 6(1): ACM Press. [9] Domingos, P. 1999. MetaCost: A general method for making classifiers cost-sensitive. In Proceedings of the Fifth International Conference on Knowledge Discovery and Data Mining, 155-164, ACM Press. [10] Drummond, C., and Holte, R.C. 2003. C4.5, Class Imbalance, and Cost Sensitivity: Why under-sampling beats over-sampling. Workshop on Learning from Imbalanced Datasets II. [11] Elkan, C. 2001. The Foundations of Cost-sensitive Learning. In Proceedings of the International Joint Conference of Artificial Intelligence (IJCAI 2001), 973-978. [12] Engelbrecht, A. P., 2001, Sensitivity Analysis for Selective Learning by Feedforward Neural Networks", Fundamenta Informaticae, 45(4), pp 295-328.

5. CONCLUSIONS
In this paper, we have reviewed customer requirements for the evaluation of data mining models. Common themes in customer model validation include: sensitivity the model's response to changes in input values, sign and magnitude; range the specific range where the models inputs are valid; parameters used which of the thousand possible factors does the model incorporate directly, and which are covered by proxies; and robustness - how much of the real-world domain space is covered by the model. There are more, but these points need systematic explanation as to how to apply tests and how to interpret results. Defining accuracy in mathematical terms is very simple but capturing the various needs and ideas that customers connect to accuracy is much more difficult. There are machine learning techniques available to address customer validation needs but a comprehensive framework is lacking to date. Certainly one complication is the need to get the evaluation into a form that customers can use. The customer will have to work through various issues to validate new models, and support from the vendor is needed. In fact, what the vendor has to do to build and validate a model is exactly what a customer has to do to evaluate the resulting model. It would be attractive to use a common tool set for model building and to support customer validation. It seems reasonable to invest in formalizing this process and sharing it with customers. Although it would require analysis across domains, it would clearly be interesting to understand how often these different sorts of evaluation questions arise.

[13] Fan, W., Stolfo, S.J., Zhang, J., and Chan, P.K. 1999. AdaCost: Misclassification Cost-sensitive Boosting. In Proceedings of the Sixteenth International Conference on Machine Learning, 97-105. [14] Fayyad, U., Piatetsky-Shapiro, G., Smyth, P., and Uthurusamy, R. (editors). 1996. Advances in Knowledge Discovery and Data Mining, AAAI/MIT Press. [15] Han, J and Kamber, M. 2005. Data Mining, Second Edition : Concepts and Techniques (The Morgan Kaufmann Series in Data Management Systems) [16] Japkowicz. N. 2001. Concept-Learning in the Presence of Between-Class and Within-Class Imbalances, In Proceedings of the Fourteenth Conference of the Canadian Society for Computational Studies of Intelligence (AI'2001). [17] Joachims, T., 2002. Optimizing search engines using clickthrough data. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, Edmonton, Alberta, Canada, pp 133 - 142 [18] Joshi, M.V., Agarwal, R.C., and Kumar, V. 2001. Mining needles in a haystack: classifying rare classes via two-phase rule induction. In Proceedings of the SIGMOD01 Conference on Management of Data. [19] Kohavi, R., and Provost, F., January 2001, Applications of Data Mining to E-commerce (editorial), Applications of Data Mining to Electronic Commerce. Special issue of the International Journal Data Mining and Knowledge Discovery. [20] Ling, C.X., and Li, C. 1998. Data Mining for Direct Marketing: Specific Problems and Solutions. In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining (KDD-98), 73-79.

[21] Ling, C.X., Yang, Q., Wang, J., and Zhang, S. 2004. Decision trees with minimal costs. In Proceedings of International Conference on Machine Learning (ICML). [22] Mitchell, T. 1997. Machine Learning.McGraw-Hill Science / Engineering / Math; 1 edition. [23] Niculescu-Mizil, A., and Caruana, R. 2001. Obtaining Calibrated Probabilities from Boosting. AI Stats. [24] Pazzani, M. J. (2000), Knowledge Discovery from Data?, IEEE Intelligent Systems, March/April 2000, 10-13. [25] Quinlan, J.R. 1993. C4.5: Programs for Machine Learning. Morgan Kaufmann. [26] Soukup, T. and Davidson, I. 2002. Visual Data Mining: Techniques and Tools for Data Visualization and Mining. Wiley. [27] Ting, K.M. 2002. An Instance-Weighting Method to Induce Cost-sensitive Trees. IEEE Transactions on Knowledge and Data Engineering, 14(3):659-665. [28] Weiss, G., and Provost, F. 2003. Learning when Training Data are Costly: The Effect of Class Distribution on Tree Induction. Journal of Artificial Intelligence Research 19: 315-354. [29] Witten, I.H., and Frank, E. 2005. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann, San Francisco. 2nd Edition. [30] Zadrozny, B., Langford, J., and Abe, N. 2003. Cost-sensitive Learning by Cost-Proportionate Example Weighting. In Proceedings of International Conference of Data Mining (ICDM).

A boosting approach for automated trading


German Creamer
Center for Computational Learning Systems Columbia University 475 Riverside MC 7717 New York, NY 10115

Yoav Freund
Department of Computer Science University of California, San Diego 9500 Gilman Drive La Jolla, CA 92093-0114

gcreamer@cs.columbia.edu ABSTRACT
This paper describes an algorithm for short-term technical trading. The algorithm was tested in the context of the Penn-Lehman Automated Trading (PLAT) competition. The algorithm is based on three main ideas. The rst idea is to use a combination of technical indicators to predict the daily trend of the stock, the combination is optimized using a boosting algorithm. The second idea is to use the constant rebalanced portfolios within the day in order to take advantage of market volatility without increasing risk. The third idea is to use limit orders rather than market orders in order to minimize transaction costs.

yfreund@cs.ucsd.edu
tors. Based on this information, the trader anticipates the direction of the market using a boosting algorithm, and then takes a long or short position if it expects that the market will go up or a down respectively. The second idea is to use constant rebalanced portfolios [1] within the day in order to take advantage of market volatility without increasing risk. This part of the trading algorithm puts limit orders to assure that there is a constant mix between the value of the stocks and of the portfolio. The third idea is to use limit orders rather than market orders in order to minimize transaction costs. The trader accesses the order book to put limit orders out of the bid-ask spread to capture the rebates that ECNs such as ISLAND pay to the trader whose submission was in the order books at the moment of execution. 2 The rest of the paper is organized as follows: section 2 introduces boosting; section 3 presents the PLAT competition and our trading strategy; section 4 presents the results of the participation of our trading algorithm in the PLAT competition; section 5 introduces improvements to our algorithm such as the integration of the market maker strategy, and section 6 concludes and discusses futures lines of research.

1.

INTRODUCTION

The recent development of electronic communication networks (ECNs) or electronic nancial markets has allowed a direct communication between investors, avoiding the additional cost of intermediaries such as the specialists of the New York Stock Exchange (NYSE). A very important aspect of the ECNs is the access and publication of the realtime limit order book. For many years such access was not available to most traders. For example, in the NYSE only specialists could observe the entries of the limit order book. Other investors could only see the price and number of shares of the executed orders. Electronic markets maintain a centralized order book for each traded stock. This book maintains lists of all active limit orders and is used as the basis for matching buyers and sellers. By making the content of this book accessible to traders, electronic markets provide a very detailed view of the state of the market and allow for new and protable trading strategies. For example, Kakade, Kearns, Mansour, and Ortiz in [14] present a competitive algorithm using volume weighted average prices (VWAP).1 Kavajecz and OddersWhite [17] study how technical analysis indicators can capture changes in the state of the limit order book. In this paper we present an automated trading algorithm that was tested in the context of the Penn-Lehman Automated Trading (PLAT) competition. The algorithm is based on three main ideas. The rst idea is to use a combination of technical indicators to predict the daily trend of the stock. The trading algorithm uses the stock price of the previous ninety days, and the open price of the current trading day to calculate a set of well-known technical analysis indica1 VWAP is calculated using the volumes and prices present on the order book.

2. 2.1

METHODS Boosting

Adaboost is a machine learning algorithm invented by Freund and Schapire [12] that classify its outputs applying a simple learning algorithm (weak learner) to several iterations of the training set where the missclasied observations receive more weight. Friedman et al [13], followed by Collins, Schapire, and Singer [6] suggested a modication of Adaboost, called Logitboost. Logitboost can be interpreted as an algorithm for step-wise logistic regression. This modied version of Adaboost known as Logitboost assumes that the labels yi s were stochastically generated as a function of the xi s. Then it includes Ft1 (xi ) in the logistic function to calculate the probability of yi , and the exponent of the logistic function becomes the weight of the training examples. Figure 1 describes Logitboost.
2 A market order is an order to buy an asset at the current market price. A buy (sell) limit order is executed only at a price less (greater) or equal than the limit price. The ECNs register the orders in the order book which is continuously updated with new orders or when an order is executed.

The bid-ask spread refers to the dierence between the bid price or the highest price that a trader is willing to pay for an asset, and the ask price or the lowest price that a trader is willing to sell an asset. A long position is the result of buying a security expecting that the value of the underlying asset goes up. A short position is the result of selling a borrowed security expecting that the value of the underlying asset goes down. Technical analysis is a method to forecast security prices and trends using patterns of prices, volumes, or volatility (see the appendix).

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and the full citation on the rst page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specic permission and/or a fee. DMBA06 August 20, 2006, Philadelphia, Pennsylvania, USA Copyright 2006 ACM 1-59593-439-1 ...$5.00.

for t = 1 . . . T t 1 wi = yi Ft1 (xi ) 1+e Get ht from  weak learner t =


1 2

Sharpe ratio calculated as the mean return and standard deviation of the 10-day prot and loss positions.
t wi

ln

P P

t i:ht (xi )=1,yi =1 wi

i:ht (xi )=1,yi =1

Ft+1 = Ft + t ht

2. Traders do not have a limit in terms of number of shares that they can hold. However, positions must be liquidated at the end of the day. Any long position will completely lost its value, and any short position must pay a penalty of twice its market value. 3. Transaction costs will follow ISLANDs fee/rebate policy: when a trade is executed, the party whose order was in the order books shall receive a rebate of $ 0.002, and the party that submitted the incoming order shall pay a transaction fee of $ 0.003 During this competition, participants were split into two groups: red and blue. Our agent was team 1 in the red group. The competition also included an agent per team that bought and sold large number of shares each day following the volume weighted average price (VWAP).

Figure 1: The Logitboost algorithm. yi is the binary label to be predicted, xi corresponds to the features of an instance i, t wi is the weight of instance i at time t, ht and Ft (x) are the prediction rule and the prediction score at time t respectively

We implemented boosting with a decision tree learning algorithm called an alternating decision tree (ADT) [11]. In this algorithm, boosting is used to obtain the decision rules and to combine them using a weighted majority vote (See Creamer and Freund [9] for a previous application to several nance problems). The importance of features used to predict earnings surprises, and cumulative abnormal returns may change significantly in dierent periods of time. As we do not know in advance what the most important features are and because of its feature selection capability, its error bound proofs [12], its interpretability, and its capacity to combine quantitative, and qualitative variables we decided to use boosting as our learning algorithm.

3.2

Trading algorithm

3. 3.1

TRADING STRATEGIES AND PLAT COMPETITION Automated trading PLAT

Our trading algorithm is tested in the Penn-Lehman Automated Trading Project3 (see Kearns and Ortiz [18]). This project, which is a partnership between the University of Pennsylvania and the quantitative trading division of Lehman Brothers, simulates ISLAND, one of the major ECNs, and has had trading competitions since the Fall of 2002. The simulator that supports PLAT captures price and volume information of ISLAND about every 3 seconds, and provides an architecture where clients can connect and submit limit orders. During the competition of April-May 2004, Microsoft (MSFT) is the only stock that is traded. The simulator creates its own order book receiving the information of ISLAND and mixes it with the orders of its clients. The simulator generates detailed information about the position of each trader: market and price simulator, outstanding shares, present value, and prot and loss position. PLAT is dierent from the well-known trading agent competition (TAC) run at the University of Michigan [24] because of PLATs strict limitation to the nancial market and because only one stock is traded: Microsoft. The classic TAC game is based on the travel industry market, and since 2003, it has also included a supply chain management game. Wellman et al. in [24] reports recent results of TAC. Both competitions, PLAT and TAC, are similar in terms of oering a platform and software for agents to develop their trading strategies.

Our basic approach is to separate our analysis of the market into two time scales. The long time scale is on the order of days or hours, the short time scale is on the order of seconds or minutes. When operating on the long time scale we use a variety of technical indicators (see the appendix) to predict price trends. In other words, we try to predict whether the stock price will go up or down in the next day or next hour. When operating on the short time scale we stick to the prediction given by the long time scale analysis and place orders in a way that would take maximal advantage of volatility, and minimize transaction costs. In more detail, our long time scale analysis is based on an adaptive combination of technical analysis indicators. The combination is optimized using the boosting learning algorithm and past month as training data. The short time-scale trading is based on constant rebalanced portfolios with a time-based prole selected according to the long-term analysis. Finally, the actual market orders are generated in a way designed to take advantage of the transaction cost policy used in ISLAND, one of the major ECNs and the one used as the data source for the PLAT competition. We call our trading algorithm CRP TA because it implements a hybrid strategy of a) forecasting the daily stock price with Logitboost using technical indicators (TA), and b) intra-day trading following a constant rebalanced portfolio (CRP) strategy.

3.2.1

Applying Logitboost to the selection of technical trading rules

3.1.1

PLAT Competition

We designed the trading algorithm CRP TA that participated in the PLAT competition run in the period April 26 to May 7, 2004. The rules used during this competition were4 : 1. The performance of each trader is measured by the
3 This description of PLAT refers to Spring 2004 when we participated in the competition. 4 Further explanation of the PLAT project can be found at <http://www.cis.upenn.edu/ mkearns/projects/plat.html>

The trading algorithm CRP TA forecasts the direction of the stock price using ADTs which are implemented with Logitboost. We introduced this algorithm in section 2.1. CRP TA trains ADTs using the following technical analysis indicators of the previous ninety days and described in the appendix: simple moving average, average directional movement index, directional movement index, Bollinger bands, moving average convergence divergence, relative strength index, stochastic indicators, and money ow index. We calculated these indicators using R and its nancial engineering package called Rmetrics. 5 The instances are labeled using the following rules: Buy, if P c P o + Sell, if P c P o Hold, otherwise where is a constant that at least covers the transaction costs ($0.003), and P o and P c are the close and open price respectively.
5 Information about R and Rmetrics found at <http://cran.r-project.org> <http://www.rmetrics.org> respectively.

can and

be at

Logitboost generates a new set of trading rules. Hence, instead of using the rules that each technical analysis indicator suggests, Logitboost denes what are the appropriate rules based on the market conditions and the combination of a list of very well-known technical indicators.

3.2.2

Constant rebalanced portfolio

Constant rebalanced portfolio, known in the nancial world as constant mix, is a well-known strategy in the investment community. Kelly [19] showed that individuals that invest the same proportion of their money on a specic assetthe constant rebalanced portfoliotheir portfolio value will increase (or decrease) exponentially. Kelly introduced the logoptimal portfolio as the one which achieves the maximum exponential rate of growth. Algoet and Cover [1] showed that if the market is stationary ergodic, the maximum capital growth rate of a log-optimal portfolio is equivalent to the maximum expected log return. Cover [8] and later on many other researchers such as Vovk and Watkins [23], Cover and Ordentlich [7] Blum and Kalai [2], and Kalai and Vempala [15] extended CRP to the concept of universal constant rebalanced portfolio. CRP simply requires that traders maintain a xed proportion of stocks to portfolio value. If stock price increases (decreases), the stock to portfolio value ratio increases (decreases), then part of the stocks must be liquidated (bought). This strategy works better when the stock price is unstable, so the trader is able to sell when the price is high, and buy when the price is low. We tested the trading algorithm CRP TA in the PLAT competition run between April 26 to May 7, 2004. Every day of the competition CRP TA trains an ADT with Logitboost using the information of the last ninety days and then using P o takes a long position (50% of the portfolio invested in MSFT), short position (25% of the portfolio) or do not trade. During the rst half hour CRP TA builds its position, and during the half hour before the market closes, CRP TA liquidates its position. There is an asymmetry between the long position (50%) and the short position (-25%) because of the higher penalty that a trader with a short position would pay during the competition. The training of ADTs was done using the MLJAVA package.6 The trading algorithm CRP TA trades during the day balancing the portfolio according to a goal mix as Figure 2 explains. CRP TA intends to increase revenues sending limit orders and expects that these orders arrive before than the counterpartys orders when the orders are executed. In this case, the trader receives rebates, and avoids paying fees.

Input: Set of price series (open (P o ), close P c , high (P h ), low (P l )), and volume is a constant that at least covers the transaction costs ($0.003) qg is goal mix of stocks and cash for MSFT Forecast with machine learning algorithm (Logitboost) and technical indicators (TA): 1. At the beginning of the day, train an ADT with Logitboost using training set with technical analysis indicators, and labels (see the appendix) calculated with price and volume series of the last 90 days. 2. Forecast trend of P c using P o and technical analysis indicators for trading day, and take one of the following positions for single stock (MSFT) in rst half hour of trading: Long (qg = 50%), if E (P c ) P o + Short (qg = 25%), if E (P c ) P o Hold, otherwise Intra-day constant rebalanced portfolio (CRP): 3. Sends simultaneously a buy and sell limit orders for according to: Submit buy limit order for , if qt < qg /W Submit sell limit order for , if qt > qg /W Hold, otherwise where W is net value portfolio, qt is current mix of stocks and cash for MSFT, and is amount of dollars to buy or sell in order to reach qg . 4. If (qt ! = qg ) after 60 ticks (about one minute), cancel limit orders, submit market orders to obtain qt , and submit new limit orders. 5. Liquidate position in the last half hour before market closes. Output: Prot/loss of algorithm

Figure 2: The CRP TA algorithm. To understand the intra-day dynamic, we present the results of a trading day when the market is up and down (Figure 4). May 3rd was a very volatile day and the market was up, while CRP TA got a short position. The losses of a short position were partially compensated by the benets of intraday trading thanks to the CRP strategy. On April 28th the market went down. CRP TA assumed a short position that led to a protable position. This last result is evident in the top panel of Figure 4 that shows an important dierence between the portfolio value index and the index price or buy and hold (B&H) position.

4.

PLAT COMPETITION RESULTS

After ten trading days of participating in the PLAT competition, CRP TA obtained a return of $27,686 and the Sharpe ratio was 0.83. Its performance was the second best in its group as Figure 3 shows. CRP TA forecasted correctly a short or long position eight out of the ten days of the PLAT competition. These results were better than the results of a simulation for a sample of 840 days when the predictor was trained with information of the last 90 days. In this last case the test error was 48.8%. These dierences could be explained because the optimization of the parameters used to calculate the technical indicators at the beginning of the competition might have not been adequate for other periods. We spent a signicant amount of time ne tuning the parameters used for the forecast. Additionally, the trader did not get its position at the open price as the above simulation did it. It reached its position after the rst half hour of trading.
6 If interested in using MLJAVA, please contact yfreund@cs.ucsd.edu.

During each trading day there were a large number of trading operations. However, the process to adjust the portfolio to reach the goal mix aected the results because the trader CRP TA paid more fees than received rebates as the bottom of Figure 4 shows. The winner on CRP TAs group during the PLAT competition, team 3, acted as a market maker placing limit orders outside the current spread. Hence, an important amount of CRP TAs orders were plausibly traded with this team; however this trader did not pay fees, only received rebates because their orders were limit orders that most of the time arrived rst than CRP TAs orders. If CRP TA could incorporate this market marker strategy, probably its results may improve as we show in the next section.

5.

IMPROVED ORDER STRATEGY

After the PLAT competition, we integrated the market maker strategy into the CRP TA, and we call the modied version of the algorithm as the Market maker CRP TA. The most important aspect of the revised version of the algorithm is that the orders should be executed as limit orders, and not as market orders as follows: Market maker CRP TA starts with a balanced position according to the proportion of shares over portfolio value established as a

Indexes (Base = 1)

Sharpe Ratio

Stock Shares %

Team1 Team2 Team3 Team4 Team5 Team6 Team7 Team8 Team9 Team10 Team11

0.8334 -0.1619 1.1221 -0.4232 -12.6200 0.7213 2.4963 0.7559 0.5778 0.0432 -12.5931

Profit and loss 26/4 27/4 28/4 29/4 30/4 3/5 4/5 5/5 6/5 7/5 Total 2249 -151 7527 7198 6628 -2523 1567 2238 1885 1068 27687 27 -513 -3062 1219 3204 -153 327 15 61 -4601 -3476 3574 7083 -127 -2832 2040 6691 4335 6108 5915 3061 35847 -44962 3147 -1185 -1832 -988 -88302 946 1129 1907 2316 -127825 -9.E+06 -8.E+06 -9.E+06 -8.E+06 -9.E+06 -7.E+06 -8.E+06 -8.E+06 -8.E+06 -7.E+06 -8.E+07 1045 4729 243 -6694 12508 11065 -2377 5708 9271 11755 47252 3433 1374 2508 2928 3717 3444 1322 3300 2199 966 25190 271 538 -242 -248 13 636 386 452 461 121 2387 1307 2891 -1563 -1349 -1339 3230 1850 2037 2465 1041 10569 -4655 -1370 2178 2820 2766 2961 2665 -5746 2402 -2545 1475 -9.E+06 -8.E+06 -7.E+06 -8.E+06 -8.E+06 -7.E+06 -8.E+06 -8.E+06 -8.E+06 -8.E+06 -8.E+07

Date:05032004
1.015 Index price, Portfolio value index 1.01 1.005 1 0.995 0 0.1 0.2 0.3 Constant Mix, Updated Mix 0.4 5 Transaction costs 0 5 10 Fees, Rebates 15 9 10 11 12 Hours 13 14 15 16 9 10 11 12 13 14 15 16 9 10 11 12 13 14 15 16

Figure 3: Prot and Loss of PLAT competition for all players. Competition was split in the rst ve teams (red group) and the next ve teams (blue group). First column shows the Sharp ratio for each team during the whole competition. Additional columns have daily prots or losses for each team expressed in US$. CRP TA is team 1. Teams 5 and 11 are articial traders who bought and sold large volume of shares following the VWAP.

(a) Market up goal (qg ). Then it sends simultaneously a buy limit order at a price slightly below ($0.005) than the price at the top of the buy order book (PBuyB ), and a sell limit order at a price slightly above ($0.005) than the price at the top of the sell order book (PSellB ). If the order is not completely lled within ten minutes of being issued, existent limit orders are canceled, and limit orders are reissued. In all cases, orders are reissued for the amount necessary to reach the goal mix of stocks and cash (see Figure 5).
Date:04282004
1.02 Indexes (Base = 1) Index price, Portfolio value index

0.98 0 Stock Shares % 0.1 0.2 0.3

10

11

12

13

14

15

16

Constant Mix, Updated Mix 0.4 5 9 10 11 12 13 14 15 16

Transaction costs

0 5 10 Fees, Rebates 15 9 10 11 12 Hours 13 14 15 16

We run this new trading strategy and the original CRP TA strategy during the period January 5-9, 2004. We present the results of January 8th for the market maker CRP TA strategy and for the CRP TA agent in Figure 6. During the week of January 5-9, the Sharpe ratio is 0.03 and -0.28 for the Market maker CRP TA strategy and for the CRP TA strategy respectively. The bottom of Figure 6 shows that Market maker CRP TA received more in rebates than the amount it had to pay in fees. This dierence helped to improve the nancial result of the algorithm which is the major shortcoming of the CRP TA strategy. Another shortcoming of the CRP TA strategy is that this strategy takes a high risk when it keeps only a short or long position during the day. A variation of the CRP TA strategy could be the creation of a portfolio that has a long and short position simultaneously. The scores obtained from Logitboost to forecast the stock price could be used to weight the long and short position. Hence, the position with higher score would have a higher weight. A market neutral portfolio could also be obtained using the same proportion of stocks to portfolio value for the short and long position. We also tried this nal alternative for the week of January 5-9, 2004 and the Sharpe ratio deteriorates to -2.06. Obviously, this alternative misses the benet of market forecasting using ADTs.

(b) Market down Figure 4: Representative intraday results of PLAT competition for CRP TA when market is up (a) and down (b). Top graphs compare portfolio value index with an index price or a simple buy and hold (B&H) position. Middle graphs compare the goal or constant mix of stocks and cash with the updated mix according to the trading algorithm. The steeper curve at the beginning and at the end of the trading day is the period when CRP TA builds and liquidates its goal position. Bottom graphs include fees (> 0) and rebates (< 0). The dierences between rebates and fees are transaction costs.

6.

CONCLUSIONS

In this paper we show that the constant rebalanced portfolio or constant mix strategy can improve if a classier may anticipate the direction of the market: up, down or no change. Additionally, transaction costs play a central role to improve performance. Instead of an automatic rebalance of the portfolio, the results of the PLAT competition indicate that if the CRP strategy is implemented only with limit orders, its results improve because of the rebates. We used very well known technical indicators such as moving averages or Bollinger bands. Therefore, the capacity to anticipate unexpected market movements is reduced because many other traders might be trying to prot from the same

indicators. In our case, this eect is reduced because we tried to discover new trading rules using Logitboost instead of following the trading rules suggested by each indicator. However, we are aware that our predictor may improve if we transform the technical indicators into more accurate ratios or select more informative indicators such as the eect of current news into stock prices. Our experience in adapting boosting to a trading algorithm is that a simple and straightforward application of boosting to nancial time series does not bring a signicant improvement in forecasting. There are other well-known methods used for nance problems, such as logistic regression, that have a similar performance to boosting [9]. However, boosting can work with a mixture of quantitative and qualitative indicators, and also with non-linear time series. Furthermore, boosting can be used to understand the nonlinear relationship between the variables, and can automatically select the best features. Our experiments showed that the boosting approach is able to improve the predictive capacity when indicators are combined and aggregated as a

Transaction costs

Input: Set of price series (open (P o ), close P c , high (P h ), low (P l ), and volume is a constant that at least covers the transaction costs ($0.003) qg is goal mix of stocks and cash for MSFT is minimum amount above or below top price of order books ($0.005) Forecast with machine learning algorithm (Logitboost) and technical indicators (TA): 1. At the beginning of the day, train an ADT with Logitboost using training set with technical analysis indicators, and labels (see the appendix) calculated with price and volume series of the last 90 days. 2. Forecast trend of P c using P o and technical analysis indicators for trading day, and take one of the following positions for single stock (MSFT) in rst half hour of trading: Long (qg = 50%), if E (P c ) P o + Short (qg = 25%), if E (P c ) P o Hold, otherwise Intra-day market maker constant rebalanced portfolio (CRP): 3. Sends simultaneously a buy and sell limit orders for according to: Buy limit order for and PB = PBuyB , if qt < qg /W Sell limit order for and PS = PSellB + , if qt > qg /W Hold, otherwise where W is net portfolio value, qt is current mix of stocks and cash for MSFT, is amount of dollars to buy or sell in order to reach qg , PB and PS are prices of long and short limit orders, PBuyB and PSellB are prices at the top of the buy and sell order book respectively 4. If (qt ! = qg ) after 600 ticks (about 10 minutes), cancel and resubmit limit orders to obtain qt . 5. Liquidate position in last half hour before market closes. Output: Prot/loss of algorithm

Date:01082004
Indexes (Base = 1) 1.005 1 0.995 0.99 0.985 0 Stock Shares % 0.1 0.2 0.3 Constant Mix, Updated Mix 0.4 20 Transaction costs 10 0 10 20 9 10 11 12 Hours 13 14 Fees, Rebates 9 10 11 12 13 14 15 16 9 10 11 12 13 14 15 16 Index price, Portfolio value index

15

16

(a) Market maker CRP TA


Date:01082004
Indexes (Base = 1) 1.005 1 0.995 0.99 0.985 0 Stock Shares % 0.1 0.2 0.3 Constant Mix, Updated Mix 0.4 5 0 5 Fees, Rebates 10 9 10 11 12 Hours 13 14 15 16 9 10 11 12 13 14 15 16 9 10 11 12 13 14 15 16 Index price, Portfolio value index

Figure 5: The Money market CRP TA algorithm.

single predictor. Additionally, we recognize that boosting or another learning algorithms used to forecast time series may have a predictive ability for only a certain period of time. However, the randomness and continuous change of the nancial market may lead to make ineective a trading strategy based on boosting or another predictor. Hence, our algorithm can be enriched by the introduction of risk management mechanisms in order to change strategy or liquidate its position if market behaves in unexpected ways.

(b) CRP TA Figure 6: Representative intraday results for Market maker CRP TA (a) and CRP TA (b) in January 8th, 2004. Top graphs compare portfolio value index with an index price or a simple buy and hold position. Middle graphs compare the goal or constant mix of stocks and cash with the updated mix according to the trading algorithm. The steeper curve at the beginning and at the end of the trading day is the period when trading algorithms build and liquidate their goal position. Bottom graphs present fees (> 0) and rebates (< 0). The dierences between fess and rebates are transaction costs.

Acknowledgements
The authors thank Michael Kearns for inviting us to participate into the PLAT competition and for his invaluable comments about our current work. GC also thanks Sal Stolfo for comments about the trading strategy, and to Luis Ortiz and Berk Kapicioglu for their help and suggestions about running the experiments in PLAT.

7.

REFERENCES

[1] P. H. Algoet and T. M. Cover. Asymptotic optimality and asymptotic equipartition properties of log-optimum investment. Annals of Probability, 16:876898, 1988. [2] A. Blum and A. Kalai. Universal portfolios with and without transaction costs. Machine Learning, 35(3):193205, 1999. Special Issue for COLT 97. [3] J. A. Bollinger. Bollinger on Bollinger bands. McGraw-Hill, New York, 2001. [4] T. S. Chande and S. Kroll. The new technical trader: boost your prot by plugging into the latest indicators. John Wiley & Sons, Inc., New York, 1994. [5] J. F. Clayburg. Four steps to trading success: using everyday indicators to achieve extraordinary prots.

John Wiley & Sons, Inc., New York, 2001. [6] M. Collins, R. E. Schapire, and Y. Singer. Logistic regression, adaboost and bregman distances. In Computational Learning Theory, pages 158169, 2000. [7] T. Cover and E. Ordentlich. Universal portfolios with side information. IEEE Transactions on Information Theory, 42(2), March 1996. [8] T. M. Cover. Universal portfolios. Mathematical Finance, 1(1):129, 1991. [9] G. Creamer and Y. Freund. Predicting performance and quantifying corporate governance risk for latin american adrs and banks. In I Proceedings of the Financial Engineering and Applications conference, MIT-Cambridge, 2004. [10] J. F. Ehlers. Rocket science for traders: digital signal processing applications. John Wiley & Sons, Inc., New York, 2001. [11] Y. Freund and L. Mason. The alternating decision tree learning algorithm. In Machine Learning: Proceedings

[12]

[13]

[14]

[15]

[16] [17]

[18] [19] [20] [21] [22] [23]

[24]

[25] [26]

of the Sixteenth International Conference, pages 124133, 1999. Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119139, 1997. J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: A statistical view of boosting. The Annals of Statistics, 38(2):337374, apr 2000. S. M. Kakade, M. Kearns, Y. Mansour, and L. E. Ortiz. Competitive algorithms for vwap and limit order trading. In Proceedings of the 5th ACM conference on Electronic commerce, pages 189198. ACM Press, 2004. A. Kalai and S. Vempala. Ecient algorithms for universal portfolios. Journal of Machine Learning Research, 3:423440, 2003. J. Katz and D. McCormick. The Encyclopedia of Trading Strategies. McGraw-Hill, New York, 2000. K. A. Kavajecz and E. R. Odders-White. Technical analysis and liquidity provision. Review of Financial Studies, 2004. M. Kearns and L. Ortiz. The penn-lehman automated trading project. IEEE Intelligent Systems, 2003. J. L. Kelly. A new interpretation of information rate. Bell System Technical Journal, 35:917926, 1956. M. Pring. Technical analysis explained. McGraw-Hill, New York, 4 edition, 2002. C. J. Sherry. The new science of technical analysis. Probus publishing, Chicago, 1994. T. Stridsman. Trading systems and money management. McGraw-Hill, New York, 2003. V. Vovk and C. Watkins. Universal portfolio selection. In Proceedings of the 11th Annual Conference on Computational Learning Theory (COLT-98), pages 1223, New York, jul 1998. ACM Press. M. P. Wellman, A. Greenwald, P. Stone, and P. R. Wurman. The 2001 trading agent competition. Electronic Markets, 13(1), 2002. W. Wilder. New concepts in technical trading systems. Trend Research, 1978. E. Zivot and J. Wang. Modeling nancial time series with S-Plus. Springer, New York, 2003.

Appendix. Technical analysis indicators used during PLAT competition Technical indicators are statistics of the market that quantify market trends. Most technical indicators have been developed by professional traders using trial and error. It is common practice to use rules based on technical indicators to choose the timing of buy and sell orders. These rules are called buy and sell signals. In this work we use a combination of market indicators and trading signals. We dene these indicators in this appendix and provide the basic intuition that motivates them. Throughout this section we assume a single xed stock. We start with some basic mathematical notation. We index the trading days by t = 1, 2, . . .. We denote by Pto , Ptc , Ptuc , Pth , and Ptl , the open, adjusted close, unadjusted close7 , high, and low price of the tth trading day. We eliminate the lower index when we wish to refer to the whole sequence, c c , . . .. Using this , P2 i.e. P c refers to the whole sequence P1 notation we dene the median price P med = (P h + P l )/2, the typical or average price P typ = (P h + P l + P uc )/3, and the weighted close price P wc = (P h + P l + 2P uc )/4. Many of the technical indicators incorporate time averages of prices or of other indicators. We use two types of time averages, the simple moving average and the exponentially weighted moving average.8 Let X denote a time sequence X1 , X2 , . . .. The simple moving average is dened as SMAt (X, n) = 1 n
n1

X
s=0

Xts , is

and the exponentially weighted moving average dened as EMAt (X, n) =


X s=0

(1 )s Xts ; =

2 . n+1

A useful property of EMAt (X, n) is that it can be calculated using a simple update rule: EMAt (X, n) = Xt + (1 )EMAt1 (X, n) . In the following table we describe the technical indicators. The parameters of each indicator are in parentheses. Most of the parameters used refer to the length of the period (n) selected to calculate the indicator. In case of exponential moving average, the parameter used is which also depends of n. We have assigned parameters which are typically used in the industry for each indicator.

Unadjusted close prices are the actual published prices at the end of the trading day. The adjusted stock price removes the eect of stock splits and dividend payments. Our goal is to predict Ptc , the adjusted close price. 8 We follow Zivot and Wang [26] in describing the technical analysis indicators. Additional useful references about technical analysis and trading are [16, 20, 4, 21, 22, 10, 5].

Technical indicators used in PLAT competition


Variable Price indicators: SM Ac t ( n) Description Simple moving average of the last n observations of a time series P c . Using the moving average or the median band m (n)) as the reference point, the upper (Bollt u (n) and and lower Bollinger [3] bands (Bollt d (n) respectively) are calculated in function Bollt of s standard deviations. When price crosses above (below) the upper (lower) Bollinger band, it is a sign that the market is overbought (oversold). Technical analysts typically calculate Bollinger bands using 20 days for the moving average and 2 standard deviations. Upper Bollinger band Lower Bollinger band Average directional movement index: indicates if there is a trend and the overall strength of the market [25]. Range of values from 0 to 100. A high number is a strong trend, and a low number is a weak trend. The directional movement index (DXt ) is the percentage of the true range (T Rangen ) that is up (+DIt (n)) or down (DIt (n)). The true range determines the trading range of an asset. Calculation detail [Source] SMAt (P c , n) where n = 3, and 6
m (n) = SM Ac Bollt t ( n)

Bollinger bands:

where n=6

u ( n) Bollt d ( n) Bollt

2 m ( n) ( n ) + s t Bollt 2 m ( n) (n) s t Bollt

where s=2.6 [Katz [16]] where s=2.6 [Katz [16]]

ADXt (n)

ADXt1 (n) (n 1) + DXt )/n where: . (+DI (n))(DI (n)) DXt = (+DIt (n))+(DIt (n)) t t l T Rangen = max(Ph n ) min(Pn ) n=5 , . . . , P h) , Ph Ph = (P h , P h l l l l Pl n = (Ptn , Ptn+1 , Ptn+2 , . . . , Pt )
n tn tn+1 tn+2 t

Momentum and oscillation indicators: M ACDt (s, f ) Moving average convergence divergence: dierence between two moving averages of dierent periods (s, f ) where s stands for a slow period and f for a fast period. M ACDt (s, f ) is regularly calculated using 26 (s) and 12 (f ) periods. MACD signal line: moving average of M ACDt (s, f ) of past n periods. A buy (sell) signal is generated when the M ACDt (s, f ) crosses above (below) the signal line or a threshold. MACD histogram: dierence between the fast MACD line and the MACD signal line. EMAt (P c , s) EMAt (P c , f ) where s=26, and f=12.

M ACDSt (s, f, n)

EMAt (M ACDt (s, f ), n) where f=12, n=9, and s=26.

M ACDHt (n, l)

EMAt (c, l, ) M ACDSt (n) where f = 26 100 1+ SMAt (Pup n , n)

RSIt (n)

Relative strength index: compares the days that stock prices nish up against those periods that stock prices nish down. Technical analysts calculate this indicator using 9, 14 or 25 periods. A buy signal is when RSIt (n) crosses below a lower band of 30 (oversold) and a sell signal when RSIt (n) crosses above an upper band of 70 (overbought).

100

SMAt (Pdn n , n) where n = 5, and n is the length of the time series c > Pc c if Pt Pt up t1 Pt = empty Otherwise c c c if Pt < Pt Pt dn = 1 Pt empty Otherwise up up up up ) Pup n = (Ptn , Ptn+1 , Ptn+2 , . . . , Pt dn dn dn dn dn , ..., P ) , P , P = (P P

tn

tn+1

tn+2

Stochastic oscillator:

Compares close price to a price range in a given period to establish if market is moving to higher or lower levels or is just in the middle. The oscillator indicators are: Percent measure of the last close price in relation to the highest high and lowest low of the last n periods (true range). Typically a period (n) of 5 is used for F AST %Kt (n) and 3 for the rest of stochastic indicators. We follow this convention. Vector with low prices of last n periods Vector with high prices of last n periods Moving average of F AST %Kt (n). Identically calculated to F AST %Dt (n) using a 3-period moving average of F AST %Kt (n). uc min(Pl ) Pt n l max(Ph n )min(Pn )

F AST %Kt (n)

F AST %Dt (n) SLOW %Kt (n)

l l l l Pl n = (Ptn , Ptn+1 , Ptn+2 , . . . , Pt ) h , Ph h h) Ph = ( P , P , . . . , P n tn tn+1 tn+2 t SMAt (F AST %Kt (n), 3) SMAt (F AST %Kt (n), 3)

SLOW %Dt (n)

Moving average of SLOW %Kt (n). Typically a period of 3 is used. A buy (sell) signal is generated when any oscillator (either %K or %D ) crosses below (above) a threshold and then crosses above (below) the same threshold. Typically a threshold of 80 is used for the above threshold, and 20 for the below threshold. Buy and sell signal are also generated when F AST %Kt (n) or SLOW %Kt (n) crosses above or below F AST %Dt (n) or SLOW %Dt (n) respectively.

SMAt (SLOW %Kt (n), 3)

M F It ( n)

Money ow index: measures the strength of money ow (M Ft ) in and out of a stock. At difference of the RSIt (n) which is calculated using stock prices, M F It (n) is calculated using volume. When M F It (n) crosses above (below) 70 (30), this is a sign that the market is overbought (oversold).

100

1+

100 P M F t ( n)

N M Ft ( n) where n = 15 typ V OLt M Ft = P t P M Ft (n) = SMAt (M Ft , n)when M Ft > 0 N M Ft (n) = SMAt (M Ft , n) when M Ft < 0 V OLt is volume of day t P M Ft (n) is positive money ow N M Ft (n) is negative money ow

Zen and the Art of Data Mining


T. Dasu
AT&T Labs Research 180 Park Avenue Florham park, NJ 07932 973 360 8327

E. Koutsofios
AT&T Labs Research 180 Park Avenue Florham Park, NJ 07932 973 360 8642

J. R. Wright
AT&T Labs Research 180 Park Avenue Florham Park, NJ 07932 973 360 8359

tamr@research.att.com

ek@research.att.com

jrw@research.att.com

ABSTRACT
In this paper, we draw upon our fifteen years of experience with clients from a wide range of technical and management backgrounds to outline the must-have characteristics of a successful data mining project. Sometimes it is human and organizational factors that determine success, not technical or scientific considerations. We present two real-life data mining applications each of which plays a critical role in a complex business process. We conclude with open problems that are in need of practical solutions.

Categories and Subject Descriptors


G.3 [Probability and Statistics]: Multivariate nonparametric statistics, statistical computing statistics,

and integrated. To ensure safety, the data has to be replicated and archived. At every stage, device and process logs indicate the status of the processing of various feeds and files. Each feed and file itself is described by characteristics like time of arrival, size, number of columns and rows when applicable, and other attributes. The process logs and feed descriptions together constitute a data store that can be mined to detect and report anomalies like damaged files and incomplete process runs. Another example where data mining is a key component of a business process is the monitoring of servers and computing clusters that support critical business functions like online sales. The data generated for this purpose typically consists of CPU and disk usage, machine loads, and, device and application logs generated by the computing framework.

Keywords
Business operations, data mining applications, anomaly detection.

2. CHALLENGES, CHARACTERISTICS 1. INTRODUCTION


Business applications of data mining have been well documented, from customer behavior (CRM) analysis to market segmentation, to provisioning. Recently however, data mining algorithms are being incorporated into the business process itself as key components. Many business operations and applications generate data that can be mined to monitor, manage and improve the efficiency and reliability of the business process. For example, in order to support a telecommunications billing application, large and complex data feeds from multiple sources (call detail, pricing plans, and customer data) have to be collected, synchronized, stored, scrubbed In both of the above examples, a clean data matrix or data stream that can be fed into a data mining algorithm does not exist. The data feeds, processes, definitions and the underlying business needs are in a constant state of change, frequently out of synch with each other. Metadata and domain knowledge play a crucial role in parsing and understanding the data, but are seldom available with any degree of completeness or accuracy. In addition to the flexibility to navigate such a challenging environment, a data mining application that supports a business process or business operations must have certain characteristics. Technical characteristics of a successful data mining algorithm arise from the requirements of generality, scalability and rigor. We invariably require the following characteristics: Generality, the method should be widely applicable we have numerous applications where the underlying data mining tasks are similar. It could be characterizing the data by capturing multivariate data distributions through multivariate histograms, or finding abnormalities such as outlier and anomaly detection. Any method we develop for a data mining task should

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. DMBA'06, August 20, 2006, Philadelphia, Pennsylvania, USA. Copyright 2006 ACM 1-59593-439-1...$5.00.

be portable from application to application with only minor changes. That is, the method should be free of model assumptions and distributional assumptions (e.g. the data are distributed as multivariate Gaussian). Scalability, the method should gracefully handle large numbers of dimensions as well data points. Many data mining methods, particularly temporal mining and anomaly detection/change detection, tend to be based on a single attribute. Such results can be misleading. Rigor, the method should provide statistical guarantees on the results. Most data mining applications provide performance guarantees but seldom address goodnessof-fit or confidence guarantees. In addition, there are non-technical criteria that play a critical part in the projects success. We have drawn on our fifteen years of experience with and meditation upon our dealings with clients to motivate these characteristics through two real life applications.

for processing. Simultaneously, the data are archived for safety and for future reference. It is a stringent requirement that data should not be lost or dropped. There is a small window of time during which lost or mangled data can be retransmitted. Beyond this window the data are lost forever. Throughout the entire process, logs are generated by various systems, applications and devices, recording the state of each file and feed as they are consumed by this business process. The goal is to ensure that the feeds are received in their entirety in a timely manner and are properly preprocessed according to specifications for downstream data mining applications that use the actual content of the feeds. Processing errors (including incomplete processing) will require reruns resulting in loss of valuable time. Worse, if the rate of data processing is less than the rate of data accumulation, the project will be mired in backlogs reducing the value of the results to the clients. Using data mining to monitor the processes and servers enables us to maintain a fine balance between data accumulation and data processing by alerting anomalies in a timely fashion for quick recovery. The data fed into the data mining application consists of extracted attributes like feed type, number of files received during a time interval, total size of the files, number and type of actions performed (e.g. files received, files cleaned, written to tape, files consumed by application A) and so on. The data mining application consists of an anomaly detection algorithm based on a suite of statistical tests that use a variety of test statistics ranging from the well-known 3-sigma bounds to the less used Hampel bounds. See [5] for details. The baseline parameters of expected values and deviations are computed from weighted historical data. These parameters are updated when there is a significant change in the baseline parameters. The outcome of each significance test is weighted using a feedback mechanism of empirical validation as well as input from subject matter experts. The ultimate decision is a weighted average of the 6 tests. By using a suite of tests and weighting them based on a feedback mechanism, we are able to automatically select the best tests for the data, without any prior knowledge or making assumptions about the data distribution. The application is made attractive and accessible to its users (process engineers, data center operators) using a visual device called a switchboard shown below in Figure 1. The Y-axis shows the test ID of each of the six tests. The X-axis denotes time. We plot a dot at each time period a test detects an outlier in terms of file size. The more tests that agree, the greater our confidence in the anomaly alert. An email is sent to the operator in charge when we have a high confidence alarm. Ideally, the switchboard should be clear of any dots. In the example below, the right side of the switchboard is ablaze with dots. Further investigation by the engineers in

3. APPLICATIONS
The two applications described below illustrate the key role of data mining in business operations that support critical corporate functions. We have built our applications from scratch making sure that they fit within our clients resource constraints. Commercial data mining applications, while helpful in a well-defined, structured, stable environment, do not have the flexibility to function in an environment beset with ambiguity and in a constant state of flux. The focus of commercial software is predominantly on snapshots of static data, addressing the temporal nature in a limited way. We have found that commercial software seldom fit our needs exactly and do not allow us to make changes to adapt them to our requirements. Furthermore, the driving force behind our applications is scale, where we need to quickly process and analyze terabytes of data. We at AT&T Labs have had to create our own database and data warehousing software (Daytona [3], Gigascope [4]) as well as specialized languages (Hancock [1]) for processing our massive data feeds and streams. The same holds true for data mining applications as well. The first application is a mature project for an internal client. The second supports an external client and is work-in-progress.

3.1 Feed Management


A business application related to telecommunications billing requires the gathering, integration, synchronization, cleaning and processing of hundreds of data feeds, each consisting of multiple files. The feeds are gathered on two servers and relayed to the appropriate computing clusters

charge of the process traced the flood of alarms to a new version of a software application which caused the files to be truncated, resulting in smaller files. It was quickly traced and fixed.

Figure 2 Overlap between a bivariate confidence region (green ellipse), and a component-wise confidence region defined by the intersection of two univariate confidence intervals. Figure 1 Alarms (denoted by a blue dot) set off by a suite of 6 statistical tests while monitoring the file sizes in a data feed. While we have shown a simple univariate (based on a single attribute) test, the underlying tests can be complex ones based on time series models, change detection algorithms in data streams or any other appropriate test. In our application, we employ both multivariate and univariate tests because univariate tests though easy to use and understand, can sometimes be misleading. For example, in Figure 2, the green elliptical region represents the confidence region corresponding to a true distribution of the X and Y attributes, say a bivariate Gaussian distribution. The blue rectangular region represents the intersection of the confidence regions computed for the two attributes individually. The non-overlapping regions of the ellipse and rectangle correspond to either false alarms or genuine alarms that are overlooked i.e. misclassified instances. It is important to consider multivariate tests in order to avoid such errors of omission and commission. Implementing the data mining application to manage the complex feeds resulted in the following improvements: Automation of the feed monitoring process o Saved many project hours spent manually reconciling and cross checking data feeds o Reduced human intervention resulting in fewer human errors Automatic documentation of process history and anomalies Reduced cycle times o Errors are immediately reported, recovery is easy Better quality data o Recovery of lost data is possible since anomalies are flagged instantly o Incomplete processes are immediately notified before downstream processes can use corrupt data.

3.2 Monitoring Application Infrastructure (Work in Progress)


The second project involves monitoring the servers and routers that support e-commerce sites. The data mining project has several goals, including: (a) identifying and characterizing longitudinal trends in transactions, and (b) preempting any problems that could bring the site down by detecting changes and anomalies. Most commercial monitoring systems use thresholds to decide when to create alarms. An example would be: if the CPU utilization exceeds 90%, create a severity 1 (critical) alarm. The problem with this approach is that if the thresholds are set too high, the alarm will be generated when it is already too late, the system performance will already be degraded. If the thresholds are set too low, the system generates too many false alarms, which annoys the operators who then stop paying attention to these alarms. By using profile-based alarming, the alarms are generated only when the parameter values exceed the expected values for the specific timeframe. So for example if a server is configured to run level-0 backups every Tuesday at 2am, the high I/O usage associated with the backups will become part of the profile and alarms will not be created about it. The business application is supported by a complex collection of web servers, application servers, database servers, and various networking devices such as load balancers that divide the traffic between several servers to balance the load. Various traffic measurements (packets, pay loads) and machine statistics (number of processes, system usage, disk and memory usage, web hits) are collected at regular 5-minute intervals. Figure 3 below shows the system usage of two servers over a one month period. Server 13 depicted in blue is typical. Server 11 depicted in red is unusual. Typically, a suitably selected cut-off is used to generate an alarm. Note that the operators will receive many such alarms, often on an attribute by attribute basis as shown in Figure 3. In reality, these alarms are invariably related and a big proportion can be traced back to a single root cause, but such analysis is quite manual and time consuming. Often, the operators guess at the root cause and ignore all alarms for that period, leaving potentially catastrophic alarms unaddressed. Our goal is to apply a multidimensional method that will detect changes in distributions (both short term and long term) that could be alerted long before the thresholdbased alarms are set off. This will give the operators time to take preemptive action, rather than wait until the alarms are set off to put out the fire. We believe that we can associate different types of changes in distributions with different

types of alarms, allowing the operators to prioritize and act upon the changes.

Figure 3 System usage of an aberrant server (#11, red) compared to a normal server (#13, blue) The existing monitoring systems use simple univariate control charts based on means and standard deviations, or other one dimensional summary statistics. With a large number of attributes, updating and monitoring these models becomes unmanageable. Furthermore, as mentioned earlier, univariate tests ignore attribute interactions and can produce misleading results. We use an information theoretic approach that is widely applicable (handles many attributes, large amount of data, and both continuous and categorical data), nonparametric (distribution and assumption free) and provides statistical guarantees for the results. The method represents the data distributions as multidimensional histograms (in two sliding windows) and flags anomalies by detecting changes in the data streams. Roughly, the steps involved are as follows: Multidimensional histograms are created for the two comparison windows using kdq-trees. The difference between two histograms is measured using the Kullback-Leibler distance. The statistical significance of the distance is determined using resampling techniques like the bootstrap. The algorithm we use is capable of gracefully handling large data streams, multiple dimensions and both numerical and categorical variables. Please see [2] for details. In addition, the algorithm we use can pin-point the problem sections in the data streams where we detect the most significant difference (Figure 4). We accomplish this using the Kulldorff spatial scan statistic which we compute at the same time as the kdq-tree histograms. The Kulldorff

spatial scan statistic is a special case of the KullbackLeibler distance. See [2] for details.

power, hardware or special software to run expensive algorithms. A data mining project has a greater chance of success if it can be handed off to a client who can take control, use it autonomously and make minor modifications if needed. The technique has to be flexible and nimble (not monolithic) and easily updated to keep up with the business requirements, process specifications and data feeds that are in a constant state of change.

5. OPEN PROBLEMS
Data mining, over the past two decades, has focused on the analysis of massive data sets, with data contents ranging from numeric and string attributes, to text, to audio/video, images, web pages, text and any other form of information that can be captured. There has been considerable interest in data streams and temporal data mining as well. In our experience, the use of data mining as a part of business operations to support business processes is perhaps the most challenging and useful application. Business operations are messy, badly documented and require enormous amounts of domain knowledge. They change and morph almost on a daily basis. The data generated are complex (sys logs, streams of measurements, free text comments) and are beset with data quality issues. Even a partial solution can lead to big gains in efficiency and performance, highly valued by the management. We list below, what we consider to be major research issues in implementing data mining algorithms as a part of business processes.

Figure 4 Kdq-tree histogram, areas of greatest difference found using the Kulldorff spatial scan statistic

4. THE ZEN OF DATA MINING


Our fifteen years of experience of providing data mining services to both external and internal clients have given us insights beyond the technical and scientific. We find that human and organizational factors play a big role in the success of a data mining project, occasionally eclipsing even technical considerations. We summarize below: The data mining algorithm has to perform demonstrably and considerably better than an existing system. A system or method that is already in place, however primitive, is tough competition. Most customers have an inertia to making change it costs money, is disruptive and there are no guarantees the new technique will work better. And never underestimate the power of politics and turf battles. The data mining algorithm has to be conceptually simple and transparent. A simple, easily understood method, even if it performs a little worse, has a better chance with a customer than a complex, opaque (black box) technique. The clients and the ultimate users of the tool are often stretched too thin to invest in long training sessions and steep learning curves. A gain in accuracy of 5%-10% might not necessarily make a complicated algorithm attractive. The likelihood that a customer will buy and use a product increases if it is based on something they are familiar with and can relate to. The algorithm has to be tailored to the resources of the customer. Clients often do not have the computational

5.1 Data Quality


Data mining and data quality are often closely linked data quality issues often show up as outliers or interesting patterns in data mining exercises. Similarly data mining is often preceded by a data cleaning stage to ensure the integrity of the data mining results. Even after all these years, data quality is often a one shot, ad hoc data preparation technique. But given the dynamic nature of most data, it is important to interleave data quality algorithms with data mining algorithms on an iterative basis, each improving upon the other. Some interesting problems include: How can one tell the difference between a data glitch and a genuine outlier, both of which have been unearthed by the data mining algorithm What kind of accuracy/confidence guarantees can one give if a significant portion of the data is missing, truncated or censored? At what point do the results become unreliable and inconsistent? What are effective ways of validating data integrated from different data sources using soft join keys?

What kind of accuracy/confidence guarantees can one give? What are effective ways of capturing domain knowledge and incorporating it into metadata? Often, data dictionaries and metadata are gathered incrementally over time from different sources and experts. How does one quantify and evaluate the effect of the incremental knowledge on the accuracy of the results i.e. a kind of utility curve for metadata? Can data quality issues be classified into categories, perhaps within a hierarchy?

5.4 Resource Constraints


Practical deployment of data mining applications always results in a few surprises. Operational systems are designed to make the best use of their hardware, and are not always capable of supporting data mining applications that are CPU or memory intensive. Therefore it sometimes happens that common assumptions made about the resources available for data mining applications cannot be met in an operational setting. Today's technology is capable of generating immense amounts of data, and there may be real-time or near realtime requirements placed on the data mining application. The best example is the requirement to alert users when anomalous conditions arise. High speed networks are capable of generating data streams that strain the fastest processors, and systems that support critical operations such as customer relationship management or banking transactions can produce data at a similarly high rate. Data may accumulate at a rate that makes it difficult to transmit to an appropriate computing platform. To be truly effective in operational settings, data mining applications need to be both statistically sound and computationally efficient. This implies the need for collaborative research between computer scientists and statisticians. To be practical, data mining algorithms must be efficient, and for their output to be useful, they must be statistically sound. Muthukrishnan (2004) provides an excellent discussion of this issue in the context of data stream processing, and Dasu, et al (2006) provides a working example of a fruitful collaborative effort between algorithmicists and statisticians.

There are some dispersed works that address these questions obliquely but not in any systematic way.

5.2 Goodness-of-Fit
Goodness-of-fit is a statistical concept that evaluates how well a model explains the variability in the data. A good model will account for most of the variability so that the deviations from predicted values are caused primarily by random events. Statisticians place quite an emphasis on goodness-of-fit. In contrast, the data mining community typically focuses on computational performance and comparison with a benchmark. Classification algorithms do talk about error rate and accuracy of prediction but in a self-referential way. Some issues that need to be addressed: How do you define goodness-of-fit for a data mining algorithm? In statistics, the goodness-of-fit criteria often arise from distributional and model assumptions. Is it possible to automate the selection of an appropriate data mining algorithm for a given data set using a goodness of fit criterion?

6. REFERENCES
[1] C. Cortes, K. Fisher, D. Pregibon and A. Rogers. Hancock: a language for extracting signatures from data streams". In Proc. Of Knowledge Discovery and Data Mining, 9-17, 2000. [2] T. Dasu, S. Krishnan, S. Venkatasubramanian, and K. Yi. "An Information-Theoretic Approach to Detecting Changes in Multi-Dimensional Data Streams." To appear in Proc. of the 38th Symposium on the Interface of Statistics, Computing Science, and Applications (Interface '06), Pasadena, CA, May 27, 2006. [3] R. Greer. Daytona and the fourth-generation language Cymbal. In Proc. of the ACM SIGMOD Conf. on Management of Data, pages 525-526, 1999. [4] T. Johnson, C. D. Cranor, O. Spatscheck. Gigascope: a Stream Database for Network Application. Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, June 9-12, 2003

In the context of feed management, it would be useful to have a goodness-of-fit number that can be used to detect changes in data distributions that require updating of the model parameters.

5.3 Grouping Multiple Metrics


In the second application, some of the assets are members of groups: web servers behind a load balancer, all serving the same web site, or multiple routers, all connected to the backbone as a load-balanced group. Such groups can be treated as assets, with metrics derived by combining the metrics of the individual assets. These super-assets represent the architectural components of the system: the ``web-service'' or the ``backbone connection''. The individual assets represent the implementation of the architecture. The primary question is: Is there a systematic way to group related units and related attributes for a clear, unbiased analysis?

[5] J. R. Wright, D. Majumder, T. Dasu and G. T. Vesonder, Statistical Ensembles for Managing Complex Data Streams. Proceedings of the International Workshop on Statistical Modeling, 2005, 431-436.

Data Mining in the real world


What do we need and what do we have ?
Franoise Fogelman Souli
Kxen, 25 Quai Gallieni, 92 158 Suresnes cedex francoise@kxen.com

ABSTRACT
In this paper, we describe the constraints for successful deployment of data mining in the real world. Because the volume of data available is constantly increasing, and competition always stronger, companies wanting to get value out of their data asset are turning to data mining to produce models for every business process and every decision. Deploying data mining on a large scale which we call Extreme Data Mining poses specific constraints which we discuss. We then show that KXEN Analytic Framework allows to develop solutions for companies with a strong demanding agenda. We present some examples of such solutions and conclude with some views of what the future of Extreme Data Mining could be.

activities, they see the need for 100s or 1000s of predictive models per year. Of course, very few companies can produce that many today, due to a lack of expert staff and appropriate tools. However, some actually do generate that many models for example : A broadband communications company moved from 5 crosssell models per year to 1600; A wireless communications company that produces 700 CRM models per year; A national retailer cut time-to-model by 90% and scores 75M households in 30 minutes; A marketing research firm built 370 propensity-to-buy models on a PC in an afternoon We will analyze model production from an industrial viewpoint and review constraints such as large data volumes (records and variables), ability to produce robust models with little intervention, and fast algorithms automatically parameterized. Typical model training would involve 300,000 records with 600 variables, trained in less than one hour total on a personal computer, including data coding, variable selection, and modeling; and application of the model on a few million records in an additional hour. We will review how we achieve this level of productivity through the extensive application of Vapniks SRM framework [2]. We will discuss some examples. Lastly, we will discuss future challenges including the automatic coding of multi-media data (text, images, audio, and movies), integration of model training and application into vertical software packages. We feel that the data mining role in businesses is on a fast track today and Machine Learning practitioners will play a major role, provided we take into account key industrial constraints.

Categories and Subject Descriptors


G.3 [Probability and Statistics] : Nonparametric statistics, Robust regression, Statistical computing I.2.6 [Learning] : Knowledge acquisition I.5 [Pattern Recognition] : I.5.1 Models Statistical. J. [Computer Applications]

General Terms
Algorithms, Experimentation, Performance, Reliability.

Keywords
Data Mining, Industrial applications.

1. INTRODUCTION
Historically, data mining has been in the hands of small teams of expert statisticians who produce a few models per year. However, recently companies invested heavily in building huge data warehouses (from a few terabytes to peta-bytes) that contain millions of records and thousands of variables; for example, 5,000 variables on 150 million customers and prospects. That has changed the economics of data mining. Now businesses want a return on that investment and are looking well beyond reporting and basic statistics [1]. When they review their business

2. DATA MINING IN COMPANIES TODAY


Data mining is widely used in companies today, mostly for CRM, fraud detection, credit scoring, web mining (see [3] for example). Yet, data mining is still mostly seen as an art to be practiced only by experts (statisticians, data miners, analysts). The development of data mining-based applications is strongly linked to the ever increasing availability of large volumes of data. In the last ten years, companies have very heavily invested in the implementation of very large data warehouses, where the largest data bases nowdays reach 20 to 100 Tbytes and comprise millions of customers and thousands of variables. Sizes typically keep tripling every two years (Fig 1).

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. DMBA'06, August 20, 2006, Philadelphia, Pennsylvania, USA. Copyright 2006 ACM 1-59593-439-1...$5.00

Figure 1. Data base size in TBytes


(from http://www.wintercorp.com)
100 2005 80 2003 2001

The IT department is in charge of maintaining the data bases and executing the models transferred to them by analysts to produce the results (e.g. scores) used by the business users. Figure 3. People involved in a Data Mining project

60

40

20

0 1 2 3 4 5 6 7 8 9 10
Top Ten

After such investments (in the $ 100 M range), companies need to see returns. Data mining can bring them just this, if the company is willing to put in place the right organization. Successful analytics competitors [1] need to invest in collecting lots of data of course, but also to commit to basing all decisions on data, which will require top management commitment and employees capability in handling data and producing models for every business process in the company (which will probably mean hundreds of models per year in the company). Enterprise Performance Management approaches such as e.g. Six Sigma, Baldridge, Balanced Score Card, Kaizen [4] provide systematic methodologies for improving key business processes. Actually investigating all key business processes may lead to a very large number of required models, for example, Vodafone D2 [5] has identified the needs for 716 models per year (Fig. 2). Figure 2. Number of models needed at Vodafone D2

This process will result in a typical 3-8 weeks delivery time, which is hardly compatible with the reactivity needed in extreme data mining. What we need is industrial data mining at a greately speeded pace we call it Extreme Data Mining This requires that a company put in place a model factory capable of churning out hundreds of models per year on terabytes of data (millions of lines customers and thousands of variables). The right data mining technology is indeed a key ingredient and it will have to provide employees flexibility, ease-of-use and productivity. Unfortunately, most of the time, companies fall short of achieving such feats : their data collection processes are not efficient, and lots of data keep sitting in spreadsheets or Access files on employees PCs instead of being shared. Moreover, data mining tools are so complex that only a few expert analysts can handle them, producing only a few models per year, with the result that most decisions are based not on data-driven analysis, but simple rules of thumb. We will not discuss the first issue here (data collection), but will focus on the second one. We now describe features required for a data mining tool to support extreme data mining.

Domains Segmentations 2*2*10 Churn in General 2 * 3 * 2 *3 Churn per product 2 * 3 * 2 * 4 * 10 Cross sell : segments*offers 2 * 4 * 10 Acquisition 2*4*10

# Analysis /Year 40 36 480 80 80

The production of a data mining model will usually involve a triangular relationship (Fig.3). The business users have an operational role (for example, they are in charge of designing / planning / executing marketing campaigns). They will list their requirements and will use the model results into their business processes; The Analytics Department serves the needs of 20-50 business users, producing 5-10 models / year with 5-10 data mining experts. Data mining experts usually are a very scarce resource in the company; they will produce the models in line with business users requirements;

3. REQUIREMENTS FOR EXTREME DATA MINING


Extreme data mining will need three ingredients, the right culture, the right people and the right technology [1]. Critical to the latter is the right data mining tool, which should have the following characteristics. 1. Data manipulation Data sources usually come in heterogeneous formats (continuous, nominal, ordinal, text, image, speech, video ). The data mining tool should be able to handle all these formats, with as little intervention as possible from the user. It might be necessary to automatically recode some of the variables;

Data quality is often poor, especially with large volumes, where multiple sources might have very different quality levels. DM tool has to automatically handle missing values, and get results robust with respect to outliers; Data volume is increasing fast. One should be able to handle millions of rows (e.g. customers) and thousands of variables. Such volumes must avoid requiring the user to : Hand-pick relevant variables. This process takes time and requires expert knowledge, and is thus not adequate to handle thousands of variables; Look individually at each variable and study its statistical properties, for the same reason as before; Duplicate or move data to produce models. Ideally, DM tools should be able to directly access the general data warehouse. 2. Model production Model calibration should be reasonably fast. Algorithms requiring days of computation should be ruled out, as well as algorithms not scaling well with respect to number of variables and examples. DM tool implementation of suitable algorithms should be fast (linear in number of variables and examples is a good target); Model application. For some applications, real time or righttime [5] is necessary. For example, for web behavior scoring, a model has to be applied in real-time to data extracted from the click-stream and the result transferred on-the-fly as the internaut goes on clicking. DM tool scoring must thus execute fast. Model industrialization Export. After a model is produced, it may have to be used on a regular basis. It will thus be necessary to export this model to a production environment. This has to be done fast and accurately, and not as an additional big project in itself. Velocity is always king! Good enough and deployed is always better than perfect and in the lab ! [6]; Control. When a model is regularly applied on new data, one has to make sure first that data have not changed structure. Theoretically, a model Y = f ( x1 , x2 ,K, xn ) is built from a sample drawn from a theoretical unknown distribution P ( X , Y ) and can only be applied to data drawn from the same distribution. The DM tool has to provide ways to easily check deviations on data distribution P ( X , Y ) . Methodology. To obtain growth, profitability and customer satisfaction, to adapt processes systematically and continuously, we need a strong methodology well suited to Extreme Data Mining and well supported by the DM tool. 3. Users Users know their business. They know which issues and problems are key, what data are used / generated by their activity, what is the business value of the result delivered by a model. The DM tool must allow them to easyly express their business questions in non-statistical jargon; Users do not know statistics (and do not want to learn !). So they cannot tell which is the best algorithm to be used, how to manipulate / recode data, how to select variables according to

statistical significance, how to handle outliers, missing data, how to decode a model results, how to evaluate the statistical validity of a result. The DM tool must do all of this for them (automatic algorithm selection, data coding and selection, missing data and outliers handling, guaranteed validity of results) and model must be self-explaining with business-explicit reports (avoiding statistical jargon). 4. Integration Analytic modeling is usually but a part of a more global process (e.g. marketing campaign). The DM tool will thus be integrated into an architecture including data bases and various other tools. To make that integration easy, it needs to be compliant with IT architecture standards (COM/DCOM, C++, Corba, Java API, Web Services, J2EE) and data mining standards (JDM, PMML) in case various DM tools are used; Model results can be produced in two ways, either within the tool (which executes the model on a given data set) or the model is exported to further execute on a different tool. In the first case, a list manager will be needed to manipulate the results produced and possibly import them into the data warehouse. In the second case, the DM tool should produce code in any format (SQL, UDF, Java, C, XML, HTML, SAS, ) so that the exported code can easily execute within the target IT platform (for example, execute directly into the data base through SQL or UDF); A components architecture will allow users to build their analytics architecture progressively, including components only when they are needed; Data Access API should allow to connect to any data format (text files, data bases SQL, DB2, Oracle, Teradata, SAS, SPSS, Microsoft files). Operational Process control & workflow. The DM tool will work within a general IT environment and will thus be inserted into a general process managed through usual IT tools (job scheduler, version control, users rights management, workflow tool ) DM tool must then be open and provide generic, welldocumented API. 5. Value Of course, models must deliver useful, accurate results, so that users can exploit them in their business processes. Exploratory modeling. The model is used to understand the data. For example, what the key drivers are for a customer to churn, what the characteristics are for a given segment. The need here is for the DM tool to provide easy-to-understand businessoriented reports and graphics. Predictive modeling. After the model is calibrated on a data sample, it will be applied to new data. Performances evaluated a priori on the training data are expected to be of the same order on the new data : robustness is thus needed here. DM tool must provide both automated and easy-to-understand quality assessment, and guaranteed robustness. Technical

4. EXISTING EXTREME DATA MINING EXAMPLES


At KXEN, we have taken the Extreme Data Mining challenges at face value. Our tool KXEN Analytic framework has been developed to answer most of the previous requirements and here is how : Data preparation / recoding & data quality handling. KXEN AF provides fully automatic coding, and handles large data volume, because it does not duplicate data and just does a few reads; Model production automated. KXEN AF does not provide algorithms, but functions (very much in line with the JDM standards), there is thus no need to choose among algorithms; Model production is de-skilled through a simple, clickthrough interface; Model production (including variable encoding) is fast, for example, on a PC, 9 seconds on 50,000 observations and 15 variables, 800 seconds on 1,000,000 observations and 220 variables; Model quality is guaranteed through built-in robustness, based upon Vapniks Structural Risk minimization theory [2]. KXEN AF provides a simple robustness indicator KR; Model export is automated through a dedicated module KMX capable of exporting the model code to almost any format; Model control can be automated through the use of the deviations detection functionality; Model integration in IS is automated. KXEN AF offers access to all data formats and export to all formats. KXEN AF is compliant with DM standards : JDM [8] and [9]. Many customers of KXEN AF are now performing Extreme Data Mining. For them, introducing the tool has allowed them a big step towards fully exploiting their data and getting a real competitive advantage ! We now give some examples, omitting protected commercial details. Cox Communications [10, 11] started using KXEN in September 2002 in its marketing department, to analyze its customers data base. It now produces hundreds of models for marketing campaigns in 26 regional markets from a data base of 10 million customers and 800 variables. Cox believes that the ease of using intuitive click-through menus lets analysts focus on creating models without extensive data preparation; Cox has found that only 4 senior analysts were needed to manage the work with one regular analyst self-sufficient in creating and supporting analytic models per region; Time to produce models, from start to finish has reduced by approximately 80 percent bringing model building time from three weeks to one; Since Cox uses the tool, results have improved : direct mail responses returns have risen from 1.5 to 5.5 %, churn rate has reduced by a percentage point;

By using this tool, the company has realized the return on investment in the (first) two months it has been in service [10]. Sears [12, 13] has been an early adopter of predictive analytics technology, but their initial mainframe-based system became too expensive and inflexible. They wanted to improve cost, performance and quality in their catalog business with low expense and a small staff performing all duties (Modeling / Analytics, Data Operations and Marketing). First they integrated data from Sears' multiple channels, brands, credit data, market demographics and external sources, resulting in a data mart with more than 900 attributes, integrated into the corporate Teradata warehouse; They then used KXEN to automate their data preparation process, including attribute importance and nominal attribute encoding, among others; They now build more models with better model quality, while reducing model development time and costs. For example, it now takes a few hours to create robust models where it used to take weeks; Finally, Sears uses KXEN analytics engine to automatically generate model deployment code directly within the data warehouse which does not need an IT person to implement models in the warehouse and allows small changes in minutes, instead of hours or days. Now, Sears can score 75 million customer records in 30 minutes; Results and benefits identified by Sears are : Creating and implementing models now takes 1-2 days; Expert statisticians are not needed to conduct a campaign or analyze customers; Sears has cut 50%+ off their operational costs; Sears has cut 90%+ from their modeling and scoring time; They can react to operational changes very quickly. Barclays [7] wanted to leverage the 100s of millions of inbound contacts; they had analyzed that that was where the greatest growth potential was. They thus incorporated events, triggers & predictive models to drive actions into the inbound channels, which needed acting in a timely way with a relevant contact (< 72 hours). They found that it all had to start with data; Using KXEN and deskilling the modeling process they considered as a potential competitive advantage. Business users now had the ability to build robust models and interpret the output easily. By empowering the business user, the fixed cost per model decreased for the organization; 20 predictive models were built, tested and operationally deployed by 2 business users within a month; But they found that training was absolutely necessary to gain adoption of their CRM platforms : timely data, great analytics and intuitive CRM platforms were not sufficient. Vodafone wanted a complete analytic environment to support the easy creation and deployment of churn and cross sell / up sell models [5].

They built a Teradata warehouse where their CAR (Customer Analytic Record) contains over 2500 meaningful variables, with monthly aggregates updated on every billing cycle. By using Teradata, they reduced the data preparation step from 70% or more of the work effort to almost nothing; They identified the need to build more than 700 models per year (Fig. 2) to be used for various business activities (marketing, customer price sensitivity analysis, channel preferences ); They implemented KXEN and found that, using KXEN is giving Vodafone the ability to move from creating a small number of models per year to hundreds of models per year; They found that consistently high quality models can be produced by less experienced analysts; Using KXEN is providing them the ability to automatically deploy models into production and rescore customers whenever necessary.

With what we already have and all these developments on their way, Extreme Data Mining is bringing data mining into every company where data exist, helping to get value out of the data asset, in all processes, in all activities : Data mining is not really an "end" per se , but a means to an end. These "means" will become progressively submerged in the infrastructure of the products they serve until they are as natural to use as standard arithmetic and graphical techniques [15].

References
1. 2. 3. 4. 5. Davenport, Thomas (2006) Competing on analytics. Harvard Business Review, January. Vapnik, Vladimir (1995) The Nature of Statistical Learning Theory. Springer. Piatetsky, Gregory (2006) Poll Results: Top Industries for Data Mining Applications. www.kdnuggets.com/news/2006/n13/1i.html
http://www.motorola.com/motorolauniversity.jsp, http://www.quality.nist.gov/

5. NEXT STEPS
The examples we have just discussed are just the beginning of a real revolution which Extreme Data Mining is bringing. With the exponential increase in data base volumes and Web usage and sources, we will have more and more massive data sets available, either in data bases or in data streams. Extreme Data Mining will thus become more and more pervasive as people will try to extract value from these data sets. However, tapping the full value hidden in these massive data sets will still require improving on various issues. Coding Automatic multimedia coding (image, speech, audio, video) is needed if one wants to mine through more and more heterogeneous data sets; Data Mining Unlabelled data. In massive data sets, the ratio of labelled to unlabelled data could be very small (1 to 1,000,000). Extreme Data Mining thus needs mechanisms for training models in this situation and still produce robust models; Theoretical tools needed. Manipulating large data sets through mostly linear models require specific tools for handling samples, decomposing matrices [14]; Production Extreme data mining on massive data sets requires efficient tools to fully integrate the analysis process, from data collection / sampling to storing, model calibration, model application and model control. It is only if and when this process is fully automated that Extreme Data Mining is able to really get deployed into industrial Model Factories; Integration into vertical packages will allow to introduce optimized solutions so that business users will routinely use data mining without even knowing !

West, Andreas & Bayer, Judy (2005) Creating a Modeling Factory at Vodafone D2: Using Teradata and KXEN for Rapid Modeling. Teradata Conference, Orlando.
http://www.teradata.com/teradata-partners/conf2005/

6.

Herschel, Gareth (2005) Right Timing Customer Analysis. Teradata Conference, Orlando. http://www.teradata.com/teradatapartners/conf2005/

7.

Harris, Matt (2005) The Journey from Product to CustomerCentricity. Teradata Conference, Orlando.
http://www.teradata.com/teradata-partners/conf2005/

8. 9.

Java Community Process (2005) JSR 73 : Data Mining API. www.jcp.org/en/jsr/detail?id=73 Hornick, Mark F., Lei Liu, Marcade, Erik, Venkayala, Sunil, Yoon, Hankil (to appear) Java Data Mining. Strategy, Standard, and Practice. A practical guide for architecture, design, and implementation. Morgan Kaufmann.

10. Douglas, Seymour (feb 2003) Product Review KXEN Analytic framework. DMReview. 11. Ericson, Jim (Dec 2005-Jan 2006) Perfect pitch. Business Intelligence Review. 12. Bibler, Paul and Bryan, Doug (sept. 2005) Sears: A Lesson in Doing More With Less. TM Tipline.
http://ga1.org/tmgroup/notice-description.tcl?newsletter_id=1960075&r=#6

13. Bibler, Paul (2005) Lifting Predictive Analytics Productivity at Sears. Teradata Conference, Orlando.
http://www.teradata.com/teradata-partners/conf2005/

14. MMDS 2006 (2006) Workshop on Algorithms for Modern Massive Data Sets. http://mmds.stanford.edu 15. Nisbet, Robert A. (March 2006 ) Data Mining Tools: Which One is Best for CRM ?. DM Direct Special Report
http://www.dmreview.com/editorial/newsletter_article.cfm?articleId=105062 7

Business Event Advisor: Mining the Net for Business Insight with Semantic Models, Lightweight NLP, And Conceptual Inference
Alex Kass
Accenture Technology Labs 1661 Page Mill Road Palo Alto, California 94304 alex.kass@accenture.com

Christopher Cowell-Shah
Accenture Technology Labs 1661 Page Mill Road Palo Alto, California 94304 c.w.cowell-shah@accenture.com
companys data systems to help managers understand whats going on within their operation. But many executives wish that they had a more systematic means of using technology to help them better identify the external events that represent potential threats or opportunities to their companies. A wealth of information that decision makers can, in theory, use to monitor their competitive ecosystem is now being made available online. However, in practice it is very difficult for a person to systematically scan, digest, and interpret the information thats available to derive the potential insights contained in that large information stream. Automated clipping services can help filter the information stream, and thus represent a step in the right direction, but they dont go far enough in helping decision-makers to easily see the potential implications of new pieces of information to their organizations specific concerns. Often, for instance, it is only by putting several pieces of raw data from disparate sources together, that an important threat or opportunity can be detected. Decision makers need more than a filtered news source; they need to have news and data related explicitly to the forces and issues that matter to their businesses. The premise of the work described in this paper is that the frequently-updated, easy-to-access information available on the Internet can greatly extend the limited, mostly inward-looking scope of Enterprise BI systemsif that information stream can be automatically related to a semantic model of the entities, relationships, and forces that make up the competitive ecosystem in which a users organization operates. Our working hypothesis is that a corporate radar system that (1) has even a relatively simple representation of the entities and relationships that make up a companys competitive ecosystem, and (2) can process online data (especially unstructured text) well enough to detect weak signals of events that might impact that ecosystem, can provide the companys decision-makers with a valuable means to spot threats and opportunities more quickly and consistently than they otherwise would. To begin exploring this hypothesis, weve developed a prototype corporate radar kit called The Business Event Advisor, which can be used to create customized solutions to monitor any companys external business environment in a way that is loosely analogous to the way in which enterprise BI systems monitor internal operations. Figure 1 depicts the conceptual framework weve been developing. (Note that to simplify formatting, all figures have been placed at the end of the text). The high-level concept is to consume several forms of Internet data, including both unstructured text, and other data in a variety of structured

ABSTRACT
The Business Event Advisor is a prototype corporate radar kit that can be used to create customized solutions to monitor the external business environment in which a company operates. It does this in a way that is loosely analogous to the way in which enterprise Business Intelligence systems monitor internal operations, exploiting Internet-based information sources to help decision makers systematically detect and interpret external events relevant to their business concerns. To accomplish this, the prototype integrates text mining components with a semantic model of a companys business environment. Applications built with the system can produce structured descriptions of events that may be relevant to a given company; it can then aggregate those events around its model of the companys business environment, and suggest what impact(s) these events might have on that company. We discuss our motivation for creating this prototype, the architecture of the system, the kinds of information it can provide, the kinds of models it requires, and some of the challenges weve encountered.

Categories and Subject Descriptors


[Artificial Intelligence]: Natural Language Processing---text analysis, I.2.4 [Artificial Intelligence]: Knowledge Representation Formalisms and Methods---representation languages, H.4.2 [Information Systems Applications]: Types of Systems---decision support, J.1[Administrative Data Processing]---business I.2.7

Keywords
Business Applications, performance support, NLP, Web, Inference, Business Intelligence, Competitive and Market Intelligence.

1. OVERVIEW
Enterprise Business Intelligence (BI) systems currently focus mostly on exploiting the information that is flowing through a
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. DMBA'06, August 20, 2006, Philadelphia, Pennsylvania, USA. Copyright 2006 ACM 1-59593-439-1...$5.00

forms, and to run it through an engine that turns this cacophonous data stream into a regularized stream of structured event types that can be used to generate alerts, populate a decision-support portal, or integrated with enterprise data systems. The output stream includes both events that are directly detected, and the output of the event interpretation module, which generates those events and implications which might be indirectly signaled by the events that are directly detected. Detection and interpretation are, as we shall discuss, guided by a set of hand-engineered models. (Keep in mind that Figure 1 represents a somewhat broader vision, than has currently been implemented. For example, we have currently focused entirely on unstructured text, particularly news, and press releases. We have implemented prototypes of both the detection and interpretation models, but they are driven only the first three models depicted the system currently has no explicit model of the information sources, and treats all as equally important. Finally, while we do generate alerts and populate a portal-like GUI, we have not integrated our output with any enterprise data systems.) Of course, there are many technical and logistical challenges in doing this type of analysis. As we set out to attack this problem, we determined that many of the individual pieces of the puzzle already existed, either as research prototypes or, in some cases, commercial products. For instance, off-the-shelf solutions existed to capture news feeds, find news related to certain keywords, classify texts into meaningful categories, and (to some extent) to extract the relevant event parameters from unstructured text, and build semantic representations. Within Accentures own laboratory were several useful components developed in the context of earlier research projects, including systems for measuring the online buzz, and the publics sentiment associated with a certain topic, finding relevant quotes about various business topics, and pulling down product-price information from online merchants. Our ambition in creating the Business Event Advisor prototype has been to develop an integration engine that would pull the right components together and help a user make sense of the results. Our primary focus has been on exploring how the various technologies required to go from raw data to business-relevant event descriptions can be made to fit together around an appropriate semantic model, rather than on inventing new algorithms to accomplish the component tasks. Our system is still a rather early-stage prototype. For instance, we have yet to integrate many of the most promising assets from within our own lab. Furthermore, the system has not yet been tested at scale or rolled out to a pilot user base. But a small-scale proof-of-concept version of the system does run; it generates structured descriptions of some of the events relevant to a particular business that are being reported in unstructured news stories on the Web, and it produces an organized portal containing both those reported events, and potential implications for the customers business. In this paper we describe the overall application vision we are working toward, the approach we are taking, and some of the challenges faced and lessons learned.

When he did, he was gripped with a stark and worrisome realization. Google, which Gates had pigeonholed as essentially a search company, was recruiting for all kinds of expertise that had nothing to do with search. In fact, Gates noted, Googles recruiting goals seemed to mirror Microsofts! Google seemed to be planning a future in which it occupied much of the turf that Microsoft now dominates. It was time to make defense against Google a top priority for Microsoft. This story about using the Web to recognize a competitive threat hints at how useful the Net can be for that purpose, but it also illustrates how random and unsystematic the process of developing that Internet-derived insight still generally is. The dominant mode of information-gathering still involves individuals browsing or searching on keywords related to issues that happen to occur to them. An automatic data-mining system that helps users more consistently detect and interpret events relevant to their businesses must deal with three inter-related realities of Web-based information business information. 1) As we have mentioned, the information-signaling events that are relevant to a user may be in broadly-varied formats, including various forms of unstructured text. 2) The information will likely be spread across a large number of high-volume sources, requiring a quick means of filtering the relevant from the rest, and of coordinating disparate signals to detect potential meaning. 3) Once translated to a standardized, structured form, relevant information will still often be in the form of weak signals that require the application of business knowledge to interpret; only a system that models Microsofts current niche and product mix would be able to detect the relevance of Googles recruiting priorities. Without such a model, it cannot analyze the indirect relationships between events it detects and the business objectives of the company it seeks to inform, leading it to either to ignore important events or cast its net too broadly.

3. USING HAND-ENGINEERED MODELS TO DRIVE EVENT INTERPRETATION


While the above story provided inspiration, and we would eventually like to extend our system to be able to automate this insight from this kind of want-ad, want-ads were not actually our first priority; so we have not pointed our prototype at want-ads or modeled job skills. Instead, we have focused our early effort on news stories, and the market events described in them. To understand how semantic models of the competitive ecosystem are used to drive processing in our prototype, consider a simple, abstract example that is closer to what the system currently does. Imagine that you manage a manufacturer that attempts to use the Net as a radar by running a system that monitors news stories and price data. If your radar is merely able to notice that one of your competitors has lowered its prices, this may be of some value, though it might be too late to react once the threat is that immediate. At any rate, company personnel are likely already to have noticed something so directly relevant to the business. Now suppose instead that its the price of a raw material that has changed, rather than the price of widgets. Perhaps its a raw material that your company doesnt use in any of its products. Such price shifts happen all the time, and humans trying to track and interpret all of these shifts may quickly become overwhelmed. But if this raw material is used by one of your competitors, the price change might have an important, though

2. INSPIRATION
An episode recounted in a recent Fortune Magazine story (Vogelstein 2005) about Microsoft and Google was part of what inspired this project, and speaks to the long-term ambition of our work. The episode involved Bill Gates use of the Web to enhance his insight into the serious nature of the threat that Google could pose to Microsoft. While poking around on the Google corporate website, Gates glanced at the listing of open positions.

indirect, impact on your business. Suppose we have a system with a model describing who your competitors are, which of their products compete with which of yours, and what raw materials each manufacturer uses in each product. A relatively simple model like that, combined with basic rules about cost/price relationships, can enable a corporate radar to see that although you dont use the raw material in question, a drop in the price of that material may mean that your competitor can lower its prices on widgets, thereby putting price pressure on you (see Fig. 2). As we have illustrated above, our approach relies on a set of hand-engineered semantic models of the entities and relationships that make up the competitive ecosystem in which a particular company operates. It also relies on models of the event types that can impact that ecosystem, and a rule base representing a (highly simplified) causal model relating events to each other. We expect that this reliance on hand-engineered models will raise alarm bells for some readers within the KDD community, where it appears to be a common premise that the automated discovery of models via data-driven statistical techniques has superseded the use of hand-engineered rules and representations, and that hand-built models are essentially are an out-of-date technique. It is thought by some that hand-built models are too expensive to be of use in real-world systems. We don't buy this view, and we see the use of statistical techniques and handengineered models as being complementary. And as the statistical techniques mature, we believe that progress will increasingly come from the development techniques to marry those techniques to sophisticated knowledge representations. In considering this position, it's worth noting, for instance, that within enterprise IT systems it is human-engineered models, developed by business analysts, that drive the vast bulk of data processing. Before a business model is automated, it is explicitly modeled by human business analysts. These models form the basis of communication between business owners, developers, and sometimes end users. It is true that it is a painstaking process to create these models, but not an impossible one, and furthermore, we appear to be a rather long way from any sort of automated system that could replace the business analyst's handengineered models of either the as-is or to-be state of a business process. Until recently, the models developed by business analysts, typically began as diagrams created in a drawing program, designed for human consumption, and were not often made machine readable until the programmer translated the process diagrams into code. So there was no explicit, machine-readable model on which any meta-reasoning about the process, or automated analysis of process-related events could be done. But with the emergence of business-modeling tools, and standard languages for representing these processes (such as BPML/BPMN and BEPL) for creating more formal models of business processes, we are beginning to see more business processes being modeled explicitly, so that these models can then be used directly to orchestrate the services that implement the business processes they represent, and to serve as the basis upon which BI systems to interpret data in terms of the underlying business process. Just as business analysts create explicit models of internal business processes, industry analysts create models of the entities, events, and relationships make up a companys external environment, to support business owners' (and investors') understanding of the influence that various forces may have on a business. These models are, by-and-large, still generally created for human consumption in diagram-drawing tools (or, certain

limited cases, for processing by special-purpose forecasting engines), and there has not been as much motivation to create the modeling tools and languages for representing external relationships in machine readable form. However, with the explosion in the amount of data about external events that is now available online, we expect demand to see a rise in the demand for systems to make sense of that data, and for the tools that industry analysts use to model market and competitive processes to follow a similar path to that being followed by business-process modeling tools; that is, we expect to see tools for creating more explicit, machine-interpretable semantics models of the competitive forces influencing a company emerge. We consider the prototype GUI business-environment modeling tools that are part of the Business Event Advisor to be a step in that direction. Of course, we welcome the notion that automated techniques might be able to enhance or even eventually replace the handengineered models we currently use. For instance, an interesting topic for future research would be to explore how the system might generate rules automatically, based on emerging patterns of directly detected events. It is possible to imagine, for instance, that repeating patterns of product introduction events coming from a competitors supplier, followed by product feature change events coming from that competitor, might be recognized by a future version of the Business Event Advisor and automatically used to create a rule whereby product introductions imply feature changes in appropriately related entities. A rule-learning system of this type could not only dramatically reduce the burden of initial configuration, but could also improve performance of a particular application of the system over time, with new implied events emerging as recurring event patterns are detected and analyzed. However, there seem to us to be very significant hurdles to achieving this. Beyond the standard complexities of engineering the learning techniques, there is a problem with the quality of the available data. The data available online is seems to be rich enough to provide useful alerts based on a model, but may not yet be rich or consistent enough to allow for the automatic learning of the model itself. In the shorter term, we expect that the labor involved in modeling will be reduced more by sharing model components, or automatically incorporating them from existing databases, rather than automatically learning them directly from underlying data. For instance, there are structured information sources, available on a commercial basis, that provide business information such as market caps, management names, and certain competitive relationships. Our model-building tools are capable of tentatively populating some portions of the needed models automatically by drawing on this data. However, it is still necessary for the analyst to review those automatically retrieved components, and to build many from scratch because the actual business relationships tend to be more complex than is captured in these databases. One more point on this topic: Keep in mind that our goal is to advise a decision-maker, not to produce a system that fully automates the reasoning process. This places a greatly-reduced burden on the model builder. The system does not need to maintain certainty in its reasoning, or even to determine probabilities. The goal is merely to alert a human decision-maker to events and implications that might be relevant.

4. HOW OUR PROTOTYPE WORKS


Our prototype includes of a simple version of this type of GUI modeling tool (as mentioned above), designed to enable an industry analyst to create these models for a customer, and a run-

time text-and-data-mining engine that uses those models to drive the detection and interpretation of events relevant to the customers business. Detection involves capturing relevant data from the Net, and using lightweight NLP to convert unstructured portions of that data, such as news reports, into structured event descriptions. What we call interpretation involves a rules engine that infers potential business implications from appropriate patterns of detected events. By combining these technologies, the prototype can provide a structured view of both the relevant events that are being reported on the Web, and important possibilities that not being explicitly reportedevents that represent plausible but uncertain inferences based on the weak signals that have been detected. To get more concrete about what weve built, lets tour an example application of our Business Event Advisor, designed for an executive of a manufacturing company, such as the Ford Motor Company. An executive using our system can navigate around the competitive ecosystem to view displays of events affecting any portion of that ecosystem. For instance, the user might choose to view a summary of all events involving any of his companys suppliers. This view (Figure 3) groups events according to the suppliers that they involve, categorizes events into one of several predefined event type classes, assigns an importance level to each event, presents key information about each event (extracted from the source text from which each event was detected). The system produces this display by scanning all the data broadcast on its monitored news sources, filtering out stories that dont mention at least one of the entities in the ecosystem model, classifying the events described in the remaining stories into the categories defined in the event model, and extracting as many of the event parameters as it can. The system also derives implied events from the directly detected events, and displays these implied events in the same fashion, marked with a special symbol. In Fig. 3 we see a number of events that involve Fords suppliers. Each events classification is displayed on the first line of the event; this information is also represented by a colored bar at the left edge of each event, where different colors represent different event types. The bars on the right side of each event box represent the systems estimate of the importance level assigned to that event (based on the event type and the entities involved). Events with white diamonds in their colored bars are implied. That is, they are events that the system has inferred from one or more directly detected events. The user can zoom in on any event clicking on its colored bar. This pops up a window that provides more information about that event. Fig. 4 shows a pop-up that displays additional information about a new product introduction by Denso, one of Fords suppliers. This window displays the event type, estimated importance level, and whatever event parameters (as specified by the systems event model) the system was able to extract from the text describing the event. The buttons in the top right corner allow the user to correct the event type or any of the extracted pieces of information, should the system process an event incorrectly. This also provides additional training data for future rounds of category learning. We believe that this approach to improving classification through end-user feedback could provide a valuable method, in the spirit of Web 2.0, of harnessing end-user effort to produce a system that captures knowledge from a broad community to accomplish what would otherwise be an arduous training exercise. If several news sources had contributed to this event, the user could view a summarized version of the text from each source by

clicking the Next and Prev buttons, or could view the original web-based news story by clicking on the View Source button.

5. INFERRING UNSEEN EVENTS


The last text box on the pop-up displays descriptions of any implied events that the system has inferred from this directly detected event. In this case, the system realizes that new products coming from one of the users suppliers mean that the users company may be able to expand its product line or change features on its existing products. Its worth noting that the Business Event Advisors rules consider the relationship between the entity involved in a detected event and the applications focus entity (in this case, Ford) when inferring events from the detected event. For example, a product introduction involving one of Fords competitors would produce an entirely different set of implied events than would a product introduction involving one of Fords suppliers. The former might imply that Ford could face price pressure on its products, while the latter might suggest that Ford could expand its product range or change features on existing products. Because these relationships represent weak signals, not logical implications, many of the implied events generated by the system will turn out to be false alarms. An appropriate interpretation of the systems rules is a suggestion that the implied event could be happening, not that it actually is. In the end, its up to the user to decide how likely the implication actually is; however, it is worth noting that the system helps the user to make that judgment by reporting how many other signals of the same implication have been detected. A weak signal combined with a number of others that point in the same direction can form a pattern that a user may give much more credence than the lone signal would have without corroboration. In our next version of the system, users will be able to view a collection of all the signals that point to a particular implication, in order to decide how strong the overall pattern is. In more distant future systems we would like to have the system itself weigh the evidence, but at this point we consider that to be the users job; the systems job is to help the user see the relevant patterns, not to make the final assessment of likelihood.

6. APPROACH AND CHALLENGES


As we have illustrated, weve defined the job of a corporate radar to be monitoring a set of information streams, distilling their varied, often unstructured content into a series of structured descriptions of business relevant events, and identifying potential implications of these events to the users organization. We will now look briefly at main stages of the processing pipeline implemented by our prototype, and how the overall architecture challenges the data-mining technology weve been using. The major stages of the pipeline are as follows: 1. Data Capture: grab the raw information from the Net. 2. Model-Based Filtering: throw out texts that dont contain references to any entities in our ecosystem model. 3. Model-Driven Event Classification: classify each article into one of the categories defined in the event model. 4. Determination of Event Parameters: extract key pieces of information and assign roles within an event template 5. Event Implication Inference: generate implied events from the directly detected events by running the systems rules engine on the event descriptions generated by the previous step.

The data-capture challenge involves engineering more than any deep theory. Ads and white space need to be stripped, links need to be followed, and other data cleaning needs to be performed. The easiest source for capturing raw data is RSS feeds, and this is the data source weve worked with the most. These feeds offer the benefit of breaking out titles, and sometimes isolated summary paragraphs. This light structure can facilitate reliable classification. Unfortunately, not everything is found in RSS feeds, and much data can be captured only with painstaking, brittle scraping of pages not designed for computer consumption. We experimented with using scraping routines to capture a much broader range of news stories. When integrating these into our system, we discovered another challenge: the form of the texts captured by our RSS reader and those captured by our scrapers are sufficiently different (the latter being much more detailed, for instance), that our classification techniques had trouble working on both at the same timea classifier tuned and trained to work fairly well on the feeds broke when given the full-text stories captured by our scrapers. It is possible to engineer a solution to this problem. We could divert the streams to different classifiers depending on their source, or with enough training data, perhaps with a single classifier could handle both types of streams. But this challenge is the tip of a larger iceberg that must be navigated when creating these types of systems: wed like to take information from a wide range of sources, cleanse it, and send it downstream for classification and parameter extraction, without downstream modules knowing or caring about the source of the captured data. However, we realize that this may not be a realistic form of modularization for current text-processing techniques. Systems that can handle very heterogeneous input sources are likely to stress classification systems which work best with less variance within classes. In any case, because the data capture module is network-bound, the speed with which it consumes events from the network is unpredictable. In addition, care needs to be taken to buffer the module against network outages or server delays. As these considerations suggest, the module is most effectively run in batch mode at a time of day when the data sources traditionally experience light traffic. The system, in other words, is updated daily, not in real time. The next challenge is how to deal with the sheer quantity of information that is captured by a system that monitors enough of the relevant sources to catch important events. Because only a small portion of the articles monitored on general news sources may be relevant to a given companys ecosystem (in our experience with small tests of Business Event Advisor, the rate of relevant articles has been around 1%), there is a lot of value in a mechanism for quickly filtering out irrelevant articles. With enough training data, perhaps a Nave Bayes classification algorithm (Domingos and Pazzani 1997 analyzes the effectiveness of this algorithm on a variety of data types) could be used to classify event texts as relevant or irrelevant to a particular application of the Business Event Advisor. But we have found a simpler scheme for this stage of processing that is workable and probably faster and more maintainable. We scan the data with hand-engineered regular expressions that are stored in the ecosystem model, and remove events dont contain any of the expressions associated with any entities defined in the model. After this filtering is done, the input stream is cleaned of any duplicate articles. These two steps shrink the input stream enormously, and take only a negligible amount of time to

complete. Once the filtering stage is complete, the surviving input is classified into one of the event types defined in the event model (e.g., hire, product recall, or award). We currently use Rainbow, an open-source natural language processing application as our classification engine. Rainbow makes it fairly easy to test different classification algorithms. We were surprised to find that (with our admittedly small training sets) the Nave Bayes algorithm produces unacceptably low rates of recall (too many events are left unclassified). The Probabilistic Indexing algorithm (described in Fuhr 1989) produces much higher recall rates (around 80%) and tolerable precision rates (around 65%) when events are classified into one of seven event types. With more or longer training examples, Nave Bayes might produce superior results. With all algorithms tested, we use stemming and a 524word stoplist to improve classification results. Rainbow classifies thousands of items per minute, so performance is not a concern. After texts are classified, we use the event type model to drive the determination of the events key parameters. Some of those parameters (such as date, or, as described earlier, event importance) can be computed or estimated, but most must be extracted from the text. First, we use a combination of commercial and custom-coded named entity recognition software to collect instances of various data types within the event text (e.g., personal names, job titles, and dates). ClearForest Tags is the commercial package. Although its recognition algorithms are proprietary, it seems to use a strategy that augments lexical lookup with some amount of syntactic parsing. Our own named entity recognition code, used to supplement the results of the commercial system, is based purely on regular expression matching. We chose this approach because the simple regular expressions we use are easy to understand, easy to maintain, and provide more than adequate performance. Named entity recognition presents the greatest performance challenge. For the sake of simplicity, we currently search for all possible entity types in each story, rather than pass into the extraction system information about what types of entities to look for. However, since classification has already occurred before we begin extraction, it would be possible to search only for the relevant entity types. This approach would require a significantly tighter integration between the extraction module and rest of the system, which would add complexity, but as the system scales up, performance demands may increase to the point where it this approach becomes necessary. Next, the system determines which instances of the named entity types should be used as values for the attributes associated with the events type. For example, a hire event would need to have the new employer attribute filled with a single company name, even if the event text refers to two or more companies. Accomplishing this role assignment makes heavy use of matching patterns associated with a specific event type (for instance, specifying terms that will typically precede or follow the entity that fills a particular role). Consistent with the work reported in forums such as Message Understanding Conference proceedings, weve found no way to avoid this custom-tailored pattern matching. The simplicity and speed of our pattern matching scheme made it a useful choice for the proof-of-concept implementation, but a more sophisticated custom approach might produce better accuracy. Once the input has been translated into a structured event description, the final information-processing step is to generate implications using a simple conceptual inference engine of the sort originally associated with Charles Rieger (see, for instance,

Schank and Rieger), and others in artificial intelligence. Rules created with our graphical model-building tools contain rule-firing conditions specifying event types and constraints that must be met by key event attributes. When a rules criteria are met, an implied event is generated and populated with attribute values as specified in the rule. Any implied events that are created are fed back into the processing loop, where they are processed just like directly detected events. The inference mechanism currently deployed in the system is extremely simple, and supplies purely qualitative insights without any quantitative estimate of probability. Furthermore, we currently have no inference-control heuristics implemented, so we cut the system off at one level removed from directly-detected facts. While this conceptual inference technique has been in the AI toolkit for quite some time, we think it is exciting that the Internet (along with the text- and data-mining

techniques required to turn the unstructured information on the Net into structured event descriptions) is now finally able to provide large amounts of data to exercise these engines. We believe that integrating these symbolic AI techniques with the results of text and data mining is a promising approach to creating a next generation of business-aware Internet processors. A system that is still in the pre-pilot stages, and has not been tested at scale, can only be suggestive, it certainly cant be said to actually prove any hypotheses. Nevertheless we believe that what our early prototype suggests is quite exciting: that putting these pieces together can lead to a system that provides practical business value by analyzing business-relevant events being reported on the Net in a way that the individual components could not do on their own.

Fig. 1. The conceptual framework for the Business Event Advisor

Event: Significant change in price of raw material X

Your Company

Competitor Y

Product 1

Product 2

Product 3

Product N

Competes with

Competing Product Z

Potential Threat: Competitor Y may be about to lower price on : your market , share on product N. Product Z, cutting into
,

Fig. 2. Inference derived from an event not directly related to you

Fig. 3. A summary of events relevant to Fords suppliers.

Fig. 4. A pop-up window displaying more information about a Denso product introduction event

REFERENCES
ALLEN J. 1995. Natural Language Understanding. Benjamin/Cummings, Redwood City, CA. DOMINGOS, P. AND PAZZANI, M. 1997. On the optimality of the simple Bayesian classifier under zero-one loss. Machine Learning 29, 103137. FUHR, N. 1989. Models for retrieval with probabilistic indexing. Information Processing and Management 25(1), 5572. MANNING, C. AND SCHUTZE, H. 1999. Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, MA. Proceedings of 7th Message Understanding Conference, Fairfax, VA, 19 April1 May, 1998. http://www-nlpir.nist.gov/
related_projects/muc/proceedings/muc_7_toc.html

SCHANK, R.C. AND RIEGER, C.J. 1974. Inference and the Computer Understanding of Natural Language. Artificial Intelligence 5(4), 373412. VOGELSTEIN, F. 2005. Gates vs. Google: Search and Destroy. Fortune 151 (9). http://money.cnn.com/magazines/
fortune/fortune_archive/2005/05/02/8258478/index.htm

Forecasting Online Auctions using Dynamic Models


Wolfgang Jank
Department of Decision and Information Technologies Robert H. Smith School of Business University of Maryland

Galit Shmueli
Department of Decision and Information Technologies Robert H. Smith School of Business University of Maryland

Shanshan Wang
Statistics Program Department of Mathematics University of Maryland

shanshan@math.umd.edu

wjank@rhsmith.umd.edu ABSTRACT

gshmueli@rhsmith.umd.edu
casts can generally be done in two dierent ways, in a static and in a dynamic way. The static model relates information that is known before the start of the auction to information that becomes available after the auction closes. This is the basic principle of some of the existing proposed models [Ghani and Simmons, 2004, Ghani, 2005, LuckingReiley et al., 2000, Bajari and Hortacsu, 2003]. For instance, one could relate the opening bid, the auction length and a sellers reputation to the nal price. Notice that opening bid, auction length, and seller reputation are all known at the auction start. Training a model on a suitable set of past auctions, one can get static forecasts of the nal price in that fashion. However, this approach does not take into account important information that arrives only during the auction. The number of competing bidders right now or the current price level are factors that are only revealed during the ongoing auction and that are important in determining the future price. Moreover, the current change in price could also have a huge impact on the future price. If, for instance, the price had increased at an extremely fast rate over the last several hours, causing bidders to drop out of the bidding process or to revise their bidding strategies, then this could have an immense impact on the evolution of price in the next few hours and, subsequently, on the nal price. We refer to models that account for newly arriving information and for the rate at which this information changes as dynamic models. Forecasting price in online auctions dynamically is challenging for a variety of reasons. Traditional methods for forecasting time-series, such as exponential smoothing or moving averages, cannot be easily applied to the auction context since bidding data arrive at very unevenly-spaced time intervals which is not amenable to these methods. Moreover, online auctions, even for the same product, can experience price paths with very heterogeneous price dynamics [Jank and Shmueli, 2005, Shmueli and Jank, 2006]. By price dynamics we mean the speed at which price travels and the rate at which it changes. Traditional models do not account for instantaneous change and its eect on the price forecast. This calls for new methods that can measure and incorporate this important information. In this work we propose a new approach for forecasting price in online auctions. The approach allows for dynamic forecasts in that it incorporates information from the ongoing auction. It overcomes unevenly spacing of data, and also incorporates change in the price dynamics. Our forecasting approach is housed within the principles of functional data

We propose a dynamic model for forecasting price in online auctions. One of the key model features of our model is that it operates during the live-auction. Another feature is that it incorporates price dynamics in the form of a prices velocity and acceleration via the use of novel functional data methodology. We illustrate our model on a diverse set of eBay auctions across many dierent book categories and nd signicantly higher prediction accuracy compared to standard approaches.

1.

INTRODUCTION

eBay (www.eBay.com) is the the worlds largest Consumerto-Consumer (C2C) online auction house. On eBay, an identical (or near-identical) product is often sold in numerous, often simultaneous auctions. For instance, a simple search under the key words iPod shue 512MB MP3 player returns over 300 hits for auctions that close within the next 7 days. A more general search under the less restrictive key words iPod MP3 player returns over 3,000 hits. Clearly, it would be challenging, even for a very dedicated eBay user, to make a purchasing decision that takes into account all of these 3,000 auctions. The decision making process can be supported via price forecasts. Given a method to predict the outcome of an auction ahead of time, one could create an auction-ranking (from lowest predicted price to highest) and select only those auctions for further inspection with the lowest predicted price. One of the diculties with such an approach is that information in the online environment changes at every moment: every minute, new auctions enter the market, closed auctions drop out, and even within the same auction the price changes permanently with every new bid. Thus, a well-functioning forecasting system has to be adaptive to change. We propose a dynamic forecasting model that can adapt to certain aspects of this changing environment. Price fore-

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and the full citation on the rst page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specic permission and/or a fee. DMBA06, August 20, 2006, Philadelphia, Pennsylvania, USA. Copyright 2006 ACM 1-59593-439-1...$5.00.

analysis [Ramsay and Silverman, 2005]. In Section 2 we briey explain the principles of functional data analysis and derive our functional forecasting model in Section 3. We apply our model to a set of bidding data for a variety of book auctions in Section 4.

2.

FUNCTIONAL DATA MODELS

A functional data set consists of a collection of continuous functional objects such as the price curves in an online auction. Despite their continuous nature, limitations in human perception and measurement capabilities allow us to observe these curves only at discrete time points. Thus, the rst step in a functional data analysis is to recover, from the observed data, the underlying continuous functional object [Ramsay and Silverman, 2005]. This is usually done with the help of smoothers. A variety of dierent smoothers exist. One very exible and computational ecient choice is the penalized smoothing spline. The penalized smoothing spline is a piecewise polynomial spline of order p. In order to control its local variability, the spline parameters are estimated using a roughness penalty. The result is a curve that balances data-t and smoothness [Ruppert et al., 2003]. Smoothing splines are not the only option. Alternatives are monotone splines or kernel methods [Ramsay and Silverman, 2005]. Consider Figure 1 for illustration. The circles in the top panel of Figure 1 correspond to a scatterplot of bids (on log-scale) versus their timing. The continuous curve in the top panel shows a smoothing spline of order m = 4 using a smoothing parameter = 50.
Current Price
3.7

compute the price velocity and price acceleration via the rst and second derivatives, D(1) yt and D(2) yt , respectively. Consider again Figure 1. The middle panel corresponds to the price velocity, D(1) yt . Similarly, the bottom panel shows the price acceleration, D(2) yt . The price velocity has several interesting features. It starts out at a relatively high mark which is due to the starting price that the rst bid has to overcome. After the initial high speed, the price increase slows down over the next several days, reaching a value close to zero mid-way through the auction. A close-to-zero price velocity means that the price increase is extremely slow. In fact, there are no bids between the beginning of day 2 and the end of day 4 and the price velocity reects that. This is in stark contrast to the price increase on the last day where the price velocity picks up pace and the price jumps up! The bottom panel in Figure 1 represents price acceleration. Acceleration is an important indicator of dynamics since a change in velocity is preceded by a change in acceleration. In other words, a positive acceleration today will result in an increase of velocity tomorrow. Conversely, a decrease in velocity must be preceded by a negative acceleration (or deceleration ). The bottom panel in Figure 1 shows that the price acceleration is increasing over the entire auction duration. This implies that the auction is constantly experiencing forces that change its price velocity. The price acceleration is at during the middle of the auction where no bids are placed. With every new bid, the auction experiences new forces. The magnitude of the force depends on the size of the price-increment. Smaller price-increments will result in a smaller force. On the other hand, a large number of small consecutive price-increments will result in a large force. For instance, the last 2 bids in Figure 1 arrive during the nal moments of the auction. Since the increments are relatively small, the price acceleration is only moderate. A more systematic investigation of auction dynamics has been done in other places [Jank and Shmueli, 2005, Shmueli and Jank, 2006].

LogPrice

3.3

3.5

Day of Auction

First Derivative of LogPrice

Price Velocity
0.00 0.05 0.10 0.15

3.

DYNAMIC FORECASTING MODEL


$100 -

B Predicted Price Path

Day of Auction

Second Derivative of LogPrice

Price Acceleration
0.00 0.04

P r i c e

$50 -

Observed Price Path

0.06
0

Day of Auction

$0 1

Day

Figure 1: Current price, price velocity (rst derivative) and price acceleration (second derivative) for a selected auction. The rst graph shows the actual bids together with the tted curve. One of our modeling goals is to capture the dynamics of an auction. While the smoothing spline yt describes the magnitude of the current price, it does not reveal the dynamics of how fast the price is changing or moving. Attributes that we typically associate with a moving object are its velocity (or its speed ) as well as its acceleration. Notice that we can 2

Figure 2: Schematic of the dynamic forecasting model of an ongoing auction. As pointed out earlier, the goal is to develop a dynamic forecasting model. Consider Figure 2 for illustration. Assume that we observe the price path from the start of the auction until time t (solid black line). We now want to forecast the continuation of this price path (broken grey lines, labelled A, B, and C). The diculty in producing

this forecast is the uncertainty about the price dynamics in the future. If the dynamics level-o, then the price increase slows down and we might see a price path similar to A. If the dynamics remain steady, the price path might look like the one in B. Or, if the dynamics sharply increase, then a path like the one in C could be the consequence. Either way, knowledge of the future price dynamics appears to be a key factor! Our dynamic forecasting model consists of two parts: First, we develop a model for the price dynamics. Then, using estimated dynamics, together with other relevant covariates, we derive an econometric model of the nal price and use it to forecast the outcome of an auction. We want to point out that the modeling-mechanics rely on well-established methods from the time-series literature (Autoregressive models, in our case); however, the novelty lies in the usage of price dynamics. In particular, standard time-series models do not take process dynamics into account; we overcome this deciency by proposing a two-step approach. An alternative approach is to use principal dierential analysis (PDA) [Ramsay and Silverman, 2005]. However, PDA is based on nonlinear operations which can slow-down practical implementation in large data sets. Moreover, model-identiability and the inclusion of covariates is much more involved in PDA. Let D(m) yt denote the mth derivative of the price yt at time t. We model the derivative curve as a polynomial in time t with autoregressive (AR) residuals, D(m) yt = a0 + a1 t + + ak tk + x(t) + ut , (1)

4.

EMPIRICAL RESULTS

Our data set is diverse and contains 768 eBay book auctions from October 2004. All auctions are 7 days long and span a variety of categories. Prices range from $0.10 to $999 and are, not unexpectedly, highly skewed. Prices also vary signicantly across the dierent book categories. This data set is challenging due to its diversity in products and price. We use 70% of these auctions (or 538 auctions) for training purposes. The remaining 30% (or 230 auctions) are kept in the validation sample. Our model building investigations suggest that only velocity f (t) is signicant for forecasting price in our data. We thus estimate model D(m) yt in (1) only for m = 1. Using a quadratic polynomial (k = 2) in time t and predictor variables for book-category ( x1 (t)) and shipping costs ( x2 (t)) results in an AR(1) process for the residuals ut (i.e. p = 1 in (2)). The rationale behind using book-category and shipping costs in model (1) is that we would expect the dynamics to depend heavily on these two variables. For instance, the category of antiquarian and collectible books typically contains items that are of rare nature and that appeal to a market not as price sensitive and with a strong interest in obtaining the item. A similar argument applies to the shipping costs. Shipping costs are determined by the seller and act as a hidden price premium. Bidders are often deterred by excessively high shipping costs and as a consequence auctions may experience dierences in the price dynamics. Table 1 summarizes the estimated coecients averaged across all auctions from the training set. Table 1: Estimates for the velocity model D(1) yt in (1). The rst column indicates the part of the model design that the predictor is associated with. The third column reports the estimated parameter values and the fourth column reports the associated signicance levels. Values are averaged across the training set.

where x(t) is a vector of predictors (see below), is a corresponding vector of parameters, and ut follows an autoregressive model of order p: ut = 1 ut1 + 2 ut2 + + p utp + t , t N (0, 2 ). We allow (1) to depend on the vector x(t), which results in a very exible model that can accommodate dierent dynamics due to dierences in the auction format or the product category. After forecasting the price dynamics, we use these forecasts to predict the auction price over the next time periods up to the auction end. Many factors can aect the price in an auction such as information about the auction format, the product, the bidders and the seller. Let again x(t) denote the vector of all such predictors. Notice that x(t) could be the same as in equation (1) or it could be a superset. In our case, x(t) contains the book category and the shipping costs for the price dynamic model, while for the price model it also contains bidder and seller ratings, opening bid, and number of bids (see Tables 1 and 2). Let d(t) = (D(1) yt , D(2) yt , . . . , D (p) yt ) denote the vector of price dynamics, i.e. the vector of the rst p derivatives of y at time t. The price at t can be aected by the price at t 1 and potentially also by its values at times t 2, t 3, etc. Let l(t) = (yt1 , yt2 , . . . , ytq ) denote the vector of the rst q lags of yt . We then write the general dynamic forecasting model as follows: yt = x(t) + d(t) + l(t) + t (3) (2)

Des Time Time x(t) x(t) x(t) AR

Predictor t t2 Intercept Book Category Shipping Costs ut

Coe -0.012 0.004 0.041 1.418 1.684 1.442

P-Val 0.055 0.041 0.004 0.038 0.036 -

where , and denote the parameter vectors and t N (0, 2 ).

After modeling the price dynamics we estimate the price forecasting model (3). Recall that (3) contains three model components, x(t), d(t) and l(t). Among all reasonable pricelags only the rst lag is inuential, so we have l(t) = yt1 . Also, as mentioned earlier, among the dierent price dynamics we only nd the velocity to be important, so d(t) = D(1) yt . The rst two rows of Table 2 display the corresponding estimated coecients. Notice that both l(t) and d(t) are predictor variables derived from price, either from its lag or from its dynamics. We also use 8 non-price related predictor variables x(t) = (x1 (t), x2 (t), x3 (t), x 4 (t), x 5 (t), x 6 (t), x 7 (t), x 8 (t))T . Specically, the 8 predictor variables correspond to the average rating of all bidders until time t (which we refer to as the current average bidder rating at time t and denote as x1 (t)), 3

the current number of bids at time t (x2 (t)), and the current winner rating at time t (x3 (t)). These rst 3 predictor variables are time-varying. We also consider 5 time-constant predictors: the opening bid ( x4 (t)), the seller rating ( x5 (t)), the sellers positive ratings only ( x6 (t)), the shipping costs ( x7 (t)), and the book category ( x8 (t)), where x i (t) again denotes the inuence-weighted variables. Table 2: Estimates for the price forecasting model (3). The rst column indicates the part of the model design that the predictor is associated with. The third column reports the estimated parameter values and the fourth column reports the associated signicance levels. Values are again averaged across the training set.

MAPE
0.4
forecasting model exponential smoothing

0.1

0.2

0.3

6.2

6.4

6.6

6.8

7.0

Des d(t) l(t) x(t) x(t) x(t) x(t) x(t) x(t) x(t) x(t) x(t)

Predictor Price Velocity D(1) yt Price Lag yt1 Intercept Cur.Avg.Bid.Rating x1 (t) Cur.Numb.Bids x2 (t) Cur.Win.Rating x3 (t) Opening Bid x 4 (t) Seller Rating x 5 (t) Pos Seller Rating x 6 (t) Shipping Cost x 7 (t) Book Category x 8 (t)

Coe 0.592 4.824 5.909 0.414 -0.008 0.197 0.051 -11.534 1.518 0.008 3.950

P-Val 0.049 0.044 0.110 0.012 0.027 0.027 0.031 0.070 0.093 0.215 0.107

Figure 3: Mean percentage error (MAPE) of the forecasted price over the last auction-day. is a popular short term forecasting method which assigns exponentially decreasing weights as the observation become less recent and also takes into account a possible (changing) trend in the data. The dashed line in Figure 3 corresponds to MAPE for double exponential smoothing. We notice that for both approaches, MAPE increases as we predict further into the future. However, while for our dynamic model MAPE increases to only about 5% at the auction-end, exponential smoothing incurs an error of over 40%. This dierence in performance is relatively surprising, especially given that exponential smoothing is a wellestablished (and powerful) tool in time series analysis. One of the reasons for this underperformance is the rapid change in price dynamics, especially at the auction-end. Exponential smoothing, despite the ability to accommodate for changing trends in the data, cannot account for the price dynamics. This is in contrast to our dynamics forecasting model which explicitly models price velocity. As pointed out earlier, a change in a functions velocity precedes a change in the function itself, so it seems only natural that modeling the dynamics makes a dierence for forecasting the nal price.

Table 2 shows the estimated parameter values for the full forecasting model. It is interesting to note that bookcategory and shipping costs have low statistical signicance. The reason for this is that their eects have likely already been captured satisfactorily in the model for the price velocity. Also notice that the model is estimated on the log-scale for better model t. That is, the response yt and all numeric predictors ( x1 (t), . . . , x 7 (t)) are log-transformed. The implication of this lies in the interpretation of the coecients. For instance, the value 0.051 implies that for every 1% increase in opening bid, the price increases by about 0.05%, on average. We estimate the forecasting model on the training data and use the validation data to investigate its forecasting accuracy. To that end we assume that for the 230 auctions in the validation data we only observe the price until day 6 and we want to forecast the remainder of the auction. We forecast price over the last day in small increments of 0.1. That is, from day 6 we forecast day 6.1, or the price after the rst 2.4 hours of day 7! From day 6.1 we forecast day 6.2 and so on until the auction-end, at day 7. The advantage of an incremental approach is the possibility of feedback-based forecast improvements. That is, as the auction progresses over the last day, the true price level can be compared with its forecasted level and deviations can be channelled back into the model for real-time forecast adjustments. Figure 3 shows the forecasting accuracy on the validation sample. We determine forecasting accuracy as the mean absolute percentage error (MAPE). The solid line in Figure 3 corresponds to MAPE for our dynamic forecasting model. We benchmark the performance of our method against double exponential smoothing. Double exponential smoothing 4

5. CONCLUSIONS
In this paper we develop a dynamic price forecasting model that operates during the live auction. Forecasting price in online auctions can have benets to dierent auction parties. For instance, price forecasts can be used to dynamically score auctions for the same (or similar) item by their predicted price. On any given day, there are several hundred, or even thousand, open auctions available, especially for very popular items such as Apple iPods or Microsoft Xboxes. Dynamic price scoring can lead to a ranking of auctions with the lowest expected price which, subsequently, can help bidders make decisions about which auctions to participate in. On the other hand, auction forecasting can also be benecial to the seller or the auction house. For instance, the auction house can use price forecasts to oer insurance to the seller. This is related to other ideas [Ghani, 2005] who suggest offering sellers an insurance that guarantees a minimum selling price. In order to do so, it is important to correctly forecast the price, at least on average. While Ghanis method

is static in nature, our dynamic forecasting approach could potentially allow more exible features like an Insure-ItNow option, which would allow sellers to purchase an insurance either at the beginning of the auction, or during the live auction (coupled with a time-varying premium). Price forecasts can also be used by eBay-driven businesses that provide brokerage services to buyers or sellers. And a nal comment: In order for dynamic forecasting to work in practice, it is important that the method is scalable and ecient. We want to point out that all components of our model are based on linear operations - estimating the smoothing spline or tting the AR model are both done in ways very similar to least squares. In fact, the total runtime (estimation on training data plus validation on holdout data) for our dataset is less than a minute, using program code that is not (yet) optimized for speed.

6.

REFERENCES
Bajari, P. and Hortacsu, A. (2003). The winners curse, reserve prices and endogenous entry: Empirical insights from ebay auctions. Rand Journal of Economics, 3:2:329355. Ghani, R. (2005). Price prediction and insurance for online auctions. In the Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Chicago, IL, 2005. Ghani, R. and Simmons, H. (2004). Predicting the end-price of online auctions. In the Proceedings of the International Workshop on Data Mining and Adaptive Modelling Methods for Economics and Management, Pisa, Italy, 2004. Jank, W. and Shmueli, G. (2005). Proling price dynamics in online auctions using curve clustering. Technical report, Smith School of Business, University of Maryland. Lucking-Reiley, D., Bryan, D., Prasad, N., and Reeves, D. (2000). Pennies from ebay: the determinants of price in online auctions. Technical report, University of Arizona. Ramsay, J. O. and Silverman, B. W. (2005). Functional Data Analysis. Springer Series in Statistics. Springer-Verlag New York, 2nd edition. Ruppert, D., Wand, M. P., and Carroll, R. J. (2003). Semiparametric Regression. Cambridge University Press, Cambridge. Shmueli, G. and Jank, W. (2006). Modeling the Dynamics of Online Auctions: A Modern Statistical Approach. In Economics, Information Systems & Ecommerce Research II: Advanced Empirical Methods. M.E. Sharpe, Armonk, NY. Forthcoming. Shmueli, G., Russo, R. P., and Jank, W. (2005). The Barista: A model for bid arrivals in online auctions. Technical report, Smith School of Business, University of Maryland. Wang, S., Jank, W., and Shmueli, G. (2006). Forecasting ebays online auction prices using functional data analysis. Journal of Business and Economic Statisitcs. Forthcoming.

Mining and Querying Business Process Logs


Akhil Kumar
Penn State University Smeal College of Business University Park, PA 16802, USA akhil@psu.edu

ABSTRACT
We describe a new algorithm to extract a model of a structured workflow process from a log consisting of actual process execution sequences. The algorithm works by first creating a relationship matrix between all pairs of tasks and then matching rows of this matrix pair-wise to find the closest match. The matching pair represents the pair of tasks or blocks to be combined together into a structure. This is repeated until the complete workflow is discovered. An additional benefit of our algorithm is that it can also tell us easily when no block structured model exists, because in this case a pair of matching rows cannot be found in the relationship matrix. This paper describes the basic algorithm and also gives some results of testing. Further, we also show how to query a process. Thus, useful process knowledge is acquired by analyzing process instances in this way, and it can aid in better understanding of real-world processes, ensure process compliance with an idealized process and lead to better process design. Consider a simple log with four instances: {ABCD, ACBD, ABCBCD, AED}. The instances consist of 5 tasks A thru E, and they are placed in the order of their completion times. Based on heuristic reasoning, one can conjecture that the process model might look like the one shown in Figure 1. The five tasks here are organized using and, or, start and end constructs. After the process starts, A is the first task. Then, there are two alternative paths, one via E to D, while the other includes B and C in parallel. In fact, B and C can also be performed multiple times because of the loop. Finally, D is performed and then the workflow ends. Thus, a process log consists of traces of actual executions of a process and, given a log, the problem is to extract the actual process schema. The log may actually consist of start and end times of each task; however, for simplicity we assume here that the tasks are instantaneous and just include the start of a task in a log. The main objective is to develop algorithms that can analyze logs and develop correct algorithms that reflect the actual workflow. Consequently, research efforts have focused on extracting graphs [3] and Petri-nets [2] from logs. In this paper, we describe a new technique for mining process patterns from workflow logs based on building a task relationship matrix between all pairs of tasks, and then matching pairs of rows of this matrix to find the closest match for combining with the relationship between them. This paper is organized as follows. The next section gives background about how block structured workflows are constructed using sequence, parallel, choice and loop structures. Then, Section 3 describes the algorithm and results from implementing and running test cases. Next, Section 4 shows how to answer queries against a process. Section 5 wraps up the paper with a discussion of related work and conclusions.

Keywords
Process knowledge; process mining, model extraction, querying a process, workflow.

1. INTRODUCTION
In sensitive applications it is imperative that tasks occur in the correct order. The requirements of an accounting application might state that: (1) an invoice must be approved before it is paid; (2) the goods must be received before the invoice is approved for payment; and, (3) the goods must be inspected before the payment is made. Similar needs arise in applications in patient care, immigrant processing, insurance claims, etc. A process log may contain literally thousands of instances of this process and it is important to verify that the instances conform to the process requirements. A process log contains semantic information pertaining to the relationships between tasks performed in a workflow, such as whether two tasks occur in a sequence, parallel, loop, or exclusive relationship. The goal of process mining [1,2] is to examine actual instance logs of processes for the relationships between tasks and thus extract or discover a process model. The purpose of process querying is to run queries against process models and execution logs, posing questions relating to the relationships between tasks. For example, a query could ask if there are instances where the constraints listed for the accounting example above are violated. Further, by knowing its process model, one can look for process improvement opportunities.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. DMBA'06, August 20, 2006, Philadelphia, Pennsylvania, USA. Copyright 2006 ACM 1-59593-439-1...$5.00.

2 PRELIMINARIES
A block structured process is created by using four building structures or operators: sequence, parallel, choice and loop. Two tasks can be combined in a sequence using the sequence structure as shown in Figure 2 (a). This can also be expressed as Seq(T1, T2) where T1 and T2 are the two tasks. Figure 2 (b) shows a parallel structure where two tasks are combined in parallel. This means that the two tasks may be performed in either order. This can be expressed as Par(T1, T2) where T1 and T2 are the two tasks. The two circles marked by 'and' in the figure denote ANDsplit and AND-join respectively. Figure 2 (c) shows a choice structure where two tasks are combined with an OR-split. This

means that the two tasks are exclusive and only one of them may be performed in any workflow instance. This can be expressed as Choice(T1, T2) where T1 and T2 are the two tasks. Figure 2 (d) shows a loop structure where two tasks are combined into a loop using OR-splits and OR-joins. This structure can be expressed as Loop(T1, T2), where T1 and T2 are the two tasks being repeated. These four operators can be applied successively and in any order to tasks or blocks to create a workflow process. The boxes in Figure 2 can represent atomic tasks or blocks that are themselves aggregates of tasks.
START
A

The goal is to extract a block structured model of a workflow from this log. The algorithm first builds a T x T relationship matrix, Rel, between all pairs of tasks that appear in the log (lets assume there are T tasks). This matrix has entries like 'S' for sequence, 'P' for parallel, 'X' for exclusive and 'L' for loop. The meaning of entry rel(i, j) is: "+S" task j appears somewhere after task i in a log record " S" task j appears somewhere before task i in a log record "P" task i and j are in parallel "X" task i and j do not appear in the same log record or instance

OR

OR

AND

AND

OR

OR

END

Figure 1 A process extracted from a log

In addition to the four structures of Figure 2, a workflow process also has a "Start" and an "End", to respectively denote the start and end of an instance. In general, single tasks (i.e. atomic tasks) could be combined using the four basic building blocks of Figure 2 in a recursive manner until one large aggregate block is formed containing all the tasks and sub-blocks. A workflow constructed by combining tasks into blocks using these structures in this manner is called a block structured workflow. Thus, we assume that unstructured workflows (see for instance, [7]) are not allowed. Next we describe our extraction algorithm.

"L" (only for a diagonal element Rel(i, i)) task i repeats more than once in a row and is part of a loop. ("NL": No loop) The main algorithm is given in Figure 3. First, the algorithm scans the log, determines the number of unique tasks, and sets the number of blocks equal to the number of tasks. Next, it iterates, successively reducing the number of blocks by one until a single block is left at which point the model extraction is complete. In each iteration, the four main steps are: build the Rel matrix again, find a matching pair of tasks, and merge this pair into a larger block. The matching pair of tasks is combined into one block based on the relationship between them as indicated in the Rel matrix by 'S', 'P', 'X' or 'L'. If a matching pair is not found, it means the process is not block-structured. Algorithm Extract_Model() Begin Scan log: set T = # of unique task names Num_blocks = T; Do While (Num_blocks >1) Build_Rel_matrix(); Compare_t_vectors(Rel, i, j); /*return 2 blocks i, j, to be merged*/ Merge_blks(i,j, T - Num_blocks + 1); Num_blocks = Num_blocks - 1; End-Do End Figure 3. Algorithm Extract_model
The Rel matrix is built by scanning the log record-wise. In each record scan, every pair of entries, i, j, (i < j), is compared in increasing order and the appropriate entry is made in Rel by applying some simple rules. If a task appears multiple times in a log, its diagonal entry in Rel is set to L. A row of this matrix is called a t-vector (or task vector) for the corresponding task. To compare t-vectors, we need an equality operator =tv as follows: Definition 1: (t-vector equality, =tv) We say two t-vectors tvi, tvj corresponding to tasks i and j are equal (denoted as =tv) if:
(i) t-vector(tvi, k) = t-vector(tvj, k), k = 1, T, k i, k j (ii) t-vector(tvi, i) = t-vector(tvj, j) Thus, all elements in the same column position of the two vectors except the elements representing the diagonal and the relationship with each other element must be equal. Further, the two diagonal elements though in different column positions must be equal. If the two t-vectors are equal in this way, then tasks i and j are candidates for merging. A tie is broken by first merging tasks (or blocks) which have equal occurrence counts in the entire log, or an 'X' relationship between them. Loops are merged last. The

and

and

(a) Sequence structure


or

(b) Parallel structure


or

or

or

(c) Choice structure

(d) Loop structure

Figure 2. Basic structures (or operators) of block structured workflows

3. ALGORITHM OVERVIEW & RESULTS


The algorithm makes some straightforward assumptions as follows: (1) A log of actual workflow instances (or records in a log) is provided from a valid workflow; (2) Each workflow instance in the log represents a complete instance from start to finish; (3) the log is sufficiently descriptive of different execution scenarios involving sequence, parallel, choice and loop structures. By this we mean that if there is choice structure, then each of its branches must be taken in at least one log record or instance. Similarly, for a loop there must be at least one instance where the loop is executed more than one time; (4) Every task in the process must appear in at least one log record.

merge step replaces the two merging tasks with a new block name. Table 1. A relationship matrix, Rel, between all pairs of tasks
A A B P1 P2 NL S S S B +S L S S P1 +S +S L P P2 +S +S P L

The pictorial representation of this query is shown in Figure 6. Table 2 shows the Rel matrix for the query, while Table 3 is the truncated Rel matrix for the process. Finally, Table 4 gives the difference or matrix between Rel-p' and Rel-q. The matrix shows that the relationships in the query also hold in the original process. The answer to the query is true. Note that in step 5, the diagonal elements of the Rel-q matrix are 'NL'. This approach can also be further extended for more complex queries than the simple ones considered above. For instance consider query Q4 below: Q4: Is there always another task between A and D?

The algorithm was implemented and tested quite extensively by generating test cases. We first tested logs with the basic structures, and then considered different combinations of nesting of these structures to exercise various parts of the algorithm. As an example, consider the log in Figure 4. The algorithm produces the process model shown in Figure 5. The complexity of the algorithm is T*I*log_size^2, for T tasks, I instances, and average instance size log_size.

A B P1 P2 P3 D R1 B P1 P3 P1 P2 D A B P2 P1 P3 D R1 B P1 P3 P1 P2 D A B P1 P3 P2 D R1 B P1 P3 P2 P1 D A B P1 P2 P3 D R1 B P1 P3 P2 D R1 B P1 P2 P3 P1 D A B P2 P1 P3 P1 D A B P2 P1 P3 P1 P3 P1 P3 P1 D Figure 5. Extracted model for the example log of Figure 4 Figure 4. Example log for test example 1 Q4 can be expressed formally as: "Seq (A, X, D)?", where X is a variable. To solve this query, we would again create a rel matrix as before. However, steps 4 and 5 above would be modified such that in the truncated Rel matrix of the process, Relp', the algorithm would have to: (a) Assign task names to X from the list of tasks (b) Compute the Rel-p'(X) matrix for each value of X (c) Compute =Rel-q Rel-p'(X) (d) If = , then answer is yes, else the answer is no. Thus, it is possible to extend our approach to more complex queries involving variables. However, a full description of a formal process query language (PQL) and more detailed algorithms are left as a future exercise. The nice feature of this approach for querying is that it is simple and intuitive, and it also pinpoints the deviation between the query and the process model. Finally, one can also query actual instances instead of the model. For this, a model is first created on-the-fly from just the instances of interest using algorithm extract_model, and then the queries are run against the model.

4. QUERYING THE PROCESS MODEL


In this section we discuss how the relationship matrices approach used to extract a process model can also help with processing queries against it. Users may wish to pose queries against the process model such as: Q1: Do tasks P1 and P2 occur in parallel? Q2: Does task A always appear before task D? Q3: Are P1 and P2 in parallel, but both after A, and before D. An algorithm for processing such queries is as follows: 1) 2) 3) Make a Rel matrix for the process model (Rel-p) Make a process model for the query Make a Rel matrix for the query (Rel-q)

4) Truncate the Rel matrix for the process by keeping only the rows and columns for tasks that are present in the query (Rel-p') 5) If (Rel-q = = Rel-p') || ((Rel-q(i,j) = =Rel-p'(i,j), i j) && (Rel-q(i,i) = 'NL', i) ) then answer = yes, else answer = no. As an example lets pose query Q3 to the process model of Figure 5. This query is expressed formally as: Q3: Seq(A, Par(P2, P3), D)?

5. RELATED WORK AND CONCLUSIONS


Researchers have focused on techniques for extracting three kinds of models: graph-oriented models [3], Petri-nets [1, 2], and block structured models [8]. The approach described in [3] extracts dependency graphs from task logs based on a dependency analysis between tasks. These algorithms were applied in the

context of the FlowMark workflow system from IBM. In [1, 2] an Alpha algorithm for extracting Petri net models from logs was proposed. This algorithm and its variants are based on extracting causal dependencies from logs, and combining tasks that depend on one another by treating them as transitions and placing tokens to connect them. The initial version of the alpha algorithm had difficulty with loops of size one and two because they were confused with parallel relationships, but later versions were able to circumvent the problem by treating them in a special manner. The approach of Herbst [6] starts with a very general initial model and applies a series of merges and splits using probabilistic approaches to induce a final model. Table 2: Rel matrix for query Q3, Rel-q3
A A P2 P3 D NL S S S P2 +S NL P S P3 +S P NL S D +S +S +S NL

tasks; it matches vectors to identify merging tasks or blocks; and, indicates if a workflow is unstructured. An additional benefit of our algorithm is that it can also tell us easily when no block structured model exists, because in this case a pair of matching rows cannot be found in the Relationship matrix. Finally, it lends itself easily to querying of process models and instances. In summary, actual execution logs of business processes contain valuable process knowledge. Process mining is a way to extract and query that knowledge, and hence it has many potential applications in the real world. Effective and fast process mining algorithms can help organizations better understand their processes and improve them. Existing data mining techniques are not adequate for doing this because they do not capture the semantic relationships between tasks. In this paper, we described a novel algorithm to extract a workflow model of a block structured workflow from a log of process execution sequences. The advantage of block structured models is that they are easier for business analysts and end users to understand. Further, there are several tools and languages (including WS-BPEL) that primarily support block structured workflow processes. In future work, we expect to conduct more evaluations to study the robustness of the extraction algorithm in the presence of noise and also to compare the algorithm more rigorously with other methods. Further, the query language can be further developed more formally into a process query language (PQL), and the query processing algorithms should be implemented and tested. Acknowledgements: This research was partially supported by a Summer Research Grant from the Smeal College of Business at Penn State University.

Figure 6: Process model for query Q3 The technique for block structured models proposed by Schimm [8] defines an algebra for workflow traces, groups traces into classes, extracts precedence relationships, creates sub-models and finally synthesizes them using the algebra. It is also a bottom-up approach like ours, which starts at the instance level, and by applying rewriting rules it clusters the instances more and more until no further rewriting is possible and a block-structured model is returned. It handles loops by relabeling which is not required in our approach. It also assumes that tasks have a start and end time while we assume that they are atomic. Another technique based on clustering of traces to identify global constraints is discussed in [5]. A probabilistic model and a learning algorithm for workflow mining are discussed in [9]. Research on querying processes has been quite limited. One method based on XML Query was described in [4]. However, it does not capture relationships between tasks in such depth and is unable to point out discrepancies between a given process and a query process. Table 3. Truncated Rel matrix rel-p'
A A P2 P3 D NL S S S P2 +S L P S P3 +S P L S D +S +S +S L A P2 P3 D

REFERENCES
1. W.M.P van der Aalst, B.F. van Dongen, J. Herbst, L. Maruster, G. Schimm, A.J.M.M Weijters: "Workflow mining: A survey of issues and approaches". DKE 47 (2003). W.M.P. van der Aalst , A.J.M.M. Weijters, and L. Maruster, Workflow Mining: Discovering Process Models from Event Logs IEEE Transactions on Knowledge and Data Engineering (TKDE), volume 16(9), pages 1128-1142, 2004. R. Agrawal, D. Gunopulos, and F. Leymann, Mining Process Models from Workflow Logs, Proc. Sixth Intl Conf. Extending Database Technology, pp. 469-483, 1998. V. Christophides, R. Hull, A. Kumar: Querying and Splicing of XML Workflows. CoopIS 2001: 386-402. G.Greco, A.Guzzo, L.Pontieri, and D. Sacca. Mining expressive process models by clustering workflow traces. In Proc. 8th Pacific-Asia Conference (PAKDD'04), 52-62, 2004. J. Herbst and D. Karagiannis, Integrating Machine Learning and Workflow Management to Support Acquisition and Adaptation of Workflow Models, Proc. Ninth Intl Workshop Database and Expert Systems Applications, pp. 745-752, 1998. R. Liu and A. Kumar, "An Analysis and Taxonomy of Unstructured Workflows," Third International Conference on Business Process Management (BPM 2005), Springer-Verlag, Lecture Notes in Computer Science, 268-284. G. Schimm, Mining Most specific workflow models, Proc. International BPM Conference, June 2003, 25-40. R. Silva, Jiji Zhang, James G. Shanahan, Probabilistic workflow mining. 275-284, ACM KDD conference, 2005.

2.

3.

4. 5.

Table 4. The difference matrix = rel-q3 rel-p'


A 0 0 0 0 P2 0 NL,L P 0 P3 0 0 NL,L 0 D 0 0 0 NL,L

6.

7.

8. The distinguishing features of our approach are that: it creates a unique Rel table that captures relationships between all pairs of 9.

Driving High Performance for a Large Wireless Communication Company Through Advanced Customer Insight
Ramin Mikaili and Lynnette Lilly Accenture Consulting CRM Customer Insight

CLIENT COMPANY OVERVIEW


The client company offers a comprehensive range of wireless and wireline communications services to consumer, business, and government customers. Their organization is widely recognized for developing, engineering, and deploying innovative technologies, including two robust wireless networks offering industryleading mobile data services; instant national and international walkie-talkie capabilities; and an awardwinning and global Tier 1 Internet backbone.

doing business. Now, the company wanted to do more: to treat customer interactions as opportunities to increase the number of products and services sold to its existing customer base. By implementing this vision, the company would begin its journey to transform its customer care function from a cost center to a revenue center. In choosing Accenture to help realize its vision, the company sought a provider that could help strategize and also implement new customer insight initiatives.

BUSINESS CHALLENGE
Over the past few years, this communications company made significant advances in using customer insight and analytics to improve customer satisfaction and reduce customer churn. The company has used advanced business intelligence techniques and tools to gain a deeper understanding of different customer segments and the reasons for customer churn within those segments. Armed with that knowledge, the company launched more than two dozen programs aimed specifically at improving retention within several important customer segments. The next strategic customer relationship management priority then became to increase customer loyalty and retention by providing improved customer experiences and by customizing treatment to the needs of each customer. The company also looked to increase revenue generation from its existing customer base through more effective cross-selling and up-selling by its customer service agents. As with many wireless companies today, this company previously looked at customer care primarily as a cost of

ACCENTURE METHODOLOGY
To meet the various challenges in this initiative, Accenture assembled a diverse team that included resources from its wireless and strategy practices, as well as deeply skilled professionals in customer relationship management and business intelligence. This team worked closely with their client teammates to identify, analyze and capture growth and revenue-generating opportunities. Analyses of customer behavioral data were employed in making decisions at strategic or tactical levels. There has not been a strong relation between the complexity of the analysis and nature of the decisions. To identify the best opportunities, the team used a simple segmentation for gaining an understanding of the relationship between usage behavior of communication services and adoption of new services or addition of users to the account, i.e., customer growth. This segmentation was based on two dimensions, profitability of the customer versus the growth index. This created a nine cell segmentation. Each of the cells was profiled based on various usage and service behavioral history. With that information, the team developed strategies that defined the optimal offer and channel for up-sell opportunities. A series of precision targeting models for various product and services were developed. These models were used to calculate propensities for adoption for each customer. These scores were used to drive the proactive and reactive initiatives. For the proactive initiatives, this score identified

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. DMBA'06, August 20, 2006, Philadelphia, Pennsylvania, USA. Copyright 2006 ACM 1-59593-439-1...$5.00

the targeted audience, and in the reactive mode (customer calling care centers) the scores determined the proper offer for presentation to the customer. For example, the simple segmentation analysis during the value targeting asserted that the mid-market accounts with 5 to 250 subscribers in the retail industry tend to grow in number of subscribers during the winter season, and they predominantly required additional lines for the account or direct connect services. The best offers for the targeted audience were identified by optimizing the value of the incentive for the customer based on the potential Average Revenue for the anticipated tenure. The targeted small business audience was then flagged with the appropriate offer within the SAS based campaign management mart which in turn was used to instantiate a pop-up window on the care representatives screen when triggered by the targeted customer needs. This particular campaign resulted in 30% incremental sales of one or more units to the account. During the pilot we launched several campaigns similar to the one mentioned. The team then launched a set of reactive dynamic programs within care as well as a number of proactive cross-sell and up-sell pilot programs, thereby creating and executing a plan to apply the results of those pilots to the companys entire customer base from both a strategic and operational perspective. To help ensure the success of the pilot programs, a joint business intelligence team analyzed existing operations and capabilities at customer care centers to identify the highestvalue growth opportunities. Many companies err in this regard by force-fitting a technology or toola data mart, for exampleinto their business environment, rather than matching available tools to business needs and strategy. Instead, the team targeted the value that would be realized through a comprehensive business solution. Accenture also helped ensure that all components of the overall solution, including technology-enabled capabilities, were usable at the customer service agent level. The mantra of the combined team was to rapidly pilot ideas and follow success with investment in technology and process. Taking the value-based approach, Accenture performed a detailed analysis of customer data to demonstrate to this companys executives how their highest-value customers behaved over that time period: their initial package of services, the products they purchased over time and the value to the company that was produced from those purchases. Accenture created a compelling view of the customer life cycle from a growth perspective, highlighting key opportunities. Accenture research into the characteristics of high-performance businesses has found that high performers are more effective than their peers at translating insight into marketing and sales productivity. By working with this company to expand its existing capability

to use advanced analytics to transform data into useful information, Accenture helped the company obtain the information it needs to make sound marketing and customer management decisions. To conclude the first phase of the project, Accenture proposed a portfolio of four to five cross-sell and up-sell pilot programs to be tested at a selected customer care center for small- to medium-sized business customers. Leveraging Accenture Communications Solutions advanced business intelligence and customer analytic assets, as well as the insight and analytics work previously conducted, the programs helped the team fine-tune the capabilities of the existing desktop software solution. The team then kicked off the second phase of the project aimed at capturing value by launching a revenue generation pilot. The result: Each incoming call now presents a targeted opportunity for a customer service agent to improve customer experience by recommending the product the customer needs most, based on statistical evidence of buying patterns. In the final stage of the project, the team moved toward applying the results achieved in the pilots to the entire network of customer care centers. The team developed and executed a customer service agent training program, which included implementing incentive programs to encourage uptake among customer service agents. To capture additional value, the joint team piloted a campaign to increase awareness among customer base for 411, an information directory service. After a targeted mailing campaign, the team found that awareness levels had increased significantly.

RESULTS
The pilot programs, rolled out in only six months, immediately led to several million dollars in additional incremental lifetime revenue for the company. This company exceeded its annual revenue goal from customer care by more than 100 percent during the year following the pilot program. By working with Accenture to execute its customer relationship management vision, this company sold more than 50,000 additional units to existing account, 1.2 million value-added services and 40,000 phone accessories during 2005. As the companys senior executive reflects, This project has delivered significant value for our company. By helping us refine, support and execute our strategy, Accenture enabled us to realize our vision. Through our collaborative, self-funding programs, Accenture helped us transform our customer care function from a cost center to a revenue-generating machine.

Quantile Trees for Marketing


Claudia Perlich, Saharon Rosset
IBM T.J. Watson Research Center P. O. Box 218 Yorktown Heights, NY 10598
{srosset,

perlich}@us.ibm.com

ABSTRACT
This work on quantile estimation was motivated by the task of estimating the realistic IT wallets of IBM customers. The goal of quantile regression modeling is to estimate a quantile of the discriminative conditional distribution of the response, rather than the mean, which the implicit goal of most standard regression approaches. High-quantile modeling is of vital importance in marketing, since it can provide realistic estimates of sales opportunities that help focusing eorts on customers with high growth potential. We propose straight forward adaptation of a regression-tree approach that can be implemented eciently as a wrapper around any existing implementation of a regression tree algorithm such as CART.

source planning and other tasks. For a detailed survey of the motivation, problem denition, and some alternative solution approaches, see [10]. In that paper we propose the definition of a customers REALISTIC wallet as the 0.9th or 0.95th quantile of their conditional spending this can be interpreted as the quantity that they may spend buying IT from IBM if we get lucky. This task of modeling what we can hope for rather than what we should expect turns out to be of great interest in multiple other business domains, including: When modeling sales prices of houses, cars or any other product, the seller may be very interested in the price they may aspire to get for their asset if they are successful in negotiations. This is clearly dierent from the average price for this asset and is more in line with a high quantile of the price distribution of equivalent assets. Similarly, the buyer may be interested in the symmetric problem of modeling a low quantile. In outlier and fraud detection applications we may often have a specic variable (such as total amount spent on a credit card) whose degree of outlyingness we want to examine for each one of a set of customers or observations. This degree can often be well approximated by the quantile of the conditional spending distribution given the customers attributes. For identifying outliers we may just want to compare the actual spending to an appropriate high quantile, say 0.95. In the next section we adapt one of the most common regression approaches used in data mining regression trees to estimation of high quantiles instead of conditional means. We also show how such a model can easily be implemented as a wrapper around existing tree implementation and report a few empirical results.

1.

INTRODUCTION

In standard regression modeling, we are given n observations on a continuous numeric variable Y and a set of p explanatory variables, or features, x = (x1 , ..., xp )t , and we try to estimate the dependence of Y on x, so that in the future we can observe x only and predict what Y may be. This typically leads us to build a model for a conditional central tendency of Y |x, usually the mean E (Y |x). In this paper we consider the situations when we are not interested in estimating a conditional mean, but rather a high quantile of the conditional distribution P (Y |x). Consider for instance the 0.9th quantile of P (Y |x), which is the function c(x) such that P (Y > c(x)|x) = 0.1. For example, our primary motivating application is the problem of customer wallet estimation, which is of great practical interest to us at IBM. A customers wallet for a specic product category (for example, Information Technology) is the total amount this customer can spend in this product category. As an IT vendor, IBM observes what the companies that are our customers actually spend with us, but does not typically have access to the customers budget allocation decisions, their spending with competitors, etc. Information about our customers wallet, as an indicator of their potential for growth, is considered extremely valuable for marketing, re-

2. ADJUSTING REGRESSION TREES FOR QUANTILE PREDICTION


Tree-induction algorithms are very popular in predictive modeling and are known for their simplicity and eciency when dealing with domains with large number of variables and cases. Regression trees are obtained using a fast divide and conquer greedy algorithm that recursively partitions the training data into neighborhoods containing subsets. Work on tree-based regression models traces back to Morgan and Sonquist [7] but the major reference is the book on classication and regression trees (CART) by Breiman et al. [1]. We

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and the full citation on the rst page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specic permission and/or a fee. DMBA06, August 20, 2006, Philadelphia, Pennsylvania, USA. Copyright 2006 ACM 1-59593-439-1...$5.00.

will limit our discussion to this particular algorithm. Additional regression tree implementation include RETIS [3], CORE [9], M5 [8], RT [12]. A tree-based modeling approach is determined predominantly by three components: the splitting criterion which is used to select the next split in the recursive partitioning, the pruning method that shrinks the overly large tree to an optimal size after the partitioning has nished in order to reduce variance, the estimation method that determines the prediction within a given leaf. The most common choice for the splitting criterion is the least squares error (LSE). While this criterion is consistent with the objective of nding the conditional expectation and therefore not necessarily with the quantile, it can also be interpreted as a measure of the improvement of the approximation quality of the estimate of the conditional distribution. One can interpret the tree induction as a searches for local neighborhoods that provide good approximations for the true conditional distribution P (Y |x). So an alternative interpretation of the LSE splitting criterion is to understand it as a measure of dependency between Y and an xi variable by evaluating the decrease of uncertainty (as measured by variance) through conditioning. In addition, the use of LSE leads to implementations with high computational eciency based on incremental estimates of the errors for all possible splits. Pruning is the most common strategy to avoid overtting within tree-based models. The objective is to obtain a smaller sub-tree of the initial overly large tree, excluding those lower level branches that are unreliable. CART uses Error-Complexity pruning approach which nds a optimal sequence of pruned trees by sequentially eliminating the subtree (i.e., node and all its ancestors) that minimizes the increase in error weighted by the number of leaves in the eliminated subtree: g (t, T ) = E (t ) E (T t ) S (T t ) 1 (1)

Given our objective of quantile estimation, the most obvious adjustment to CART is to replace the sample mean estimate in the leaves with the quantile estimate using the D (c) of P (Y |x) by simply sorting empirical local estimate G l the observations in the leaf and picking the y value from the appropriate quantile of y for the training examples in the leaf. A more interesting question is whether the LSE splitting (and pruning) criterion should to be replaced by a quantile loss. On one hand, nding splits that minimize the supposedly correct performance measure of quantile loss [10] p (y y ) if y y (3) Lp (y, y ) = (1 p) ( y y ) otherwise on the training sample in the leaves corresponds directly to our prediction objective. On the other hand, having the best possible approximation of the conditional distribution can be expected to result in the best quantile estimates of the distribution. Therefore minimizing the distribution variance could lead to a better approximation than the direct optimization of quantile loss, in particular for very high quantiles. In addition, changing the splitting criterion to quantile loss causes severe computational problems. The evaluation of a split now requires the explicit construction of the two sets of predictions in each leaf, sorting both of them in order to nd the correct quantile and the calculation of the loss. We will not consider the issue of eciency any further, as there is already some related work by Torgo [12] on ecient implementations of trees that minimize mean absolute deviation (MAD), i.e., quantile loss for q =0.5. In initial empirical experiments, we have investigated the success of the two splitting criteria and concluded that using quantile loss does not improve the performance signicantly.

2.1 Wrappers for Quantile Prediction


In related work, Meinshausen [6] and Chaudhuri and Loh [2] have devised tree-based algorithms for quantile esitmation. While tree induction algorithms are not that dicult to implement, it might still be of interest, that quantile prediction can be achieved using a conventional decision tree implementation, as long as no variables are missing. In order to predict the appropriate quantile we need to access the distribution of the target values from the training examples within particular leaf. But this information is typically not provided as optional output. However, it is easily recreated by scoring both the training set and the holdout/prediction dataset. For all observations in the training set l = 1, ..., N we have a vector of pairs (yl , y l ), where y l is the prediction and is calculated as the mean over all target values in the leaf according to Equation 2. For a given prediction y u on the holdout, we can nd all training examples in the same leaf from the pairs (yl , y l ) with y l = y u . The probability of two dierent leaves having the exact same mean value is exceedingly low. Therefore, we can calculate the quantile prediction jointly for all holdout observations with the same value y u as the quantile over all yl with y l = y u leaving the original algorithm untouched.

where E (Tt ) is the error of the subtree Tt containing t and all its ancestors, and E (t) is the error if it was replaced by a single leaf, and S (Tt ) is the number of leaves in the subtree. E (.) is measured in terms of the splitting criterion (i.e., for standard CART it is squared error loss). Given an optimal pruning sequence, one still needs to determine the optimal level of pruning and Breiman et al. suggest cross validation on a holdout set. Finally CART estimates the prediction for a new case that falls into leaf node l as the mean over the set of training responses Dl in the leaf: 1 X y l (x) = yi (2) nl y D
i l

2.2 Initial Empirical Observations


We have run some initial experiments using the outlines approach. As mentioned before, the choice of the splitting

where nl is the cardinality of the set Dl of training cases in the leaf.

criteria was irrelevant across multiple datasets. The dierence between predicting a conditional mean and predicting a conditional high quantile does indeed seem to be very significant from a practical perspective. The degree of divergence is strongly aected by the distribution of the target variable. For exponential distributions, the mean prediction may well exceed the high quantile predictions.

[11] I. Takeuchi, Q. V. Le, T. Sears, and A. Smola. Nonparametric quantile regression. NICTA Technical Report, 2005. [12] L. Torgo. Inductive learning of tree-based regression models, Ph.D. thesis. Technical report, Faculty of Sciences, University of Porto, 1999.

3.

CONCLUSION AND FUTURE WORK

High customer spending quantiles are very valuable quantities for marketing and strategic decision making. It is possible to obtain such estimates using standard regression tree approaches with minor adjustments resulting in reliable, interpretable, and ecient solutions. The immediate next step is an extensive empirical evaluation of the suggested algorithm in comparison to existing quantile estimation approaches ([11],[5],[4]). We are also interested in statistical properties of quantile estimation methods, including robustness and bias-variance decomposition as a function of the distribution of the dependent variable and the choice of quantile.

Acknowledgments
We thank Biance Zadrozny, Sholom Weiss, Rick Lawrence, Paulo Costa, Alexey Ershov and John Waldes from IBM for useful discussions. We are also very grateful to the reviewer for pointing out additional related work.

4.

REFERENCES

[1] L. Breiman, J. H. Friedman, Olshen, R. A., and C. J. Stone. Classication and regression trees. Wadsworth International Group, Belmont, CA, 1984. [2] P. Chaudhuri and W. Loh. Nonparametric estimation of conditional quantiles using quantile regression trees. Bernoulli, 8:561576, 2002. [3] A. Karalic. Employing linear regression in regression tree leaves. In Proceedings of the European Conference on Articial Intelligence, pages 440441. John Wiley & Sons, 1992. [4] R. Koenker. Quantile Regression. Econometric Society Monograph Series. Cambridge University Press, 2005. [5] J. Langford, R. Oliveira, and B. Zadrozny. Predicting the median and other order statistics via reduction to classication. 2006. UAI. [6] N. Meinshausen. Quantile regression forests. Journal of Machine Learning Research, 7:983999, 2006. [7] Morgan and Sonquist. Problems in the analysis of survey data and a proposal. JASA, 58:415434, 1963. [8] R. Quinlan. Combining instance-based and model-based learning. In Proceedings of the Tenth International Conference of Machine Learning, pages pages 236243. Morgan Kaufmann, 1993. [9] M. Robnik-Sikonja. CORE - a system that predicts continuous variables. In Proceedings of ERK, 1997. [10] S. Rosset, C. Perlich, B. Zadrozny, S. Merugu, S. Weiss, and R. Lawrence. Wallet estimation models. In International Workshop on Customer Relationship Management: Data Mining Meets Marketing, 2005.

A Decision Management Approach to Basel II Compliant Credit Risk Management


Peter van der Putten Arnold Koudijs Rob Walker
Chordiant Software Chordiant Software Chordiant Software De Lairessestraat 150 De Lairessestraat 150 De Lairessestraat 150 1075HL Amsterdam, the Netherlands 1075HL Amsterdam, the Netherlands 1075HL Amsterdam, the Netherlands peter.van.der.putten@chordiant.com arnold.koudijs@chordiant.com rob.walker@chordiant.com

ABSTRACT
In this paper we highlight some high level requirements for Basel II compliant credit risk management and share a decision management approach to this problem as a case example of a real world business data mining application.

1.1 Outline of the paper


The procedures for developing, deploying and monitoring models must meet certain requirements, some of which are specified explicitly by Basel II regulations and some of which are implied by it. In section 2 we will highlight a selection of these requirements. Next, we will share the Chordiant Decision Management approach as a case example of a state of the art industry solution for implementing a managed Basel II credit risk management process and outline opportunities for data mining research (section 3). In addition we identify some opportunities that are related to Basel II, beyond mere compliancy (section 4). Section 5 concludes the paper.

Keywords
Credit risk management, scoring models, decisioning, predictive analytics, monitoring, Basel II

1. INTRODUCTION
The new Basel Accord (Basel II) aims to make regulatory capital requirements more sensitive to risk to improve the overall stability of the financial market. Basel II covers credit, market and operational risk. For calculating and managing credit risk, banks can follow the so-called Internal Ratings-Based approach (IRB), which allows calculation of risk ratings based on predictive models developed by the bank on its own data [1,2]. As such, predictive data mining has a huge business impact on this application area. This paper is a position paper and its goal is to present Basel II and credit risk management in general as an interesting application area for predictive data mining applications and research, and to identify challenges and opportunities. Data mining research often focuses on the core model development step in the KDD process, however we claim that other steps are also of key importance for credit risk management, such as integrating predictions with business rules, policies and strategies, safe deployment and the monitoring of business performance, decisions and predictive models.

2. The Internal Ratings Based Approach


The Internal Ratings Based (IRB) approach depends on socalled risk components to calculate the expected loss for each exposure, which in turn sum up to the total credit risk. Basel II dictates that the capital requirements are not only based on the expected losses but also on the sophistication of the methods used to estimate these losses. The more advanced the methods used, the better a bank can do in reducing the minimum capital requirements, which in turn allows banks to free up capital to be used for other purposes such as investments. Estimates of exactly how much can be freed up vary from 7% for large international banks (Group 1 Banks) to 27% (small and midsize banks) and 50% (high quality mortgage portfolios) [3,4].

2.1 Requirements
As discussed a central requirement is to calculate the expected loss for an exposure, it is defined as: EL= PD x LGD x EAD x M (1)

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. DMBA'06, August 20, 2006, Philadelphia, Pennsylvania, USA. Copyright 2006 ACM 1-59593-439-1...$5.00.

with PD the probability of default (not paying back a loan or equivalent completely), LGD the loss given default, EAD the exposure at default and M the maturity of the exposure. Typically, under the so-called IRB Foundation approach, banks provide their own estimates of PD and use supervisory estimates for other risk components in the

Advanced IRB approach all components are estimated by the bank itself. Basel compliant ratings must be calculated on the basis of at least two years of data, so the availability of historical data over that period is a must have. The data must be analyzed to calculate ratings that properly reflect the risk in a portfolio. Traditionally, classical statistical methods like regression are used for estimating risk components. However, the Basel II standard does not require a specific modeling algorithm to be used, rather the modeling process must meet a number of criteria. Merely being predictive is not good enough. The models must be proven to be stable under varying economic conditions and over various points in time. A priori risk assessment knowledge that may improve the bottom-up risk models must be made explicit and incorporated where possible. A safe and auditable process must be followed to develop, deploy and monitor credit models and strategies.

3.1 Risks and Losses Datamarts


As discussed Basel II requires that at least two years of risk and losses data is available. As usually data is out there somewhere in the company, getting reliable access to it is generally the issue. However, it is generally less of a problem than for other business areas because banks have been obliged to keep a lot of risk related records for financial or accounting regulations. Furthermore, periodical snapshots of a simple flattened customer table are usually sufficient, and explorative predictive analytics can be used to guide the search for relevant risk indicators that should be stored on an ongoing basis in the risk data mart. Risk management is an area where everyone is aware of the risk and reality of poor data quality. From a data mining research point of view it could be interesting to adapt algorithms specifically for mining dirty data or further develop methods that continuously monitor data quality or detect anomalies [5,9].

3. A Decision Management Approach


In this section we would like to outline the various steps involved in managing the Basel II credit risk management process, present our decision management approach as an example of real world application of data mining and discuss various challenges and opportunities. Historically, the focus of our tools and methodology was purely on predictive model development for credit risk management using genetic algorithms [10]. Later it evolved into a full decision management platform for optimizing the overall customer relationship, across issues like risk, cross sell, service and attrition (next best action or next best activity). Decision Management provides managed process support for the steps after a model has been developed: linking models with business rules into a decision logic (strategy management), batch and real-time decision logic deployment and monitoring of decisions and models [3,4,6,7].

3.2 Safe Model Development Process


Basel II requires that best practices for developing rating models are followed, to minimize the scope for errors and ensure consistent quality. In our opinion, the best and safest way to ensure this is to hardcode this process in the model development tool and to provide automated support where possible, without sacrificing user control. This should not be limited to the core model algorithms, but also include project definition, data preparation and model evaluation. Because of this model factory approach, model accuracy, robustness and process compliancy are optimized and a full audit trail to the development process can be provided. As core modeling algorithms, logistic regression, additive scorecards, decision trees and genetic algorithms are used, to cover the whole spectrum between simple, understandable models and more powerful, complex models.

Figure 1: Decision logic examples: combining models for subpopulations, logic aware of the economic cycle, logic implementing the core IRB formula.

There are various interesting opportunities for data mining research. Of course there will always be a need for better scoring algorithms, especially if these allow to reduce the variance error which is generally a major source of error in real world problems (see also [8]). However as discussed above providing more (semi)-automated support for the entire model development process may have even more impact. For instance, evaluation of scoring models is an topic of its own. Simple measures such as accuracy do not suffice because of the relatively low default rates. For instance measures such as area under the ROC curve are more useful. In addition usually extensive testing needs to take place to demonstrate the stability of these models on different samples and over time periods (back testing). Often simulations are used as well to check under what conditions models may break down, by generating inputs over a range of distributions and studying the behavior of the model, for instance through Monte Carlo simulations.

3.3 Business Policies, Rules and Strategies


The core idea of the decision management approach is that often not just a prediction is needed, but rather a decision. In our approach we allow rating models to be complemented by decision rules. This facility is typically used to express risk assessment knowledge based on

information that is out of scope of the model. Examples are exceptions that lead to an increase or decrease in rating or simply the implementation of core IRB formulae. Another use case would be a comprehensive rating system that requires multiple rating models. An example would be a system that rates small and medium enterprises based on several sector specific rating models, or a rating logic that combines models for different economic cycles (figure 1). These are relatively simple examples; a real world decision logic generally contains tens to hundreds of decision components or more. The marriage between models and rules is an area that could deserve more attention in data mining research. At least in AI data mining and knowledge based systems are usually quite separate areas. Also it is often claimed that integration of models into applications is specific for each application and doesnt allow the development of generic methods. The decision management approach of combining rules and models can be seen as a counter example to this claim. The same methodology is also used for marketing (real time next best activity, marketing campaigns) and medical applications (implementing decision support and medical protocols).

Figure 2: Example of a monitoring dashboard

3.4 Monitoring
To be Basel II compliant. product & customer portfolios, rating logic and models must be monitored continuously and consistently. Example analyses include calculation of rating and score distributions, comparisons of predicted and actual defaults and losses, and population drift. This is achieved without the need to actually design or build a dedicated data mart or equivalent repository. Again, this is a feature of process automation. In model development all expectations are stored within the model and in model deployment all the produced ratings along with the input are written away to the monitoring data mart, which contains a unified Basel II data model. Obviously, cost of the data mart is not an issue from a regulatory perspective; however the automated approach again minimizes the risk of errors (figure 2). The monitoring of models (and decisions) is another topic that is very useful but also interesting (and under studied) from a data mining research point of view. For instance how can be detected that a model is getting tired and needs to be replaced. Or what are methods to discover

the root cause of a change in score distributions? Did the portfolio change or are the models no longer valid? How can we detect a sudden external change in the environment? How could we use a set of credit models and strategies and adaptively use the best ones given the current environment, or in other words how can we use monitoring to adaptively steer the rating environment in a safe manner?

4. Basel II Related Applications


Banks that have implemented an IRB compliant process will be rewarded considerably. Changes in capital requirements of only a couple per cent will have significant impact on the bottom line. However, Basel II is not just an exercise in compliancy, but also an opportunity to build on the rating environment and implement or improve related applications. The first candidate applications to profit from a Basel II infrastructure are other credit risk management applications. Examples are loan acceptance models (for new clients), behavioral models (during client lifetime) and collection models (after arrears or default). This is actually

not limited to the banking industry, but will be relevant for non-Basel sectors like telecommunications, insurance and retail as well. Note that these application areas lead to interesting new research problems as well. For instance in application scoring outcome information (loan repayed) is missing for a non random sample (the rejected applicants). This is known as the reject inference or outcome inference problem. Secondly, the risk dimension may be included in any situation where the right offer has to be made to the right client, i.e. integrate risk as one of the dimensions when deciding the next best action for a given customer. Customer relationship management should become risk sensitive, so that high risk clients are not targeted, or offers for a client are configured so that both risk and revenue are optimized. An example is risk based pricing, which allows a bank to accept customers that would normally be rejected. From a client perspective, the communication channel or customer touch point should not necessarily influence the treatment, offers and value he gets, so the client rating must be done both offline (batch scoring for outbound marketing campaigns for example) or online (contacts through inbound call centers or web site visits).

[2] Basel Committee on Banking Supervision. International Convergence of Capital Measurement and Capital Standards: A Revised Framework, Comprehensive Version. See http://www.bis.org/publ/bcbs128.pdf. June 2006 [3] A Solution For Basel II Compliant Credit Risk Management. Chordiant white paper, 2005. [4] Applying Chordiant Solutions In Risk Management And Basel Ii Compliancy. Chordiant white paper, 2005 [5] Davidson I.,, A. Grover, A. Satyanarayana, Giri K. Tayi A general approach to incorporate data quality matrices into data mining algorithms. KDD 2004, page 794-798, 2004. [6] Koudijs A., Putting Decision-Making in Consumer Credit into Action, Credit Risk International, September-October 2002, pp 25-26 [7] van der Putten, P, A. Koudijs and R. Walker. Basel II Compliant Credit Risk Management: the OMEGA Case . 2nd EUNITE Workshop on Smart Adaptive Systems in Finance: Intelligent Risk Analysis and Management. Rotterdam, The Netherlands, May 19, 2004. [8] van der Putten, P. and M. van Someren. A Bias-Variance Analysis of a Real World Learning Problem: The CoIL Challenge 2000. Machine Learning, October 2004, vol. 57, iss. 1-2, pp. 177-195, Kluwer Academic Publishers [9] Sun, Zhaochun EQPD, A Way to Improve the Accuracy of Mining Fused Data?. MSc Thesis, Leiden University, 2005 [10] Walker R.F., Haasdijk E.W., Gerrets M.C. Credit Evaluation Using a Genetic Algorithm, in Intelligent Systems for Finance and Business, S. In: Goonatilake & P. Treleaven (eds), John Wiley & Sons, Chichester, England, ISBN 0471-94404-1, pp. 39-59, 1995.

5. Conclusion
In this paper we have discussed Basel II compliant credit risk management as an interesting area as an interesting application of data mining. We have presented the decision management approach to credit risk management that puts heavy emphasis on the steps and processes beyond the core modeling step. In our view this opens up new areas for data mining research as well.

6. REFERENCES
[1] Basel Committee on Banking Supervision, Overview of The New Basel Capital Accord. Third Consultative Paper (CP3), April 2003

Resolving the Inherent Conflicts in Academic-Commercial Data Mining Collaboration


David Selinger Tyler Kohn FortisForge San Francisco, CA {dselinger, tkohn}@fortisforge.com Wendy Liu Scott Burk Overstock.com Salt Lake City, UT {wliu, sburk}@overstock.com

ABSTRACT
There are three dimensions defining the points of friction in a collaborative effort between a commercial and academic party: the definition of valuable contribution, intellectual property (of the data set and of the results), and privacy. Further, such an effort is also subject to the resourceallocation problems faced by any joint-undertaking. These conflicts affect knowledge discovery in any discipline from genomics to operational business intelligence to web behavior mining; further, the lack of a structural solution to these problems prevent industry from tapping into academic resources and academia from leveraging true data. The dimensions of intellectual property and privacy have been identified and addressed by numerous efforts. Privacy has been addressed by obfuscation techniques which remove individually-identifiable data while retaining patterns[1], while intellectual property has been addressed by parties such as Amazon.com, eBay, and Google by opening APIs/data sets which relinquish rights to the resulting research. However, the intellectual property ownership issues of other, more strategic data sets, methods, results developed from tighter forms of collaboration, resource commitments, and the definition of valuable contribution have not been sufficiently explored. The contribution of this work is to define valuable contribution from both perspectives, analyzing the impact of intellectual property in such a relationship and identifying clear resource-allocation pitfalls. Further, this work proposes process modifications based on CRISP-DM to overcome these risks.

1. INTRODUCTION
The creation of a symbiotic relationship between a corporate and academic partner may seem to leverage the strengths of each party (large data sets and opportunity for the corporation and academic prowess for the academic), but further analysis and experience suggests that this type of joint venture has a number of complications. The three dimensions along which conflict may arise in such an arrangement are: the definition of valuable contribution, intellectual property of the data set and the results, and privacy. Further complications arise from resource-allocation conflicts on both sides of the relationship. This work examines both the academic and corporate constituents and the problems they may face in attempting to collaborate in data analysis.

2. CONSTITUENT ANALYSIS
The Academic and Commercial parties have differing terms of success and interests in engaging in a collaborative effort. The Academic The academic participants in a collaborative effort are likely focused on the goal of developing and publishing new methods of discovering patterns in data or of verifying the validity of particular methods on dataall this in a neverending effort to find and secure funding to support further research. They will describe the success state using terms such as: funding, grant, sponsorship, interesting problem, publication, paper, method, lift, predictive, descriptive, novel, and contribution. An academic party is likely attracted to such a collaborative effort because of the difficulty in finding interesting, motivating problems and associated access to large volumes of real-world data. The Commercial The commercial participants are interested in making a financial or strategic advancement. The terms of their success state include: ROI, performance, promotion, profit, customer

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. DMBA'06, August 20, 2006, Philadelphia, Pennsylvania, USA. Copyright 2006 ACM 1-59593-439-1...$5.00.

Academic ` Professor/Advisor

private while academics compete by making their reasons for success public. Goals advancement towards tenure, academic recognition, financial support degree progress, resume material

3B. RESOURCE-ALLOCATION
Business parties may be distracted by the numerous corporate priorities, internal politics or strategic shifts in the business. This is not unique to this joint venture; however, because the company does not have a cash investment in the project, such contextual shifts are uniquely likely to lead to de-prioritization of the project as opposed to re-thinking or canceling it. This resource starvation can doom the project to a drawn-out and painful termination in numerous ways: Meta-Data: The availability of the Data Analyst to provide contextual information about the data Re-Inventing the Wheel: A business may not invest sufficient time to reveal to their academic partner what insights have already been made. Further, because there is no cost to the business of the Academic parties flailing, the Commercial parties may not feel an incentive to share such sensitive information with their partners. Strategic Value: The Business Sponsor is required to provide regular feedback, differentiating between strategically valuable and valueless insights ROI Guidance: The implementation of an insightful process or method may be too expensive to execute, or operationally support to be viable.

Student

Commercial Role Business Manager/Sponsor Operational Managers (Data Mining, Marketing, Engineering) Data Analyst

Goals career promotion via ROI recognition or ingenuity career promotion via execution and ROI recognition, support and defend the existence of their team

career promotion via individual contribution recognition Table 1: Goals the individual parties involved in a collaborative project.

satisfaction, privacy, strategic advantage, proprietary, and patent. Furthermore, there are likely numerous participants on each side, whose personal interests may introduce the invisible hand of politics into such a collaborative partnership (See Table 1).

3. CONFLICTS IN PARTNERSHIP
As is clear from Section 2, both Industry and Academia clearly have something to gain by successful analysis of commercial data sets. However, the parties interests do not align in the metrics used to define success and in the application of the results that are found, a conflict bound to come to light in a collaborative effort. Resource-allocation, a problem not unique to this space, also poses a risk to the success of a project. These problems are accentuated by the almost totalitarian power of the Commercial party over decision-making.

Similar resource allocation problems may plague a project on the Academic side. Changes in funding, shifts in the academic environment or early matriculation may pull Students or Professors away from a project.

3C. POWER
The business owns the data, and can therefore dictate terms as to what happens with these data and the results of any analysis on the data. The disparity in power between the Commercial and Academic parties may result in Commercial party abuse of the Academic party, exacerbating the resource-allocation and value conflicts by dictating terms which border on unacceptable. Worse, the Commercial partner may not state terms (i.e., ignore the problems) until much work has already been done and the issues come to a head, when the business may unilaterally determine what is to be done with the results.

3A. VALUE CONFLICTS


A business values insight that can provide increased revenue or reduced costs, hopefully in a manner which is a proprietary strategic advantage. This inherently includes a requirement to keep both the results and process used private in order to better compete in their market. Further, no business would be interested in releasing strategic data (e.g., sales information, experiments and results) which might reveal to its competitors confidential information about its internal operations Academics value publishing results and making public their data sets and methods in order to gain recognition. In effect, businesses compete by keeping their reasons for success

4. FAILURE CASE NARRATIVES


Below are presented four hypothetical situations synthesized from the joint experiences of the authors which exemplify the analysis of Sections 2 and 3. Case Study I Business wants to retain advantage An academic study determines that applying a specific process to a companys sales and marketing data results in the ability to greatly increase revenue. The academics have discovered a novel way to analyze these data sets and are eager to publish.

The company cites privacy and SEC regulations that prevent disclosing data in addition to prohibiting the academic from publishing the algorithms with sanitized data for fear of a competitor leveraging this methodology. Case Study II A 25% improvement is not always an improvement An academic analysis of a companys sales data reveals that efficiencies can be made to decrease shipping time by 25%. The student is thrilled, but the company declines to test this hypothesis due to current shipping times rated as excellent by customers and prohibitive costs associated with testing and implementation. Case Study III - Academic achieves 2% lift A dataset is provided by the business to the academics, under the hypothesis that that adjusting an existing method significantly increases the ability to predict a target business variable/identify a business pattern. A novel approach is developed, improving on an age-old method achieving 2% lift. The student writes a paper describing the process and the detailing results gleaned from experimenting on this proprietary dataset. The new method takes 20 times the computational power of the existing methods to operationally implement. The increase in systems requirements results in a net-negative implementation ROI due to higher system and maintenance costs and lower system availability. The business does not want to invest subsequent time into auditing reviewing the paperespecially since the method requires real data to prove lift. Case Study IV Recognition for Participation The business is interested in making its first foray into academic publishing with a partner in an attempt to attract top data analysis talent, and releases a somewhat risky data set to its partner for analysis. Over time they achieve impressive results which are deemed by the business as suitable for publishing. The academic party combines its results from this analysis with that of five other studies for publication. The business is appalled at the minor role their results play in the overall paper and are unwilling to risk such an important dataset to publication.

used in validating results. An example of this would be a data released by eBay or Amazon.com via their web APIs. A publicly-available data set may be even more attractive; however, this may undermine the academic interest in collaborating in the first place. There is a balance between the importance of a project to the company and the risk of revealing strategic advantages. A project which is too far from the companys core competency has a limited ability to yield a profit, whereas a project too close to the core competency would produce results definitely applicable by competitors. Finding the right balance will be critical. Further, projects closest to the core competency are likely to require data sets which contain confidential strategic information. The Right Partner There are particular commercial organizations and data sets which may be more suited to finding positive answers to the above questions. A corporate collaborator that is a non-profit, government, or effective monopoly in the arena the academic wishes to investigate may be good candidates. These parties have less to lose if their data are made public and may be more willing to invest sufficient time in a collaborative knowledge discovery effort.

5. CRISP-DM MODIFICATIONS TO ACCOMMODATE COLLABORATION


CRISP-DM is a widely accepted standard process for data mining[2]. It was developed by a consortium of industry leaders and practitioners to provide a framework that was flexible and rich enough to be utilized on a variety of data mining projects. Although the standard represents the best

4. ELEMENTS OF A SUCCESSFUL COLLABORATIVE PROJECT


Academic and businesses joint ventures have great potential but are not without pitfalls. The risks in a project can be mitigated by picking an appropriate project with the right partner(s). The Right Project A successful project both the data sets used and the importance of the project. If a project can use a data set which does not reveal strategic business value, it is more likely to be publishable or at least Figure 1: The CRISP-DM process cycle

process to date to structure a data mining project, it can not ensure success of collaborative efforts between commercial and academic parties. The hierarchical breakdown of the CRISP-DM emphasizes four tiers of abstraction from general to specific (Phases -> Generic Tasks - > Specialized Tasks -> Process Instances). At the highest or most general level it is not difficult to find agreement in definitions between parties, but as you move down the hierarchy there is less chance that both parties will find agreement. The CRISP-DM reference model consists of a cycle of interdependent phases. This cycle of phases are -> Business Understanding -> Data Understanding -> Data Preparation -> Modeling -> Evaluation -> Deployment ->. The business understanding phase includes the tasks of determining the business and data mining objectivesand is an opportunity to flush out the goals of each constituent. Because of the differences in the definitions of value contribution between academics and industry partners these tasks will be difficult. However, as weve seen, a project without such understanding may be bound for failure. This output of these tasks should include a contract agreement between the parties. The contract should include specific elements from the above analysis: The goals of each party: Both the spirit and specific terms of success should be laid out: What is the end-goal of this relationship? The restrictions of each party: What things will be unacceptable in this relationship? In contrast with the goals, these are things which can explicitly not be done. Resource dedication: Who will be dedicated to the project? How much time will each resource provide? Business Context: What are some specific tasks which will ensure transfer of contextual meta-data and strategic knowledge? This modification to the process may require significantly more time to be spent upfront in the relationship. Because of the numerous risks and the high cost of these risks, this investment of time will pay off. Similar to the development of software specifications, high-risk projects necessitate upfront investments of time in requirements gathering and mutual understanding of terms and goals. If the outcome of the contract negotiations is that no agreement can be reached, it is better that such an impasse was discovered before significant investment of time.

in power between the two parties with numerous scenarios where these risks overwhelm the potential value of the relationship, resulting in wasted time and effort on the part of both parties. Understanding these risks and vetting them in a collaborative relationship can prevent such losses. Specifically, the risks can be mitigated by identifying the elements of a successful project and by following the proposed modifications to the CRISP-DM data mining process.

7. REFERENCES
[1] R. Agrawal, and R. Srikant. Privacy-Preserving Data Mining. Proc. Of the ACM SIGMOD Conference on Management of Data, 2000. [2] P. Chapman, J. Clinton, R. Kerber, T. Khabaza, T. Reinartz, C. Shearer, and R. Wirth. CRISP-DM 1.0: Step-bystep data mining guide. Downloadable from http://www.crispdm.org

6. CONCLUSION
Despite the apparent symbiotic opportunity of collaborative Commercial-Academic data analysis projects, they face serious risks of differing goals, intellectual property rights and resource allocation. These risks are exacerbated by a disparity

Using Data Mining in Procurement Business Transformation Outsourcing


Moninder Singh, Jayant R. Kalagnanam,
IBM Thomas J. Watson Research Center 1101 Kitchawan Road Yorktown Heights, NY 10598, U.S.A

ABSTRACT
Business Transformation Outsourcing (BTO) is rapidly becoming a popular way for enterprises (BTO clients) to streamline operations and reduce costs by offloading non-core operations to external BTO service providers. A commonly outsourced business operation is procurement, wherein a BTO service provider takes over the task of acquiring goods and services for the BTO client. The BTO service provider in such engagements can then do strategic sourcing by aggregating the spend of multiple BTO clients with its own spend, thereby increasing sales volumes, consolidating the supplier base, and negotiating better pricing deals by redirecting the bigger volumes to fewer, preferred suppliers. The BTO service provider shares a part of the savings generated with the BTO clients while retaining the bulk of the savings itself, making it a win-win situation for all parties. However, there is a significant technical effort required in aggregating the spend information since it involves merging transactional spend data from heterogeneous data sources across various functional and geographic organizations across multiple enterprises. As such, the transactions are not cross-indexed, different taxonomies are used to categorize the transactions, data comes from multiple sources such as invoices, purchase orders and payments, same suppliers are referred to differently, and different ways are used to represent information about the same commodity, mostly as unstructured textual descriptions. This paper discusses the use of data mining for performing such spend aggregation in IBMs procurement BTO practice, including the problems faced, techniques used, lessons learnt and open issues that still need to be addressed.

resources to its core operations (in which it has maximum expertise), reduce the complexity and overhead of its business operations and streamline core-business processes by eliminating the processes in which it has little or no expertise. Second, substantial savings are immediately generated via reduction in human as well as non-human (e.g. procurement applications, h/w to run those applications) resources.

1. INTRODUCTION
Over the past few years, more and more enterprises are outsourcing some of their non-core business processes to external parties, a practice called Business Transformation Outsourcing (BTO). One commonly outsourced business operation is procurement, wherein a BTO service provider takes over the task of acquiring goods and services for the BTO client. Since enterprises invest a significant amount of resources in procurement activities, such outsourcing leads to immediate direct benefits across two fronts. First, by offloading non-core business processes to an external party, the enterprise can devote more

However, an additional significant chunk of savings comes indirectly via the BTO service provider who is able to acquire the same goods and services at substantially lower prices due to three main reasons. First, the service provider normally has significant expertise in procurement, and can utilize specialized and more efficient procurement processes. Second, the service provider can take advantages of economies of scale by taking on multiple, procurement BTO clients. Third, and most significantly, the service provider in such engagements can then do substantial strategic sourcing by aggregating the spend of the multiple BTO clients with its own spend, thereby increasing sales volumes, consolidating the supplier base, and negotiating better pricing deals by redirecting the bigger volumes to fewer, preferred suppliers. For a BTO service provider, such as IBM, that itself has a significant procurement spend, this allows substantial savings to be negotiated. Note that aggregation of spend across multiple BTO clients enables such strategic sourcing to be substantially enhanced over what could be done solely across a single enterprise. Figure 1 shows a cost function for volume-based deals that are typically negotiated with vendors. The function is commonly a step-function where the cost per unit decreases with increasing volume. Thus, if there are three different BTO clients that procure the same commodity with estimated volumes v1, v2 and v3, albeit from different suppliers, then the BTO service provider can negotiate a significantly better price by combining that volume (v=v1+v2+v3) and directing it all to a common, preferred supplier. The service provider shares a part of the savings generated with the BTO clients while retaining the bulk of the savings itself, making it a win-win situation for all parties. In order to do this procurement on behalf of its BTO clients, the BTO service provider must be able to develop a consistent supplier base and a consistent commodity base across all clients so that an accurate view can be developed of exactly what is being procured and from whom. Once that is done, the BTO service client can evaluate all suppliers from which a particular commodity is acquired, and negotiate better deals with one or more of them based on the combined volume of that commodity. However, there is a significant effort required in aggregating the spend information since it involves merging transactional spend data from heterogeneous data sources across various functional and geographic organizations across multiple enterprises and,

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. DMBA 06, August 20, 2006, Philadelphia, Pennsylvania, USA. Copyright 2006 ACM 1-59593-439-1$5.00.

potentially, different industries. Moreover, such data comes from multiple sources such as invoices, purchase orders and payments. As such, the transactions are not cross-indexed, different taxonomies are used to categorize the transactions, same suppliers are referred to differently, and different ways are used to represent information about the same commodity. Doing the aggregation manually is an extremely costly and time-intensive process. A few independent software vendors (ISVs), such as Emptoris [2] and Zycus [11] do provide services/solutions for aggregating intracompany spend (traditionally referred to as spend analysis); however, there are no solutions for extending that to aggregation of inter-company spend which present a host of problems not encountered while aggregating intra-company spend. A leader in the BTO space, IBM has an extensive procurement BTO practice with several large clients, including United Technologies, Unilever and Goodyear. Traditionally, IBM has used a combination of ISV services and internal manual efforts for aggregating spend during procurement BTO engagements. However, we have been developing and deploying a system internally for automatically aggregating spend, both within and across multiple enterprises. This paper describes the use of data mining in this system to enable fast and accurate spend aggregation across multiple BTO clients, and facilitating the enhanced strategic sourcing described previously. An earlier paper, Singh et al. [9] , provides a system view description of much of the design and the basic functional and architectural features of the system, and the interested user is referred to the same. However, [9] addressed primarily the problem of aggregating intra-company spend; in this paper, we provide a technical view and discuss the tasks involved and the technical and business issues faced in aggregating spend across multiple enterprises plus the approach we adopted to address these issues.
v1 v2 v3 v = v1+v2+v3

repositories comes from a variety of source and applications, such as invoices, purchase orders and account ledgers. As such, this data is generally inconsistent with no cross-indexing between transactions, with the same supplier or commodity being described differently in different transactions. This complexity gets multiplied manifold when spend across multiple enterprises has to be aggregated, since different supplier bases as well as multiple commodity taxonomies have to be reconciled to a common supplier base and taxonomy, respectively. As such, in order to aggregate spend across multiple enterprises, normally two main tasks have to be performed. First, the supplier names have to be normalized (e.g. IBM and International Business Machines have to be recognized as a single entity). This has to be done both within an enterprise, as well as across the multiple enterprises. This enables the BTO service provider to determine exactly who the main suppliers are and the amount of total spend across the multiple enterprises that is being channeled to those suppliers. The normalization of supplier names involves the mapping of multiple names for the same entity to a single, common name for that entity. Multiple names arise due to different locations, different business undertaken by the same enterprise, parent child relationships due to acquisitions, as well as errors and noise in the transactional data. Since enterprises have demographic information about suppliers, such as addresses, contact information and tax identifiers, this data can be used in conjunction with the names to normalize the supplier names. Second, the commodity taxonomies of the individual enterprises have to be mapped to a common commodity taxonomy to enable spend to be aggregated across commodities (e.g. hazardous waste handling expense, pollution control expense need to be mapped to the hazardous waste control commodity). This enables the BTO service provider to accurately determine the total procurement volume of any given commodity. Moreover, in the case of BTO engagements where the procurement-spend of multiple clients has to be merged, the spend categories of all the enterprises (BTO clients and BTO service provider) need to be mapped to a uniform taxonomy. In such cases, a standard taxonomy such as the United Nations Products and Services Code (UNSPSC) [10] is well suited since it spans all industry verticals, thus enabling a BTO service provider to host procurement for all kinds of clients. In fact, a third case of mapping may also need to be done in some cases to facilitate spend aggregation across multiple enterprises. This happens in cases where a particular enterprise does not have (or enforce) a formal spend or commodity taxonomy to categorize individual spend transactions. In such cases, the individual spend transactions have to be mapped to the target commodity taxonomy, such as the UNSPSC, based on unstructured textual descriptions in the transactions (such as from invoices or purchase orders). Another situation in which such transactional mapping becomes necessary is when the client spend taxonomy is not specific enough i.e. spend categories are at a higher level than the commodity level needed for the aggregation. In such cases, the transactional level descriptions may provide more information about the actual commodities purchased to allow such mapping to be done. Our focus has been primarily on commodity taxonomy mapping; nevertheless, the system also has the complete capability, functionality and tools needed to do such transactional mapping using the same techniques used for taxonomy mapping

Figure 1: Example demonstrating cost-savings in procurement BTO

2. PROBLEM DESCRIPTION
As described in the section 1, a significant technical effort is required in aggregating the spend information across multiple enterprises. Even within a single enterprise, transactional spend data often resides in multiple, heterogeneous data sources across various functional and geographic organizations. Data in these

Cost per Unit Volume

since both problems are essentially the same with some differences that we discuss later. These mapping tasks are complicated due to a variety of reasons as we discuss below. The various complicating factors can be summarized as follows: 1. Commodity descriptions are normally very short. As such, each word is significant. However, distinguishing between different potential matches becomes correspondingly harder, since the items in a taxonomy often number in the tens of thousands of which the best one has to be selected based on a couple of words. Commodity descriptions often contain significant amounts of domain-specific terminology as well as abbreviations and terms. The order of words in commodity descriptions becomes an important issue, one that is not considered in traditional information retrieval methods that use a bagof-words approach. The UNSPSC taxonomy is extremely broad covering all industrial groups but is often not very deep (commodities are often very general). Moreover, it is quite big with almost 20K commodities. On the other hand, company-specific taxonomies are generally much smaller, less broad but often contain more specific codes. Therefore, the mapping between many enterprise taxonomies and the UNSPSC is many-to-many rather than one-to-one. The UNSPSC taxonomy is a true hierarchical taxonomy in which an is-a relationship exists across different levels. Thus, a UNSPSC Family includes similar Classes of Commodities, each Class further including related Commodities. However, enterprise commodity taxonomies are seldom organized this way. More often than not, they reflect functional or spend categorizations (such as business travel expenses, direct procurement related etc). Therefore, multiple commodities that are children of the same class in an enterprise-taxonomy may map to very different areas of the UNSPSC taxonomy and vice versa. In cases where transactional mapping needs to be done, for the reasons highlighted earlier, the problems get more compounded by the fact that transactional descriptions are often more noisier than taxonomy descriptions, often have substantially more domain specific terminology, and also entail the need for resolving potentially conflicting matches resulting from multiple descriptions in the same transaction (arising from different sources such as Pos and invoices). Computational complexity also becomes a significant factor, and many techniques that perform well on simple problems become unusable when applied to spend cleansing tasks. An important factor that we found with regards to human users taking such a system seriously and using it for the mapping was their perception of how good it was. For supplier name normalization, users expressed a strongly disapproval of false negatives (such as when

suppliers that clearly looked to be the same were considered by the tool to be different). However, in the case of commodity mapping they considered false positives in bad light, especially if the mapped or suggested commodity was too far from the item being mapped. For example, an item such as tax software being mapped to software tax was regarded to be extremely bad, even though the two terms were very similar by some approaches (e.g. tf-idf).

3. TECHNICAL SOLUTION
For the two mapping problems, we looked at techniques from the information retrieval literature, such as string similarity methods, as well as supervised and non-supervised learning methods from the machine learning literature. One often limiting factor in BTO engagements is the absence of any mapped data. For example, there is generally no data that explicitly marks multiple supplier names as being different names for the same physical enterprise. Thus, we are normally precluded from using supervised machine learning methods to learn classifiers to predict the normalized supplier name for a given supplier or for mapping one commodity-taxonomy to another. While we have had a high degree of success using such methods such as SVMs [3], maximum entropy models [7] and naive Bayesian classifiers [5] on small datasets of manually normalized supplier names and mapped taxonomies, we are mostly forced to use only unsupervised (clustering) approaches in conjunction with similarity methods based rules. While string similarity measures, such as Levenshtein distance ([1][4]), and token-frequency/inverse-document-frequency (tf-idf) [8] could be used directly to compare supplier names or taxonomy descriptions, each approach was found to have some weakness that made it unsuitable for use by itself. For example, Levenshtein distance calculation is computationally expensive, and its usage on real data with tens- to hundreds-of-thousands of supplier names makes the mapping process computationally prohibitive. Moreover, such edit-distances do not distinguish between positional differences but only character differences; in supplier names, however, positional differences are very important. For example, differences in the beginning of names are more likely to imply that the names belong to different enterprises than differences towards the end of the names. Even relatively computationally-cheaper methods like tf-idf have similar problems, since edit-distance computations still need to be done at the token level. Moreover, tf-idf too does not take into consideration the order of the tokens in a textual string.

2.

3.

4.

5.

6.

3.1 Supplier Name Normalization


Thus, for supplier name normalization, we built a repository of rules based on the application of various string similarity methods to tokenized supplier names. These rules include exact and fuzzy matches on whole or parts of supplier names along with demographic data (street address, city, phone, etc). These rules are further enhanced by the use of various techniques such stop word elimination, removal of special characters, transformation of numbers to uniform format, abbreviation generation and comparison, etc. Furthermore, we use tf-idf based indexes, dictionaries and standard company name databases (such as the Fortune 500 list) to assign different weights to different words and tokens in a name. Thus, differences towards the beginning of

7.

8.

names are considered to be more important than difference towards the end of the name, proper nouns are weighted more than other types of words, etc. Simpler rules are evaluated first; more complex rules and methods are applied later. For cases where address information is present as unstructured text instead of attribute-value pairs, regular-expressions are used for extracting attributes such as zip codes, street addresses and city. Furthermore, these rules are used in conjunction with a clustering approach to increase the efficiency of the process, using the canopy based clustering method of McCallum et al. [6] - the idea being to first use computationally cheap methods to make some loose clusters, called canopies, followed by the more computationally intensive methods to refine the canopies further into appropriate clusters. Due to the extremely large supplier bases encountered for many enterprises, this clustering approach is particularly attractive. Nevertheless, even this approach was found to have lots of problems for large supplier bases when used with the string similarity methods since very small canopies led to large errors in data while big canopies resulted in unacceptable computational expense. As such, for big datasets, we use various kinds of simple rules on names, name tokens, and address fields, and restrict strong string similarity approaches to a minimum. However, for smaller datasets, this is not a problem and more stringent similarity methods are used. To create canopies, cheap methods including zip code matches, phone number matches and name and/or address token are used in various rules. Once the canopies have been formed, the more expensive techniques consisting of the elaborate rules described above are used to form clusters. Once the supplier base of the client under consideration has been clustered, it is merged with the cumulative normalized supplier base formed from all previously normalized clients data. Note that by incrementally building up such a repository of normalized suppliers, and mining the repository for subsequent clients normalization tasks, the accuracy and performance of the system improves with each additional client. In fact, this behavior was quite apparent in our application of the system to several client spend data. Thus, while the complete details of the method (especially the rules repository) is IBM proprietary, the supplier-normalization approach can generally be described as follows: 1. For each non-normalized supplier, format the supplier name by eliminating stop words, removing special characters, transforming numbers to uniform format, etc. Create metadata for each supplier such as first few tokens, potential abbreviations, presence of tokens in dictionaries or supplier names in databases such as Fortune 500, etc. Use regular-expression based extractors to break up address fields into more specific information such as street name, street number, PO box number, etc. Segment the supplier base of the current BTO client a. Create canopies using cheap methods such as zip code matches, phone number matches, first-token matches (using n-grams) as well as exact and inclusion name matches Use more stringent methods to create clusters from the canopies. Use cheaper rules first,

followed by more expensive rules. Rules include checking for non-dictionary words, Fortune 500 words, similarity in name and address fields, abbreviation matches, etc. 5. Merge the set of clusters with the current normalized supplier base consisting of all clients data that has already been normalized. a. For each cluster, find possible matches in the normalized supplier base using weak rules (as described above for canopy formation). Use stronger rules to determine if exact match exists. If match found, then add new cluster to the normalized database with the normalized name of the match found. Otherwise, add as a new supplier with a normalized name chosen from amongst the set of supplier names in the cluster

b. c.

3.2 Commodity Taxonomy and Transactional Mapping


On the other hand, for mapping commodity taxonomies to the UNSPSC commodity taxonomy, we primarily use string similarity based methods augmented with WordNET [12], and rules based on positional differences between tokens in the query and candidate descriptions. One significant approach we took that also proved to be quite beneficial was to mine previously mapped company-proprietary taxonomies for similarities to the commodity description in question, and use that to acquire the proper UNSPSC mapping when substantial similarities were found. Having said that, the actual approach we have taken so far is also motivated by the various issues involved, as discussed previously in Section 2. These include the computational expense of editdistance methods, the absence of positional differentiation in all the string similarity methods as well as the use of domain specific terminology. Compounding all this is the fact that the source and target taxonomy may have wide structural differences. As a case in point, consider the UNSPSC code. It has roughly 20K commodities in a four-level taxonomy. However, while the taxonomy is very broad, and includes commodities and services in almost all industrial sectors, it is not very deep in any given sector. Company taxonomies, on the other hand, are not very broad but are generally far more specific in terms of commodities, especially in the case of items used in production. For example, while the UNSPSC has commodity codes for desktop and notebook computers, companies are much more specific in terms of the specific types of desktop and notebook computers. This is more so in the case of production parts, but also occurs in the case of services. As such, there is often a many-to-many mapping that needs to be done between the two taxonomies. Another important factor, also pointed out in Section 2, is to determine exactly what the commodity description is referring to. For example, software tax is a sort of tax while tax software is a type of software. To enable the system to do this mapping properly, we have used various techniques from classical information retrieval literature including stop word removal, stemming, tokenization using words and grams, coupled with dictionaries and domain specific vocabulary. Moreover, we have integrated the system with WordNet to enable use of synonyms, sense determination,

2.

3.

4.

b.

morphological analysis and part of speech determination in the creation of rules and methods for better identifying the main keyword(s) in a description and ranking the results of mapping better. Finally, we developed a set of post-filtering and ranking rules which assign weights to tokens in the queries and candidatedescriptions based on such importance, and re-ranks the candidate results to get a more accurate match list. The method employed can be described as follows: 1. For the UNSPSC taxonomy, as well as each previously mapped company taxonomy, do: a. For descriptions belonging to company taxonomies, remove stop words and do simple transformations such as stemming and term normalization For all descriptions, use WordNET to generate synonyms for individual tokens Generate tf-idf indexes for each taxonomy

absolute position, as well as their importance to the query (object, immediate qualifier, distant qualifier, etc).. Thus, for example, if the objects matched in value and position, the candidate would be ranked higher than a candidate in which the tokens matched but their relative positions did not. Thus, if the query was software tax, then a candidate tax would be ranked higher than a candidate tax software even though the latter is a perfect token-based match. Similarly, application software tax would be ranked higher than tax software but lower than software tax. For transactional mapping, we have used essentially the same techniques and algorithms with some extensions. First, we use the same clustering techniques as for supplier name normalization to cluster together similar transactions based on the transactional descriptions. Second, we extend the taxonomy mapping algorithm to use transactional descriptions from previously mapped companies data as well. Third, we use simple methods (such as majority rules) to combine mapping results arising from multiple descriptions, either for the same transaction or different transactions in the same cluster. Finally, we are building up better repositories and techniques for filtering out the noise from such descriptions, mainly using stop words and better keyword indices.

b. c. 2.

For each commodity description in the to-be-mapped taxonomy, do a. Remove stop words and do simple transformations such as stemming and term normalization Generate synonyms for individual tokens using WordNET If exact match found with an entry in the UNSPSC taxonomy or a previously mapped taxonomy, stop and use the matches. Otherwise, go to d. Use tf-idf (cosine distance) method to determine a candidate list of possible matches from each taxonomy. Take a union of all candidate lists to generate a final candidate list. For the query description, i. ii. Use WordNET to determine the part of speech of each token Identify the main object of the description, as well as the related qualifiers. Thus, for software tax, object would be tax and the qualifier would be software. For application software tax, there would be an additional qualifier, application.

b. c.

4. SYSTEM PERFORMANCE
The system has been used for a variety of companies across a wide spectrum of industry verticals, including, in addition to IBM itself, an electronics manufacturer (company B), a transport company (company C), a materials company (company D), a consumer discretionary products manufacturer (company E), and a procurement services company (company F, since acquired by IBM). Each company presented a different set of problems and issues. Some of them only had spend data linked to suppliers without any commodity information (B, E & F) while one had only commodity taxonomy information and no supplier data (D). Of these 5, only one has been an actual BTO spend aggregation engagement; the other 4 have been used to test and further develop the system. We mainly measured performance for the two main tasks: supplier name normalization and commodity taxonomy mapping. Commodity transactional mapping (based on invoice descriptions) was tested only for some of the IBMs transactional spend data. For commodity taxonomy mapping, we mapped each taxonomy against the UNSPSC taxonomy to enable spend to be aggregated across the multiple enterprises via commodity. For supplier name normalization, we first normalized the IBM supplier base, and then mapped each clients supplier base against the (incrementally) expanding supplier-base. For both tasks, since we did not have any mapped data, there was no direct way to evaluate the quality of the mappings. As such, we measured the performance approximately using the web-based review /signoff tools whereby humans examined random subsets of the mappings and manually changed mappings that were incorrect. The quality of supplier name normalization increased rapidly with each subsequent client. The worst performance was on the IBM vendor base, primarily due to its size (487K unique vendors). As

d.

e.

f. g.

For each candidate match, do the same process as in (d). Use the set of weighting rules to re-rank the candidate matches based on the objects, qualifiers, and their relative positions. While the exact set of rules are IBM proprietary and we are unable to discuss them, the general idea is as follows: in addition to the presence/absence of various tokens in the query and candidate match, weights are assigned to tokens based on their relative and

such, the use of computationally intensive methods such as edit distances turned out to be computationally prohibitive, and we were forced to limit the use of such methods to only very limited situations. Moreover, a significant number of the vendors were in fact unique. Nevertheless, the 487K supplier base was reduced to 204K normalized vendors (42%). The accuracy of the mapping was estimated to be between 85-90%, with a majority of the errors being due to positional character differences, something that would have been caught if we were able to use edit distances methods. Sure enough, the accuracy steadily increased to well over 90% for subsequent clients, since the supplier base in each case was significantly smaller and we were able to use the string similarity methods easily. In each case, the reduction in the original supplier base was quite large. Company B was reduced from 25K to 13K (52%), company C was reduced from 16K to 13K (81%), company E was reduced from 14K to 12K (86%) while company F was reduced from 12K to <1K (8%). As expected, the reduction was more for companies B and F (similar industry vertical as IBM) but less for companies C and E which had a significantly different supplier base. More dramatic were individual supplier normalizations. For example, in the case of IBM, almost 500 unique vendors were normalized to IBM, 64 to Hewlett Packard and 68 to Sanmina. Similarly, in the case of company B, there were 200 unique vendors that were namevariations or different locations of company B itself. Similarly, there were 58 mappings to Hewlett Packard and 124 mappings to Tyco. Large suppliers often had high tens to low hundreds of mappings in the case of each of the company whose supplier base we normalized. Note, however, that the system does not directly identify a special kind of mappings those generated via mergers and acquisitions where the names and address of different suppliers are very different although they are in fact the same enterprise. As an example, consider IBM and Lotus Development Corporation. The only way the relationship between these two would be correctly detected is if they had the exact same address in different records. While this information can be purchased from external vendors, such as Dun and Bradstreet (D&B), it is a very expensive proposition and that too is based upon an exact match of supplier name and address with D&Bs databases. To handle this, we have incorporated our own database of mappings resulting from mergers and acquisitions from information in the public domain. This database, however, is still relatively small, and we did not consider such mappings while evaluating the system. While including these in the performance evaluation significantly reduces the performance; this type of mapping is not really addressable by data mining techniques unless we have similar mapped data available. As we use the system for more and more clients, this mapped data is also increasing as is the database of mappings described above, both of which are resulting in gradual improvements for such mappings as well. For commodity taxonomy mapping, the results were far more mixed. This was due to the fact that none of the companies had a pure commodity-taxonomy. Each one of them had categories related to functional or business areas (such as accounting categories). As such, it was not possible to map them to UNSPSC commodities merely by considering the descriptions in the two taxonomies. Moreover, mapping direct commodities was sometimes difficult due to the paucity of domain specific knowledge (as in the case for company C). However, for indirect

commodities, since they are described similarly across different industry verticals, the performance was significantly better. As in the case of the supplier data, we did not have any mapped data, and the only way to see the quality of mapping a taxonomy to the UNSPSC taxonomy was to have human experts review samples of the results and mark the mapping as correct or change accordingly. Given the diversity of the companies to which we have applied the system, and the tediousness of manually reviewing the mappings, we are still in the process of completing evaluations and we do not have exact performance results yet. However, the general patterns seem to be quite apparent. The sizes of the taxonomies also varied substantially from company to company, with IBM having 632 categories, company C having 229 categories, company D having 717 categories and company B with 394 categories. The performance was worst on the IBM taxonomy, since the taxonomy had a significant number of non-commodity categories, such as tool cleaning of manufacturing equipment, delegated sponsorships and streaming webcasts. Moreover, some IBM categories are quite broad (such as monitors) that map to multiple UNSPSC commodities (LCD displays, plasma panel displays, CRT monitors). Other categories were too specific which resulted in multiple IBM categories being mapped to a single UNSPSC code. As such, for most of the categories, the system was unable to map an IBM category to a single UNSPSC ode but recommended a list of typically 5-10 candidates from which the user had to manually choose using the sign-off tools. However, the system performed much better than when we used only standard retrieval methods (tf-idf without WordNET and positional weighting rules) where the recommended lists often did not have the correct codes, often had irrelevant categories, the recommended lists were significantly larger (due to the same weights being assigned to a large number of commodities) and the accuracy was less than 50%. When we augmented the methods with WordNET and the rules based on positional weighting and word senses, the accuracy improved significantly, especially for indirect commodities. Similar results were seen with the other commodity taxonomies. Company D had the cleanest taxonomy in terms of having a significant fraction of commodities, and the performance was correspondingly better (and again significantly better with the enhanced retrieval methods, than with traditional retrieval methods). For company C, the results were mixed. For indirect commodities, the accuracy is again quite high, but overall deteriorates substantially due to a large number of non-commodity categories. The limited testing we have done so far on transactional commodity mapping on some of the IBM transactional data met very limited success. The problems outlined above in mapping the IBM spend taxonomy got even more magnified at the transactional level. Another reason was that transactional descriptions often had fairly domain specific information and also described commodities at a level that was even more specific than that described by the UNSPSC. For example, IBM categorizes chemicals (in its spend taxonomy) at a fairly high level (such as lab chemicals) while the transactions often described exact chemicals with associated nomenclature and strengths etc (such as ammonium chloride or hydrochloric acid). UNSPSC on the other hand lies somewhere in between, such as organic acids, or inorganic acids etc. So, while the system was able to shortlist viable candidates, it also picked up large numbers of invalid

matches, primarily due to this specificity mismatch (such as acid anhydrides, etc).

6. REFERENCES
[1] Baeza-Yates, R. and Ribeiro-Neto, B. Modern Information Retrieval, Addison Wesley, 1999. [2] Emptoris Spend Analysis Module. www.emptoris.com/solutions/spend_analysis_module.asp [3] Joachims, T. Text categorization with support vector machines: learning with many relevant features, 10th European Conference on Machine Learning, 1998. [4] Levenshtein, V.I. Binary codes capable of correcting deletions, insertions and reversals, Soviet Physics Doklady 10 (8): 707-710, 1966.. [5] McCallum, A. and Nigam, K. A comparison of event models for naive Bayes text classification, AAAI-98 Workshop on Learning for Text Categorization, 1998. [6] McCallum, A., Nigam, K., and Ungar, L.H. Efficient clustering of high-dimensional data sets with application to reference matching, Proc. of 6th ACC SIGKDD Intl. conf. on Knowledge discovery and data mining, 169-178, 2000. [7] Nigam, K., Lafferty, J., McCallum, A. Using maximum entropy for text classification, IJCAI-99 Workshop on Machine Learning for Information Filtering, 6167, 1999. [8] Salton, G., Buckley, C. Term weighting approaches in automatic text retrieval, Tech Report No. 87-881, Dept of Computer Science, Cornell University, Ithaca, NY, 1987. [9] Singh, M., Kalagnanam, J., Verma, S., Shah, A. and Chalasani, S. Automated cleansing for spend analytics. Proceedings of the 14th ACM International Conference on Information and Knowledge Management, 437-445, 2005. [10] UNSPSC, The United Nations Standard Products and Services Code, http://www.unspsc.org [11] Zycus Spend Data Management solution. www.zycus.com/ solution/spend-data-management-solutions.html [12] WordNet. A lexical database for the English language. Cognitive Science Laboratory, Princeton University, Princeton, NJ. http://wordnet.princeton.edu.

5. OPEN ISSUES
We have applied the system for aggregating spend for a variety of companies across a wide spectrum of industry verticals with varying degree of success. While results for supplier normalization were generally good (often 80-90% accurate), commodity taxonomy mapping results were far more mixed (ranging from under 50% to generally around 70%). We have identified several different areas for improvement; some with known possible solutions; some without. First, there is a need to develop domain dictionaries to enable better matching for taxonomies across a wide spectrum of industry verticals. Second, we also need to enhance further the ability to better handle noncommodity taxonomy items, such as accounting categories that are often found in many company taxonomies. Third, we can likely get better performance by using the multiple applications of the system to various clients to develop a repository of mapped data and using it to learn supervised models, as discussed previously, in conjunction with the unsupervised methods and IR techniques. Fourth, we are now also focusing on improving the transactional commodity mapping capabilities of the system by building better indices and representations as well as improving our matching algorithms to better identify useful data in textual descriptions, as well as improved utilization of multiple descriptions for the same transaction to yield better matches. Some other issues, however, are far more challenging and unresolved. One, the system currently needs people skilled in data mining techniques to do the spend aggregation, especially for enterprises in yet-unseen industry groups or dramatically different commodity taxonomies. The target user for this system, however, is a BTO client representative who is more business oriented, though with some technical background. Bridging this gap is one big unsolved issue that is severely limiting the total acceptance of the system within IBM. Second, from a technical viewpoint, better mapping algorithms are needed for commodity taxonomy mapping that take into account positional differences between strings (as opposed to the bag of word approaches of traditional IR methods) to improve upon the rather ad-hoc, weight-based rules approach we have currently taken.

Vous aimerez peut-être aussi