Vous êtes sur la page 1sur 25

International Journal of Forecasting 18 (2002) 647671

www.elsevier.com / locate / ijforecast

A hybrid system-identification method for forecasting


telecommunications product demands
a, b
Louis A. Cox Jr. *, Douglas A. Popken
a
Cox Associates, 503 Franklin St., Denver, CO 80218, USA
b
Systems View, 9139 S. Roadrunner St., Highlands Ranch, CO 80129, USA

Abstract

A crucial challenge for telecommunications companies is how to forecast changes in demand for specific products over the
next 6 to 18 monthsthe length of a typical short-range capacity-planning and capital-budgeting planning horizon. The
problem is especially acute when only short histories of product sales are available. This paper presents a new two-level
approach to forecasting demand from short-term data. The lower of the two levels consists of adaptive system-identification
algorithms borrowed from signal processing, especially, Hidden Markov Model (HMM) methods [Hidden Markov Models:
Estimation and Control (1995) Springer Verlag]. Although they have primarily been used in engineering applications such as
automated speech recognition and seismic data processing, HMM techniques also appear to be very promising for predicting
probabilities of individual customer behaviors from relatively short samples of recent product-purchasing histories. The
upper level of our approach applies a classification tree algorithm to combine information from the lower-level forecasting
algorithms. In contrast to other forecast-combination algorithms, such as weighted averaging or Bayesian aggregation
formulas, the classification tree approach exploits high-order interactions among error patterns from different predictive
systems. It creates a hybrid, forecasting algorithm that out-performs any of the individual algorithms on which it is based.
This tree-based approach to hybridizing forecasts provides a new, general way to combine and improve individual forecasts,
whether or not they are based on HMM algorithms. The paper concludes with the results of validation tests. These show the
power of HMM methods to forecast what individual customers are likely to do next. They also show the gain from
classification tree post-processing of the predictions from lower-level forecasts. In essence, these techniques enhance the
limited techniques available for new product forecasting.
2002 International Institute of Forecasters. Published by Elsevier Science B.V. All rights reserved.

Keywords: Classification; Markov-models; Telecommunications; Combining forecasts; Market forecasting, State-space models; Transition
probabilities

1. Introduction must forecast demand growth for products for


which they have little or no historical data
Telecommunications companies increasingly (Fildes, 2002). For example, many local ex-
change carriers (LECs) are offering new prod-
* Corresponding author. Tel.: 11-303-388-1778; fax:
11-303-388-0609. ucts such as high-speed internet access, wire-
E-mail addresses: tony@cox-associates.com (L.A. Cox less, and video services to create new revenue
Jr.), dpopken@systemsview.com (D.A. Popken). streams and to retain market share for their core

0169-2070 / 02 / $ see front matter 2002 International Institute of Forecasters. Published by Elsevier Science B.V. All rights reserved.
PII: S0169-2070( 02 )00069-9
648 L. A. Cox Jr., D. A. Popken / International Journal of Forecasting 18 (2002) 647671

local access services. Several high-profile start- holds or accounts with enough accuracy to be
ups have launched competing offers in an effort useful in target-marketing requires a fundamen-
to quickly establish and build share in the tally different approach to predictive modeling
expanding markets for these services. In such than the aggregate state-space and time series
settings, insufficient product sales data are models that have long been used for high-level
available to make reliable forecasts of future planning and budgeting.
sales using standard time series and market- This paper introduces a new, two-level ap-
forecasting techniques. proach to forecasting individual purchase be-
Yet, capital and expense planning, capacity haviors to meet these needs. It addresses the
forecasting, and marketing plans for the imme- need to forecast from relatively short histories
diate term require useful sales forecasts for at by quantifying instantaneous transition rates of
least the next 12 to 24 months. Even ex- individuals among product-ownership states
perienced carriers are frequently surprised by (Aoki, 1996), rather than trying to forecast
the actual demand growth over this horizon. An aggregate product sales levels from their own
enthusiastic customer response to a new market- past values and from values of other macro-
ing initiative (e.g. AT&T Wireless Services level variables. The lower-level forecasting
introduction of the Digital One Rate price plan algorithms are based on estimating state-depen-
in 1998) can easily exceed anticipated growth dent transition rates from short-duration data.
and swamp existing and planned capacity. Al- Small transition probabilities (per customer per
ternatively, demand forecasts that over-estimate month) do not impair the ability to estimate
actual growth in demand can lead to excess, short-term transition rates by maximum-likeli-
possibly stranded capacity and create financial hood estimation from observed transitions per
strain for the company that invested in it. The unit time, provided that population sizes are
Iridium satellite network is a recent example. large enough so that multiple transitions (e.g.
A similar problem of short-run forecasting five or more) are observed in each period (Bhat,
from limited historical data arises for current 1984, p. 140). This is the case in telecommuni-
service providers who keep only a small amount cations applications with millions of customers.
of historical data (e.g. a year or less) on product However, short observation sequences (typically
sales available for analysis and prediction. less than a year) do imply that transition rates
While corporate data warehouses and data marts must be re-estimated frequently as new data
are starting to be deployed to solve this problem become available to test for possible non-
(Strouse, 1999), many companies still face the stationarity or seasonality. (In principle, if states
challenge of using only a few quarters worth of truly carry all relevant information for making
sales data to forecast likely future sales, by predictions, then state-dependent transition rates
individual product, for market areas and for should be constant. However, this modeling
individual customers. Marketing pressure to idealization should be checked with data). In
forecast what individual customers are likely to applications to telecommunications product pur-
do and to identify which customers are most chases over 4 years, we have found that short-
(and least) likely to respond to specific product term transition rates estimated from only a few
offers also drives a need for improved micro- months of data can be used to forecast 1-year
forecasting methods (Schober, 1999). Forecast- cumulative transitions with small error bands
ing probable future product-purchasing be- (less than 3% difference between estimated and
haviors down to the level of individual house- observed cumulative transitions). This level of
L. A. Cox Jr., D. A. Popken / International Journal of Forecasting 18 (2002) 647671 649

predictive accuracy is achieved by choosing summarizes his current product profile and a
states carefully using the methods described in micro-state that summarizes aspects of history
this paper. (i.e. information about past states and times
The higher-level analysis treats the predic- spent in them) and any other covariates that are
tions from different algorithms as explanatory found to significantly affect transition rates
variables and observed transitions as response among macro-states.
variables. A classification tree analysis This paper describes and illustrates the two-
(Breiman, Olshen, Friedman, & Stone, 1984, level forecasting approach using Hidden Mar-
Biggs, De Ville, & Suen, 1991) then yields a kov Models (HMMs) as predictive algorithms.
classification tree that combines the predictions Individuals are treated as the units of analysis.
from the lower-level algorithms to create a new, Their transition rates among product-ownership
hybrid prediction that out-performs any of the states are response variables. The explanatory
initial predictions. variables and covariates that affect individual
For computational convenience, we use two transition rates are incorporated into state defini-
levels of state variables to describe each in- tions. The methodology is illustrated and tested
dividual at any moment: a macro-state that by application to telecommunications customer

Table 1
Raw data from which to predict customer demands for products
Customer ID MONTH CW ADD CF TWY VM TP CR CWID CID CC PCS
1 1 0 1 0 0 1 0 0 0 1 0 0
1 2 0 1 0 0 1 0 0 0 1 0 0
1 3 0 1 0 0 1 0 0 0 1 0 0
1 4 0 1 0 0 1 0 0 0 1 0 0
1 5 0 1 0 0 1 0 0 0 1 0 0
1 6 0 1 0 0 1 0 0 0 1 0 0
1 7 0 1 0 0 1 0 0 0 1 0 0
1 8 1 1 0 0 1 0 0 0 1 0 0
1 9 1 1 0 0 1 0 0 0 1 0 0
1 10 1 1 0 0 0 0 0 0 1 0 0
1 11 1 1 0 0 0 0 0 0 1 0 0
1 12 1 1 0 0 0 0 0 0 1 0 0
1 13 1 1 0 0 0 0 0 0 1 0 0
2 1 1 1 0 1 1 0 0 0 0 0 0
2 2 1 1 0 1 1 0 0 0 0 0 0
...
2 12 1 1 0 1 1 0 1 0 0 0 0
2 13 1 1 0 1 1 0 1 1 1 0 0
3 1 0 0 0 0 1 0 0 0 1 0 0
3 2 0 0 0 0 1 0 0 0 1 0 0
...
3 12 0 0 0 0 1 0 0 0 1 0 0
3 13 0 0 0 0 1 0 0 0 1 0 0
Key: CW5call waiting, ADD5additional line, CF5call forwarding, TWY5three way calling, VM5voice messaging,
TP5toll plan, CR5custom ring, CWID5call waiting ID, CID5caller ID, CC5custom choice, PCS5wireless.
650 L. A. Cox Jr., D. A. Popken / International Journal of Forecasting 18 (2002) 647671

data provided by the US WEST Communica- X(t) 5 frequency distribution of customers


tions Marketing Intelligence and Decision Sup- among states at time t, represented as a
port (MIDS) group. In the final section we column N-vector with one component for
present evidence for the value of the approach. each of the N possible states. Xi (t) is the
number of customers in state i at time t and
2. Data and basic methods of analysis E[Xi (t)] is its expected value. Equivalently,
after dividing through by the population size
2.1. From raw data to state transition models (ignoring customer arrivals and departures
for the moment), Xi (t) may be interpreted as
Table 1 shows example data for three cus- the probability that a randomly selected
tomers and a selection of telecommunications customer at time t will be in state i.
products over a 13-month observation period. A 5 N 3 N matrix of one-step transition rates
The products owned by each customer at the (i.e. probability-per-month of making a
end of each month are recorded, with 1 meaning transition from each possible state to each
that the product was owned and 0 meaning that possible next state).
it was not. Thus, for example, customer 1 adds
Call Waiting (CW) in month 8 and drops voice Attractive though this Markov modeling ap-
messaging (VM) in month 10. Customer 2 adds proach may be in principle, there are at least
Call Rejection (CR) in month 12, Call Waiting two things wrong with it in practice. The first is
n
ID (CWID) and Caller ID (CID) in month 13. that there are too many possible states2 of
Customer 3 had no product transitions over the them for n products even if all variables are
period of observation. From such data, we wish only binary. For the 11 products shown in Table
to predict the change in total demand for each 1, this gives 2048 states, making A a matrix of
product (i.e. the aggregate of product adds and 20483204854 194 304 elements, which is far
drops from approximately 11 million customers, too large to estimate comfortably from available
by month) for each product over the 24 months data. The other problem is that product combi-
following the end of the observation period. nations may not contain all the relevant in-
To create a forecasting model from these formation needed to be a state in Markov
data, we adopt the framework of econometric modeling theory. The defining requirement for a
state transition modeling (Lancaster, 1990). state variable is that the probability density of
Initially, and somewhat naively, the entire prod- the next (and hence all future) states must be
uct profile that a customer has at the end of a conditionally independent of the past history of
month (a row in Table 1) might tentatively be states (i.e. the initial state and the subsequent
viewed as a single state. Adds and drops of state trajectory leading from it to the present
products would then be viewed as transitions state), given the current state. Intuitively, the
among states. All transition rates are non-nega- present state must make the future conditionally
tive. Quantifying the rates at which these transi- independent of the past. It is not especially
tions take place would provide a Markov model plausible that the current set of products satis-
for predicting the probable distribution of cus- fies this requirement. Indeed, it seems likely
tomers among states over time, starting from that, if a customer has already tried a particular
any initial distribution, via the recursion: product and dropped it, then she may be less
likely to add it again. So, a customers current
E[X(t 1 1) u X(t)] 5 AX(t)
set of products may have to be augmented with
where some additional historical information to create
L. A. Cox Jr., D. A. Popken / International Journal of Forecasting 18 (2002) 647671 651

a set of information that truly functions as a


state variable for modeling purposes. This in-
creases the combinatorial profusion of potential
states even further. Finally, state variables need
not be binary. For example, monthly minutes of
use for services with usage-dependent costs (e.g.
local toll products) and variables such as ac-
count age or time since last transition may be
included among potential state variables. There-
fore, a general-purpose method for dealing with
large combinatorial sets and mixed variables
types (e.g. nominal, ordinal, ordered-categorical,
and continuous) is required.

2.2. Learning predictively useful state-


transition models from data

Classification tree analysis (Breiman et al.,


1984; Biggs et al., 1991) can be adapted to
simultaneously address the problems of com-
binatorial explosion, mixed variable types, and
conditional independence. Fig. 1 helps to illus-
trate the main ideas. The steps for obtaining a
state transition model from product ownership
history data using classification trees are as
follows.

1 CREATE TRANSITION RATE IN-


DICATORS. For each transition of interest
(i.e. for each product and account add and Fig. 1. A classification tree for predicting customers who
drop), create a transition indicator variable will Add Call Waiting.
indicating the rate (per customer per unit
time) at which the specified transition took Waiting, denoted by CWA, has a value of 1
place over the period of observation. This for month 8 and for Customer 1 in Table 1. It
variable is defined as: 1 for a customer who has a value of 0 for all other months for all
was eligible to make the specified transition three of the customers in Table 1.
in a time period and did so; 0 for a customer 2 IDENTIFY STATE VARIABLES VIA
who was eligible to make the transition but CLASSIFICATION TREE ANALYSIS. Ide-
did not; and missing if the customer was not ally, the response variable at the top (root) of
eligible to make the transition then, e.g. a classification tree is conditionally indepen-
because she has already made it or was not a dent of all variables not in the tree, given the
customer then. (This definition allows for values of variables within it. (Otherwise, the
interval-censored data). As an example, the tree would continue to grow). This is precise-
transition indicator variable for adding Call ly the property needed to search for potential
652 L. A. Cox Jr., D. A. Popken / International Journal of Forecasting 18 (2002) 647671

state variables. In practice, of course, limita- Fig. 1 shows a classification tree for the Add
tions on sample sizes and classification tree Call Waiting (CWA) transition rate variable, as
algorithms make real trees approximations to an example, based on a sample of 97 098
this idealbut, in our experience, they are customer records extracted for purposes of
very useful approximations. One may define illustration. (This constitutes a simple random
states using classification trees as follows: sample from a customer data base of over 10 M
(a) Generate potential state variables by growing customers, each having a 1% probability of
a classification tree for each transition rate being selected). Of the 11 products in Table 1,
indicator variable, using all product indicator only four are retained in the tree. Only custom-
variables as explanatory variables. The initial ers who are eligible to add Call Waiting, i.e.
list of candidate state variables consists of who do not already have it, are considered.
product indicator variables that appear in at Already, several fragments of potential states
least one tree. can be discerned, as in Table 2.
(b)Prune the set of candidate state variables by Each row in Table 2 represents a partial
eliminating ones that are well-predicted by state, i.e. a set of values for some of the
the rest or that are seldom used. explanatory variables, together with dont care
(c) Iteratively refine the remaining set of state conditions (indicated by dashes) for the rest.
variables by introducing other (e.g. histori- These partial states suffice to make the CWA
cal) variables until no further improvements variable conditionally independent of the values
can be found. of the dont care variables and of the variables
(e.g. TWY, TP, PCS, etc.) not shown in Fig. 1.
Each of these steps is described next. The tree in Fig. 1 predicts CWA from the
To initialize the process, each transition rate current product configuration. To account for
indicator in turn is treated as a response variable serial dependence among monthly observations,
and a tree for it is grown using all of the product trees may also be grown with other candidate
indicator variables in Table 1 as candidate explanatory variables derived from those in
explanatory variables. Thus, Table 1 is aug- Table 1. Examples of derived variables that
mented with a column of 01-missing values have proved useful in several retail applications
for the transition rate indicator variable. This include: Account age, products subscribed to
column (i.e. variable) is then split on the other when an account was first opened, ever had
variables in Table 1. Only variables that appear and never had indicators for each product in
in one or more trees need be considered further. each period, products owned prior to the most

Table 2
Partial states derived from classification tree
Partial Frequency VM CF ADD CID CWA
state (in 97 098)
0 33 233 0 0 0.008
1 17 275 0 0 1 0.014
2 24 388 0 1 0 0.004
3 10 576 0 1 1 0.009
4 2218 1 0 1 0.008
5 6726 1 1 0 0.008
6 2682 1 1 1 0.003
L. A. Cox Jr., D. A. Popken / International Journal of Forecasting 18 (2002) 647671 653

recent transition, and time since last transition. all possible transitions from it are conditionally
To include relevant history for accounts of independent of information not in the state
different ages, it is also useful to create derived definition. (Thus, dont care conditions are
variables for the number of product purchases removed by filling in all logically possible
and the number of adds and drops of each combinations for their values). This paper will
product in the past 3 months, 46 months ago, focus on how to use complete states to make
79 months ago, and so forth. When all of these predictions. The final set of states should be
variables are submitted to a tree analysis for the closed under state transitions, i.e. each add or
telecommunications products in Table 1, only drop of a product (or account) that takes place
account age, time since last transition, and from one state should move the process to
selected current products turn out to be strong another state. This is necessary to prevent a
predictors of product add and drop rates. In transition from carrying the process outside the
other words, a time-varying (non-homogeneous) range of the model.
semi-Markov process well describes the transi- A computationally practicable approximation
tions of individuals among product profiles over to the ideal of a complete set of states closed
time. Had other information been available and under transitions may be constructed by parti-
turned out to be relevant, such as variables tioning the full set of products into core prod-
describing customer demographics and cus- ucts and peripheral products. Only core prod-
tomer service events history, it would be in- ucts are used in the state definitions. A value is
cluded using the same tree-growing process. required for each core product in order to define
Once a tree has been grown for each transi- a complete state. Dont care conditions are
tion of interest, the set of all candidate state allowed for peripheral products, thus reducing
variables can be restricted to those that appear the number of combinations that must be con-
in at least one tree. The trees automatically sidered explicitly. The set of states is thus
discretize any continuous explanatory variables, closed only under additions and deletions of
such as account age, by partitioning them into core products. However, the current state can
intervals. The final set of information used to still be used to predict the rates at which
construct states is the Cartesian product of peripheral products (or entire accounts) are
values of variables from the transition rate trees. likely to be added or dropped, as well as to
For continuous and ordered categorical vari- predict expected transition rates to other states.
ables, possible values are discretized into a To make these predictions, transition rates are
finite number of consecutive intervals, namely, estimated by conditioning observed transition
the join of the intervals for those variables in rates on the states from which the transitions
individual transition trees. If the number of occurred. This gives the MLE for state-depen-
combinations of values for the predictor vari- dent transition rates (Bhat, 1984).
ables is sufficiently small (e.g. a few hundred), We have found several different heuristics to
then no further processing is needed. Otherwise, be approximately equally effective for selecting
the information in the trees may be further products to include in the core, when effective-
reduced to a smaller core set of facts that suffice ness is assessed by ability to predict the next
to make future transitions approximately con- transition of a randomly selected customer (via
ditionally independent of information not in the lift curves, as discussed below). The simplest
core. is to select the products that appear in the
A complete state is defined as a combination greatest number of transition rate trees. In
of values for all of the state variables such that effect, each transition rate tree, such as Fig. 1,
654 L. A. Cox Jr., D. A. Popken / International Journal of Forecasting 18 (2002) 647671

casts one vote for each product that appears in CID define the states. The entries in these
it. Core products are those with the most votes; columns give the average number of each
other products are pruned. The number of product (measured in USOCs, i.e. Universal
products included in the core is determined by Service Order Codes) for customers in each
adding products until including further ones state. A single household may have more than
makes no significant improvement in lift. Then, one copy of a product (i.e. more than one
products are dropped (i.e. pruned) from the core USOC) if, for example, it subscribes to two
if doing so creates no significant deterioration in different lines with two different voice messag-
lift. Finally, pair-wise swaps of products in the ing or other USOCs. For this reason, some of
core with products outside it are tried to see the average numbers in Table 3 exceed 1 USOC
whether a further improvement in lift can be per household. Rounding to 0 and 1 values
obtained. This greedy approach requires very gives the logical definition of each state, where
little computational effort. A more CPU-inten- we simplify by only considering whether each
sive search procedure that yields similar results customer has each product and not how many
in the problems we have worked on is to USOCs of it he or she has. For example, a
evaluate each of the nCk subsets of size k, for customer in state 19 has ADD, VM, and CID,
k 5 1, 2, . . . , up to some small number (e.g. but not CW or CF.
k 5 7) for which it is practical to enumerate The remaining columns of Table 3 give the
possibilities and then to choose the smallest average number of USOCs for each of several
subset with predictive power not significantly peripheral products (TWY, TP, CR, etc.) for
different from that of the best subset found. each state. These illustrate how states may be
Interestingly, traditional factor analysis (with used to make predictions. For example, Table 3
Varimax or Quartimax rotation, for example) predicts that a randomly sampled customer in
and scree plots often identify factors whose state 13 is 10 times more likely to subscribe to
dominant components are the core products three-way calling (TWY) than a randomly
identified by other methods, even though the sampled customer in state 8 (1 vs. 0.1%).
usual assumptions for factor analysis (multi- Similarly, Table 4 provides a look-up table to
variate linearity and normality) may not hold. predict the rates at which individual core prod-
Table 3 shows an example of an initial set of ucts are expected to be added in each of the
complete states formed from the five core eight most frequent states. (Blanks indicate
products ADD, CW, CF, VM, and CID. Column products that are already owned in the corre-
1 is the state number, ranging from 05no core sponding states, and that therefore cannot be
products to 315all core products. Column 2 added again). In this table, DISC is the rate of
gives the frequency distribution of the 2 5 532 account disconnects. It is included to illustrate
logically possible states. From this frequency the use of a state look-up table in predicting
distribution, it is apparent that states 46, 20 account attrition or churn. Table 5 shows the
23, 28, and 30 could be eliminated (i.e. replaced average transition rates (per customer per
with a single other state) with little loss of month) among the eight most frequent states.
model accuracy, as these states are seldom used. For example, starting from state 0, the most
Thus, once n core products have been identified common transition is to Drop Acct., i.e.
and used to generate 2 n logically possible states, account disconnect. The two most frequent
some of the possibilities may be pruned because transitions for customers who stay are to state 1
they occur too seldom to make a significant (adding Caller ID) and to state 9 (adding both
contribution to predictions. Columns ADD to Caller ID and Call Waiting). Such information
L. A. Cox Jr., D. A. Popken / International Journal of Forecasting 18 (2002) 647671 655

Table 3
Initial states, frequencies, and look-up table for product penetrations
State % ADD CW CF VM CID TWY TP CR CWID CC NET
0 31.4 0.0 0.0 0.0 0.0 0.0 0.0 0.05 0.01 0.00 0.00 0.01
1 9.7 0.0 0.0 0.0 0.0 1.0 0.0 0.06 0.01 0.00 0.01 0.01
2 3.9 0.0 0.0 0.0 1.0 0.0 0.0 0.05 0.02 0.00 0.00 0.03
3 2.05 0.0 0.0 0.0 1.0 1.0 0.1 0.04 0.01 0.00 0.05 0.04
4 0.11 0.0 0.0 1.0 0.0 0.0 0.2 0.11 0.22 0.00 0.22 0.00
5 0.38 0.0 0.0 1.0 0.0 1.0 0.8 0.07 0.18 0.04 0.64 0.07
6 0.04 0.0 0.0 1.0 1.0 0.0 0.5 0.00 0.00 0.00 0.50 0.00
7 1.1 0.0 0.0 1.0 1.0 1.0 1.0 0.03 0.09 0.00 0.74 0.07
8 11.3 0.0 1.0 0.0 0.0 0.0 0.1 0.05 0.01 0.03 0.01 0.01
9 8.8 0.0 1.0 0.0 0.0 1.0 0.2 0.03 0.01 0.59 0.08 0.01
10 3.1 0.0 1.0 0.0 1.0 0.0 0.1 0.05 0.02 0.06 0.01 0.03
11 2.45 0.0 1.0 0.0 1.1 1.0 0.3 0.06 0.03 0.80 0.19 0.01
12 0.75 0.0 1.0 1.0 0.0 0.0 0.7 0.08 0.05 0.38 0.52 0.05
13 6.3 0.0 1.0 1.0 0.0 1.0 1.0 0.03 0.08 0.68 0.83 0.03
14 0.24 0.0 1.0 1.0 1.0 0.0 0.9 0.09 0.16 0.64 0.79 0.02
15 3.4 0.0 1.0 1.0 1.0 1.0 1.0 0.05 0.10 0.73 0.89 0.04
16 4.5 1.0 0.0 0.0 0.0 0.0 0.0 0.06 0.02 0.00 0.00 0.04
17 0.88 1.0 0.0 0.0 0.0 1.0 0.0 0.04 0.02 0.00 0.00 0.00
18 0.77 1.1 0.0 0.0 1.0 0.0 0.0 0.04 0.06 0.00 0.00 0.06
19 0.56 1.1 0.0 0.0 1.1 1.0 0.1 0.03 0.09 0.00 0.00 0.03
20 0.17 1.0 0.0 1.0 0.0 0.0 0.0 0.00 0.00 0.00 0.00 0.00
21 0.13 1.3 0.0 1.5 0.0 1.0 0.3 0.00 0.25 0.00 0.00 0.00
22 0.14 1.3 0.0 1.0 1.0 0.0 0.7 0.00 0.67 0.00 0.33 0.00
23 0.47 1.1 0.0 1.1 1.0 1.0 0.8 0.08 0.32 0.00 0.64 0.04
24 1.8 1.0 1.1 0.0 0.0 0.0 0.1 0.03 0.01 0.02 0.01 0.02
25 1.4 1.1 1.1 0.0 0.0 1.0 0.2 0.06 0.04 0.60 0.10 0.07
26 0.79 1.1 1.1 0.0 1.1 0.0 0.2 0.06 0.09 0.12 0.03 0.00
27 0.80 1.0 1.1 0.0 1.2 1.0 0.3 0.04 0.10 0.85 0.15 0.06
28 0.30 1.1 1.0 1.0 0.0 0.0 0.5 0.05 0.15 0.20 0.25 0.10
29 1.1 1.1 1.1 1.1 0.0 1.0 1.0 0.05 0.06 0.71 0.72 0.06
30 0.15 1.2 1.0 1.1 1.0 0.0 0.7 0.00 0.27 0.53 0.40 0.13
31 0.96 1.1 1.1 1.1 1.1 1.0 1.2 0.04 0.13 0.75 0.63 0.05

Table 4
State look-up table for product add and account disconnect rates
State ADDADD CWA ADDCF ADDVM ADDCID DISC
0 0.0018 0.0062 0.0018 0.0019 0.008 0.018
1 0.0011 0.01245 0.0029 0.0022 0.019
2 0.0009 0.0136 0.0018 0.011 0.025
8 0.0041 0.00815 0.0048 0.028 0.022
9 0.0032 0.012 0.0049 0.024
13 0.0028 0.0148 0.025
15 0.0053 0.036
16 0.0072 0.0024 0.0080 0.0087 0.023
All Grps 0.0024 0.0081 0.0043 0.0043 0.013 0.021
656 L. A. Cox Jr., D. A. Popken / International Journal of Forecasting 18 (2002) 647671

Table 5
Part of a state-transition intensity matrix
State Drop acct. 0 1 2 8 9 13 15 16
0 1.8% 96.98% 0.34% 0.13% 0.10% 0.32% 0.13% 0.01% 0.17%
1 1.95% 0.43% 96.40% 0.00% 0.00% 0.97% 0.22% 0.04% 0.00%
2 2.5% 0.98% 0.09% 96.06% 0.18% 0.09% 0.00% 0.09% 0.00%
8 2.3% 0.46% 0.03% 0.03% 94.53% 2.04% 0.53% 0.12% 0.00%
9 2.5% 0.24% 0.71% 0.00% 0.40% 95.08% 1.07% 0.04% 0.00%
13 2.5% 0.28% 0.22% 0.00% 0.39% 0.11% 95.09% 1.38% 0.00%
15 3.0% 0.00% 0.10% 0.00% 0.00% 0.00% 0.31% 95.99% 0.00%
16 2.3% 0.08% 0.00% 0.08% 0.00% 0.00% 0.00% 0.00% 97.55%

can help marketers design bundles of products Here, D(t) is an indicator vector. Di (t) 5 1 for a
to offer or promote together. customer if transition i (e.g. an add or drop of a
States can be used to forecast future product specific product) occurs, and 0 otherwise. Lij 5
and account adds and drops for a population of Pr(Dj 5 1 u Xi (t) 5 1). Table 4 shows a portion
customers. The state transition matrix (the 323 of L, the state look-up matrix mapping states to
32 expansion of Table 5) yields direct predic- non-core transition probabilities. This equation
tions of how customers move among core states simply says that the probability of a specific
over time, via the equation: transition for a randomly selected customer is
the sum over all states, s, of the conditional
E(Xt11 uXt ) 5 AXt . probability that the transition occurs if the
Here, X is the 32-component column vector customer is in state s (the sth component of L)
giving the fraction of the population in each times the probability that the customer is in
state. A denotes the 32332 state transition state s (the sth component of X T (t)).
matrix. This equation represents the fact that the
number of customers in any state at the start of 2.3. Lift charts for comparing and evaluating
period t 1 1 is the number there at the start of t models
plus the number that flowed in from other states
minus the number that flowed out to other The above predictive modeling framework
states. The elements of A quantify the rates of can be applied equally easily whether the states
flow (i.e. expected transitions per customer per have been chosen well or badly. However, the
period) among the states. (If customers enter or quality of the predictions will vary depending
leave states from outside the system, then on how well the defined states suffice to predict
corresponding terms are added to the right-hand the future. Predictions from Table 3 were tested
side of the above equation, as in Cox, 2001). via cross-validation in samples of customers not
Dividing by the total size of the customer included in the data used to construct it. The
population to normalize gives the above equa- criterion for predictive usefulness was the lift
tion. provided by the model in predicting which
Peripheral product add and drop rates and customers are most likely to undergo attrition,
account disconnect rates are predicted from the buy specific products next, etc.
core product state-vector via the equation: Lift is a marketers term. Suppose that we
wish to predict the 1% of customers who are
E[D(t) u X(t)] 5 X T (t)L. most likely to buy an Additional Line (ADD)
L. A. Cox Jr., D. A. Popken / International Journal of Forecasting 18 (2002) 647671 657

next month, based on the data in Table 1. Table with the slopes of the segments indicating the
4 indicates that state 15 contains these custom- yield rate of the corresponding state for the
ers and that, while randomly selecting 1% of the transition being analyzed. One set of state
customers would yield an average of only definitions is predictively more useful than
0.0024 Additional Line purchasers per sampled another for predicting a specific transition if it
customer, selecting the customers from among yields a higher lift curve for that transition.
those in state 15 more than doubles the expected In practice, lifts are usually only quantified at
yield, from 0.0024 to 0.0053. The ratio of one or a few points, with x 5 10% being the
0.0053 / 0.00242.21 is the estimated lift ratio most common value. Table 6 shows average
achieved by using Table 4 compared to random lifts for several products, evaluated at x 5 10%
selection of customers. and x 5 50%, using the state look-up table
Repeating this calculation for many random methodology developed in this paper. All num-
samples of customers gives a frequency dis- bers are based on 100-fold cross-validation
tribution of estimates for the lift ratio, called the estimates. The numbers in the State-based
cross-validation estimate. In 100 cross-valida- rows give the expected number of sales of each
tion replicates, the mean lift ratio for this product among the x% of customers predicted to
application is also slightly greater than 2). As be most likely to purchase them, where all
long as the fraction of the population is less predictions are made using a state look-up table.
than about 3.4% (the % of customers in state The numbers in the random rows show the
15, see Table 3), customers can be selected expected number of purchases among a random
from state 15. Once all customers in state 15 x% of the population. The ratios of state-based
have been selected, however, the next highest- to random numbers (shown in the shaded cells)
yield state in Table 4 becomes state 8, with a lift indicate how successful the model is identifying
ratio of only 0.0041 / 0.00241.71. Continuing the x% of the population who are most likely to
in this way creates an entire lift chart or lift buy each product. For comparison, lifts
curve, showing the lift obtained (compared to achieved for these products and data using
randomly selecting customers) by using states to logistic regression, business rules, and neural
predict the x% of the population that is most net classifiers are typically in the range of 1.3
likely to add Additional Line in the next month, 1.8 at x 5 10%, well below the 2.25.2 range
for all 0 # x # 100%. The lift curve consists of shown in Table 6.
consecutive straight-line segments with each Since the same set of states can be used to
segment corresponding to a specific state and predict multiple transitions, choosing a best set

Table 6
Average lifts for several products
ADD CC CF C ID CR CW CWID
10% State-based 73.9 580.80 253.24 478.13 58.33 227.02 348.06
Random 29.3 123.19 67.69 213.28 11.14 92.12 100.61
Lift at x 5 10% 2.52 4.71 3.74 2.24 5.23 2.46 3.46
50% State-based 202.0 1168.07 589.47 1283.46 100.39 571.77 872.98
Random 146.4 615.98 338.44 1066.44 55.72 460.63 503.07
Lift at x 5 50% 1.39 1.90 1.74 1.20 1.80 1.24 1.74
658 L. A. Cox Jr., D. A. Popken / International Journal of Forecasting 18 (2002) 647671

of states may require reducing multiple criteria appears to offer useful information for predict-
to one. We have used as an evaluation criterion ing the probable next state. (The test data set
(a) The average lift, evaluated at x 5 5% (the used in Fig. 2 is disjoint from the training data
size of population that might be addressed by set used to identify core products and states in
targeted promotions or mail campaigns). This is Table 3). Thus, the choice of core products
the lift from use of state-based predictions, passes this diagnostic test. However, Fig. 2
averaged over all transitions of interest (i.e. suggests that some additional information
product and account adds and drops) weighted should be added to the state definitions. Spe-
by their relative frequencies. (b) The maximin cifically, it reveals that time since last transition
lift, which prescribes choosing state definitions (TSLT) is an important predictor of the prob-
to maximize the smallest lift obtained for any of able next state. Account age is also relevant for
the transitions of interest. Both criteria lead to some states. Therefore, the 32 macro-states
the state definitions in Table 3. defined by logical combinations of the five core
products must be augmented with TSLT and
account age information to create an expanded
2.4. Refining state definitions: micro-states set of states that makes the next state con-
ditionally independent of all other information.
As previously mentioned, to be completely Refining the macro-state definitions with these
satisfactory, a set of states should make the new items of information is easy: Each leaf in
probable next transition for each customer con- the state-refinement tree ( in Fig. 2) represents
ditionally independent of all other information, one of the new states. In other words, each leaf
given the current state. As soon as a set of states corresponds uniquely and specifically to a
has been proposed, therefore, its adequacy can combination of (a) One of the original macro-
be diagnosed by creating a new variable repre- states; and (b) The additional information re-
senting the Next State and growing a tree for quired to refine it so that the conditional in-
it. Ideally, only the Current State should dependence criterion will be satisfied. The states
appear in the tree for Next State, i.e. the tree defined by these leaves will be called micro-
should have only one split. If this ideal is not states. Thus, the state-refinement tree-growing
achieved, then the additional information that step provides a technique for refining the initial
helps to determine the next state can be incorpo- set of macro-states until no further improvement
rated into the state definitions, yielding a refined can be discovered.
set of state definitions. Thus, splitting the next We note that the term macro-state is also
state variable on other variables helps to both often used to describe a population frequency
assess and, if needed, improve the current state distribution of individuals among micro-states
definitions. (Aoki, 1996). Here, however, the forecasting
Fig. 2 shows the results of this step for the horizon is short enough so that it is useful to
states in Table 3. To is the next-state variable treat state-dependent transition rates as being
identifying the next state. The code 1 (the top approximately constant, rather than modeling
entry in each box) represents an account dis- them as functions of the (slowly changing)
connect. From is the current state variable. For distribution of individuals among states. There-
brevity, only the most frequent states are shown, fore, we focus on estimating the state-dependent
with the less frequent ones being lumped into a rates and partition the descriptions of individual
single other category. This tree confirms that states into product-ownership (macro) and other
none of the products excluded from the core (micro) information, such as the time spent in
L. A. Cox Jr., D. A. Popken / International Journal of Forecasting 18 (2002) 647671 659

Fig. 2. Refining macro-states to obtain micro-states.

the macro-state, without regard for the fre- 2. Create initial macro-states. Generate all logi-
quency distribution of individuals among states. cally possible combinations of the core prod-
Our procedure for defining the states of a ucts. Find the frequencies of these combina-
state-transition predictive model from historical tions and prune (or combine into an other
category) any combinations that occur too
data such as that in Table 1 can be summarized
infrequently to significantly affect state-based
as follows.
predictions of transition rates. The surviving
combinations are the initial macro-states.
SUMMARY OF STATE-DEFINITION PROCESS 3. Refine the initial macro-states by augmenting
1. Identify core products using any of the heuris- them (via a state-refinement tree-growing step
tics (e.g. classification tree voting) previous- that iteratively splits the next-state frequency
ly described. This step eliminates from the distributions on other variables) with the in-
core any products that can be well predicted formation needed to make transition rates
from the products in the core. among them conditionally independent of the
660 L. A. Cox Jr., D. A. Popken / International Journal of Forecasting 18 (2002) 647671

data, given the augmented state definitions. work, predictions follow a simple canonical
These new states are called (core) micro-states. form: each transition of interest has a rate that is
4. Make predictions. The refined set of states can conditioned on the relevant state (and on
be used to create state look-up prediction nothing else). All predictions are made from
tables (Table 3) for predicting probable next these conditional rates.
core states and for predicting probable product
A potentially valuable innovation is to extend
and account adds and drops from each core
micro-state. the above approach to determine whether the
5. Evaluate predictions with lift charts. The predictive power of the state-transition frame-
average or minimax lift (or other criteria, work can be improved by conditioning on
depending on the decisions to be supported) unobserved information. The main idea is that
from the state look-up prediction tables mea- transitions may depend on more than just the
sures the utility of the defined states in predict- observed variables. For example, propensity to
ing customer behaviors. If desired, this evalua- drop a product or account will typically depend
tion can be fed back to step 1 to guide the on factors such as offers from competitors,
search for the most useful set of core products advertising impacts, and recent service ex-
to be used in defining states. This iterative periences, which are not captured in the raw
loop is relatively CPU-intensive compared to data in Table 1. From an individuals observed
using classification-tree voting or other non-
pattern of transitions over time, one may be able
iterative heuristics to identify core products.
to draw inferences about the existence and
In applications to U S WEST Communica- effects of such hidden variables. Inferred prob-
tions data, this procedure yielded lift ratios abilities for the values of unobserved state
(evaluated at x 5 10% of the population) of variables can potentially help to improve the
between 2 and 6 for most product adds. Most of predictions made by conditioning on observed
these lift ratios were more than twice as great as state variables alone. Hence, in principle, the
the ones obtained from previous predictive main challenge is to use historical data to
models, which used logistic regression rather identify the probable values of unobserved state
than a state transition framework to predict variables that affect the observed variables
probabilities of customer behaviors. (specifically, transition rates), and then to use
the joint distribution of the observed and un-
observed components of the state variable to
3. Exploiting unobserved information: the predict likely future transitions. The main tech-
HMM framework nical tool used is Hidden Markov Model
(HMM) technology, which provides construc-
The state-transition framework presented tive algorithms for identifying both observed
above offers potential advantages in predictive and unobserved state variables and probabilities
accuracy over simpler regression-based ap- from observed data.
proaches. It does so by paying close attention to Suppose that observed customer behaviors
the information in historical observations that (such as rates of adding and dropping products
appears to be most useful for making predic- or accounts) are probabilistic functions of an
tions of interest, and discarding all other in- unobserved underlying true state. Define a
formation. The information retained for use is matrix of conditional probabilities for observed
organized into a relatively small number of behaviors given states as follows. If a custom-
discrete combinations of variable values (or ers true state is s, the probability that she will
ranges of values, for continuous variables), exhibit observable response r is denoted by Crs ,
namely, the micro-states. Within this frame- a number between 0 and 1. Let C denote the
L. A. Cox Jr., D. A. Popken / International Journal of Forecasting 18 (2002) 647671 661

entire matrix of such conditional observable linear transformations of state vectors. Rather,
response probabilities: C 5 [Crs ]. Moreover, let the linear representation following without loss
Xt now indicate a customers state at time t and of generality as a consequence of the preceding
let Yt indicate the customers observed behavior notation and definitions. Any finite set of states
at time t. More precisely, Xt is a state indicator and any finite set of observations such that the
vector, having a 1 in exactly one of its positions state determines the probability of each observa-
(to indicate the state) and with zeros elsewhere. tion can be represented in this linear format by
If there are N possible states, then Xt is a binary using indicator vectors rather than the actual
N-vector. In notation, state vectors and observation vectors.
Now, the product CXt gives the expected
Xt [ he 1 , e 2 , . . . .e N j; e r 5 (0, . . . 1,0 . . . .,0)9
value of the observation indicator variable for
(vector of 0s with 1 in rth position) period t 1 1. But the actual value will typically
Similarly, Yt is an indicator vector for observa- differ from the expected value. The difference,
tions, with which may be loosely interpreted as an ob-
servation noise vector for period t, is the
Yt [ h f1 , f2 , , . . . .fM j; fr 5 (0, . . . 1,0 . . . .,0)9 increment:
(vector of 0s with 1 in rth position). Wt11 5 Yt11 2 CXt .
Thus, if a customers underlying state at time t This increment has zero mean, since E(Wt 11 ) 5
is Xt , then the probability of observing each E(Yt11 2 CXt ) 5 CXt 2 CXt 5 0. This and other
possible response for period t is given by the properties of the observation noise are detailed
vector: in Appendix A.
A customers underlying state may change
E(Yt11 u Xt ) 5 CXt ,
over time. HMM techniques assume that these
(The choice of t 1 1, rather than t, as the time changes take place according to a Markov
index for Y is a convention reflecting the transition process. If A denotes the state transi-
common experience that data collected by the tion matrix for this process, then the stochastic
end of one month or quarter are usually com- transitions are described by:
piled and reported, i.e. observed, only in the E(Xt11 uXt ) 5 AXt .
next month or quarter. Since Yt denotes the
observed behavior at time t, we adopt the Again, the gap between expected and actual
convention that it is the actual behavior from values can be denoted by a noise term:
period t 2 1. The entire formulation works
Vt 11 5 Xt 11 2 AXt .
equally well if observations are made as be-
haviors occur, in which case E(Yt u Xt ) 5 CXt . The entire (state dynamics plus observations)
This is essentially the distinction between the process can now be represented by stochastic
Moore machine and Mealy machine representa- processes Xt and Yt with dynamics:
tions of a finite-state machine and is incidental
State transition equations: Xt11 5 AXt 1Vt11 ; t 5 1,2, . . . T
to subsequent results). Observation equations: Yt11 5 CXt 1 Wt11 ; t 5 1,2, . . . T
For readers not previously familiar with the
indicator vector formulation of HMMs, it may A and C are matrices with: Si a ij 5 Si c ij 5 1. Vt
be worth emphasizing that the linear representa- and Wt are martingale increments whose prop-
tion E(Yt 11 uXt ) 5 CXt is not an assumption about erties are described in more detail in Appendix
the form of the relationship between states and A. As already mentioned, these linear equations
observations, e.g. stating that observations are are not consequences of an assumed linear
662 L. A. Cox Jr., D. A. Popken / International Journal of Forecasting 18 (2002) 647671

relation between states and observations or ]


assumed linear dynamics for the evolution of VM
0
VM
1

the state. Rather, they reflect the use of indicator 0 1
variables in the formulation. 0 1
To illustrate the framework, suppose that the
data in the format of Table 1 constitute all 0
0
1
1 Y1
observed customer-specific data. Then, the num-
ber of observable states, M, is equal to 2 n ,
where n is the number of products. To simplify,
Y ;


0
0
0
0
1
1
1
1


3 4
Y
where Y 5 2
...
YT
we will model only one product, Voice Messag-
ing (n 5 1) and one unobservable or hidden 1
1
0
0
customer attribute, which might be interpreted
as whether the customer operates a Home 1
1
0
0
Office. The state indicator vector, x, then has 3.1. Recursive estimation algorithms for HMM
N 5 4 elements to indicate the status of both models
Voice Messaging and Home Office. The ob-
servation indicator vector, y, requires only M 5 Given initial estimates of A, C, and an initial
2 elements to indicate the status of Voice state probability distribution, how should esti-
Messaging. Let the elements of the state vector mates of A and C be updated based on the
be enumerated as follows: observation sequence Y1 , Y2 , . . . YT ?
An optimal estimation procedure developed
State 15No Voice Messaging, No Home by Elliott, Aggoun, and Moore (1995),
] ]
Office (VM /HO) generalizing ideas from Kalman filtering for
State 25Voice Messaging, No Home Office LQG systems, computes time-varying recursive
]
(VM /HO) estimates for the state indicator probability
State 35No Voice Messaging, Home Office distribution, state-to-state transition rates (A),
]
(VM /HO) state occupation times, number of transitions
State 45Voice Messaging, Home Office among states, and state-to-observation condi-
(VM /HO) tional probabilities (C). In essence, hidden
variables are treated as missing data, and a
Initial estimates of the matrices, A and C, are version of the famous EM (expectation-maxi-
as follows: mization) algorithm is applied to obtain joint
1 2 3 4 maximum-likelihood estimates of unobserved
1 0.90 0.01 0.02 0.01 quantities and model parameters (Baum &
A ; 2 0.03 0.85 0.09 0.02 Petrie, 1966). The (conditional psuedo log-)
3 0.04 0.01 0.88 0.02 likelihood function is brought into a tractable
4 0.03 0.13 0.01 0.95
form via a change of probability measure
1 2 3 4
(calculated via a RadonNikodym derivative)
C;] VM 0.94 0.02 0.90 0.03 that allows the observation indicator variables to
VM 0.06 0.98 0.10 0.97
be treated as independent random variables
Here the first row of C is the sum of the first uniformly distributed over h f1 , f2 ,, . . . .fM j. After
and third rows of A, while the second row of C the recursive equations have been solved to
is the sum of the second and fourth rows of A. obtain approximate maximum-likelihood esti-
The observation data for Customer 1 (see Table mates for all quantities, an inverse change of
1) would be: measure (from the inverse RadonNikodym
L. A. Cox Jr., D. A. Popken / International Journal of Forecasting 18 (2002) 647671 663

derivative) is used to obtain estimates in terms Section 2, i.e. Additional Line (ADD), Call
of the original variables. The computational Waiting (CW), Call Forwarding (CF), Voice
formulas corresponding to this approach are Messaging (VM), and Caller ID (CID). The
straightforward, as detailed in Elliott et al. analysis includes all 32 possible combinations
(1995), c.f. pp. 340) and summarized in of these products as macro-states. Based on Fig.
Appendix A. 2, the length of time that a customer has held a
To illustrate the approach, suppose that in particular product combination (the Time Since
time period zero, Customer 1 has Voice Messag- Last Transition, i.e. TSLT in Fig. 2, defined as
ing and that it is estimated that there is a 60% 1, 2, 3, or 4 or more quarters) is used to specify
chance that Customer 1 also has a Home Office. customer micro-states within macro-states. (Ac-
(At U S WEST, such estimates can be obtained count age could also have been used, but TSLT
from special-purpose survey data in which small makes a larger difference and suffices to illus-
samples of a few hundred customers are asked trate the HMM approach). The data consisted of
about what products they currently have, 13 months of observations for each customer,
whether they have home offices, fax machines, allowing us to obtain five quarterly values by
on-line accounts, and so forth. The 60% prob- using months 1, 4, 7, 10, and 13.
ability estimate is the prior probability estimate Each state is further partitioned into hidden
needed to initialize the recursion). Using data Y and revealed components reflecting the cen-
and applying the Elliot et al. recursive es- soring of the complete product history for each
timators yields the following revised estimates customer by the five-quarter window of ob-
C of the initial matrices A and C:
A, servation. For each of the five quarters, the
product combination macro-state is known with
1 2 3 4 certainty. The state occupation time (TSLT) is
1 0.9463 0.0383 0.0239 0.0497 known with certainty from the first observed
A ; 2 0.0076 0.7925 0.0232 0.0153 transition on, but is censored (i.e. missing or
3 0.0383 0.0303 0.9502 0.0783 hidden) at the beginning of the five quarters.
4 0.0077 0.1389 0.0026 0.8567 These considerations lead to 43258 micro-
1 2 3 4 states (4 occupation times3hidden / revealed)
C ; ]
VM 0.9847 0.0021 0.9745 0.0044 for each product combination macro-state. Thus,
VM 0.0153 0.9979 0.0255 0.9956 for the four possible product combinations,
there are 32 micro-states. A customer in a

Note that the updated state transition matrix, A, hidden micro-state can remain hidden by jump-
has higher probabilities for transitions from ing to the next occupation time within that
Voice Messaging states to No Voice Messaging product combination, but a revealed customer
states than did A. This reflects Customer 1s stays revealed.
observed transition from Voice Messaging to No
Voice Messaging in period 10 (see Table 1).
4.1. Prior estimates

4. Application to forecasting customer Values of the prior matrix, A sr , are found as


transitions follows as follows. For transitions from re-
vealed states to revealed states, prior estimates
This section reports the results of applying are obtained directly from the historical cus-
the preceding HMM techniques to purchasing tomer data. They are merely the historical
behaviors for the five core products defined in transition rates between observable (revealed)
664 L. A. Cox Jr., D. A. Popken / International Journal of Forecasting 18 (2002) 647671

states for all customers in the sample. Note that tape storage after 13 months), so this status
Table 1 data does not provide observable transi- should not affect their behaviors. Transition
tions until a customer changes product rates from revealed states to hidden states are
combinations 1 . For example, Customer 1 in all zero since a customer cannot become hidden
Table 1 begins in a state that includes Addition- once revealed.
al Line, Voice Messaging, and Caller ID. How- Most customers start in a hidden state (the
ever, the length of time spent in that product exception is new customers) where the product
combination prior to month 1 is unknown. It is combination is observed, but not the occupation
not until month 8 (quarter 4), when Customer 1 time (TSLT). The prior estimate of the state
adds Call Waiting, that both product combina- indicator probability vector, denoted by q0 ,
tion and occupation time are known with cer- must reflect an assumed or estimated marginal
tainty. distribution of TSLT. A prior based on the
Some transitions may not be observed in the geometric distribution is motivated by the fact
sample data. For example, the data are not of that customers at the beginning of the observa-
sufficient duration to include revealed transi- tion window will be in an occupation time
tions from microstate 4 (four or more quarters micro-state k if and only if they have been in
in a product combination) to any other state. In that macro-state for at least k periods, and they
these cases, an unconditional macro-state transi- happen to be observed at age k.
tion rate, ignoring the occupation time micro- Initial values for C are needed in the compu-
state, may be used instead. The unconditional tations. The first N / 2 rows of the initial C
probability of transitioning from product-combi- matrix are identical to those in A (the first 16
nation i to product-combination j in any given rows in our example). The next N / 8 rows of C
time period is estimated by the fraction of all map the hidden micro-states to the observable
observed transitions out of i that take place to j product combination macro-states (four rows in
2
. our example). Therefore M 5 N / 2 1 N / 8. The
Transitions from hidden states to revealed values of the elements in the C matrix follow
states are assumed to take place at the same from the fact that each micro-state (i.e. macro-
rates as corresponding revealed transitions. state1duration information, TSLT) determines
Transitions from hidden states to hidden states a unique corresponding macro-state.
are also assumed to take place at the same rates
as the corresponding revealed transitions. In
both cases, the logic is that customers dont 4.2. HMM algorithm
know and dont care what is hidden (e.g. they
have no way to know that records are sent to The following algorithm utilizes estimated
prior matrices, A and C,
and a prior estimate,
q0 , of the state vector. It applies the HMM
1
The exception is those customers who start out as new algorithm to each individual customer in the
customers after quarter 1. These customers begin in the sample, starting with these common prior val-
first occupation time micro-state for the observed starting
ues. The updated state transition matrices, A,
product combination. resulting from calculations upon each customer,
2
Cases where the denominator of this estimator was zero
are aggregated by a weighted averages tech-
occasionally occur (no transitions from a given product
nique into an overall composite estimate, A.
combination to another). In that case we filled in the rth
column of with zeros, with the exception of row s, which Further details of the HMM algorithm and
was set to 1.0. estimators may be found in the Appendix.
L. A. Cox Jr., D. A. Popken / International Journal of Forecasting 18 (2002) 647671 665

Step 0. Set the current estimates a sr 5 A sr (1 # transition matrix. For each column r 5 1, 2 . . .
r # N; 1 # s # N), c sr 5 C sr (1 # r # N; 1 # s # N, perform the following updates:
M) as well as the prior estimate, q0 , of the state
vector. Set the composite transition matrix, A, to a r 5 a r 1 a r (T j )
zeros. Set the total observation counts,
numObs(r) 5 0 (1 # r # N)
numObs(r)5numObs(r)1occupy(r)
For j51: [the number of sample custom-
ers]
Final Step. Compute the final weighted average
hStep j1. Initialize the recursive estimator vec- composite matrix by columns:
tors for customer j:
g0,0 (O r0 ) 5 0 N (1 # r # N), g0,0 (J rs
0 ) 5 0 N (1 #
ar
r # N; 1 # s # N), q0j 5 q0 ,
occupy(r) 5 0 (1 # r # N) 5]]]
ar 5 numObs(r)
ar
numObs(r).0 r51,2, . . . N
otherwise
(g0,0 (O r0 ) and g0,0 (J rs
0 ) are the estimator vec-
tors at time 0 for the occupation time counts and (We divide the values in each column (state) of
the state-to-state transition counts. occupy is the composite state-transition matrix by the
used in computing a weighted average matrix estimated number of instances of customers
below) possibly in that state to obtain the weighted
average values).
Step j2. For t 5 1,2 . . . T j , (each observable
time period for this customer) recursively up-
date the estimators:
gt,t (J rs r
t ), gt,t (O t ), and qtj
5. Results of HMM experiments

Step j3. Update the final period, T j , estimates of We predicted all possible product combina-
A : tions of up to five products using 31 experi-
ments, allocated as follows:
gT j (J Trsj )
1 product, 2 product
5 ]] g (O r ).0 combinations, 16
r
asr (Tj )5 gT j (O T j ) T j T j states: 5 experiments
0N otherwise 2 products, 4 product combinations, 32
r
occupy(r)51 r51, 2, . . .N; gT j (O T j ).0 states: 10 experiments
3 products, 8 product combinations, 64
states: 10 experiments
(The ratios above, derived from an Expectation 4 products, 16 product combinations, 128
Maximization algorithm, provide final updates states: 5 experiments
to each element of the customer specific state 5 products, 32 product combinations, 256
transition matrix. A value of 1 for the variable states: 1 experiment
occupy(r) indicates an occupation of state r by n
customer j with non-zero probability). (Recall that having n products implies 2
product combinations, with eight micro-states
Step j4. Add updated estimates for the customer four revealed and four hidden occupation
state-transition matrix to the composite state- timesfor each. The number of possible experi-
666 L. A. Cox Jr., D. A. Popken / International Journal of Forecasting 18 (2002) 647671

Table 7
Forecasting techniques evaluated
Technique Description of technique Algorithm State Space A prior q0
HMM]Flat HMM, uniform prior for HMM full historical Flat (uniform)
initial state, q0
HMM]Best HMM, geometric prior for q0 HMM full historical Best (geometric)
NNLS Non-negative Least Squares Non-negative product- NA NA
Least Squares 3 combinations
AHAT Forecast with prior matrix None full historical flat (forecast only)
AHAT Small Forecast with prior matrix, None product- historicalp ij s NA
]
macro-states only combinations
HMM]Small HMM, macro-states only HMM product- historicalp ij s known
combinations

Table 8
Summary of results from HMM state-forecasting experiments
Technique Number of products
1 2 3 4 5 Average
HMM]Flat 3.31 3.72 5.71 4.31 11.01 7.02
HMM]Best 2.70 3.18 5.28 4.23 11.20 6.65
NNLS 1.48 3.42 N /A N /A N /A 2.45
AHAT 2.65 4.20 6.56 4.91 13.24 7.89
AHAT Small 1.18 2.02 4.05 3.66 14.79 6.42
]
HMM]Small 1.20 1.95 3.91 3.53 14.33 6.23
Average 2.25 3.42 4.75 4.11 13.91 7.11
[Runs 5 10 10 5 1
(Mean Absolute Percent Errors Averaged over All States and Runs).

ments for n products is 5Cn ). Table 7 summa- the techniques utilizing hidden states (the first,
rizes the techniques used to estimate a revised A second, and fourth in Table 7), the distributions
matrix in these experiments. of occupation times for period 1 were estimated
We evaluated the performance of each tech- from the associated q0 . Predicted vs. actual
nique on a second random sample of 4000 transitions in the test set were compared via the
customers by using the A matrix estimated by Mean Absolute Percent Error (MAPE) of the
each technique from the first set of 4000 state values for each period (quarter) and each
customers (the training set) to predict future product combination. MAPE is the absolute
states of the second set of 4000 customers (the percentage difference between the predicted and
test set), given their observed starting states. For actual fractions of the population in each macro-
state, averaged over all macro-states and
periods.
3
This technique requires at least as many periods of data as Table 8 summarizes the main experimental
there are states. Therefore it could be used for only up to results. The MAPE increases for larger numbers
two products with the available data. of products (columns in Table 8) since we are
L. A. Cox Jr., D. A. Popken / International Journal of Forecasting 18 (2002) 647671 667

subdividing the same customer set into smaller formed. As shown next, however, it does appear
groups. That is, there are fewer customers for in the hybrid forecaster, reflecting the fact that it
each product combination, providing a smaller, is not dominated by any other technique. In-
less representative sample for transitions out of deed, no single technique dominates the rest.
that product combination.
The HMM techniques utilizing hidden states,
HMM]Flat and HMM]Best, have relatively high 6. Hybridizing individual forecasting
average error rates (although HMM]Best was techniques
superior to all others in six of the 31 experi-
ments). Note that the two best techniques, If a specific state or transition is to be
HMM]Small and AHAT]Small, apply the HMM predicted, then the predictions from different
algorithm to a state space consisting only of the techniques can be combined to reduce forecast
macro-states. HMM]Small was superior to all errors. Although combining forecasts from dif-
others techniques in 13 of 31 experiments. For ferent sources has been much discussed in the
each of the five product number levels, the forecasting literature, the following technique
MAPE was smaller for HMM]Small than for appears to be original.
AHAT. The next best after HMM]Small was First, each prediction source (or technique) is
AHAT]Small, that is, using historical transition treated as a new variable, with a value for each
frequencies among macro-states as the basis for customer, given the specific state or transition of
the forecast. It was best in seven of the 31 interest. Namely, the value for a customer is the
experiments. Thus, the HMM algorithm does predicted probability that that customer will be
improve predictions compared to what can be in the specified state or will make the specified
achieved using historical frequencies alone. transition. Next, the true value of the predicted
There was no significant change in the relative variable is recorded for each customer in a
ranking of the techniques as the number of training set. Finally, the true value is treated as
products changed. a response variable and a classification tree for
These average error rates show that it is grown using the predictions from different
HMM]Small is only slightly better than sources as the explanatory variables. The re-
AHAT]Small. It is reasonable to hypothesize sulting prediction-combination tree shows how
that knowing one might make the other re- to combine the predictions from the different
dundant. However, as shown next, this is not the sources. Each of its leaves, corresponding to a
case. Each contributes independent information combination of predictions from multiple
useful for forecasting, and the two together can sources, has associated with it a conditional
yield better predictions than either alone. In- frequency distribution for the true value of the
deed, three of the six predictors in Table 8 can predicted variable. This constitutes the hybrid
be hybridized to yield an even better predictor. prediction obtained by combining information
Finally, although this evaluation has shown from the different sources. The sources consid-
that HMM algorithms can improve on (and ered may include predictors for variables other
complement) other techniques, it appears that than the variable to be predicted, in case
estimating hidden information adds significant multivariate associations among the variables
predictive value only some of the time, at least can help to improve prediction for the one of
in this application. More specifically, interest.
HMM]Best is the best single technique in only Fig. 3 illustrates the process for a subset of
20% of the computational experiments per- customers for whom we wish to predict future
668 L. A. Cox Jr., D. A. Popken / International Journal of Forecasting 18 (2002) 647671

horizon. Then, based on Fig. 3, one should


select those customers for whom CF]HMMS5
0.994 (which Fig. 3 suggests may be a slight
over-estimate, since CF]HMMS is not a perfect-
ly calibrated predictor) or for whom
CF]HMMS50.976 to 0.992 and for whom
CW]HMMB$0.997.

7. Conclusions

This paper has shown that HMM algorithms


provide smaller prediction errors of future prod-
uct-combination macro-states (e.g. as measured
by MAPEs or classification tree prediction error
rates) than transition models based on historical
transition rates alone. Even simple transition
models, however, are more powerful (e.g. as
measured by lift charts) than logistic regression
and dynamic regression models for forecasting
the population frequency distribution of product
Fig. 3. A prediction-combination tree for Call Forwarding. combination macro-states over time. Moreover,
the different specific forecasting algorithms
penetration of Call Forwarding. The best single considered here complement each other, in that
predictor automatically selected by the classifi- several of them appear in the forecast-combina-
cation tree algorithm for the fraction of these tion tree (Fig. 3) used to obtain best and final
4000 customers who will have CF by the end of forecasts for specific products.
the forecast horizon is given by CF]HMMS, This is the first paper that we know of that
i.e. the Call Forwarding prediction from the combines: (a) the digital signal processing algo-
HMM]Small technique. This might have been rithms of HMM with (b) classification tree
expected from Table 8. Interestingly, predic- techniques for extracting state definitions from
tions of CW from the AHAT]Small and data to better forecast individual behavior prob-
HMM]Best (HMMB) techniques, namely, abilities. We are currently extending the analytic
CW]AHATS and CW]HMMB, are also in- approach and the computational experiments
cluded in the hybrid prediction represented by reported here to allow other causal drivers of
the tree. They are not redundant, given the individual behaviorsincluding commercially
predictions from HMM]Small, but rather add available demographic information, competitor
additional useful information that the hybrid intelligence, advertising and promotion
prediction is able to exploit. Including them schedulesto be incorporated into the state-
reduces forecasting error for CF. transition forecasting framework.
As an example of the use of such a tree,
suppose that one wanted to target a CF promo- Appendix A. Derivation of HMM estimates
tion or direct mail campaign to only those
customers who have at least a 95% probability The state-space formulation via Elliott et al.
of owning CF by the end of the forecast (1995) describes HMM models in terms of
L. A. Cox Jr., D. A. Popken / International Journal of Forecasting 18 (2002) 647671 669

r
stochastic processes Xt and Yt with dynamics: O t11 5 the number of occasions up to time t for
State transition equations: Xt 11 5 AXt 1Vt11 ; which the process has been in state e r (occupa-
t 5 1,2, . . . T tion time)

Observation equations: Yt11 5 CXt 1 Wt 11 ; T rs


t 5 the number of times up to time t that the

t 5 1,2, . . . T observation process is in state fs given the


process at the preceding time is in state e r (state
A and C are matrices of transition prob- to observation transitions).
abilities, such that: Si a ij 5 Si c ij 5 1. The
vectors, Xt are the state indicator vectors, while qt 5 the un-normalized conditional probability
the vectors Yt are the observation indicator
vectors formed as follows: distribution for state r at time t.

Xt [ SX 5 he 1 , e 2 , , . . . .e N j; The derivation employs a discrete time


e r 5 (0, . . . 1,0 . . . .,0) change of probability measure from the prob-
(vector of 0s with 1 in rth position) ability space, P, of the processes defined above,
to a computational probability space, P. Under
P define:
Yt [ SY 5 h f1 , f2 , , . . . .fM j;
fr 5 (0, . . . 1,0 . . . .,0) gt (Ht ) 5 the un-normalized expectation under P
(vector of 0s with 1 in rth position). of the random variable (vector process) Ht

Now let:
Define 0 N as a vector of 0s of length N. Vt 11
and Wt 11 are driving noise and measurement c j ; the jth column of C
noise vectors in the form of Martingale incre-
ments satisfying: a j ; the jth column of A, and
E[Vt 11 ] 5 0 N , E[Wt 11 ] 5 0 N
E[Vt 11 V 9t11 uXt ] 5 diag(AXt ) 2 A diag Xt A9
c s (Yt ) 5 M Pc
M

r51
Y rt
rs

E[Wt11 W 9t11 uXt ] 5 diag(CXt ) 2 C diag Xt C9 where A is N 3 N and C is M 3 N. Define 1 N as


a vector of 1s of length N.
Where diag(z) denotes the diagonal matrix
We can now define gt, (J rs r rs
t ), gt (O t ), gt (T t ), via
with vector z on its diagonal. Note that a
the recursive functions:
stochastic process hZn , n # 1j is said to be a
Martingale process if: g0,0 (J rs
0 ) 5 0N

O c (Y
N
E[uZnu] , ` for all n rs rs
gt 11,t 11 (J t11 )5 j t 11 )(gt,t (J t )9e j )a j
and j51

1 c r (Yt 11 )(q t9 e r )a sr e s
E[Zn 11 u Z1 ,Z2 , . . . Zn ] 5 Zn . (Ross, 1983)
t 5 0,1,2, . . . T 2 1
The following vector processes are defined:
rs
gt 11 (J trs11 ) 5 1 N ? gt 11,t 11 (J rs
t11 ),
J 5 the number of jumps from state e r to state
t

e s up to time t (state-to-state transitions) g0,0 (O 0r ) 5 0 N


670 L. A. Cox Jr., D. A. Popken / International Journal of Forecasting 18 (2002) 647671

O c (Y O c (t)
N M21
r r
gt 11,t 11 (O t 11 )5 j t 11 )(gt,t (O )9e j )a j
t c Mr (t) 5 1 2 sr 1#r#N
j51 s51

1 c r (Yt11 )(q 9t e r )a r
References
t 5 0,1,2, . . . T 2 1
Aoki, M. (1996). New Approaches to Macroeconomic
gt 11 (O tr11 ) 5 1 N ? gt 11,t 11 (O t11
r
), Modeling. New York: Cambridge University Press.
Baum, L. E., & Petrie, T. (1966). Statistical inference for
probabilistic functions of finite State Markov Chains.
g0,0 (T 0rs ) 5 0 N Ann. Math. Stat., 37, 15541563.
Bhat, U. N. (1984). Elements of Applied Stochastic
Processes, 2nd ed.. New York: Wiley.
O c (Y
N

gt 11,t 11 (T rs
)5
rs
)(gt,t (T t )9e j )a j Biggs, D., De Ville, B., & Suen, E. (1991). A method of
t 11 j t11
j 51 choosing multiway partitions for classification and
decision trees. J. Appl. Stat., 18 (1), 4962.
1 M (q 9t e r )(Y t911 fs )c sr a r Breiman, L., Olshen, R., Friedman, J., & Stone, C. (1984).
Classification and Regression Trees. Wadsworth Pub-
t 5 0,1,2, . . . T 2 1
lishing.
Cox, Jr. L. A. (2001). Forecasting demand for telecom-
gt (T rs rs
t 11 ) 5 1 N ? gt11,t11 (T t11 )
munications products from cross-sectional data. Tele-
commun. Sys., 16 (3 / 4), 437454.
Elliott, R. J., Aggoun, L., & Moore, J. B. (1995). Hidden
where Markov Models: Estimation and Control. New York:
Springer Verlag.
O c (Y
N
Fildes, R. (2002). Telecommunications demand forecast-
qt11 5 j t11 )(q 9t e j )a j ;
j51 inga review. International Journal of Forecasting,
16 4, 489522.
qt 5 (qt (e 1 ),.. . . .qt (e N ))9 t 5 0,1,2, . . . T 2 1 Lancaster, T. (1990). The Econometric Analysis of Transi-
tion Data. New York: Cambridge University Press.
Ross, S. M. (1983). Stochastic Processes. New York:
The qt (e r ) are the un-normalized conditional Wiley.
probability distribution for state r at time t. The Schober, D. (1999). Data detectives: What makes custom-
normalized estimates are: ers tick? Telephony, 237 (9), 2124.
Strouse, K. A. (1999). Weapons of mass marketing.
qt (e r ) Telephony, 237 (9), 2628.
pt (e r ) 5 ]]]
O
N

qt (e r ) Biographies: Tony COX is President of Cox As-


r51 sociates, an independent applied research company
specializing in data mining, operations research modeling
It can be shown that by using the above and network optimization for telecommunications com-
estimators in an expectation maximization algo- panies. Cox Associates scientists develop and apply ma-
rithm we can obtain revised estimates for the chine-learning algorithms, computer simulation and op-
timization models, statistical and epidemiological risk
parameters of A and C as follows: analysis methods, and decision analysis models to improve
business and engineering decisions. Dr. Cox is on the
gt (J rs
t ) Faculty of the Center for Computational Mathematics at
a sr (t) 5 ]] 1 # r # N; 1 # s # N
gt (O rt ) the University of Colorado at Denver, where he is Honor-
ary Full Professor of Mathematics. He has a Ph.D. in Risk
gt (T rs Analysis (1986) and an S.M. in Operations Research
t )
c sr (t) 5 ]] 1 # r # N; 1 # s # M 2 1 (1985), both from M.I.T.s Department of Electrical
gt (O tr ) Engineering and Computer Science.
L. A. Cox Jr., D. A. Popken / International Journal of Forecasting 18 (2002) 647671 671

Douglas POPKEN provides management science con- ceived his Ph.D. in Operations Research from UC Berkeley
sulting and software development through his company, (1988) and M. Engr. and B.S. degrees in Operations
Systems View. Specialty areas include: probabilistic Research from Cornell University.
modeling techniques, simulation, and logistics. He re-

Vous aimerez peut-être aussi