Académique Documents
Professionnel Documents
Culture Documents
Kalyan Veeramachaneni
CSAIL, MIT
Cambridge, MA - 02139
kanter@mit.edu
CSAIL, MIT
Cambridge, MA- 02139
kalyan@csail.mit.edu
AbstractIn this paper, we develop the Data Science Machine, which is able to derive predictive models from raw data
automatically. To achieve this automation, we first propose and
develop the Deep Feature Synthesis algorithm for automatically
generating features for relational datasets. The algorithm follows
relationships in the data to a base field, and then sequentially
applies mathematical functions along that path to create the final
feature. Second, we implement a generalizable machine learning
pipeline and tune it using a novel Gaussian Copula process based
approach. We entered the Data Science Machine in 3 data science
competitions that featured 906 other data science teams. Our
approach beats 615 teams in these data science competitions. In
2 of the 3 competitions we beat a majority of competitors, and
in the third, we achieved 94% of the best competitors score. In
the best case, with an ongoing competition, we beat 85.6% of the
teams and achieved 95.7% of the top submissions score.
I.
I NTRODUCTION
Interpret
Model
Represent
Time series
Discriminatory
Organize
Stacked
t=2 t=3
Flat
Predict x?
Correlation to y?
Think of
explanations
<Scripts>
Formulate ML approach
Shape transform data
Build models
Refine
Extract Variables
Formulate labels
Analyst
Data Engineer
Machine Learning
Researcher
Fig. 1. A typical data science endeavor. It started with an analyst positing a question: Could we predict if x or y is correlated to z? These questions are usually
defined based on some need of the business or researcher holding the data. Second, given a prediction problem, a data engineer posits explanatory variables and
writes scripts to extract those variables. Given these variables/features and representations, the machine learning researcher builds models for given predictive
goals, and iterates over different modeling techniques. Choices are made within the space of modeling approaches to identify the best generalizable approach
for the prediction problem at hand. A data scientist can undergo the entire process; that is, positing the question and forming variables for building, iterating,
and validating the models.
Customers
Orders
Products
OrderProducts
much does the total order price vary for the customer? or
does this customer typically buy luxurious or economical
products?. These questions can be turned into features by
following relationships, aggregating values, and calculating
new features. Our goal is to design an algorithm capable of
generating the calculations that result in these types of features,
or can act as proxy quantities.
Fig. 3. Illustration of computational constraints when synthesizing each feature type. Both rfeat and dfeat features can be synthesized independently, while
efeat features depend on both rfeat and dfeat features. One instantiation of an approach for Deep Feature Synthesis is given in Algorithm 1.
a specific entry as xki,j , which is the value for feature j for the
ith instance of the k th entity.
Next, given entities, their data tables, and relationships we
define a number of mathematical functions that are applied at
two different levels: at the entity level and at the relational
level. We present these functions below. Consider an entity
E k for which we are assembling the features. For notational
convenience we drop k that is used to represent an specific
entity.
The first set of features is calculated by considering the
features and their values in the table corresponding to the entity
k alone. These are called entity features, and we describe them
below.
Entity features (efeat): Entity features derive features by
computing a value for each entry xi,j . These features can
be based on the computation function applied element-wise
to the array x:,j . Examples include functions that translate
an existing feature in an entity table into another type of
value, like conversion of a categorical string data type to a
pre-decided unique numeric value or rounding of a numerical
value. Other examples include translation of a timestamp into 4
distinct features weekday (1-7), day of the month (1-30/31),
month of the year (1-12) or hour of the day (1-24).
These features also include applying a function to the entire
set of values for the j th feature,x:,j , and xi,j , given by:
xi,j 0 = ef eat(x:,j , i).
(1)
(2)
CustomerID Gender
1
2
3
4
...
...
...
...
OrderID
AVG
1
2
3
4
...
SUM
$250
...
...
...
45
...
...
...
f
...
...
...
AVG(Orders.SUM(Product.Price))
MAX
MIN
STD
Base Column
ProductID
1
2
3
4
...
Price
ID
OrderID
1
2
3
4
...
1
1
...
3
1
...
4
...
3
...
$200
2
...
...
$300
...
...
2
...
$200
...
$100
...
Customer ID SUM(Product.Price)
DFEAT
...
...
ProductId Product.Price
$200
$100
...
$200
...
RFEAT
AVG
SUM
MAX
MIN
STD
Fig. 4. An example of a feature that could be generated by Deep Feature Synthesis. The illustration shows how features are calculated at different depths, d,
by traversing relationships between entities.
1:
2:
3:
4:
5:
6:
7:
10:
11:
12:
13:
CONTINUE
F i = F i EF EAT(E i )
The algorithm pseudocode for M AKE _F EATURES is presented above to make features, F i , for the ith entity. The
organization of recursive calls and calculation of each feature
type is in accordance with the constraints explained above.
The RF EAT, DF EAT, and EF EAT functions in the pseudocode
are responsible for synthesizing their respective feature types
based on the provided input. The algorithm stores and returns
information to assist with later uses of the synthesized feature.
This information includes not only feature values, but also
metadata about base features and functions that were applied.
In our algorithm, we choose to calculate rfeat features
before dfeat features, even though there are no constraints
between them. Additionally, EV keeps track of the entities that
we have "visited". On line 10, we make sure to not included
zi = O zi2 p + q p + q .
Continuing the expansion till z0 we get
zi = O z0 pi + q pi1 + q pi2 + q .
Replacing z0 = O(e j) = q in the above equation we get
zi = O q pi + q pi1 + q pi2 + q
!!
i
X
u
zi = O q
p
.
u=0
i
X
!
u
(r m + n) (e + 1)
u=0
III.
IV.
TABLE I.
Param
k
nc
rr
n
md
Default
1
100
100
1
200
None
50
S UMMARY OF PARAMETERS IN THE MACHINE LEARNING PIPELINE AND THE OPTIMIZED VALUES FROM RUNNING GCP BY DATASET.
Range
[1, 6 ]
[10, 500]
[10, 100]
[1, 10]
[50, 500]
[1, 20]
[1, 100]
Function
The number of clusters to make
The number of SVD dimensions
The percentage of top features selected
The ratio to re-weight underrepresented classes
The number of decision trees in a random forest
The maximum depth of the decision trees
The maximum percentage of features used in decision trees
Many stages of the machine learning pipeline have parameters that could be tuned, and this tuning may have a
noticeable impact on the model performance. A naive grid
search would lead to search in the space of 6 490 90
10 450 20 100 = 2, 381, 400, 000, 000 (two trillion, three
hundred eighty-one billion, four hundred million) possibilities
if we consider all possible combinations of parameter values.
To aid in the exploration of this space, we use a Gaussian
Copula Process (GCP). GCP is used to model the relationship
f between parameter choices and the performance of the whole
pathway (Model). Then we sample new parameters and predict
what their performance would be (Sample). Finally, we apply
selection strategies to choose what parameters to use next
(Optimize). We describe each of these steps in detail below.
Model: Typically, a Gaussian process is used to model f
such that for any finite set of N points p1...n in X , { f (
pi )
KDD
1
389
32
4
387
19
32
cup 2014
IJCAI
KDD
1
271
65
8
453
10
51
2
420
18
1
480
10
34
cup 2015
N
}N
i=1 has a multivariate Gaussian distribution on R . The
properties of such a process are determined by the mean
function (commonly taken as zero for centered data) and the
covariance function K : P P R. For an extensive review
of Gaussian Processes, see [3], while their use in Bayesian
optimization is largely explained in [4] and their usage in
tuning parameters of classifiers is explained in [2].
E XPERIMENTAL R ESULTS
D ISCUSSION
1
0.86
0.9
0.85
0.8
10
20
0.66
0.7
0.64
0.62
0.6
0.6
0.5
0.58
20
40
60
80
100
120
140 0
10
20
Iteration
Fig. 7. The maximum cross validation AUC score found by iteration for all
three datasets. From top to bottom: KDD cup 2014, IJCAI, KDD cup 2015
Donors
Outcome
User
Event Type
Vendors
Resources
Donations
Schools
Projects
Outcomes
Merchant
Behavior
Action Type
Teacher
Essays
Brand
Item
Category
Object
Log
Object
children
Outcome
User
Enrollment
Course
Fig. 6. Three different data science competitions held during the period of 2014-2015 from left to right: KDD cup 2014, IJCAI, and KDD cup 2015. A total
of 906 teams took part in these competitions. Predictions from the Data Science Machine solution were submitted and the results are reported in this paper. We
note that two out of three competitions are ongoing at the time of writing.
TABLE II.
T HE AUC SCORES ACHIEVED BY THE DATA S CIENCE M ACHINE . T HE "D EFAULT " SCORE USES THE DEFAULT PARAMETERS . T HE "L OCAL "
SCORE IS THE RESULT OF RUNNING K - FOLDS ( K =3) CROSS VALIDATION , WHILE THE "O NLINE " SCORE IS BASED ON SUBMITTING PREDICTIONS TO THE
COMPETITION .
TABLE III.
Parameter Selection
KDD
Default
GCP Optimized
Local
.6059
.6321
cup 2014
Online
.55481
.5863
IJCAI
Local
.6439
.6540
KDD
Online
.6313
.6606
Local
.8444
.8672
cup 2015
Online
.5000
.8621
T HE NUMBER OF ROWS AND THE NUMBER OF SYNTHESIZED FEATURES PER ENTITY IN EACH DATASET. T HE UNCOMPRESSED SIZES OF THE
KDD CUP 2014, IJCAI , AND KDD CUP 2015 ARE APPROXIMATELY 3.1 GB, 1.9 GB, AND 1.0 GB, RESPECTIVELY.
cup 2014
# Rows
664,098
57,004
249,555
2,716,319
1,282,092
3,667,217
357
619,326
664,098
KDD
Entity
Projects
Schools
Teachers
Donations
Donors
Resources
Vendors
Outcomes
Essays
IJCAI
# Features
935
430
417
21
4
20
13
13
9
Entity
Merchants
Users
Behaviors
Categories
Items
Brands
ActionType
Outcomes
# Rows
4995
424,170
54,925,330
1,658
1,090,390
8,443
5
522,341
cup 2015
# Rows
2,000,904
112,448
39
120,542
13,545,124
26,750
26,033
7
KDD
# Features
43
37
147
12
60
43
36
82
Entity
Enrollments
Users
Courses
Outcomes
Log
Objects
ObjectChildren
EventTypes
# Features
450
202
178
3
263
304
3
2
0.9
0.8
0.7
0.6
0.5
0.9
0.8
0.7
0.6
0.5
20
40
60
80
100
0.8
0.7
0.6
0.5
% Participants
0.9
20
40
60
80
100
20
% Participants
40
60
80
100
% Participants
Fig. 8. AUC scores vs % participant achieving that score. The vertical line indicates where the Data Science Machine ranked, from top to bottom: KDD cup
2014, IJCAI, and KDD cup 2015
TABLE IV.
H OW THE DATA S CIENCE M ACHINE COMPARES TO HUMAN EFFORTS . W E DO NOT HAVE DATA ON NUMBER OF SUBMISSIONS FOR IJCAI .
KDD CUP 2015 WAS AN ONGOING COMPETITION AT THE TIME OF WRITING , SO THIS IS A SNAPSHOT FROM M AY 18 TH , 2015.
Dataset
KDD cup 2014
IJCAI
KDD cup
2015
# Teams
473
156
277
% of Teams Worse
69.3%
32.7%
85.6%
Cumulative submissions
1.2
104
# Submissions worse
3873
1319
R ELATED W ORK
1
0.8
0.6
0.4
0.2
0
0
100
200
300
400
Similarly, in natural language processing, there are generalized features generation techniques such as term frequencyinverse document frequency (tf-idf)", which is the ratio of
how frequently a word shows up in a document to how often
it shows up in the whole corpus of documents. This feature
type is easy to calculate and performs well [9]. Others include
modeling techniques like Latent Dirichlet Allocation (LDA),
which transforms a corpus of documents into document-topic
mappings [10]. LDA has proven to be general purpose enough
to be useful in many document classification tasks, including
spam filtering [11] and article recommendation [12].
Importantly, while we consider these generalized algorithms to be useful, they still have to be tuned for best
performance. For example, in LDA, choosing the number of
topics in the corpus is a challenging enough to encourage
further research [10]. The importance of these algorithms is
that they are techniques to generically capture information
about type of data.
C ONCLUSION
We have presented the Data Science Machine: an end-toend system for doing data science with relational data. At its
core is Deep Feature Synthesis, an algorithm for automatically
synthesizing features for machine learning. We demonstrated
the expressiveness of the generated features on 3 datasets from
different domains. By implementing an autotuning process,
we optimize the whole pathway without human involvement,
enabling it to generalize to different datasets. Overall, the
system is competitive with human solutions for the datasets
we tested on. We view this success as an indicator that the
Data Science Machine has a role in the data science process.
A. Future work
We see the future direction of the Data Science Machine
as a tool to empower data scientists. To this end, there are