Académique Documents
Professionnel Documents
Culture Documents
com
http://hamzehal.blogspot.com/2014/06/adaboost-sparse-input-support.html
reproduce the same behavior of prediction on dense target data first a sparse class distribution function was written to
get the classes of each column in the sparse matrix, second a random sampling function was created to provide a
sparse matrix of randomly drawn values from a user specified distribution. Read the blog post to see detailed results
of the sparse output dummy pull request.
PR #3350 - Sparse Output KNN Classifier
Status: Nearing Completion
Summary of the work done: In the predict function of the classifier the dense target data is indexed one column at a
time. The main improvement made here is to leave the target data in sparse format and only convert a column to a
dense array when it is necessary. This results in a lower peak memory consumption, the improvement is proportional
to the sparsity and overall size of the target matrix.
Future Directions
It is my goal for the Fall semester to support the changes I have made to the scikit-learn code base the best I can. I
also hope to see myself finalize the remaining two pull requests.
The Scikit-learn dummy classifier is a simple way to get naive predictions based only on the target data of your
dataset. It has four strategies of operation.
constant - always predict a value manually specified by the use
uniform - label each example with a label chosen uniformly at random from the target data given
stratified - label the examples with the class distribution seen in the training data
most-frequent - always predict the mode of the target data
The dummy classifier has built in support for multilabel-multioutput data. I have made a pull request #3438 this week
that has introduced support for sparsely formatted output data. This is useful because memory consumption can be
vastly improved when the data is highly sparse. Below a benchmark these changes with two memory consumption
results graphed for each of the four strategies, once in with sparsely formatted target data and once with densely
formatted data as the control.
Results Visualized
Constant Results: Dense 1250 MiB, Sparse 300 MiB
The constants used in the fit have a level of sparsity similar to the data because they were chosen as an arbitrary row
from the target data.
Conclusions
We can see that in all cases expect for Uniform we get significant memory improvements by supporting sparse
matrices. The sparse matrix implementation for uniform is not useful because of the dense nature of the output even
when the input shows high levels of sparsity. It is possible this case will be revised to warn the user or even throw an
error.
Remaining Work
There is work to be done on this pull request to make the predict function faster in the stratified and uniform cases
when using sparse matrices. Although the uniform cases is not important in itself the underlying code for generating
sparse random matrices is used in the stratified case. Any improvements to uniform will come for free is the stratified
case speed is improved.
Another upcoming focus is to return to the sparse output knn pull request and make some improvements. There will
be code written in the sparse output dummy pull request for gathering a class distribution from a sparse target matrix
that can be abstracted to a utility function and will be reusable in the knn pull request.
Going forward with sparse support for various classifiers I have been working on a pull request for sparse one vs. rest
classifiers that will allow for sparse target data formats. This will results in a significant improvement in memory usage
when working with large amount of sparse target data, a benchmark is given bellow to measure the. Ultimately what
this means for users is that using the same amount of system memory it will be possible to train and predict with a ovr
classifier on a larger target data set. A big thank you to both Arnaud and Joel for the close inspection of my code so far
and the suggestions for improving it!
Implementation
The One vs. Rest classier works by binarizing the target data and fitting an individual classifier for each class. The
implementation of sparse target data support improves memory usage because it uses a sparse binarizer to give a
binary target data matrix that is highly space efficient.
By avoiding a dense binarized matrix we can slice the one column at a time required for a classifier and densify only
when necessary. At no point will the entire dense matrix be present in memory. The benchmark that follows illustrates
this.
Benchmark
A significant part of the work on this pull request has involved devising benchmarks to validate intuition about the
improvements provided, Arnaud has contributed the benchmark that is presented here to showcase the memory
improvements.
By using the module memory_profiler we can see how the fit and predict functions of the ovr classifier affect the
memory consumption. In the following examples we initialize a classifier and fit it to the train dataset we provide in
one step, then we predict on a test dataset. We first run a control benchmark which shows the state of one vs. rest
classifiers as they are without this pull request. The second benchmark repeats the same steps but instead of using
dense target data it passes the target data to the fit function in a sparse format.
The dataset used is generated with scikit-learns make multilabel classification, and is generated with the following
call:
from sklearn.datasets import make_multilabel_classification
X, y = make_multilabel_classification(sparse=True, return_indicator=True,
n_samples=20000, n_features=100,
n_classes=4000, n_labels=4,
random_state=0)
This results in a densely formatted target dataset with a sparsity of about 0.001
Control Benchmark
est = OneVsRestClassifier(MultinomialNB(alpha=1)).fit(X, y)
consumes 179.824 MB
est.predict(X)
consumes -73.969 MB. The negative value indicates that data has been deleted from memory.
est.predict(X)
consumes 0.180 MB
Improvement
Considering the memory consumption for each case as 180 MB and 30 MB we see a 6x improvement in peak
memory consumption with the data set we benchmarked.
Thank you Arnaud, Joel, Oliver, Noel, and Lars for the time taken to give the constructive criticism that has vastly
improved my implementation of these changes to the code base.
PR - Sparse Metrics
Status: To be started
Summary of the work done: Modify the metrics and some misc tools to support sparse target data so sparsity can
be maintained throughout the entire learning cycle. The tools to be modified include precision score, accuracy score,
parameter search, and other metrics listed on scikit-learns model evaluation documentation under the classification
metrics header.
PR - Decision Tree and Random Forest Sparse Output
Status: To be started
Summary of the work done: Make revisions in the tree code to support sparsely formatted target data and update
the random forest ensemble method to use the new sparse target data support.
This data in a list of list format would look like this, we list each images labels one after the other:
Y = [2,1,3,2,4]
This Label binarizer will give a matrix where each column is an indicator for the class and each row is an
image/example.
[0,1,0,0]
[1,0,0,0]
Y = [0,0,1,0]
[0,1,0,0]
[0,0,0,1]
Before my pull request all conversions from label binarizer would give the above matrix in dense format as it appears.
My pull request has made it so that the user can specify if they would like the matrix to be returned in sparse format, if
so the matrix will be a sparse matrix and has the potential to save a lot of space and runtime depending on how
sparse the target data is.
These two calls to the label binarizer illustrate how sparse output can be enabled, the first call will print a dense matrix
the second call will return a sparse matrix.
Input:
Y_bin = label_binarize(y,classes=[1,2,3,4])
print(type(Y_bin))
print(Y_bin)
Output:
<type
[[0 1
[1 0
[0 0
[0 1
[0 0
'numpy.ndarray'>
0 0]
0 0]
1 0]
0 0]
0 1]]
Input:
Y_bin = label_binarize(y,classes=[1,2,3,4],sparse_output=True)
print(type(Y_bin))
print(Y_bin)
Output:
<class 'scipy.sparse.csr.csr_matrix'>
(0, 1)
1
(1, 0)
1
(2, 2)
1
(3, 1)
1
(4, 3)
1
The next pull request for sparse One vs. Rest support is what motivated this update because we want to overcome
runtime constraints on datasets with large amounts of labels causing extreme runtime and space requirements.
Thank you to the reviewers Arnaud, Joel, and Oliver for their comments this week and to Rohit for starting the code
which I based my changes off of.
This week as part of my work on the scikit-learn code base I implemented sparse input support with AdaBoost. This
work is being done in pull request 3161. I will give an demonstration of the value of AdaBoost and how my
contributions improved the scikit-learn implementation of the classifier. In addition with the goal of implementing
sparse output support in scikit-learn I have been working on a this pull request 3203 for sparse label binarization,
building off of code written previously by Rohit Sivaprasad. Of course I had help and I would like to thank Arnaud Joly,
Joel Nothman, and Olivier Grisel, for reviewing my code to help finalize and verify the correctness!
What is AdaBoost?
AdaBoost is a meta classifier, it operates by repeatedly training many base classifiers that are not very accurate and
pooling their results together to make a more accurate classifier. This is a common ensemble method known as
boosting. AdaBoost in addition looks for examples that most base classifiers are having trouble getting right and it
increases the focus on these examples in hopes of improving overall prediction accuracy.
We can demonstrate AdaBoost honing in on hard samples by running a demonstration where we train AdaBoost to
recognize the integer value from an image of a handwritten digit. By running AdaBoost we will now be able to see
which examples it had the most trouble on by examining the sample weights. Images with high sample weight are
harder to get right for the classifier.
The idea behind this experiment is that samples with high final sample weights after AdaBoost has finished training
on them is this: These samples were more commonly miss classified, the reasons for miss classification could be
subtle or they could be very obvious. A possible reason for miss classification is that the image looks not much like the
digit it is supposed to be representing, so it gets classified as another incorrect digit. Maybe it is more likely these
images with high sample weights will be malformed examples since so many classifiers are getting them incorrect.
To test this I trained AdaBoost on the digits dataset. I then retrieved the sample weights from the AdaBoost classifier,
I sorted them and got the four highest sample weights. These sample weights correspond to the four samples I have
put in the top row of the following image. I also found the four lowest sample weighted samples and put them on the
bottom row of the image.
In line with our intuition there is a very sloppy and vague example in the top row. The third image would be very hard
to identify as a two, the AdaBoost training process identified this sample and gave it a high sample weight. What is
interesting is that 1) Most of the other digits in the top row look easy to identify 2) The digits in the top row are all twos
and threes 2) The bottom row is all eights.
The way that I interpret this is that in this given data set the eights are the most consistently portrayed digit. They all
look the same in the bottom row, this is very important for accurate classification since the classifiers make their
prediction only on data they have seen before.
In the top row alone however we see two very different looking twos and two very different looking threes.
Understandably this makes these two digits hard to label correctly if the variation seen here is represented
throughout the entire data set.
Low Training and Prediction time is important because it allows us to refactor experiments more rapidly but also is
necessary for quick realtime applications of prediction such as facial recognition, or handwriting recognition from a
live video stream.
Index
David
likes
to
paint
Mary
sail
also
write
1
2
3
4
5
6
7
8
We now look back over each original document and count the number of times an entry occurs in it. We represent
each document as a set of eight numbers, one for the count of each dictionary entry.
[1, 2, 2, 1, 1, 1, 0, 0]
[1, 0, 1, 0, 0, 0, 1, 1]
[1, 2, 2, 1, 1, 1, 0, 0]
We now look at a second example designed to illustrate data sparsity. Our three docuemnts this time will be full
wikipedia articles, Instead of pasting the text of the articles here I will put a link in place.
Tardigrade
Cinco de Mayo
Southeast Asia
Using the same approach as above we build a dictionary and find it has 2820 words.
Entry
Index
A
A.D
API
ASEAN
Abagatan
Abackay
...
Zone
1
2
3
4
5
6
...
2820
Each document will now have significantly more zeros in the vector representation since there are many words in the
dictionary that a document might not use. As we keep adding more and more documents we will have a larger
dictironary, but the number of non zero entries in our vectors will remain the same for each document since it still only
uses the words it contains.
[877 non-zero entires, 1943 zeros]
[706 non-zero entries, 2114 zeros]
[1790 non-zero entires, 1030 zeros]
This means the parts of our numeric data that encode the information in each document have become a smaller part
of the numerical representation. As data grows with more documents this trend continues until the number of empty
data or zeros would overwhelm the non-zero data that represents the words present.
In practical contexts where these documents correspond to text heavy encyclopedia entries or new articles we also
have some metadata about the document such as topics or categories it falls under. One document could have the
labels: biology, nature, and wilderness. Another document could fall under history, Britain, and nature.
With machine learning, the process of a computer improving with respect to a test as it sees more data, we would
want to teach software to recognize documents and be able to accurately label them. The matrix we generated above
to describe the documents would be called the input data. The output data, some of which we would provide initially
to give examples of correct answers, would be the labels or categories each document falls under.
A classifier is the piece of code we train to place data into categories, one such classifier is a decision tree which
gives its answers to new instances by using the most influential signs it can as identified by previous examples it has
seen. In machine learning a common way to get the best performance from training classifiers is to train many of
them and then combine them as a population to make one final decision, this is referred to as ensemble learning. The
first method for doing this is boosting which builds a strong classifier by combining identifiably weak classifiers which
do better than random guessing. Bagging is another ensemble method, where classifiers are trained independently
on random pieces of the data and combined resulting in an increased accuracy.
Data Density
Consider a set of approximately 2.4 million documents where each document can be labeled with a set of labels from
a collection of 325,000 labels. This is exactly the data set behind the competition: Large Scale Hierarchical Text
Classification (LSHTC) held on Kaggle, a host for machine learning competitions.
With data sets this large we accumulate a massive dictionary because of the vast vocabulary encountered across all
the different groups of the 2 million documents. Consequently each document only uses a small subset of the entries
in the dictionary, and the documents term counts for most of the words in dictionary is zero. This means our data will
consist of mostly zeroes. Sparsity is the fraction of zeros with respect to all elements.
The LSTHC train data set consists of 3.8 trillion elements, only 100 million of which are non zero. Assuming 4 bytes
per element represented as an integer, a computer would need 15.31 TB to load this data set into memory. Since this
is not all realistic, it is useful to represent the data in a sparse format by excluding all the zeros. This format only
records the non zero elements and their locations. Since the data is approximately 99.9% zeros this shrinks the
memory require way down to 700 MB.
Dense Format
Sparse Format
# Initialize K-nn
neigh =
KNeighborsClassifier(n_neighbors=3)
# Initialize K-nn
neigh =
KNeighborsClassifier(n_neighbors=3)
# Begin Timing
start = time.clock()
# Begin Timing
start = time.clock()
# Train on data
neigh.fit(X_train, Y_train)
# Train on data
neigh.fit(X_train, Y_train)
# Predict
neigh.predict(X_test[index])
return (time.clock() - start)
# Predict
neigh.predict(X_test[index])
The sparse format shows a speedup of close to two orders of magnitude on the same data. To see how sparse and
dense performance vary with data size we rerun the above experiment with different sizes of the data trained on and
plot the progression of performance.
Sparsity occurs in images, video and audio through appropriate data representation, images can be encoded in
wavelets and all formats can be encoded sparsely with key feature extraction.