Vous êtes sur la page 1sur 46

MINING HIGH-SPEED DATA STREAMS

Presented by:

Yumou Wang

Dongyun Zhang
Hao Zhou

INTRODUCTION
The

worlds information is doubling every t


wo years.
From 2006 to 2011, the amount of informat
ion grew by a factor of 9 in just five years.

INTRODUCTION
By

2020 the world will generate 50 times th


e amount of information and 75 times the n
umber of "information containers"
However, IT staff to manage it will grow less
than 1.5 times.
Current algorithms can only deal with small
amount of data less than a days data of ma
ny applications.
For example, banks, telecommunication co
mpanies.

INTRODUCTION
Problems : When new examples arrive at a higher r
ate than they can be mined, the amount of unused d
ata grows without bounds as time progresses.
Today, to deal with these huge amount of data in a r
esponsible way is very important.
Mining these continuous data streams brings unique
opportunities, but also new challenges.

BACKGROUND
Design

Criteria for mining High Speed D


ata Streams

It

must be able to build a model using at most


one scan of the data.
It must use only a fixed amount of main memo
ry.
It must require small constant time per record.

BACKGROUND
Usually,

use KDD system to operate this exa


mples when they arrive.
Shortcomings: learning model learned are
highly sensitive to example ordering compar
e to the batch model.
Others can produce the same model as batc
h version but very slower.

CLASSIFICATION METHOD
Input:
Examples of the form (x,y), y is the class label, x is
the vector of attributes.
Output:
A model y=f(x), predict the classes y of future exa
mples x with high accuracy.

DECISION TREE
One

of the most effective and


widely-used classification meth
ods.
Adecision treeis a decision su
pport tool that uses a tree-like
graphormodel.
Decision trees are commonly u
sed inmachine learning.

BUILDING A DECISION TREE


1.

Starting at the root.


2. Testing all the attributes and choose the
best one according to some heuristic meas
ure.
3. Split one node into branches and leaves.
4. Recursively replacing leaves by test node
s.

EXAMPLE OF DECISION TREE

EXAMPLE OF DECISION TREE

PROBLEMS
There are some problems existed in traditional dec
ision tree.
Some of them assume that all training data exampl
es can be stored simultaneously in main memory.
Disadvantages: Limited the number of examples ca
n be learned from.
Disk-based decision tree learners: examples in disk
, repeatedly reading them.
Disadvantages: expensive when learning complex
trees.

HOEFFDING TREES
Designed

for extremely large datasets


Main idea: To find the best attribute at a giv
en node by considering only a small subset of
the training examples that pass through the n
ode.
Using how many examples is sufficient

HOEFFDING BOUND
Definition: The statistical result that can decide
how many examples n using by each node is
called Hoeffding bound.
Assume: Rthe range of variable r
n independent observations
mean: r

With probability 1-, the true mean of r is


at least r-
R 2 In( 1 )

2n

HOEFFDING BOUND

R 2 In ( 1 )

2n

This function is a decreasing function


n is bigger, the is smaller
It is the difference between true value and mean
value of r.

HOEFFDING TREE ALGORITH


M

HOEFFDING TREE ALGORITH


M
Inputs:

S -> is a sequence of examples,


X -> is a set of discrete attributes,
G(.) -> is a split evaluation function,
-> is one minus the desired probability
of choosing the correct attribute at any give
n node.
Outputs:
HT -> is a decision tree.

HOEFFDING TREE ALGORITH


M
Goal:
Ensure that, with a high probability, the attribute
chosen using n examples, is the same as that
would be chosen using infinite examples.
Let Xa be the attribute with the highest observed
G and Xb be with second highest attribute.
After seeing n examples.
Let G = G(Xa) G(Xb)
G >
Thus a node needs to accumulate examples from
the stream until becomes smaller than G.

HOEFFDING TREE ALGORITH


M
The

algorithm constructs the tree using the


same procedure as ID3. It calculates the inf
ormation gain for the attributes and deter
mines the best attributes.
At each node it checks for condition G > .
If the condition is satisfied, then it creates c
hild nodes based on the test at the node.
If not it streams in more training examples
and carries out the calculations till it satisfie
s the condition.

HOEFFDING TREE ALGORITH


M
Memory

cost
dnumber of attributes
cnumber of classes
vnumber of values per attribute
lnumber of leaves in the tree
The memory cost for each leaf is O(dvc)
The memory cost for whole tree is O(ldvc)

ADVANTAGES OF HOEFFDING
TREE
1.

Can deal with extremely large datasets.


2. Each example to be read at most once in
a small constant time. Makes it possible to
mine online data sources.
3. Build very complex trees with acceptable
computational cost.

VFDTVERY FAST DECISION TREE

Breaking ties
Reduce

waste
Useful under condition where

Use of
Split

may not change with a single example


Significantly reduce the time of re-computation

Memory cleanup
Measurement

of
Clearance of least promising leaves
Option of enabling reactivation

VFDTVERY FAST DECISION TREE

Filtering out poor attributes


Dropping

early
Reduces memory consumption

Initialization
Can

be initialized with other existing tree


Set a head start

Rescans

TESTSCONFIGURATION

14 Concepts
Generated

by random decision trees using


Number of leaves: 2.2k to 61k
Noise level: 0 to 30%

50k examples for testing


Available memory: 40MB
Legacy processors

TESTSSYNTHETIC DATA

, ,

TESTSSYNTHETIC DATA

TESTSSYNTHETIC DATA

TESTSSYNTHETIC DATA

TESTSSYNTHETIC DATA

TESTSSYNTHETIC DATA
Time

consumption

20m

examples

VFDT takes 5752s to read, 625s to process

100k

examples

C4.5 takes 36s


VFDT takes 47s

TESTSPARAMETERS

W/ & w/o over-pruning

TESTSPARAMETERS
W/ ties vs. w/o ties

65

nodes vs. 8k nodes for VFDT


805 nodes vs. 8k nodes for VFDT-boot
72.9% vs. 86.9% for VFDT
83.3% vs. 88.5% for VFDT-boot

vs.
VFDT:

+1.1% accuracy, +3.8x time


VFDT-boot: -0.9% accuracy, +3.7x time
5% more nodes

TESTSPARAMETERS
40MB vs. 80MB memory

7.8k

more nodes
VFDT: +3.0% accuracy
VFDT-boot: +3.2% accuracy

vs.
30%

less nodes
VFDT: +2.3% accuracy
VFDT-boot: +1.0% accuracy

TESTSWEB DATA

For predicting accesses

1.89m examples

61.1% with most common class

276230 examples for testing

TESTSWEB DATA
Decision

dump

64.2%

accuracy
1277s to learn
C4.5

with 40MB memory

74.5k

examples
2975s to learn
73.3% accuracy

VFDT-bootstrapped with C4.5


1.61m examples
1450s to learn after initialization(983s to read)

TESTSWEB DATA

MINING TIME-CHANGING DATA STRE


AMS

WHY IS VFDT NOT ENOUGH?


VFDT, assume training data is a sample drawn from
stationary distribution.
Most large databases or data streams violate this
assumption

Concept

Drift: data is generated by a time-changing c


oncept function, e.g.
Seasonal effects
Economic cycles

Goal:
Mining

continuously changing data streams


Scale well

WHY IS VFDT NOT ENOUGH?


Common Approach: when a new example arrives, r
eapply a traditional learner to a sliding window of
w most recent examples
Sensitive to window size

If

w is small relative to the concept shift rate, assure t


he availability of a model reflecting the current concept
Too small w may lead to insufficient examples to learn
the concept

If examples arrive at a rapid rate or the concept c


hanges quickly, the computational cost of reapplyi
ng a learner may be prohibitively high.

CVFDT

CVFDT (Concept-adapting Very Fast Decision Tree le


arner)
Extend

VFDT
Maintain VFDTs speed and accuracy
Detect and respond to changes in the example-genera
ting process

CVFDT (CONTD.)
With a time-changing concept, the current splitting
attribute of some nodes may not be the best anym
ore.
An out dated subtree may still be better than the b
est single leaf, particularly if it is near the root.

Grow an alternative subtree with the new best attribu


te at its root, when the old attribute seems out-of-date.

Periodically use a bunch of samples to evaluate qu


alities of trees.

Replace the old subtree when the alternate one beco


mes more accurate.

HOW CVFDT WORKS

EXAMPLE

SAMPLE EXPERIMENT RESULT

CONCLUSION AND FUTURE WORK


CVFDT is able to maintain a decision-tree up-todat
e with a window of examples by using a small const
ant amount of time for each new examples that arri
ves.
Empirical studies show that CVFDT is effectively able
to keep its model up-to-date with a massive data str
eam even in the face of large and frequent concept
shifts.
Future Work: Currently CVFDT discards subtrees th
at are out-of-date, but some concepts change period
ically and these subtrees may become useful again
identifying these situations and taking advantage of
them is another area for further study.

THANK YOU

Vous aimerez peut-être aussi