Cs525b Mine Streams

MINING HIGH-SPEED DATA STREAMS
Presented by:
Yumou Wang
Dongyun Zhang
Hao Zhou
INTRODUCTION
The
worlds information is doubling every t

wo years.
From 2006 to 2011, the amount of informat
ion grew by a factor of 9 in just five years.
INTRODUCTION
By
2020 the world will generate 50 times th

e amount of information and 75 times the n
umber of "information containers"
However, IT staff to manage it will grow less
than 1.5 times.
Current algorithms can only deal with small
amount of data less than a days data of ma
ny applications.
For example, banks, telecommunication co
mpanies.
INTRODUCTION
Problems : When new examples arrive at a higher r
ate than they can be mined, the amount of unused d
ata grows without bounds as time progresses.
Today, to deal with these huge amount of data in a r
esponsible way is very important.
Mining these continuous data streams brings unique
opportunities, but also new challenges.
BACKGROUND
Design
Criteria for mining High Speed D

ata Streams
It
must be able to build a model using at most

one scan of the data.
It must use only a fixed amount of main memo
ry.
It must require small constant time per record.
BACKGROUND
Usually,
use KDD system to operate this exa

mples when they arrive.
Shortcomings: learning model learned are
highly sensitive to example ordering compar
e to the batch model.
Others can produce the same model as batc
h version but very slower.
CLASSIFICATION METHOD
Input:
Examples of the form (x,y), y is the class label, x is
the vector of attributes.
Output:
A model y=f(x), predict the classes y of future exa
mples x with high accuracy.
DECISION TREE
One
of the most effective and

widely-used classification meth
ods.
Adecision treeis a decision su
pport tool that uses a tree-like
graphormodel.
Decision trees are commonly u
sed inmachine learning.
BUILDING A DECISION TREE

1.
Starting at the root.

2. Testing all the attributes and choose the
best one according to some heuristic meas
ure.
3. Split one node into branches and leaves.
4. Recursively replacing leaves by test node
s.
EXAMPLE OF DECISION TREE
EXAMPLE OF DECISION TREE
PROBLEMS
There are some problems existed in traditional dec
ision tree.
Some of them assume that all training data exampl
es can be stored simultaneously in main memory.
Disadvantages: Limited the number of examples ca
n be learned from.
Disk-based decision tree learners: examples in disk
, repeatedly reading them.
Disadvantages: expensive when learning complex
trees.
HOEFFDING TREES
Designed
for extremely large datasets

Main idea: To find the best attribute at a giv
en node by considering only a small subset of
the training examples that pass through the n
ode.
Using how many examples is sufficient
HOEFFDING BOUND
Definition: The statistical result that can decide
how many examples n using by each node is
called Hoeffding bound.
Assume: Rthe range of variable r
n independent observations
mean: r
With probability 1-, the true mean of r is

at least r-
R 2 In( 1 )
2n
HOEFFDING BOUND

R 2 In ( 1 )
2n
This function is a decreasing function

n is bigger, the is smaller
It is the difference between true value and mean
value of r.
HOEFFDING TREE ALGORITH

M

M
Inputs:
S -> is a sequence of examples,

X -> is a set of discrete attributes,
G(.) -> is a split evaluation function,
-> is one minus the desired probability
of choosing the correct attribute at any give
n node.
Outputs:
HT -> is a decision tree.

M
Goal:
Ensure that, with a high probability, the attribute
chosen using n examples, is the same as that
would be chosen using infinite examples.
Let Xa be the attribute with the highest observed
G and Xb be with second highest attribute.
After seeing n examples.
Let G = G(Xa) G(Xb)
G >
Thus a node needs to accumulate examples from
the stream until becomes smaller than G.

M
The
algorithm constructs the tree using the

same procedure as ID3. It calculates the inf
ormation gain for the attributes and deter
mines the best attributes.
At each node it checks for condition G > .
If the condition is satisfied, then it creates c
hild nodes based on the test at the node.
If not it streams in more training examples
and carries out the calculations till it satisfie
s the condition.

M
Memory
cost
dnumber of attributes
cnumber of classes
vnumber of values per attribute
lnumber of leaves in the tree
The memory cost for each leaf is O(dvc)
The memory cost for whole tree is O(ldvc)
ADVANTAGES OF HOEFFDING
TREE
1.
Can deal with extremely large datasets.

2. Each example to be read at most once in
a small constant time. Makes it possible to
mine online data sources.
3. Build very complex trees with acceptable
computational cost.
VFDTVERY FAST DECISION TREE
Breaking ties
Reduce
waste
Useful under condition where
Use of
Split
may not change with a single example

Significantly reduce the time of re-computation
Memory cleanup
Measurement
of
Clearance of least promising leaves
Option of enabling reactivation
VFDTVERY FAST DECISION TREE
Filtering out poor attributes

Dropping
early
Reduces memory consumption
Initialization
Can
be initialized with other existing tree

Set a head start
Rescans
TESTSCONFIGURATION
14 Concepts
Generated
by random decision trees using

Number of leaves: 2.2k to 61k
Noise level: 0 to 30%
50k examples for testing

Available memory: 40MB
Legacy processors
TESTSSYNTHETIC DATA
, ,
TESTSSYNTHETIC DATA
TESTSSYNTHETIC DATA
TESTSSYNTHETIC DATA
TESTSSYNTHETIC DATA
TESTSSYNTHETIC DATA
Time
consumption
20m
examples
VFDT takes 5752s to read, 625s to process
100k
examples
C4.5 takes 36s

VFDT takes 47s
TESTSPARAMETERS
W/ & w/o over-pruning
TESTSPARAMETERS
W/ ties vs. w/o ties
65
nodes vs. 8k nodes for VFDT

805 nodes vs. 8k nodes for VFDT-boot
72.9% vs. 86.9% for VFDT
83.3% vs. 88.5% for VFDT-boot
vs.
VFDT:
+1.1% accuracy, +3.8x time

VFDT-boot: -0.9% accuracy, +3.7x time
5% more nodes
TESTSPARAMETERS
40MB vs. 80MB memory
7.8k
more nodes
VFDT: +3.0% accuracy
VFDT-boot: +3.2% accuracy
vs.
30%
less nodes
VFDT: +2.3% accuracy
VFDT-boot: +1.0% accuracy
TESTSWEB DATA
For predicting accesses
1.89m examples
61.1% with most common class
276230 examples for testing
TESTSWEB DATA
Decision
dump
64.2%
accuracy
1277s to learn
C4.5
with 40MB memory
74.5k
examples
2975s to learn
73.3% accuracy
VFDT-bootstrapped with C4.5

1.61m examples
1450s to learn after initialization(983s to read)
TESTSWEB DATA
MINING TIME-CHANGING DATA STRE

AMS
WHY IS VFDT NOT ENOUGH?

VFDT, assume training data is a sample drawn from
stationary distribution.
Most large databases or data streams violate this
assumption
Concept
Drift: data is generated by a time-changing c

oncept function, e.g.
Seasonal effects
Economic cycles
Goal:
Mining
continuously changing data streams

Scale well
WHY IS VFDT NOT ENOUGH?

Common Approach: when a new example arrives, r
eapply a traditional learner to a sliding window of
w most recent examples
Sensitive to window size
If
w is small relative to the concept shift rate, assure t

he availability of a model reflecting the current concept
Too small w may lead to insufficient examples to learn
the concept
If examples arrive at a rapid rate or the concept c

hanges quickly, the computational cost of reapplyi
ng a learner may be prohibitively high.
CVFDT
CVFDT (Concept-adapting Very Fast Decision Tree le

arner)
Extend
VFDT
Maintain VFDTs speed and accuracy
Detect and respond to changes in the example-genera
ting process
CVFDT (CONTD.)
With a time-changing concept, the current splitting
attribute of some nodes may not be the best anym
ore.
An out dated subtree may still be better than the b
est single leaf, particularly if it is near the root.
Grow an alternative subtree with the new best attribu

te at its root, when the old attribute seems out-of-date.
Periodically use a bunch of samples to evaluate qu

alities of trees.
Replace the old subtree when the alternate one beco

mes more accurate.
HOW CVFDT WORKS
EXAMPLE
SAMPLE EXPERIMENT RESULT
CONCLUSION AND FUTURE WORK

CVFDT is able to maintain a decision-tree up-todat
e with a window of examples by using a small const
ant amount of time for each new examples that arri
ves.
Empirical studies show that CVFDT is effectively able
to keep its model up-to-date with a massive data str
eam even in the face of large and frequent concept
shifts.
Future Work: Currently CVFDT discards subtrees th
at are out-of-date, but some concepts change period
ically and these subtrees may become useful again
identifying these situations and taking advantage of
them is another area for further study.
THANK YOU

Cs525b Mine Streams

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Cs525b Mine Streams

Transféré par

Droits d'auteur :

Formats disponibles

MINING HIGH-SPEED DATA STREAMS

worlds information is doubling every t

2020 the world will generate 50 times th

Criteria for mining High Speed D

must be able to build a model using at most

use KDD system to operate this exa

of the most effective and

BUILDING A DECISION TREE

Starting at the root.

EXAMPLE OF DECISION TREE

EXAMPLE OF DECISION TREE

for extremely large datasets

With probability 1-, the true mean of r is

This function is a decreasing function

HOEFFDING TREE ALGORITH

HOEFFDING TREE ALGORITH

S -> is a sequence of examples,

HOEFFDING TREE ALGORITH

HOEFFDING TREE ALGORITH

algorithm constructs the tree using the

HOEFFDING TREE ALGORITH

Can deal with extremely large datasets.

VFDTVERY FAST DECISION TREE

may not change with a single example

VFDTVERY FAST DECISION TREE

Filtering out poor attributes

be initialized with other existing tree

by random decision trees using

50k examples for testing

VFDT takes 5752s to read, 625s to process

C4.5 takes 36s

W/ & w/o over-pruning

nodes vs. 8k nodes for VFDT

+1.1% accuracy, +3.8x time

For predicting accesses

61.1% with most common class

276230 examples for testing

with 40MB memory

VFDT-bootstrapped with C4.5

MINING TIME-CHANGING DATA STRE

WHY IS VFDT NOT ENOUGH?

Drift: data is generated by a time-changing c

continuously changing data streams

WHY IS VFDT NOT ENOUGH?

w is small relative to the concept shift rate, assure t

If examples arrive at a rapid rate or the concept c

CVFDT (Concept-adapting Very Fast Decision Tree le

Grow an alternative subtree with the new best attribu

Periodically use a bunch of samples to evaluate qu

Replace the old subtree when the alternate one beco

HOW CVFDT WORKS

SAMPLE EXPERIMENT RESULT

CONCLUSION AND FUTURE WORK

Vous aimerez peut-être aussi