Académique Documents
Professionnel Documents
Culture Documents
Presented by:
Yumou Wang
Dongyun Zhang
Hao Zhou
INTRODUCTION
The
INTRODUCTION
By
INTRODUCTION
Problems : When new examples arrive at a higher r
ate than they can be mined, the amount of unused d
ata grows without bounds as time progresses.
Today, to deal with these huge amount of data in a r
esponsible way is very important.
Mining these continuous data streams brings unique
opportunities, but also new challenges.
BACKGROUND
Design
It
BACKGROUND
Usually,
CLASSIFICATION METHOD
Input:
Examples of the form (x,y), y is the class label, x is
the vector of attributes.
Output:
A model y=f(x), predict the classes y of future exa
mples x with high accuracy.
DECISION TREE
One
PROBLEMS
There are some problems existed in traditional dec
ision tree.
Some of them assume that all training data exampl
es can be stored simultaneously in main memory.
Disadvantages: Limited the number of examples ca
n be learned from.
Disk-based decision tree learners: examples in disk
, repeatedly reading them.
Disadvantages: expensive when learning complex
trees.
HOEFFDING TREES
Designed
HOEFFDING BOUND
Definition: The statistical result that can decide
how many examples n using by each node is
called Hoeffding bound.
Assume: Rthe range of variable r
n independent observations
mean: r
2n
HOEFFDING BOUND
R 2 In ( 1 )
2n
cost
dnumber of attributes
cnumber of classes
vnumber of values per attribute
lnumber of leaves in the tree
The memory cost for each leaf is O(dvc)
The memory cost for whole tree is O(ldvc)
ADVANTAGES OF HOEFFDING
TREE
1.
Breaking ties
Reduce
waste
Useful under condition where
Use of
Split
Memory cleanup
Measurement
of
Clearance of least promising leaves
Option of enabling reactivation
early
Reduces memory consumption
Initialization
Can
Rescans
TESTSCONFIGURATION
14 Concepts
Generated
TESTSSYNTHETIC DATA
, ,
TESTSSYNTHETIC DATA
TESTSSYNTHETIC DATA
TESTSSYNTHETIC DATA
TESTSSYNTHETIC DATA
TESTSSYNTHETIC DATA
Time
consumption
20m
examples
100k
examples
TESTSPARAMETERS
TESTSPARAMETERS
W/ ties vs. w/o ties
65
vs.
VFDT:
TESTSPARAMETERS
40MB vs. 80MB memory
7.8k
more nodes
VFDT: +3.0% accuracy
VFDT-boot: +3.2% accuracy
vs.
30%
less nodes
VFDT: +2.3% accuracy
VFDT-boot: +1.0% accuracy
TESTSWEB DATA
1.89m examples
TESTSWEB DATA
Decision
dump
64.2%
accuracy
1277s to learn
C4.5
74.5k
examples
2975s to learn
73.3% accuracy
TESTSWEB DATA
Concept
Goal:
Mining
If
CVFDT
VFDT
Maintain VFDTs speed and accuracy
Detect and respond to changes in the example-genera
ting process
CVFDT (CONTD.)
With a time-changing concept, the current splitting
attribute of some nodes may not be the best anym
ore.
An out dated subtree may still be better than the b
est single leaf, particularly if it is near the root.
EXAMPLE
THANK YOU