Vous êtes sur la page 1sur 4

A New Fast Algorithm for Constructing FP_tree

*


Zhenzhou Wang Jiaomin Liu Sheng Guo Lijuan Yang
College of Electrical
Engineering
College of Electrical
Engineering
.College of Information
Science and Engineering
Computer Science and
Engineering Department
North China Electric Power
University(Baoding)
North China Electric Power
University(Baoding)
HebeiUniversity of Science
and Technology
North China Institute of
Aerospace Engineering
Baoding, Hebei Province,
071003,China
Baoding, Hebei Province,
071003,China
Shijiazhuang, Hebei
Province,050054,China
Langfang,Hebei
Province,065000,China
chinawzz@126.com ljm6667@126.com gush_98@163.com ylj23@163.com


*
This ol `: :u|:`d`zd o_t o |`|` Dutmut o Commuu`ut`ou: b0b0!+.
Abstract- FP-tree is an efficient algorithm for generating
frequent item-sets and the present algorithms are all based on
FP_tree generally. But the FP_trees generative process needs
much time and it needs to scan database twice. In order to
improve the efficiency of constructing the FP_tree, a new fast
algorithm called Level FP_tree (abbreviate L-FP_tree) was
proposed. The algorithm contains two main parts. Firstly, it
scans database and generates equivalence classes of each item.
Secondly, it deletes the non-frequent items and rewrites the
equivalence classes and then constructs the L-FP_tree.



Key words: FP_tree; L-FP_tree; frequent pattern; equivalence
class

I. INTRODUCTION
Frequent pattern mining is the most important part of data
mining, and its also an essential part of some algorithms such
as association rules and sequential pattern mining. The most
classical Apriori algorithm
[1]
proposed by R.Srikant uses
generate-and-test method to generate and detect candidate
itemsets so it wastes a lot of time. The subsequent
algorithms
[2]
based on Apriori such as GSP
[3]
and PSP
[4]

make some improvements but they still can not solve the
problem of generating too much candidate itemsets. Later Han
jia-wei proposed the FP_growth algorithm. This algorithm
stores frequent itemsets by a condensed tree called FP_tree so
it improved speed greatly and doesnt generate candidate
itemsets. FreeSpan
[5]
, PrefixSpan
[6]
and FP_growth
[7]
are all
belong to it

. If the database was too large and the support
degree was too small, it would be wasteful to find out all the
frequent patterns. Consequently algorithms of mining the
maximal frequent pattern
[8]
and closed frequent pattern were
put forward in order to solve this problem. These algorithms
[9]

waste a lot of time, because they all based on the FP_tree and
they have to scan the database twice to generate an FP_tree.
Algorithm called FP_spilt
[10]
,

which scans the database only
one time, was proposed by Chin Feng Lee and Tsung-Hsien
Shen in order to solve this problem.
II. FP_TREE AND ALGORITHM BASED ON FP_TREE
A condensed data structure called FP_tree improves the
data mining speed greatly because it does not generate too
much candidate itemsets. Compared with Apriori algorithm
[11]
,
the performance of FP_tree has been improved greatly
because generative process of FP_tree only needs to scan the
database twice. The method to construct FP_tree is shown as
follows:

A. Algorithm 1 (FP_tree construction)
Input: A transaction database DB and min_sup
Output: FP_tree
Methord:
1) Scan DB once, collect the set of request items F and
their support degree. Arrange F in support degrees
descending order as L.
2) Create the root of an FP_tree T and label it as null,
for each transaction in DB do the following. Select and sort
the frequent items in transaction according to the order of L.
Let the sorted frequent item list in List, call insert_tree
([p|P],T)
3) Function insert_tree([p|P],T){
If T has a child N that N item_name =P.item_name
Then increment Ns count by 1
else do {
create a new node N;
Ns count=1;
Ns parent link be linked to T;
Ns node_link be linked to the nodes with the same
item_name via the node_link;
}
If p is nonempty
Call insert_tree(p,N);
}
B Algorithm based on FP_tree
Because of its high efficiency, many frequent pattern
mining algorithms are based on FP_tree such as close+
algorithm, TFP algorithm. For more specific steps, see the
reference.
III. CONSTRUCTING L-FP_TREE
According to the above analysis, the main point of
algorithms that based on FP_tree is constructing the FP_tree.
There are two steps to construct the FP_tree: the first is
scanning database to generate frequent itemsets, the second is
scanning the database for the second time to construct
978-1-4244-2114-5/08/$25.00 2008 IEEE. 6153
Proceedings of the 7th
World Congress on Intelligent Control and Automation
June 25 - 27, 2008, Chongqing, China
FP_tree. A method called L-FP_tree was introduced in this
text. It can generate FP_tree by scanning the database once.
L-FP_tree algorithm contains four parts:
1) scan database to generate equivalence classes of each
item;
2) calculate support degree of each item and get rid of the
infrequent items;
3) adjust the equivalence classes which have removed
infrequent items then readjust the equivalence classes in
support degree s descending order;
4) use the adjusted equivalence classes to construct
FP_tree .
Specific measures and implementation procedures are
shown as follows:
1) scan database to generate equivalence classes of each
item
Definition 1:

Each items equivalence classes was defined
as Ei= {(T
i
d, j ) |T
i
d} which is sequence flag, i is one item of
the sequence, j is the following item of i alphabetically},
suppose that the first item of each list is root.
Example 1: A database is shown in Fig.1, then
{(1, )(2, )(3, )(4, )(5, )}
root
E a c a a d = (1)
{(1, )(3, )(4, )(5, )}
a
E b b d d = (2)
{(1, )(3, )}
b
E e e = (3)
{(2, )}
c
E e = (4)
{(4, )(5, )}
d
E f f = (5)
{(1, )(2, )(3, )}
e
E f f f = (6)
{(1, )(2, )(3, )(4, )(5, )}
f
E null null null null null = (7)

Fig.1 Transaction Database
2) calculate support degree of each item and remove the
infrequent items
Definition 2| E
i
| is defined as the support degree of i.
Example 2: According to Definition2, the support degree
of a, b, c, d, e, f are | E
a
|=4, | E
b
|=2, | E
c
|=1, | E
d
|=2, | E
e
|=3, |
E
f
|=5 respectively. Suppose that the minimal support degree is
3 then the support degree of b, c and d cant satisfy the
requirement and should be removed. The remained items be
arranged in support degrees descending order and the result is
f: 5, a: 4, e: 3.
3) adjust the equivalence classes which have been
removed infrequent items then rearrange the equivalence
classes in support degree s descending order.
After removing the infrequent items, the rest items
behind frequent item in equivalence classes should be
rearranged.
Example 3: b is included in E
a
, if b was deleted it will be
replaced by e which is the item follows b.
{(1, )(3, )(4, )(5, )}
a
E e e d d = (8)
When c was deleted
{(1, )(2, )(3, )(4, )(5, )}
root
E a e a a d = (9)
When d was deleted:
{(1, )(3, )(4, )(5, )}
a
E e e f f = (10)
Then we should readjust each E
i
in descending order of
support degree. Start from root, its shown that the following
items of item that the sequential flag of all equivalence classes
is 1 are a, null, f and e. And the support degree of f is
maximal, so the following item in the first sequence of root is
f. For the same reason, all the results are:
{(1, )(2, )(3, )(4, )(5, )}
root
E a e a a d = (11)
{(1, )(3, )(4, )(5, )}
a
E e e f f = (12)
{(1, )(2, )(3, )}
e
E f f f = (13)
{(1, )(2, )(3, )(4, )(5, )}
f
E null null null null null = (14)
4) use the adjusted equivalence classes to construct
FP_tree
Definition 3:| E
ij
| is defined as the support degree of j in
item i.
Example 4: According to above definition:
| | 5
rootf
E = | | 4
fa
E = | | 1
fe
E = | | 2
ae
E =
| | 5
rootf
E = , so the following item which is the child
note of root is f. The support degree of f is 5, then the
constructed tree was shown in Fig. 2.
f
a
e
5
4
3
Root
f : 5

Fig.2 Constructed Tree
f
a
e
5
4
3
Root
f : 5
a: 4 e: 1
e: 2

Fig.3 Entire L-FP_tree
6154
And for the same reason, the L-FP_tree was constructed
and it was shown in Fig.3. And the head table and link was
also created.
Algorithm 2:(construct L-FP_tree)
Input: A transaction database DB and min_sup
Output: L-FP_tree
Method:
1) scan DB once, construct the equivalence class of
each item
2) count the support degree of each item
3) re-construct the equivalence class of each item
after removing the non-frequent item
4) re-construct the equivalence class of each item by
the order of descending support degree
5) function insert-tree (E
ij
) {
create i and then create j as is child;
the sustaining of i equal to (E
ij
);
call insert-tree (E
ij
);
finally create head table and the link of each
node be linked to the nodes with the same
item-name via the node link;
}
IV. TEST RESULTS
The algorithm that introduced in this text was realized in
c++ language and runs in Windows XP OS. The data set was
generated by Assoc.gen
[11]
. This new algorithms runtime in
different parameters are shown in Fig 4-6.
As shown in Fig 4, the L-FP_tree algorithm is more
effective than FP_tree as the support degree descending. If
support degree was smaller the algorithm was more effective.
The runtime of the algorithm introduced in the paper will not
ascend when the support degree descends because this
algorithm scans the database only once and it doesnt need to
search the head table repeatedly.
As shown in Fig 5, suppose that the minimal support
degree is 7%, the runtime of L-FP_tree is smaller than
FP_tree. One reason is that if the database is larger, the
runtime of FP_tree is longer. Another reason is that smaller
time is needed for L-FP_tree to search the head table to keep
the relation of each item.
As shown in Fig 6, suppose that the minimal support
degree is also 7%, the result is similar to Fig.5. The main
reason is that the L-FP_tree scans database only once and the
I/O cost is smaller than FP_tree.
0
50
100
150
200
2 4 6 8
1
0
Domain Value of Support Degree
E
x
e
c
u
t
i
o
n

T
i
m
e
(
s
)
LFPtree
FP_tree

Fig.4 Support degree
0
100
200
300
400
10 15 20 25
Average Size of
Transaction

Support
Degree is 7%

E
x
e
c
u
t
i
o
n

T
i
m
e
(
s
)
LFPtree
FP_tree

Fig. 5 Average Transaction
0
?00
+00
b00
800
!000
!00 ?00 300 +00 00
Domain Value of Support Degree
E
x
e
c
u
t
i
o
n

T
i
m
e
(
s
)
LFPtree
FP_tree

Fig. 6 Quantity of TransactionsK
.CONCLUSION
In this paper, a new method named L-FP_tree was
proposed to construct FP_tree. According to the test result, it
was proved to be more effective than FP_tree algorithm. The
main reason is that this algorithm scans database only once,
the second reason is that it doesnt need to filter and
recompose transaction record. Whats more, it doesnt need to
check head table and link repeatedly when a new node was
inserted.
REFERENCE
6155
[1] R.Agrawal, R.Srikant. Fast Algorithms for Mining Association Rules.
B.Bocca, M.Jarke, C.Zaniolo. Proc. of the 20th Int. Conf. on Very Large
Data Bases, Santiago, Chile, 1994. San Francisco, CA, Morgan
Kaufmann Publishers:487-499
[2] R.Agrawal, R.Srikant. Mining Sequential Patterns. P.Yu, A.Chen. Proc. of
the 11th Int. Conf. on Data Engineering, Taipei, Taiwan, 1995. Los
Alamitos, CA, IEEE Computer Society Press:3-14
[3] R.Srikant, R.Agrawal. Mining Sequential Patterns: Generalizations and
Performance Improvements. P.Apers, M.Bouzeghoub, G.Gardarin. Proc.
of the 5th Int. Conf. on Extending Database Technology, Avignon,France,
1996.London, SpringerVerlag:3-17
[4] F.Masseglia, F.Cathala, P.Poncelet. The PSP Approach for Mining
Sequential Patterns. J.Zytkow, M.Quafafou. Proc. of the 2nd European
Symposium on Principles of Data Mining and Knowledge Discovery,
Nantes, France, 1998:176-184
[5] J.Han, J.Pei, B.Mortazavi-Asl, et al. FreeSpan: Frequent Pattern-projected
Sequential Pattern Mining. R.Ramakrishnan, S.Stolfo, R.Bayardo, et al.
Proc. of the 6th ACM SIGKDD Int. Conf. on Knowledge Discovery and
Data Mining, Boston, 2000. New York, ACM Press, 2000:355-359





















































[6] J.Pei, J.Han, H.Pinto, et al. PrefixSpan: Mining Sequential Patterns
Efficiently by Prefix-projected Pattern Growth. A.Buchmann,
D.Georgakopoulos. Proc. of the 17th Int. Conf. on Data Engineering,
Heidelberg, Germany, 2001. Los Alamitos, CA, IEEE Computer Society
Press, 2001:215-224
[7] J.Han, J. Pen, and Y.Yin. Mining Frequent Patterns without Candidate
Generation. Proc. 2000 ACM-SIMOD Int. May 2000.135-143
[8] Grahne G, Zhu JF. High performance mining of maximal frequent
itemsets. In : Pro. Of the 6
th
SIAM Intl Workshop on High Performance
Data Mining. 2003.
[9] N.Pasquier, Y.Bastide, R.Taouil, L. Lakhal. Discovering Frequent Closed
Itemsets for Association Rules. Proc. Intl Conf. Database Theory, pp.
398-416, Jan.1999.
[10] Chin-Feng Lee and Tsung-Hsien Shen. An FP_split method for fast
association rules mining.In: IEEE pp. 459-463. 2005.
[11]IBM Almaden Research Center, Quest synthetic data
generation,http://www.almaden.ibm.com/software/quest/Resources/datase
ts/syndata.htnl, 2005.

6156

Vous aimerez peut-être aussi