Académique Documents
Professionnel Documents
Culture Documents
Jizhou Luo, Jianzhong Li, Hongzhi Wang, Yanqiu Zhang, and Kai Zhao
1 Introduction
Massive data often appears. For examples, the data generated by high-energy-density
physics experiments in Lawrence Lab can be up to 300T bytes per year. The data of a
large telecommunication company’s business data gets up to TeraBytes per month. In
last three decades, the data accumulated in demography, geognosy, meteorology and
nuclear industries is even up to 1PetaBytes.
The efficiency is the key issue while processing massive data. Many results about
the storage, management and query processing of massive data have been obtained.
Many people are invoked in finding efficient algorithms or techniques to deal with
massive data; some other people adapt the compression techniques to the storage,
management and query processing of massive data in database, and these works lead
to the compression database techniques [1-9].
In general, massive relations in database can be divided into two kinds. One con-
sists of massive online relations (MONLRs) that are being used frequently in manu-
facturing currently. So, the goal of compressing MONLRs is to reduce the costs of
operations on these relations and guarantee the performance of the database [1-9].
The other consists of the massive offline relations (MOFLRs) that are not used fre-
quently now. But for some purpose, we must keep them for a long time. Because the
storage space of online database is finite, these MOFLRs have to be stored offline.
However, the types of operation on MOFLRs are exiguous and the frequency of those
operations is low. Additionally, the offline data will accumulate as time goes. So we
must consider the following two issues while compressing MOFLRs: to reduce the
storage requirements as possible and to ensure the performance of these limited que-
ries. So far, we have not see any literature about the compression of MOFLRs. In
industries, the usual method is to export the data of database into text files and com-
* Supported by the National Natural Science Foundation China under Grant No.60273082.
Q. Li, G. Wang, and L. Feng (Eds.): WAIM 2004, LNCS 3129, pp. 634–639, 2004.
© Springer-Verlag Berlin Heidelberg 2004
The Compression of Massive Offline Relations 635
press them with tools such as Winzip, Compress etc. The compressed files are stored
on the tertiary storages. Before queries being executed, these files must be decom-
pressed and imported back into database. So the queries on MOFLRs are costs inten-
sive operations and their performance is unacceptable for most users. In this paper,
we propose a compression strategy for MOFLRs to overcome the shortcomings.
In contrast to MONLRs, the features of MOFLRs observed are as follows. Firstly, the
sizes of MOFLRs are far larger and will grow rapidly. Secondly, only a few of
MOFLRs grow rapidly as the time goes. Thirdly, the queries on MOFLRs are quite
different, the types of queries on MOFLRs are much less and their frequencies are
also much lower. Finally, most attributes of MOFLRs do not appear in query condi-
tions.
To improve the performance of the queries, we classify the records of each
MOFLR into many subsets. All records in each subset satisfy some common condi-
tions. And each subset can be compressed into files with any compression algorithm
and stored onto the tertiary storage. When a query is executed, we first find out the
subsets involved and decompressed them. The query can be completed in two strate-
gies. One is to import the decompressed files back into the database and to process
the query as usual. The other is to filter the records in the decompressed files directly
according to the conditions appearing in the query. The server of management system
will choose a lower cost strategy to complete the query.
Additionally, together with the small relations or the relations with stable sizes, the
information about classification is stored in the database of sever of the system. This
arrangement is to save I/O costs between the secondary storage and the tertiary stor-
age further. The architecture of the management system has following three advan-
tages. The access of data has randomicity in some extent. and the I/O costs for each
query may be small. At last, the compression algorithm can be chosen flexibly.
Now, we consider the classification of records. Since most types of queries on
MOFLRs are known before compressing. The atomic predicates appearing in where-
clause of these queries can be extracted and so do the predicates associated with a
specific relation R. These predicates can be divided into three kinds, i.e. the predi-
cates used as join conditions, indefinite predicates, such as “user-name = ???”,
which are known definitely until the queries will be executed, and definite predi-
cates, such as “a phone-call last at least 5 minutes”, which known definitely accord-
ing to the applications on MOFLRs. It’s hard to classify records according to the first
two kinds of predicates. So, we use the third kind of predicates to classify the records
of R.
Suppose the set of definite predicates associated with R be PREDR={pred1,
pred2 ,…, predm}. They are not independent however. For example, if a record satis-
fies predicate “ a phone-call last at least 10 minutes”, then it also satisfies predicate
“a phone-call last at least 5 minutes”. If every record satisfying predicate predi also
satisfies predicate predj, then we denote this relation as predi≤predj. To reduce the
636 Jizhou Luo et al.
MAPPING. Repeat these operations until all records of R are processed. The algo-
rithm scans R one pass, need one I/O between disk and the tertiary storage and two
I/O between memory and disk, which results that the algorithm is slower than the
traditional method. However the compression is executed only once, it is devisable if
the performance of query processing is improved. Additionally, since m is very small,
we need not fear the lack of memory. Moreover, the compression ratio of the algo-
rithm depends on the compression algorithm chosen in step 2.2.2.1.
Theorem. Tc /To≤1/2m−d
If the definite predicates in query Q are a small part of all predicates used in compres-
sion, the performance of Q can be improved notably. However, if all definite predi-
cates in PREDR appear in Q, then we have to decompress all data files and the per-
formance is the same as the traditional method.
If there are join operations between pairs of relations in query Q, we process Q in
four steps. Firstly, find out all compressed massive relations in Q. Secondly, extract
definite predicates associated with each compressed massive relation from Q. Thirdly,
utilize the method above to obtain medial results. Finally, load these medial results
into database on the secondary storage and execute query Q.
4 Experiments
We implement our system and give the experimental results here. All experiments run
on a 2.0GHz Intel Pentium processor with 256M main memory running Linux 6.2.
The experiments are conducted with real data sets of a telecommunication provider.
The data set used is a 10G subset of the massive relation. According to the applica-
tion, we use six definite predicates. We look into the effects of compression algorithm
on compression ratio, the comparison between the costs of the traditional method and
that of our method, and the comparison of the performance of these two methods.
In step 2.2.2.1 of compress_M_R, we use Adaptive Huffman coding, Run length
coding, LZW, and Arithmetic coding to compress the data set respectively. The com-
pression ratios are listed in fig.1. The horizontal axis denotes the value of parameter
SIZE in the algorithm.
0.8 ratio
0.7
0.6
0.5
0.4
We allocate 2M buffer for each subset and run the compression algorithm. The
costs are listed in table 1. We noted that the time of this algorithm with SIZE=1M or
2M is very close to the time of the traditional method. This is because there are no
additional I/O between memory and disks. On the contrary, when the additional I/O is
necessary, the runtime of our algorithm is longer. In order to save the seek time of
accessing data files, this value of SIZE should be as large as possible. So, we set
SIZE=16M which is the maximum value of our system.
The Compression of Massive Offline Relations 639
After compressing data set, we run four queries on the compressed data. The run-
time these queries is listed in table 2. The third column and last column are the run-
times of our method with different strategy mentioned in section 3.
5 Conclusions
We have notice that the compression of MOFLRs is quite different from the compres-
sion of massive online relations. On the basis of the characteristics of MOFLRs, we
have proposed a management system and designed a compression algorithm of
MOFLRs. The theoretical analysis and the experimental results indicate that our
method can improve the performance of query processing efficiently, although the
costs of compression may increase.
References
1. Jonathan Goldstein, Raghu Ramakrishnan, and Uri Shaft. “Squeezing the most out of rela-
tional database systems”. In Proc. of ICDE, page 81, 2000.
2. Wee K.NG and CHINYA V Ravishhankar. “Block-Oriented Compression Techni- ques for
Large Statistical Databases”. IEEE Transactions on Knowledge and Data Engineering,
Vol.8, Match-April, 1997.
3. M.A.Roth and S.J.Van Horn. “Database compression”, SIGMOD Record, Vol 22,No.3,
September 1993.
4. T.Westmann, D.Kossmann et al. “The Implementation and performance of Compressed
Database”. SIGMOD Record,Vol 29,No.3, September 2000.
5. S.Babu, M.Garofalakis and R.Rastogi. “SPARTAN: A Model Based Semantic Compression
System for Massive Data Tables”. ACM SIGMOD ,May 2001.
6. Jianzhong Li, Doron Rotem, and Jaideep Srivastava. “Aggregation algorithms for very large
compressed data warehouses”. In Proc. of VLDB, pages 651–662, 1999.
7. G.Ray, J.R.Harisa and S.Seshadri, “Database compression: A Performance Enhancement
Tool”, Proc. COMAD, Pune, India, December 1995
8. S.J.O’Connell and N.winterbottom. “Performing Joins without Decompression in Com-
pressed Database System”, SIGMOD Record, Vol 32,No.1, March 2003
9. Meikel Poess and Dmitry Potapov. “Data compression in oracle”. In Proc. of 29th Confer-
ence of VLDB, 2003.