The Compression of Massive Offline Relations

The Compression of Massive Offline Relations*
Jizhou Luo, Jianzhong Li, Hongzhi Wang, Yanqiu Zhang, and Kai Zhao
Computer Science Department, Harbin Institute of Technology, China

{luojizhou,lijzh,wangzh,sarai,zhaokai}@hit.edu.cn
Abstract. Compression database techniques play an important role in the man-

agement of massive data in database. Based on an important feature of offline
relations and the features of operations on these data, we propose a compression
method of massive offline relations to improve the processing performance of
these relations. Experiments show that our method is efficient.
1 Introduction
Massive data often appears. For examples, the data generated by high-energy-density
physics experiments in Lawrence Lab can be up to 300T bytes per year. The data of a
large telecommunication company’s business data gets up to TeraBytes per month. In
last three decades, the data accumulated in demography, geognosy, meteorology and
nuclear industries is even up to 1PetaBytes.
The efficiency is the key issue while processing massive data. Many results about
the storage, management and query processing of massive data have been obtained.
Many people are invoked in finding efficient algorithms or techniques to deal with
massive data; some other people adapt the compression techniques to the storage,
management and query processing of massive data in database, and these works lead
to the compression database techniques [1-9].
In general, massive relations in database can be divided into two kinds. One con-
sists of massive online relations (MONLRs) that are being used frequently in manu-
facturing currently. So, the goal of compressing MONLRs is to reduce the costs of
operations on these relations and guarantee the performance of the database [1-9].
The other consists of the massive offline relations (MOFLRs) that are not used fre-
quently now. But for some purpose, we must keep them for a long time. Because the
storage space of online database is finite, these MOFLRs have to be stored offline.
However, the types of operation on MOFLRs are exiguous and the frequency of those
operations is low. Additionally, the offline data will accumulate as time goes. So we
must consider the following two issues while compressing MOFLRs: to reduce the
storage requirements as possible and to ensure the performance of these limited que-
ries. So far, we have not see any literature about the compression of MOFLRs. In
industries, the usual method is to export the data of database into text files and com-
* Supported by the National Natural Science Foundation China under Grant No.60273082.
Q. Li, G. Wang, and L. Feng (Eds.): WAIM 2004, LNCS 3129, pp. 634–639, 2004.
© Springer-Verlag Berlin Heidelberg 2004
The Compression of Massive Offline Relations 635
press them with tools such as Winzip, Compress etc. The compressed files are stored
on the tertiary storages. Before queries being executed, these files must be decom-
pressed and imported back into database. So the queries on MOFLRs are costs inten-
sive operations and their performance is unacceptable for most users. In this paper,
we propose a compression strategy for MOFLRs to overcome the shortcomings.
2 Compression Algorithm of MOFLR
In contrast to MONLRs, the features of MOFLRs observed are as follows. Firstly, the
sizes of MOFLRs are far larger and will grow rapidly. Secondly, only a few of
MOFLRs grow rapidly as the time goes. Thirdly, the queries on MOFLRs are quite
different, the types of queries on MOFLRs are much less and their frequencies are
also much lower. Finally, most attributes of MOFLRs do not appear in query condi-
tions.
To improve the performance of the queries, we classify the records of each
MOFLR into many subsets. All records in each subset satisfy some common condi-
tions. And each subset can be compressed into files with any compression algorithm
and stored onto the tertiary storage. When a query is executed, we first find out the
subsets involved and decompressed them. The query can be completed in two strate-
gies. One is to import the decompressed files back into the database and to process
the query as usual. The other is to filter the records in the decompressed files directly
according to the conditions appearing in the query. The server of management system
will choose a lower cost strategy to complete the query.
Additionally, together with the small relations or the relations with stable sizes, the
information about classification is stored in the database of sever of the system. This
arrangement is to save I/O costs between the secondary storage and the tertiary stor-
age further. The architecture of the management system has following three advan-
tages. The access of data has randomicity in some extent. and the I/O costs for each
query may be small. At last, the compression algorithm can be chosen flexibly.
Now, we consider the classification of records. Since most types of queries on
MOFLRs are known before compressing. The atomic predicates appearing in where-
clause of these queries can be extracted and so do the predicates associated with a
specific relation R. These predicates can be divided into three kinds, i.e. the predi-
cates used as join conditions, indefinite predicates, such as “user-name = ???”,
which are known definitely until the queries will be executed, and definite predi-
cates, such as “a phone-call last at least 5 minutes”, which known definitely accord-
ing to the applications on MOFLRs. It’s hard to classify records according to the first
two kinds of predicates. So, we use the third kind of predicates to classify the records
of R.
Suppose the set of definite predicates associated with R be PREDR={pred1,
pred2 ,…, predm}. They are not independent however. For example, if a record satis-
fies predicate “ a phone-call last at least 10 minutes”, then it also satisfies predicate
“a phone-call last at least 5 minutes”. If every record satisfying predicate predi also
satisfies predicate predj, then we denote this relation as predi≤predj. To reduce the
636 Jizhou Luo et al.
number of subsets, we first eliminate such relation from PREDR. If predi,

predj∈PREDR and predi≤predj, we delete predicate predi from PREDR. Finally, we
denote the set of left predicates with the same symbols. Usually, m is no more than 10.
For each subset, a mask with log2m +1 bits length is needed, indicating which
predicates in PREDR the records in the subset satisfy. For example, the ith bit of the
mask for some subset is 1 if all records in the subset satisfy predicate predi ∈
{pred1, … , predm } and 0 otherwise.
To complete the classification, we must decide in which subset each record r of R
should be put into. We use the following formula to compute the mask M(r) of record
r to indicate the predicates that r satisfies and r is put into the subset with mask M(r).
log m +1
where χ r ( pred i ) = 1 if record r satisfies predicate pred i
M (r ) = ∑2 i −1
χ r ( pred i ) , 
i =1  0 otherwise
If we compute the mask for each record r while compressing, the speed of com-
pression will be very low. So we let the data producer compute these masks and store
them in the online database together with the data itself. These operations do not
force a speed-decrease in the online database.
Now, let’s consider the compression of massive relation R. Since the memory size
M is much less than the size of R or the size of each subset, we have to compress each
subset into many files. In order to access the data files fast in the processing of que-
ries, we record the compressed files’ addresses corresponding each subset. So we
create a relation MAPPING<mask, adress> in the database on thesecondary storage.
Each time when we write a compressed file onto the tertiary storage, a tuple will
beinserted into MAPPING. Moreover, to reduce the seek time on the tertiary storagee,
the size of each fileshould not be too small. So, we suppose the size of each data file
before compressing be SIZE.
Algorithm 1: compress_M_R
Input: relation R, the set of definite predicates {pred1, pred2, …, predm }, the size
of available memory, the size SIZE of each data file, relation MAPPING;
Output: the compressed storage of R on the tertiary storage.
1. Allocate 2m memory buffers B[0:2m-1], each with size M/m;
2. FOR each record r of R DO
2.1 Insert r into buffer B[ M(r) ];
2.2 IF B[M(r)] is full THEN
2.2.1 Write B[M(r)] onto disk;
2.2.2.1 Compressing the data file FB[M(r)] ;
2.2.2.2 Write the compressed data file onto the tertiary storage and suppose
the address of the file is address;
We allocate a memory buffer for each subset, each with size M/m. And for each
buffer, there is a file with size SIZE on the disk. While compressing R, we deal with
its records one by one. According to M(r), we can put record r into a proper buffer.
If the buffer is full, we write its data into the corresponding data file. Repeat this
process until the size of one data file reaches SIZE. Then compressthe data file and
write it onto the tertiary storage. Finally, insert the corresponding tuple into
MAPPING. Repeat these operations until all records of R are processed. The algo-
rithm scans R one pass, need one I/O between disk and the tertiary storage and two
I/O between memory and disk, which results that the algorithm is slower than the
traditional method. However the compression is executed only once, it is devisable if
the performance of query processing is improved. Additionally, since m is very small,
we need not fear the lack of memory. Moreover, the compression ratio of the algo-
rithm depends on the compression algorithm chosen in step 2.2.2.1.
3 Query Processing in the Compressed Data

The key issue of query processing is to find out the data files involved on the tertiary
storage. We begin with the queries on one massive compressed relation R without
join operations with other relations. Let Q be such a query. Suppose the set of definite
predicates appearing in query Q is {pred_Q1,pred_Q2,…,pred_Qs}=PRED(Q). Ac-
cording to the relationships among these predicates, we can construct a binary tree T
whose leaf nodes are these predicates and internal nodes are AND or OR.
According to the construction of PREDR, we know that there is a definite predicate
predj ∈PREDR such that pred_Qi≤predj for any predQi∈PRED(Q). To find out the
i i
data files we must decompress, we have to substitute predj for pred_Qi in the binary
i
tree T. For simplicity, we still denote the modified binary tree as T. From T, we can
know the subsets in which the records satisfying query Q lie.
Let the set of masks of the subsets satisfying the definite predicates in the right
sub-tree of root is S1 and that corresponding the left sub-tree is S2. If the operator in
root is OR, then the set of masks of the subsets satisfying the definite predicates in T
is S1∪S2 If the operator in root is AND, then the set of masks is {mask1∧mask2|
mask1∈S1, mask2∈S2}. So the set of masks can be computed iteratively.
If the output of algorithm MASK is S, then the compressed data files whose masks
are in S are those we must decompress. For ∀mask∈S, we can find out the data files’
addresses with mask mask from MAPPING and then access the file and decompress it.
The server of the management system will chose one of strategy described in section
3 to obtain the query results.
Algorithm 2: MASK
Input: the root root of the modified binary tree T;
Output: the set of masks of subsets satisfying the predicates in T;
1. IF T is a leaf in which the definite predicate is predj
a. THEN return {0…010…0| the jth bit is 1 and other bits are 0s.}
2. S1= MASK(root→right child)
3. S2= MASK( root→left child);
4. IF the operator in root is OR THEN return S1∪S2;
Suppose there are m definite predicates in PREDR and f compressed data files in each
subset. Given a query Q, suppose there are d definite predicates in Q. Let TO, TC de-
note the run time of Q in original compression architecture and the runtime of Q in
our compression architecture respectively.
638 Jizhou Luo et al.
Theorem. Tc /To≤1/2m−d
If the definite predicates in query Q are a small part of all predicates used in compres-
sion, the performance of Q can be improved notably. However, if all definite predi-
cates in PREDR appear in Q, then we have to decompress all data files and the per-
formance is the same as the traditional method.
If there are join operations between pairs of relations in query Q, we process Q in
four steps. Firstly, find out all compressed massive relations in Q. Secondly, extract
definite predicates associated with each compressed massive relation from Q. Thirdly,
utilize the method above to obtain medial results. Finally, load these medial results
into database on the secondary storage and execute query Q.
4 Experiments
We implement our system and give the experimental results here. All experiments run
on a 2.0GHz Intel Pentium processor with 256M main memory running Linux 6.2.
The experiments are conducted with real data sets of a telecommunication provider.
The data set used is a 10G subset of the massive relation. According to the applica-
tion, we use six definite predicates. We look into the effects of compression algorithm
on compression ratio, the comparison between the costs of the traditional method and
that of our method, and the comparison of the performance of these two methods.
In step 2.2.2.1 of compress_M_R, we use Adaptive Huffman coding, Run length
coding, LZW, and Arithmetic coding to compress the data set respectively. The com-
pression ratios are listed in fig.1. The horizontal axis denotes the value of parameter
SIZE in the algorithm.
0.8 ratio
0.7
0.6
0.5
0.4
0.3 Huffm an coding

LZW
0.2
run length coding
0.1 Arithmetic coding
0
size
-0.1
0.5M 1M 2M 4M 8M 16M
Fig. 1. Compression ratios
We allocate 2M buffer for each subset and run the compression algorithm. The
costs are listed in table 1. We noted that the time of this algorithm with SIZE=1M or
2M is very close to the time of the traditional method. This is because there are no
additional I/O between memory and disks. On the contrary, when the additional I/O is
necessary, the runtime of our algorithm is longer. In order to save the seek time of
accessing data files, this value of SIZE should be as large as possible. So, we set
SIZE=16M which is the maximum value of our system.
After compressing data set, we run four queries on the compressed data. The run-
time these queries is listed in table 2. The third column and last column are the run-
times of our method with different strategy mentioned in section 3.
Table 1. The costs of compression Table 2. The performance of query processing

The costs of compression (s) The costs of query processing (s)
SIZE Old method New method Old method New import New filter
1M 41275 42943 Sql1 46353 18535 17362
2M 40524 41780 Sql2 45119 17683 16408
Sq3 43675 9172 10803
4M 39265 51325
Sql4 46379 20294 16895
8M 37932 49756
5 Conclusions
We have notice that the compression of MOFLRs is quite different from the compres-
sion of massive online relations. On the basis of the characteristics of MOFLRs, we
have proposed a management system and designed a compression algorithm of
MOFLRs. The theoretical analysis and the experimental results indicate that our
method can improve the performance of query processing efficiently, although the
costs of compression may increase.
References
1. Jonathan Goldstein, Raghu Ramakrishnan, and Uri Shaft. “Squeezing the most out of rela-
tional database systems”. In Proc. of ICDE, page 81, 2000.
2. Wee K.NG and CHINYA V Ravishhankar. “Block-Oriented Compression Techni- ques for
Large Statistical Databases”. IEEE Transactions on Knowledge and Data Engineering,
Vol.8, Match-April, 1997.
3. M.A.Roth and S.J.Van Horn. “Database compression”, SIGMOD Record, Vol 22,No.3,
September 1993.
4. T.Westmann, D.Kossmann et al. “The Implementation and performance of Compressed
Database”. SIGMOD Record,Vol 29,No.3, September 2000.
5. S.Babu, M.Garofalakis and R.Rastogi. “SPARTAN: A Model Based Semantic Compression
System for Massive Data Tables”. ACM SIGMOD ,May 2001.
6. Jianzhong Li, Doron Rotem, and Jaideep Srivastava. “Aggregation algorithms for very large
compressed data warehouses”. In Proc. of VLDB, pages 651–662, 1999.
7. G.Ray, J.R.Harisa and S.Seshadri, “Database compression: A Performance Enhancement
Tool”, Proc. COMAD, Pune, India, December 1995
8. S.J.O’Connell and N.winterbottom. “Performing Joins without Decompression in Com-
pressed Database System”, SIGMOD Record, Vol 32,No.1, March 2003
9. Meikel Poess and Dmitry Potapov. “Data compression in oracle”. In Proc. of 29th Confer-
ence of VLDB, 2003.

The Compression of Massive Offline Relations

Transféré par

Informations du document

Description originale:

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

The Compression of Massive Offline Relations

Transféré par

Droits d'auteur :

Formats disponibles

The Compression of Massive Offline Relations*

Computer Science Department, Harbin Institute of Technology, China

Abstract. Compression database techniques play an important role in the man-

2 Compression Algorithm of MOFLR

number of subsets, we first eliminate such relation from PREDR. If predi,

3 Query Processing in the Compressed Data

0.3 Huffm an coding

Fig. 1. Compression ratios

Table 1. The costs of compression Table 2. The performance of query processing

Vous aimerez peut-être aussi