Vous êtes sur la page 1sur 4

JOURNAL OF COMPUTING, VOLUME 3, ISSUE 5, MAY 2011, ISSN 2151-9617

HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 167

Lossless Compression and Aggregation Analysis


for Numerical and
Categorical Data in Data Cubes
K. Bhaskar Naik, M. Supriya, Ch. Prathima, B. Ramakantha Reddy

Abstract— Logistic regression is an important technique for analyzing and predicting data with categorical attributes and the linear
regression models are important techniques for analyzing and predicting data with numerical attributes. We propose a novel scheme to
compress the data in such a way that we can construct regression models to answer any OLAP query without accessing the raw data.
Through these regression models we develop new compressible measures, for compressing and aggregating the both numerical and
categorical type of data effectively. Based on a first-order approximation to the maximum likelihood estimating equations, and Ordinary least
squares we develop a compression scheme that compresses each base cell into a small compressed data block with essential information to
support the aggregation models. Aggregation formulas for deriving high-level regression models from lower level component cells are given.
We prove that the compression is lossless in the sense that the aggregated estimator deviates from the true model by an error that is bounded
and approaches to zero when the data size increases. The results show that the proposed compression and aggregation scheme can make
feasible OLAP of regression in a data cube. Further, it supports real-time analysis of stream data, which can only be scanned once and
cannot be permanently retained.

Index Terms— Data cubes, Aggregation, Compression, OLAP, Linear Regression, Logistic Regression.

——————————  ——————————

1 INTRODUCTION

T he fast development of OLAP technology has led


to high demand for more sophisticated data analyz-
ing capabilities, such as prediction, trend monitor-
analysis in data cubes to handling numerical and categor-
ical data respectively. Logistic regression is an important
statistical method for modeling and predicting categorical
ing, and exception detection of multidimensional data. data. When we conduct logistic regression analysis in
Oftentimes, existing simple measures such as sum() real-world data mining applications, we often encounter
and average() become insufficient, and more sophis- the difficulty of not having the complete set of data in
ticated statistical models, such as regression analysis, advance. It is often demanded to recover logistic regres-
are desired to be supported in OLAP. Moreover, there sion models of a large data set with access not to the raw
are lots of applications with dynamically changing data but to only sketchy information of divided chunks of
stream data generated continuously in a dynamic the data set. Non-linear regression analysis is also one of
environment, with huge volume, infinite flow, and the important statistical methods for effectively modeling
fast changing behavior. When collected, such data is and predicting numerical data that may be discrete or
almost always at a rather low level, consisting of various continous. Example 1: Suppose a nation wide bank wants
kinds of detailed temporal and other features. To find to study the likelihood of customers to apply for a new
interesting or unusual patterns, it is essential to per- credit card. Suppose that for each day, for each re-
form regression analysis at certain meaningful abstrac- gional branch of the bank, there is a data set contain-
tion levels, discover critical changes of data, and drill ing (y1 , x11 , x12),(y2 , x21, x22) ... ,(yn , xn1 , xn2), where n is
down to some more detailed levels for in-depth analy- the number of customers, xn1 represents the age of the
sis when needed. nth customer, xn2 represents the account balance of the
nth customer, and yn is a binary indicator of whether
Here in this paper, we propose two regression the customer applied for the new credit card (0 for no
analysis models such as logistic regression and non-linear and 1 for yes). To model the relationship between credit
card application and user information, the bank manager
———————————————— can assume that the probability of a customer applying
 K. Bhaskar Naik M.Tech., Assistant Professor, Dept. of CSE., Sree Vidyan- for the new credit card. p depends on the customer
ikethan Engg. College, Tirupati.
 M. Supriya M.Tech., Teaching Assistant, Dept. of CSE., Sree Vidyani- age x1 and account balance x2 as follows:
kethan Engg. College, Tirupati. Logit(p)= log( p/1-p)=β0+β1x1+β2x2 (1)
 Ch. Prathima M.Tech., Assistant Professor, Dept. of CSE., Sree Vidyani- The above model (1) is called a logistic regression model.
kethan Engg. College, Tirupati. After the logit transformation, logit(p) ranges over the
 B.Ramakantha Reddy M.Tech., Teaching Assistant, Dept. of CSE., Sree
Vidyanikethan Engg. College. entire real line and makes it reasonable to be

© 2011 Journal of Computing Press, NY, USA, ISSN 2151-9617


http://sites.google.com/site/journalofcomputing/
JOURNAL OF COMPUTING, VOLUME 3, ISSUE 5, MAY 2011, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 168

modeled as a linear function of x = (x1,x2)T. The re- use an aggregation function g1 to obtain an esti-
gression coefficients β=(β0,β1,β2)T , are often estimated mate of the regression coefficients for ca by
~
using maximum likelihood.  = g1(m1,…m2) (3)
The dimension of mi is independent of the number of
A key issue for aggregation operation is: Can we gen- tuples in ci. We call mi an lossless compression repre-
erate the high-level models without accessing the raw sentation (LCR) of the cell ci , i = 1, ... k. We show that
data? Since, computing the regression model requires the difference between the estimates obtained from ag-
performing a nonlinear numerical optimization problem, gregating the linearized equations in component cells and
solving such a problem from scratch over a large ag- the OLS and MLE in the aggregated cell approaches zero
gregated data set for each roll-up operation is computa- when the number of tuples in the component cells is suf-
tionally very expensive. ficiently large. Further, the space complexity of LCR is
independent of the number of tuples.
It is far more desirable to derive high-level regression
models from low- level model parameters without access-
3 LCRN : LCR FOR NUMERICAL DATA
ing the raw data. In this paper, we proposed a compres-
sion scheme and its associated theory to support high-
3.1 Theory of GMLR
quality aggregation of regression models in a multidi-
mensional data space. In the proposed approach, we
compress each data segment by retaining only the model We now briefly review the theory of GMLR. Suppose
parameters and a small amount of auxiliary measures. we have n tuples in a cell: (xi , yi ), for i = 1..n, where
We then develop an aggregation formula that allows us xT=(xi1 , xi2 , ... , xip ) are the p regression dimensions of
to reconstruct the regression models from partitioned the ith tuple, and each scalar yi is the measure of the
segments with a small approximation error. The error is ith tuple. To apply multiple linear regressions, from each
theoretically bounded and asymptotically convergent to xi, we compute a vector of k terms ui:
zero as the sample size grows.  u0  1 
   
We proposed the concepts of regression cubes  u1 ( xi )   ui ,1 
ui     ..... 
and a data cell compression techniques, LCRN (Lossless .... (4)
   
Compression Representation for Numerical data) and u (x ) u 
 k 1 i   i ,k 1 
LCRC(Lossless Compression Representation for Categor-
ical data) to support efficient OLAP operations in the The first element of ui is u0=1 for fitting an intercept,
regression cubes. LCRC is a compression technique, es- and the remaining k—1 terms uj (xi),j = 1..k — 1 are de-
pecially for categorical data. In this technique we used rived from the regression attributes xi and are often
logistic regression analysis for compressing and aggregat- written as ui,j for simplicity. uj (xi ) can be any kind of
ing the data cubes, which are having categorical data. function of xi . It could be simple as a constant, or it
LCRN is compression technique is for numerical data. In could also be a complex nonlinear function of xi .
this technique, we use non-linear regression analysis for The nonlinear regression function is defined as fol-
compressing and aggregating the data cubes, which are lows:
having numerical data. The nonlinear regression analysis
models are effectively used for modeling and predicting E ( yi | ui )   0  1ui1  ........   k 1uk 1   T ui (5)
the numerical data.
Where   ( 0,1 , 2, ...., k 1, )Tis a kx1vector
y=(y1,y2,…yn)T and ui,j= uj(xi) We can now write the
2 LOSSLESS COMPRESSION AND AGGREGATION regression function E(y|U)=U .
We proposed lossless compression techniqu to sup-
port efficient computation of the OLS for linear regression Definition 2. The OLS estimate of η is the argument
and the MLEs for logistic regression models in data cu- that minimizes the residual sum of squares function
bes. We elaborate the notion of lossless as follows: RSS(η)=(y-Uη)T(y-Uη). If the inverse of (UTU) exists, OLS
estimates of regression parameters are unique and are
Definition 1. In data cube analysis, a cell function, g is given by:
a function that takes the data records of any cell with η = (UTU)-1UTy (6)
an arbitrary size as inputs and maps into a fixed-length We only consider the case where the inverse of (UTU)
vector as an output. exists. If the inverse of (UTU) does not exist, then the
That is, matrix (UTU) is of less than full rank and we can always
g(c) = v, for any data ce11 c (2) use a subset of the u terms in fitting the model, so that
where the output vector v has a fixed size. Suppose the reduced model matrix has full rank.
that we have a regression model y = f (x, θ), where y and
x are attributes and θ is coefficient. Suppose ca is a cell The memory size of U is nk, and the size of UTU is k2,
aggregated from the component cells c1, ..., ck . We define where n is the number of tuples of a cell and k is the
a cell function g2 to obtain mi = g2(ci ), i = 1, ... ,k and number of regression terms which usually a very small
JOURNAL OF COMPUTING, VOLUME 3, ISSUE 5, MAY 2011, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 169

constant independent of the number of tuples. For exam- 4.1 Logistic Regression Model
ple, k is two for linear regression, and three for quadratic Suppose we have n independent observations (y1 ,
regression. Therefore, the overall space complexity is O(n) x1) ,... , (yn , xn ), where yi is a binary variable assumed to
and only linear to n. have a Bernoulli distribution with parameter pi=P(yi= 1)
and xi € IRd are some explanatory variables. An inter-
cept term can be easily included by setting the first ele-
3.2 LCR for Numerical data of a Data Cell
ment of xi to be 1. Logistic regression models are widely
We propose a compressed representation of data cells used to model binary responses using the following for-
to support multidimensional GMLR analysis. The com- mulation:
pressed information for the materialized cells will be suf-
pi
ficient for deriving regression models of all other cells. log   T xi (9)
1  pi
Definition 3. For multidimensional online regression where β € IRd are some unknown regression coefficients
analysis in data cubes, the Lossless Compression Repre- often estimated using maximum likelihood. The maxi-
sentation for Numerical data (LCRN) of a data cell c is mum likelihood estimates (MLEs) are so chosen as to
defined as the following set: maximize the following likelihood function:
LCRN(c)={ |i=0,….,k-1} n
U { |i,j=0,…,k-1,i<j}
n
(7) L(β) = p
i 1
i
yi
(1  pi )1 yi (10)

Where  ij   uhi uhj (8) According to (9), we have


h 1
It is useful to write LCRN the form of matrices. In
exp(  T )
Pi = (11)
fact, elements in an LCRN can be arranged into two ma- 1  exp(  T )
trices: ˆ and  , where ˆ  (ˆ0,ˆ1 ,ˆ 2, ....,ˆ k 1, ) T and Thus, the likelihood function can be written as

  ( 
  00  01 . .  0,k 1  n
xi ) (1   (  T xi ))1 yi
yi
  L(β) =
T
(12)

 10 11 . . 1,k 1 
i 1
 . . . . .  Maximizing L(β) is equivalent to maximizing the log-
 
 . . . . .  likelihood function
 
 k 1, 0  k 1,1 . .  k 1,k 1 
 y  
n
L(β)= i
T
xi  log(1   (  T xi )) (13)
We can write an LCRN in a matrix form as i 1
LCRN= (ˆ ,  ). The parameter that maximizes l(β) is the solution to the
From the above we conclude that For aggregation of following likelihood equation:
numerical data, suppose LCRN(c1) = (̂1 , 1 ), LCRN(c2)
l (  ) n
= (̂ 2 ,  2 )... , LCRN(cm) = (̂ m ,  m ) are the LCRNs for l (  )    [ yi   (  T xi )]xi  0 (14)
the m component cells, respectively, and suppose  i 1
LCRN(ca) = (̂ a ,  a ) is the LCRN for the aggregated cell, The Newton-Raphson method proved that it is strongly
then LCRN(ca ) can be derived from that of the compo- consistent under certain regularity conditions. Strong
nent cells using the following equations: consistency converges to true parameter with probability

a. ˆa     
m
i 1 i
1 m
i 1

 iˆi and
one, when n goes to infinity.

4.2 Compression and Aggregation Scheme


b.   m  i Denote the observations in the kth component cell ck
i 1
by {(yk1 , xk1 ), ... , (ykn , xkn )}, where each yki is a bina-
Note that, since  ij =  ji and, thus, ΘT=Θ, we only ry categorical attribute, and xki is a d-dimensional
need to store the upper triangle of in an LCRN. There- vector of explanatory variables. For the logistic regres-
fore, the size of an LCRN is S(k)=(k2+3k)/2. The follow- sion model in (10), we denote by ˆk the MLE of β in (10)
ing property of LCRN’s indicates that this representa- based on the data cell ck. Therefore, ˆ equation is the
tion is economical in space and scalable for large data solution to the likelihood equation. k
cubes. The size S(k) of an LCRN of a data cell is quadratic
lk (  ) n
in k, the number of regression terms, and is independent lk (  )    [ ykj   (  T xkj )]xk , j  0
of n, the number of tuples in the data cell.  j 1
Its Taylor’s expansion at is given by
n
lk (  )   [  ( ˆk xkj )(   ˆk )T xk , j )]xkj
4 LCRC : LCR FOR CATEGORICAL T

In this section, we review the theory of logistic regres- j 1


sion and propose our lossless compression technique to n
1
support the construction of regression cubes having cate-   [ ( ˆk xkj )((   ˆk )T xk , j ) 2 ]xkj
T
(15)
gorical data. j 1 2
JOURNAL OF COMPUTING, VOLUME 3, ISSUE 5, MAY 2011, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 170

where the second equality comes from the fact =0 and


is some vector between β and from Taylor’s theorem. Lossless aggregation formulae are derived based on
Let Fk (β) be the first-order approximation to the likeli- the compressed LCRN, LCRC representations. The ag-
hood equation, it follows from (15) that gregation is efficient in terms of time and space complexi-
n ties. The proposed technique allows us to quickly perform
  [  ( ˆk xkj (   ˆk ) T xk , j )]xkj
T
Fk (β) = OLAP operations and generate regression models at any
j 1 level in a data cube without retrieving or storing the raw
= - Akβ + Ak ̂ k data. We are currently extending the technique to more
Where general situations, such as quasi-likelihood estimation for
n n  ˆ T x xT  generalized statistical models.
e k kj kj
 [  (  k xkj ) xkj xkj ]   
ˆ 
 
T
Ak =
T

j 1  1  e k kj 
2
j 1
ˆ T x REFERENCES
 
[1] A. Agresti, Categorical Data Analysis, second ed. John Wiley & Sons,
It is clear that Fk (β) only depends on Ak and ˆ k 2002.
whose dimensions are independent of the number of
[2] B. Chen, L. Chen, Y. Lin, and R. Ramakrishnan, “Predic-
data records in ck. Solving the linearized equation,
tion Cubes,” Proc. 3lst Int’l Conf. Very Large Data Bases
K

 F ( )
(VLDB ’05), pp. 982-993, 2005..
Fk (β)= =0
k [3] K. Chen, I. Hu, and Z. Ying, “Strong Consistency of Maxi-
k 1
In the aggregated cell instead of l  (  )  0 by
k k
saving ( ̂ k ,Ak) in cell ck. Using this compression
 mum Quasi-Likelihood Estimators in Generalized Linear
Models with Fixed and Adaptive Designs,” The Annals of
Statistics, vol. 27, pp. 1155-1163, 1999
scheme, we can approximate the MLE of ca using the so-
[4] Y. Chen, G. Dong, J. Han, J. Pei, B. Wah, and J. Wang, “Regres-
lution to
sion Cubes with Lossless Compression and Aggregation,”
K

 Fk ( )
K
IEEE Trans. Knowledge and Data Eng., vol. 18, pp. 1585-1599,
Fa=
k 1
=  (-Akβ+Ak) =0
k1
(16)
2006.
[5] R Xi, N. Lin and Y. chen “Compression and Aggregation for
1 K
~  K  logistic regression analysis in data cubes,” IEEE Trans.
Which leads to  a    Ak   A ˆ k k
(17)
Knowledge and Data Eng., vol. 21, no.4, pp. 479-492, 2006.
 k 1  k 1
K. Bhaskar Naik, M.Tech., Assistant Professor of De-
In addition, in the aggregated cell ca, we can also obtain partment of Computer Science and Engineering at Sree
it’s aAa from the Ak of component cells by observing Vidyanikethan Engineering College, Tirupathi, A.P.,
INDIA. He received B.Tech and M.Tech degree in Com-

  ̂ x x x    A
K n K
puter Science at JNTU Hyderabad and JNTU Anantapur
Aa = T
k kj
T
kj kj k and pursing Ph.D. His research interests are Data Mining,
k 1 j 1 k 1 Knowledge Engineering, Image Processing and Pattern
To summarise, our asymptotically lossless compression Recognition.
technique can be described as follows:
Compression of LCRC: M. Supriya, M.Tech., Teaching Assistant of De-
partment of Computer Science and Engineering at
Sree Vidyani- kethan Engineering College, Tirupathi,
LCRC = ( ˆk , Ak) (18) A.P., INDIA. She received M.Tech degree in Com-
puter Science. Her academic interests are Data Min-
ing and Com- puter Networks.
In each component cell ck aggregation of LCRC is calcu-
lated the aggregated LCRC ( ̂ a , Aa) using (16) and (17). Ch. Prathima, M.Tech., Assistant Professor of Depart-
Such a process can be used to aggregate base cells at the ment of Computer Science and Engineering at Sree
lowest level as well as those cells at intermediate lev- Vidyanikethan Engineering College, Tirupathi, , A.P.,
els. But for any non-base cell, is used in place of ˆk in its
INDIA. She received M.Tech degree in Computer Sci-
ence. Her academic interests are Computer Networks and
LCRC. Cloud Computing.

B. Ramakantha Reddy, M.Tech., Teaching Assistant of


5 CONCLUSION Department of Computer Science and Engineering at
Sree Vidyanikethan Engineering College, Tirupathi, A.P.,
In this paper, we have developed a compression INDIA. He received M.Tech degree in Computer Sci-
scheme that compresses a data cell into a compressed ence. His academic interests are Data Mining and Com-
representation whose size is, independent of the size of puter Networks.
the cell. We have developed the LCRN and LCRC,
the lossless compression techniques for aggregations of
linear and nonlinear regression and logistic regression
parameters in data cubes respectively, so that only a
small number of data values (numerical as well as cate-
gorical) instead of the complete raw data need to be
registered for multidimensional regression analysis.

Vous aimerez peut-être aussi