Académique Documents
Professionnel Documents
Culture Documents
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 167
Abstract— Logistic regression is an important technique for analyzing and predicting data with categorical attributes and the linear
regression models are important techniques for analyzing and predicting data with numerical attributes. We propose a novel scheme to
compress the data in such a way that we can construct regression models to answer any OLAP query without accessing the raw data.
Through these regression models we develop new compressible measures, for compressing and aggregating the both numerical and
categorical type of data effectively. Based on a first-order approximation to the maximum likelihood estimating equations, and Ordinary least
squares we develop a compression scheme that compresses each base cell into a small compressed data block with essential information to
support the aggregation models. Aggregation formulas for deriving high-level regression models from lower level component cells are given.
We prove that the compression is lossless in the sense that the aggregated estimator deviates from the true model by an error that is bounded
and approaches to zero when the data size increases. The results show that the proposed compression and aggregation scheme can make
feasible OLAP of regression in a data cube. Further, it supports real-time analysis of stream data, which can only be scanned once and
cannot be permanently retained.
Index Terms— Data cubes, Aggregation, Compression, OLAP, Linear Regression, Logistic Regression.
—————————— ——————————
1 INTRODUCTION
modeled as a linear function of x = (x1,x2)T. The re- use an aggregation function g1 to obtain an esti-
gression coefficients β=(β0,β1,β2)T , are often estimated mate of the regression coefficients for ca by
~
using maximum likelihood. = g1(m1,…m2) (3)
The dimension of mi is independent of the number of
A key issue for aggregation operation is: Can we gen- tuples in ci. We call mi an lossless compression repre-
erate the high-level models without accessing the raw sentation (LCR) of the cell ci , i = 1, ... k. We show that
data? Since, computing the regression model requires the difference between the estimates obtained from ag-
performing a nonlinear numerical optimization problem, gregating the linearized equations in component cells and
solving such a problem from scratch over a large ag- the OLS and MLE in the aggregated cell approaches zero
gregated data set for each roll-up operation is computa- when the number of tuples in the component cells is suf-
tionally very expensive. ficiently large. Further, the space complexity of LCR is
independent of the number of tuples.
It is far more desirable to derive high-level regression
models from low- level model parameters without access-
3 LCRN : LCR FOR NUMERICAL DATA
ing the raw data. In this paper, we proposed a compres-
sion scheme and its associated theory to support high-
3.1 Theory of GMLR
quality aggregation of regression models in a multidi-
mensional data space. In the proposed approach, we
compress each data segment by retaining only the model We now briefly review the theory of GMLR. Suppose
parameters and a small amount of auxiliary measures. we have n tuples in a cell: (xi , yi ), for i = 1..n, where
We then develop an aggregation formula that allows us xT=(xi1 , xi2 , ... , xip ) are the p regression dimensions of
to reconstruct the regression models from partitioned the ith tuple, and each scalar yi is the measure of the
segments with a small approximation error. The error is ith tuple. To apply multiple linear regressions, from each
theoretically bounded and asymptotically convergent to xi, we compute a vector of k terms ui:
zero as the sample size grows. u0 1
We proposed the concepts of regression cubes u1 ( xi ) ui ,1
ui .....
and a data cell compression techniques, LCRN (Lossless .... (4)
Compression Representation for Numerical data) and u (x ) u
k 1 i i ,k 1
LCRC(Lossless Compression Representation for Categor-
ical data) to support efficient OLAP operations in the The first element of ui is u0=1 for fitting an intercept,
regression cubes. LCRC is a compression technique, es- and the remaining k—1 terms uj (xi),j = 1..k — 1 are de-
pecially for categorical data. In this technique we used rived from the regression attributes xi and are often
logistic regression analysis for compressing and aggregat- written as ui,j for simplicity. uj (xi ) can be any kind of
ing the data cubes, which are having categorical data. function of xi . It could be simple as a constant, or it
LCRN is compression technique is for numerical data. In could also be a complex nonlinear function of xi .
this technique, we use non-linear regression analysis for The nonlinear regression function is defined as fol-
compressing and aggregating the data cubes, which are lows:
having numerical data. The nonlinear regression analysis
models are effectively used for modeling and predicting E ( yi | ui ) 0 1ui1 ........ k 1uk 1 T ui (5)
the numerical data.
Where ( 0,1 , 2, ...., k 1, )Tis a kx1vector
y=(y1,y2,…yn)T and ui,j= uj(xi) We can now write the
2 LOSSLESS COMPRESSION AND AGGREGATION regression function E(y|U)=U .
We proposed lossless compression techniqu to sup-
port efficient computation of the OLS for linear regression Definition 2. The OLS estimate of η is the argument
and the MLEs for logistic regression models in data cu- that minimizes the residual sum of squares function
bes. We elaborate the notion of lossless as follows: RSS(η)=(y-Uη)T(y-Uη). If the inverse of (UTU) exists, OLS
estimates of regression parameters are unique and are
Definition 1. In data cube analysis, a cell function, g is given by:
a function that takes the data records of any cell with η = (UTU)-1UTy (6)
an arbitrary size as inputs and maps into a fixed-length We only consider the case where the inverse of (UTU)
vector as an output. exists. If the inverse of (UTU) does not exist, then the
That is, matrix (UTU) is of less than full rank and we can always
g(c) = v, for any data ce11 c (2) use a subset of the u terms in fitting the model, so that
where the output vector v has a fixed size. Suppose the reduced model matrix has full rank.
that we have a regression model y = f (x, θ), where y and
x are attributes and θ is coefficient. Suppose ca is a cell The memory size of U is nk, and the size of UTU is k2,
aggregated from the component cells c1, ..., ck . We define where n is the number of tuples of a cell and k is the
a cell function g2 to obtain mi = g2(ci ), i = 1, ... ,k and number of regression terms which usually a very small
JOURNAL OF COMPUTING, VOLUME 3, ISSUE 5, MAY 2011, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 169
constant independent of the number of tuples. For exam- 4.1 Logistic Regression Model
ple, k is two for linear regression, and three for quadratic Suppose we have n independent observations (y1 ,
regression. Therefore, the overall space complexity is O(n) x1) ,... , (yn , xn ), where yi is a binary variable assumed to
and only linear to n. have a Bernoulli distribution with parameter pi=P(yi= 1)
and xi € IRd are some explanatory variables. An inter-
cept term can be easily included by setting the first ele-
3.2 LCR for Numerical data of a Data Cell
ment of xi to be 1. Logistic regression models are widely
We propose a compressed representation of data cells used to model binary responses using the following for-
to support multidimensional GMLR analysis. The com- mulation:
pressed information for the materialized cells will be suf-
pi
ficient for deriving regression models of all other cells. log T xi (9)
1 pi
Definition 3. For multidimensional online regression where β € IRd are some unknown regression coefficients
analysis in data cubes, the Lossless Compression Repre- often estimated using maximum likelihood. The maxi-
sentation for Numerical data (LCRN) of a data cell c is mum likelihood estimates (MLEs) are so chosen as to
defined as the following set: maximize the following likelihood function:
LCRN(c)={ |i=0,….,k-1} n
U { |i,j=0,…,k-1,i<j}
n
(7) L(β) = p
i 1
i
yi
(1 pi )1 yi (10)
(
00 01 . . 0,k 1 n
xi ) (1 ( T xi ))1 yi
yi
L(β) =
T
(12)
10 11 . . 1,k 1
i 1
. . . . . Maximizing L(β) is equivalent to maximizing the log-
. . . . . likelihood function
k 1, 0 k 1,1 . . k 1,k 1
y
n
L(β)= i
T
xi log(1 ( T xi )) (13)
We can write an LCRN in a matrix form as i 1
LCRN= (ˆ , ). The parameter that maximizes l(β) is the solution to the
From the above we conclude that For aggregation of following likelihood equation:
numerical data, suppose LCRN(c1) = (̂1 , 1 ), LCRN(c2)
l ( ) n
= (̂ 2 , 2 )... , LCRN(cm) = (̂ m , m ) are the LCRNs for l ( ) [ yi ( T xi )]xi 0 (14)
the m component cells, respectively, and suppose i 1
LCRN(ca) = (̂ a , a ) is the LCRN for the aggregated cell, The Newton-Raphson method proved that it is strongly
then LCRN(ca ) can be derived from that of the compo- consistent under certain regularity conditions. Strong
nent cells using the following equations: consistency converges to true parameter with probability
a. ˆa
m
i 1 i
1 m
i 1
iˆi and
one, when n goes to infinity.
j 1 1 e k kj
2
j 1
ˆ T x REFERENCES
[1] A. Agresti, Categorical Data Analysis, second ed. John Wiley & Sons,
It is clear that Fk (β) only depends on Ak and ˆ k 2002.
whose dimensions are independent of the number of
[2] B. Chen, L. Chen, Y. Lin, and R. Ramakrishnan, “Predic-
data records in ck. Solving the linearized equation,
tion Cubes,” Proc. 3lst Int’l Conf. Very Large Data Bases
K
F ( )
(VLDB ’05), pp. 982-993, 2005..
Fk (β)= =0
k [3] K. Chen, I. Hu, and Z. Ying, “Strong Consistency of Maxi-
k 1
In the aggregated cell instead of l ( ) 0 by
k k
saving ( ̂ k ,Ak) in cell ck. Using this compression
mum Quasi-Likelihood Estimators in Generalized Linear
Models with Fixed and Adaptive Designs,” The Annals of
Statistics, vol. 27, pp. 1155-1163, 1999
scheme, we can approximate the MLE of ca using the so-
[4] Y. Chen, G. Dong, J. Han, J. Pei, B. Wah, and J. Wang, “Regres-
lution to
sion Cubes with Lossless Compression and Aggregation,”
K
Fk ( )
K
IEEE Trans. Knowledge and Data Eng., vol. 18, pp. 1585-1599,
Fa=
k 1
= (-Akβ+Ak) =0
k1
(16)
2006.
[5] R Xi, N. Lin and Y. chen “Compression and Aggregation for
1 K
~ K logistic regression analysis in data cubes,” IEEE Trans.
Which leads to a Ak A ˆ k k
(17)
Knowledge and Data Eng., vol. 21, no.4, pp. 479-492, 2006.
k 1 k 1
K. Bhaskar Naik, M.Tech., Assistant Professor of De-
In addition, in the aggregated cell ca, we can also obtain partment of Computer Science and Engineering at Sree
it’s aAa from the Ak of component cells by observing Vidyanikethan Engineering College, Tirupathi, A.P.,
INDIA. He received B.Tech and M.Tech degree in Com-
̂ x x x A
K n K
puter Science at JNTU Hyderabad and JNTU Anantapur
Aa = T
k kj
T
kj kj k and pursing Ph.D. His research interests are Data Mining,
k 1 j 1 k 1 Knowledge Engineering, Image Processing and Pattern
To summarise, our asymptotically lossless compression Recognition.
technique can be described as follows:
Compression of LCRC: M. Supriya, M.Tech., Teaching Assistant of De-
partment of Computer Science and Engineering at
Sree Vidyani- kethan Engineering College, Tirupathi,
LCRC = ( ˆk , Ak) (18) A.P., INDIA. She received M.Tech degree in Com-
puter Science. Her academic interests are Data Min-
ing and Com- puter Networks.
In each component cell ck aggregation of LCRC is calcu-
lated the aggregated LCRC ( ̂ a , Aa) using (16) and (17). Ch. Prathima, M.Tech., Assistant Professor of Depart-
Such a process can be used to aggregate base cells at the ment of Computer Science and Engineering at Sree
lowest level as well as those cells at intermediate lev- Vidyanikethan Engineering College, Tirupathi, , A.P.,
els. But for any non-base cell, is used in place of ˆk in its
INDIA. She received M.Tech degree in Computer Sci-
ence. Her academic interests are Computer Networks and
LCRC. Cloud Computing.