Académique Documents
Professionnel Documents
Culture Documents
TABLE I
Abstract-Missing data creates various problems in analysing
and processing data in databases. In this paper we introduce a TABLEWITH MISSING VALUES
new method aimed at approximating missing data in a database
using a combination of genetic algorithms and neural networks.
The proposed metbod uses genetic algorithm to minimise an
error function derived from an auto-associative neural network.
Multi-Layer Perceptron iJvlLP) and Radial Basis Function (RBF)
? 1 6.9 I 5.6 I ? I 0.5
9.7
1
I ?
I
1 3000 1 ?
accurately predicting missing data as the number of missing cases
within a single record increases. It is observed that there is no
signi6cant reduction in accuracy of results as the number of
missing cases in a single record i n c r e w . It: is also found that variables in the database? Thus, the aim of this paper is to use
results obtained using RBF are superior to MLP. neural networks and genetic algorithms to approximate the
missing data in such situations.
I. INTRODUCTION
Inferences made €om available data for a certain application
11. BACKGROUND
depends on the completeness and quality of the data being
used in the analysis. Thus, inferences made from a complete
A. Missing Dada
data are most likely to be more accurate than those made from
incomplete data. Moreover there are time critical applications Missing data creates various problems in many applications
which require us to estimate or approximate the values of which depend on good access to accurate data. Hence, methods
some missing variables that have to be supplied in relation to handle missing data have been an area of research in
to the values of other corresponding variables. Such situations statistics, mathematics and other various disciplines [ 1][2][3].
may arise in a system which uses a number of instruments The reasonable way to handel missing data depends upon
and in some cases one or more of the sensors used in the how data points become missing. According to 141 there are
system fail. In such situation the value of the sensor have to three types of missing data mechanisms. These are Missing
be estimated within short time and with great precision and Completely at Random (MCAR), Missing at Random (MAR)
by taking in to account the values of the other sensors in the and non-ignorable. MCAR situation arises if the probability
system. Approximation of the missing values in such situations of missing value for variable X is unrelated to the value X
require us to estimate the missing value taking into account itself 01to any other variable in the data set.This refers to data
the interrelationships that exists between the values of other where the absence of data does not depend on the variable of
corresponding variables. interest or any other variable in the data set [3]. MAR arises
Missing data in a database may arise due to various reasons. if the probability of missing data on a particular variable X
It can arise due to data entry errors, respondents non response depends on other variables, but not on X itself and the non-
to some items on data collection process, failure of instruments ignorable case arises if the probability of missing data X is
and other various reasons. In Table I we have a database related to the value of X itself even if we control the other
consisting five variables namely X I ,2 2 , z3,x4, and “ 5 where variables in the analysis [2]. Depending on the mechanism
the values for Some variables are missing. Assume we have of missing data, currently various methods are being used to
a database consisting of various records of the five variables. treat missing data. For a detailed discussion on the various
But some of the observations for some variables in various methods used to handle missing data refer to [3]E2][4]and [SI.
records are not available. How do we know the values for the The method proposed in this paper is applicable to situations
missing entries? Are there ways to approximate the missing where the missing data mechanism is either MCAR, MAR or
data depending on the interrelationships that exist between the non-ignorable.
forward networks trained using a supervised training algorithm Fig. 2. RBP architecture
[7]. They are typically configured with a single hidden layer
208
Genetic algorithms view learning as a competition among Substituting the value of from (I) into (2) we get
a population of evolving candidate problem solutions. A
fitness function evaluates each solution to decide whether it
will contribute to the next generation of solutions. Through
operations analogous to gene transfer in sexual reproduction, We want the error to be minimum and non-negative. Hence,
the algorithm creates a new population of candidate solutions the error function can be rewritten as the square of equation
1141.
(3)
The three most important aspects of using genetic algo-
rithms are [111[15]: e= (2- j ( 2 ,TQ)~ (4)
Definition of the objective function.
Definition and implementation of the genetic representa- In the c_ase of missing data, some of the values for the input
tion, and vector X-are not available, Hence, we can categorize the input
Definition and implementation o l the genetic operators. vector (X) elements in to X known represented by (zk) and
GAS have been proved to be successful in optimization unknown represented by (zu).Kewriting equation (4) in
problems such as wire routing, scheduling, adaptive control, terms of -&,and 2%we have
game playing, cognitive modeling, transportation problems,
traveling salesman problems, optimal control problems, and
database query optimization [113.
The following pseudo-code from [ l l ] illustrates the
high level description of the genetic algorithm employed in
the experiment. P ( t )represents the poputation at generation t, To approximate the missing input values, equation (5) is
procedure genetic algorithm minimized using genetic algorithm. Genetic algorithm was
begin chosen because it finds the global optimum solution. Since
t t o a genetic algorithm always finds the maximum value, the
initialise P ( t ) negative of equation (5) was supplied to the CA as a fitness
evaluate P(t) function. Thus, the final error function minimized using the
while(not termination condition) do
begin genetic algorithm is
t t o
select P ( t ) from P ( t - 1)
alter P ( t )
evaluate P ( t )
end
end
Fig. 3 depicts the graphical representation of proposed
Algorithm 1: Structure of genetic algorithm Ell] model. The error function is derived from the input and
The MATLAB implementation of genetic algorithm described output vector obtained from the trained neural network. The
in [15] has been used to implement the genetic algorithm. error function is then minimized using genetic algorithm to
After executing the program with different genetic operators, approximate the missing variables in the error function.
optimal operators that gave the best results were selected to
be used in conducting the experiment.
111. METHOD
9
The neural network was trained to recall to itself (predict
its input vector). Mathematically the neural network can be
written as
P = f(2,G) (1)
Function
where is the output vector, 2 the input vector and W the
vector of weights. Since the network-is trained to predict its
own input vector, the i_nput_vecty X will be approximately
equal to output vector Y (X sz Y).
In reality the input vector I? and output vector will not
always be perfectly the same hence, we will have an error
function expressed as the diflerence between the input and
output vector. Thus, the emor can be formulated as
A d Fig. 3. Schematic represenkcion of proposed model
e=X-Y (2)
209
TABLE II
Iv. RESULTAND DISCUSSION
CORRELATION COEFtlCIENT
An MLP and RBF with 10 neurons, 14 inputs and 14
outputs was trained on the data obtained from South African
Breweries (SAB). A total of 198 training inputs were provided
for each network architecture. Each element of the database
I 1
1
1 1 1 2 1
Number of Missing
Value
3 1 4 1 5
I
was removed and approximated using the model. . MLP 1 0.94 1 0.939 1 0.939 I 0.933 0.938 I
Cases of 1, 2, 3, 4,and 5 missing values in a single record RBF 1 1 I
0.968 0.969 0.970 0.970 0.968 I I
were examined to investigate the accuracy of the approximated
values as the number of missing cases within a single record TABLE I11
increases. To asses the accuracy of the values approximated STANDARD ERROR
using the model the standard error and correlation coefficient
wete calculated for each missing case. ! Number of Missing
We have used the following terms to measure the modeling Value
I 1 I
quality: (i) Standard error (Se) and (ii) Correlation coeffi- 1 2 3 4 5
cient (r). For a given data ~ 1z2, , ....,zn and corresponding MLP 16.62 16.77 16 8 16.31 16.4
approximated values il,iz, ....,inthe Standard error (Se) is RBF 11.89 11.92 11.80 11.92 12.02
computed as
I n
n
and the correlation coefficient (r) is computed as
n
1=1
210
TMLE IV
ACTUAL AND APPROXIMATED VALUES USING MLP
TABLE V
ACTUALA N D APPROXIMATED VALUES USING RBF
II II - values in a record
Number of missing I
V. CONCLUSION
211
[7] S. Haykin. Neural Nerworkr. New Jersey: Prentice-Hall, second ed., [13] P. C Pendharkar and J. A. Rodger. “An empirical study of non-
1999. binary genetic algorithm-based neural approaches for classification.” In
[SI M. H.Hassoun. Fundumenrals of Artificial Neural Nehuorkr., Cam- Pmceeding of the 20th international conference on Information System,
bridge. Massachusetts: MIT Press, 1995. pp. 155-165. Association for Informiition Systems, 1999.
[9] 1. T. Nabney. Netlab: Algorirhms for Pattern Recognition. United [14] W, Banzhaf, P. Nordin, R. Keller, and E Francone. Genetic
Kingdom: Springer-Verlag,2M)I . Pmgraming-un introduction: on the automatic evolution of computer
[IO] C . M. Bishop. Neural Nehvorkr for Partern Recognition. Oxford: Oxford pmgrum and ifs applications. Califomia: Morgan Kaufmmn Publish-
University Press, 1995. ers, fifth ed., 1998.
Ill] 2. Michalewicz. Genetic Algorithm + Data Srmcrures = Eyolurion [U]C. R. Houck, J. A. Joines, and M. G. Kay. “A genetic algo-
Programs. Berling Heidelberg, NY: Springer-Verlag.third ed., 1996. rithm for function optimisation: a matlab implementation.”, 1995.
[I21 S. Forrest. “Genetic algorithms.” ACM Compur. Sum., vol. 28, no. 1. k l i i a State University. http://www.ie.ncsu.eddmirage/GAToolBox
pp. 77-80, 1996. /gad.
212