Zhang Et Al - Wavelet Neural Networks For Function Learning

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 43, NO. 6.
JUNE 1995 1485
Wavelet Neural Networks for Function Learning

Jun Zhang, Member, IEEE, Gilbert G . Walter, Yubo Miao, and Wan Ngai Wayne Lee, Member, IEEE
Abshurct-In this paper, a wavelet-based neural network is networks can be much easier than MPL networks. Hence,
described. The structure of this network is similar to that of the there has been considerable interest in the implementation of
radial basis function (RBF) network, except that here the radial RBF networks [3]-[5] (also see the references in [8]) and the
basis functionsare replaced by orthonormal scaling functions that
are not necessarily radial-symmetric. The efficacy of this type theoretical analysis of their properties, such as approximation
of network in function learning and estimation is demonstrated ability and convergence rates [6]-[ti].
through theoretical analysis and experimental results. In partic- From the point of view of function representation, an RBF
ular, it has been shown that the wavelet network has universal network is a scheme that represents a function of interest by
and L 2 approximation properties and is a consistent function using members of a family of compactly (or locally) supported
estimator. Convergence rates associated with these properties
are obtained for certain function classes where the rates avoid basis functions. The locality of the basis functions makes
the “curse of dimensionality.” In the experiments, the wavelet the IU3F network more suitable in learning functions with
network performed well and compared favorably to the MLP local variations and discontinuities. Furthermore, the RBF
and RBF networks. networks can represent any function that is in the space
spanned by the family of basis functions. However, the basis
I. INTRODUCTION functions in the family are generally not orthogonal and are
EVELOPING MODELS from observed data, or func- redundant. This means that for a given function, its RBF
D tion learning, is a fundamental problem in many fields, network representation is not unique and is probably not
such as statistical data analysis, signal processing, control, the most efficient. In this work, we replace the family of
forecasting, and artificial intelligence. This problem is also basis functions for the RBF network by an orthonormal basis,
frequently referred to as function estimation, function ap- namely, the scaling functions in the theory of wavelets. The
proximation, system identification, and regression analysis. resulting network, called the wavelet neural network,’ provides
There are two general approaches to function learning, namely, a unique and efficient representation of the given function. At
the parametric and the non-parametric approach. When the the same time, it preserves most of the advantages of the RBF
observed data is contaminated (such that it does not follow network. The use of orthogonal wavelets in the network also
facilitates the theoretical analysis of its asymptotic properties,
a pre-selected parametric family of functions or distributions
closely) or when there are no suitable parametric families, such as universal approximation and consistency. The wavelet-
the non-parametric approach provides more robust results and based network, however, is generally not an RBF network
hence, is more appropriate. since multi-dimensional scaling functions can be non radial-
Recently, neural networks have become a popular tool symmetric.
in non-parametric function learning due to their ability to The idea of using wavelets in neural networks has also been
proposed recently by Zhang and Benveniste [9] and Pati and
learn rather complicated functions. The multi-layer perceptron
(MLP) [I], along with the back-propagation (BP) training Krishnaprasad [ 101. However, the wavelets they used are non-
algorithm, is probably the most frequently used type of neural orthogonal and form frames rather than orthonormal bases.2
network in practical applications. However, due to its multi- In fact, the wavelet networks they studied can be considered
layered structure and the greedy nature of the BP algorithm, as special cases of RBF networks. Nevertheless, the use of
the training processes often settle in undesirable local minima wavelet frames provides a good framework for determining
of the error surface or converge too slowly. The radial basis the structure of the RBF network. Finally, while revising this
function (RBF) network [2], as an alternative to the MLP, has a paper, we became aware of Boubez and Peskin’s work [ 111,
simpler structure (one hidden layer). With some preprocessing [12] on orthogonal wavelet based networks. While our network
on the training data, such as clustering, the training of RBF structure has some similarity to that of Boubez and Peskin’s,
our work differs from theirs in several aspects. First, our work
Manuscript received October 9, 1993; revised October 13, 1994. J. Zhang is concerned more with the estimation of arbitrary L2 functions
was supported in part by the NSF under Grant DIRIS-9010601 and Y.
Miao was supported by a grant from Modine Manufacturing Company. The (energy-finite and continuous or discontinuous) rather than
associate editor coordinating the review of this paper and approving it for those for pattern classification as in [ll], [12]. Secondly, we
publication was Prof. J. N. Hwang. looked into different ways of determining the network weights.
J. Zhang is with the Department of Electrical Engineering and Computer
Science. University of Wisconsin-Milwaukee, Milwaukee, WI 53201 USA. Thirdly, we performed theoretical analysis of the asymptotic
G. G. Walter is with the Department of Mathematics, University of
Wisconuin-Milwaukee, Milwaukee, WI 53201 USA.
Y. Miao is with the Marshfield Clinic, Marshfield, WI 54449 USA. ’Although scaling functions are the ones actually used, the network is still
W. N. W. Lee is with the Modine Manufacturing Company, Racine, WI called the wavelet network to emphasis its connection with the theory of
53403 USA. orthonormal wavelets.
IEEE Log Number 94 11 190. *Zhang and Benveniste mentioned orthogonal wavelets in their paper [ 9 ] .
1053-587)(/95$04.00 0 1995 IEEE

1486 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 43, NO. 6, JUNE 1995
properties of the wavelet network. Finally, in experiments we

4’ 4
compared the performance of the wavelet network not only
with that of the MLP’s, as was done in [ l 13, [12], but also *’ 2.
0 0
with that of the RBF networks.
This paper is organized as follows. After a brief review of -* 2.
-4. 4.
the theory of orthogonal wavelets in Section 11, the wavelet-
\ ,
results are described in Section V, followed by a summary in

Fig. 1 . Examples of Lemarie-Meyer scaling function and wavelet: (a)
Section VI. Scaling function; (b) Wavelet.
11. WAVELETS where V, is the subspace spanned by {2m/2cp(2mt -

n=+m n=+m
Here, we briefly review some results from the theory of n))n=-, or {cpm,n(t)}n,-,. In addition, V , and W ,
wavelets that are relevant to this work. For more comprehen- are related by
sive discussions of wavelets and their applications in signal
processing, see e.g., [13]-[16]. For the sake of simplicity, we
Vm+l = v, €9 w,. (7)
first describe one-dimensional wavelets. A typical wavelet, the Lemarie-Meyer wavelet, and its asso-
ciated scaling function are shown in Fig. 1.
A. Orthogonal Wavelets The above discussions suggest two schemes for decompos-
The wavelets we are interested in are functions whose ing a L2 function f ( t ) , namely,
dilations and translations form an orthonormal basis of L2(R), f ( t )= $m,n)$m,n(t) (84
the space of all square integrable functions on R. Specifically, m,n
there exists a function $(t) (the “mother wavelet”) such that
and
dim,n(t) = 2”/’9~(2”t - n) (1)
n
form an orthonormal basis of L2(R). Therefore, the wavelet
basis induces a orthogonal decomposition of L2(R)
m?mo,n
P ( R ) = $W” (2)where (., .) represents the inner product and mo is an arbitrary
m
integer, representing the lowest resolution or scale in the
where W, is a subspace spanned by {2m/2$(2mt - decomposition.What is more important to this work, however,
7 1 ) } ~ = ? ~ .
is the fact that any f ( t ) E L 2 ( R ) can be approximated
The wavelet $(t) is often generated from a companion cp(t), arbitrarily closely in V A ~for: some integer M. That is, for
known as the scaling function (the “father wavelet”), through any E > 0, there exists an M sufficiently large such that
the following “dilation” equations:
cp(t) = hkcp(2t - I C ) (34

k
where the norm is the L2 norm.
$ ( t ) = Jz SkcpPt - IC) (3b)
k B. Extension to Multiple Dimensions
where { h k } and { g k } are a pair of discrete quadrature-mirror All of the one dimensional results described above can
filters (lowpass and highpass) that are related through be extended to multiple-dimensions, i.e., to L 2 ( R d )where
,
d > 1 is an integer. One scheme is to generate separable
,9k = (-l)k-lh-(k-l). (4) scaling functions and wavelets by the tensor products of one-
dimensional scaling functions and wavelets [ 151. For example,
The dilations and translations of the scaling function induce a a d-dimensional scaling function can be generated by
multiresolution analysis (MRA) of L 2 ( R ) i.e.,
, a nested chain
d
of closed subspaces
p d ( Z ) = (Pd(XlrZZ;‘.’iXd)= nv(Xj). (10)
. . . c v-1 c vo c VI c v2 . . . (5) j=1
Later in the paper, the subscript d in (10) will be omitted,

such that wherever there is no confusion.
Another scheme is to generate non-separable scaling func-
tions and wavelets [17]. The extension of the MRA of (3,(6)
and the wavelet representations of (8a)-(9) is straightforward.
ZHANG er al.: WAVELET NEURAL NETWORKS FOR FUNCTION LEARNING I487
111. WAVELETNETWORKS
In this section, we first describe the structure and train-
I Output
ing procedure of the wavelet-based neural network and then

present some theoretical results related to its properties. With-
out loss of generality, we restrict our discussion to multiple-
input one-output wavelet networks. Due to the special structure
of the wavelet networks, as will be seen in this section, the
results are also applicable to multiple-output networks. ......-
A. Wavelet Network
For the sake of simplicity, we first consider the one-
dimensional (one-input) case. Let f ( t ) E L2(R)be an arbi-
trary function. The problem of function learning can be stated )‘inputt
‘as: given a set of training data,
Fig. 2. The 3-layer wavelet network.
For a given set of M and K , the wavelet network described

find an estimate of f ( t ) that closely approximates f ( t ) .Here, above implements a function g ( t )
we have assumed that the training data was not corrupted by
K
noise. The extension to the case of noisy training data will be
discussed later (in Appendix B). From the theory of wavelets,
k=-K
we know that f ( t ) can be approximated arbitrarily closely by
selecting a sufficiently large M such that which could be used to approximate f ( t ) if the weights Ck are
properly selected. When a training data set T N is available,
Ck can be obtained by minimizing the mean square (training)
error ehr(f,g), i.e.,
( ~ - K , . . . ; ? K=) arg ( c - min
~ : . . , r K )e N ( f , g ) (1 4 4
In a conventional neural network, each node has the fol-
lowing input-output relationship where
. N
i=l
and the subscript N emphasizes the dependence on the training

where Sin.nlis the mth input, Wm’s are the weights, T is a data. Equation (14a) can be solved by taking the partial
threshold, and d(.) is generally a non-linear function, e.g., a derivatives of the mean square error of (14b)
sigmoidal function as in the MLP. Hence, the approximationin
(12) can be implemented by a three-layer network, as shown
in Fig. 2. Specifically, the input layer is just one “all-pass’’
node with output t. The hidden layer contains a countable which leads to a set of linear equations. Since the scaling
number of nodes indexed by k , which runs through all integers. functions are orthogonal, for sufficiently large N (15a) gen-
The weights and non-linearities of the hidden-layer nodes erally has a unique solution. However, when K is large, a
are identical, i.e., 2 M and c p ( . ) , respectively. The thresholds direct solution of (15a), which involves the inversion of a
of hidden-layer nodes, on the other hand, are different and (2K+l) x (2KS1) matrix, may be computation intensive. An
the threshold for the kth node is k . The output layer is altemative to the direct solution is an iterative gradient-descent
one linear node with threshold 0. The weights of the node, procedure
ideally, should be the representation coefficients in (12). Since
in most practical applications, the functions of interest have
finite support, it is possible for the hidden layer to contain
only a finite number of nodes. Indeed, this is true when cp(t) where X is a stepsize. In fact, further computation reduction
has compact support, as is the case for Daubechies’ wavelets is possible in (15b) once the locality of the scaling function is
[14]. For all practical purposes, it could also be considered taken into consideration. Specifically, the derivative in (15b)
true even when cp(t) does not have compact support but is contains the sum of terms like c p ~ , k ( t i ) .When cp(t) has
fast-decaying, as is the case for the Lemarie-Meyer scaling compact support and t i ’ s are more or less evenly distributed in
functions [15]. Hence, without loss of generality, we assume [-1/2, +1/2] (i.e., they can be found in the support-intervals
that f i t ) is supported in [-1/2,1/2] and that the indices of the of scaling functions with different shifts), for a given k most
hidden nodes run from -K-K for some positive integer K . of these terms will be zero and need not be summed.
I488 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 43, NO. 6, JUNE 1995
Another way to determine the weights of the wavelet M=2

2’M = 114
network is by simply letting
P~.
+ 1 =5
P=3
e=the centers of the scaling functions
This would be a good approximation to (f(t), c p ~ r , k ( t ) if

) the
training data is drawn uniformly, i.e., tz’s are i.i.d. (indepen- a a a a a a a
dent and identically distributed) random variables uniformly 112 0 112
distributed in [-1/2,1/2]. Generally, for a particular set of Fig. 3. An illustration of the positions of the scaling functions for a scale A l .
training samples, this scheme leads to a larger mean square
error than that of (14a).
The extension of the wavelet network to the multi- An alternative scheme is the multiresolution scheme used
dimensional case is straightforward. Let p(z) be a separable by Boubez [12] which could further reduce the number of
scaling function of the Mallat kind [15] in L 2 ( R d )for some hidden layer nodes. The idea here is to start with a reasonably
(1 > 1 (see (10)). It then induces a multiresolution analysis. small M . Since the mean square (training) error contains
As in the 1-D case, we assume for the rest of this section contributions from each hidden node, it is possible to find those
that the functions of interest can be approximated by v.3.1 “bad nodes” whose contributions are above some threshold.
for some M and have finite supports (e.g., in the hypercube Each “bad node” is then replaced by several “finer-resolution’’
[-1/2. +1/2]‘). Then, the 1-D wavelet network of (13) can nodes in a way that is similar to the two-scale dilation equation
be extended to of (3a). The process repeats until no more hidden nodes need to
be broken. A variation on this scheme, suggested by a referee,
.9(z) = Ck’Pnf,k(z) (17) is to select a relatively small M and the corresponding scaling
k functions and then add in wavelets.
where k is an index vector that runs in a finite set. However, The second practical issue is that the training data are
there are practical difficulties when d is large, e.g., d > 3, as often not uniformly distributed in the support of the functions
will be described in part C of this Section. of interest (e.g., [-l/2,+1/2Id). Rather, they come in the
form of clusters. This is especially true for functions of high
B. Practical Considerations dimensions. In this case, it would be a waste of computational
resources to cover the entire [-1/2, +1/2Id by scaling func-
There are several issues concerning the practical implemen-
tions. Indeed, it would be more sensible to identify regions
tation of the wavelet network. The first is how to determine in [-1/2, +1/2Id that contain sufficient amount of training
the number of hidden layer nodes. data (non-empty regions), in which the training data can be
In the 1-D case, for a given M the number of c p ~ [ , k ( t ) ’ s considered more or less as uniform, and to construct a sub
whose centers are inside [-1/2.+1/2] is roughly 2” +1. wavelet network for each non-empty region. The function of
This stems from the fact that the scaling functions of scale
interest can then be approximated by the composite of such
2”‘ are shifted versions of cp(2“t) whose shifts are integer
sub-networks, as illustrated in Fig. 3. To identify the non-
multiples of 2T“‘. Since the scaling functions used in this empty regions, we could use any one of a number of popular
work either have compact support or are fast-decaying, for all clustering algorithms (e.g., [22]) as was done in [3], [4].
practical purposes, the number of scaling functions with scale
2’“ that are needed to “cover” [ -1/2, +1/2] or to approximate
C. Learning High-Dimensional Functions
any L2 function supported in [-1/2,+1/2] is no more than
2”‘ + p , where p 2 1 is a small integer. Often p is selected Learning a d-dimensional function becomes considerably
to be greater than 1 in order to provide some safeguard, as more difficult when d is large, e.g., d > 3, due to two main
illustrated in Fig. 4. Similarly, for d-dimensional functions, problems. First, the available training data is generally sparse
to approximate a function at scale 2”, the number of hidden and do not “fill out” the domain of the function. For example,
layer nodes needed is no greater than (2“ + p ) d . suppose that to learn a 1-D function defined in [0, 11, 128
To determine a proper M , we can use the following simple uniformly observed training points3 are needed for adequate
scheme: resolution. To learn a 10-d function defined on [0, 1]1° with
comparable resolution, it would require 1281° = 270 training
1) Start with a small M and use the above bound (i.e.,
points! Hence, one generally does not have enough training
(2.‘’+ P ) ~ to
) determine the number of hidden layer
samples to learn the function “in its entirety.” The second
nodes.
difficulty in high-dimensional function learning is the “curse
2) Find the weights of the output layer by using either
(14a) or (16). The resulting wavelet network is denoted of dimensionality:” as d increases, the convergence rate of a
function estimation scheme generally decreases.
as f h i , ~ where
. the subscripts A4 and N indicate the
dependence on the scale and training data. Although a general solution to these problems remains to be
found, for many applications, good results can be obtained by
3) Compute the mean square error e,v(f. f b r , ~ . ) .
4) If the mean square error is smaller than some threshold,
observing that in the sparse training data {z,, f ( z , ) } ,{z,}are
stop. Otherwise, increase M by 1 and go back to step 1. ’Here, a training point I$ a pair such as ( I ,. f( s,))
ZHANG et ai.: WAVELET NEURAL NETWORKS FOR FUNCTION LEARNING 1489
output scheme while the more involved PPR scheme is currently

under investigation.
During the revision of this paper, we came across a recent
r--------- A and interesting paper by Hwang e? al. [25] in which PPR was
used for function learning in the context of neural networks. In
their approach, f k ( . ) was expanded by Hermite polynomials
which are an orthonormal basis but do not have local supports.
Hwang et al. obtained good experimental results for two-
dimensional continuous functions. Hence, we are optimistic
I about the wavelet-based PPR since wavelets are known to
I
I be better function estimators than polynomials (better local
accuracy and faster convergence, see e.g., [26],[27]). Finally,
we notice that Hwang et al. stressed the need to test their PPR
approach for high-dimensional functions.
P‘?
Iv. hOPERlTES OF THE WAVELET NETWORK
In this subsection, we describe some theoretical results
conceming the learning capability of the wavelet networks,
i.e., how good they are as function estimators. These results
include approximation ability and consistency, as well as the
associated convergence rates.
The approximation ability of a neural network is concerned
I
non-empty region 1
with the following question: what kind of functions can the
neural network learn? To address this question, it is helpful
Fig. 4. A composite network.
to think of the neural network as a function approximation
scheme (i.e., it approximates the function to be learned). A
“concentrated’ in one or a small number of regions in [0, lId function approximation scheme, according to the definition
whose dimensions are much less than d. Here, we assumed given by Xu et al. [8], is a set of functions F defined on
that the function in question is defined in [0, lid. To be more Rd! where
precise, in such cases the (high) d-dimensional function, f (z),
can be approximated by 343,
K n
f(2)= Ckfic(Pk5) (18)

k=l where .Tn’s are subsets of functions. For example, for the
wavelet network, Fn is the set of all wavelet networks
where f k ( . ) is a function defined for the kth region, C k is its constructed using scaling functions with scale n = 2“. .F
“weight,” and Pk is the corresponding projection operator and is said to possess the property of universal approximation if it
a dl x d matrix, where dl << d. It is interesting to notice that is dense in C [ U ] the, space of continuous functions supported
a three-layer MLP can also be described by (18) when f k ( . ) on a compact subset U c R d .That is, for any f E C[U], there
is the shifted sigmoid of the kth hidden-layer node and PI,is exists a sequence f n , f n E 3n,such that f n -+ f uniformly.
a row vector (i.e., dl = 1). If the convergence (and therefore, the sense of dense) is L2
There are two ways of applying (1 8) in our wavelet network. rather than uniform, 3 is said to have the property of L2
The first scheme is related to the classical principle-component approximation.
analysis regression (PCAR) 1241. Specifically, one selects By construction, it is easy to see (e.g., from (9)) that the
K = 1 ander, = 1, wavelet network has the L2 approximation property, i.e., it
PI to be a matrix whose row vectors are the significant can learn any L2 function with finite support. The universal
eigenvectors of the covariance matrix of {z,}, approximation property, on the other hand, requires that cp be
fl(.) be the wavelet network of (13). regular. Details are provided in Appendix A. To summarize,
This scheme works well when (2,) are nearly Gaussian we have the following:
distributed (with correlated vector components) in a single Theorem I :
region or several non-overlap regions. The second and more The wavelet network has the properties of universal approx-
elaborate scheme is to use projection pursuit regression (PPR) imation and L2 approximation.
[24], in which C k , f k ( . ) and PI,are optimized to minimize In universal and L2 approximation properties, it is also
the training error. In particular, fk(.) is expanded by scaling of interest to know how fast the sequence f n converges or
functions as is in (13) and hence (18) remains a wavelet how fast the approximation error decreases. By using the
network. The experiment results for high-dimensionalfunction relationship between delta sequences and reproducing kemel
leaming in this paper was generated by the simpler PCAR of the scaling functions [19], we can show the following:
1490 IEEE TRANSACTIONS ON SIGNAL PROCESSING. VOL. 43. NO. 6. JUNE 1995
Theorem 2:
For the wavelet network, the rates of convergence in the
= 1E [ ( ( f‘Pnf,k)
, - ?,,,)’I
k
universal and L2 approximation properties can be made arbi-
= E[(cF’ - ? , Y , ~ ) ~ I (22)
trarily rapid in the following sense: for any Q > 0, there is a
k
Sobolev space H o such that for any f E H p , there exists a
sequence of wavelet networks f n , where n = 2‘, such that where c‘’’ - (f,pilf,k).Hence, if for all k , ?,,k -+ ’!c as
k .-
N --+ oc in the mean square sense, the wavelet network is
consistent. Indeed, we have the following:
Theorem 3:
and
Assume that in the training data set
T N = ((21, f(zi))};:l
Here I I I Iu and I I . I ILZ are the sup and L2 norms, respectively. 2 ; ;i = 1,. . . , N , are i.i.d. and uniformly distributed. Assume
The proof of Theorem 2 is given in Appendix A. Notice that that the weights of the wavelet network of (20b), denoted by
in this theorem, we require that f be smooth (i.e., in a Sobolev ? N , k ,are obtained by either (14a) or (16). Then, for any k in
space). The degree of smoothness, /3, generally depends on Q the wavelet network of (20b)
and d. The larger cy and d are, the larger /3 is required to be.
This smoothness requirement allows Theorem 2 to avoid the
‘‘curse of dimensionality” in that, as long as the function is
smooth enough, the convergence rates above are independent in the mean square sense and hence, the wavelet network of
of d . (20b) is L2 consistent in the mean square sense. Furthermore,
A function estimator, such as the wavelet network, is said the rate of convergence in (23) is O( l / N ) .
to be consistent if it converges to the function to be estimated The proof of Theorem 3 is given in Appendix B. When
as the number of training samples approaches infinity. Let the f E L2( R d ,)but f 6 VM for any M , we can let N > n = 2lVf.
function to be estimated be denoted by f(z) E L 2 ( R d )As- . Then, by using Theorems 1 and 3, the consistency result
sume, for the moment, that f ( z ) E V Mfor some sufficiently follows when we let n + CO. Specifically, according to
large M . Then, Theorem 1, there exists a sequence fn E V1\f and fn -+ f
as n -+ 30. Similarly, according to Theorem 3, there exists
f(z) = c(f,
k
&,k)(PL\f,k(z). (204 a sequence fnl,-iv and fnf,,l; t f n as N + 00. Thus for
sufficiently large n and N : we have
Let f h f , N ( z ) be a wavelet network estimate of f(z) from E[II f - f ~ 1 , ~ IhI’Ir F I If - fn I I +E[IIfn -~ M , NI 12] F 26. (24)
training data. Then, That is, $M,N -+ f in the mean square sense.
fiVf.N(z) = ?,,k’PM,k(z) (20b) D.Relation to RBF and Wavelet Frame Networks
k
The function represented by a d-input and one-output RBF
where the coefficients are obtained through the d-dimensional network can be represented as
extension of either (14a) or (16). For a-given sequence of n
training sets T N ,the wavelet network f M , N is said to be g n (E) = I I
Ck$( Iz - rk IE-Ik ) (254
L2 consistent if Ilf - ffi1,NII --+ 0 as N t CO, where the k=l
norm is the L2 norm. Generally, since the training samples
where
are obtained through random sampling or measurement, TA7
is a set of random variables. Hence, the L2 convergence has
to be in some well-defined probabilistic sense, e.g., almost
surely, mean square, in probability, in distribution, etc, In this Here, 4(.) is a localized basis function, e.g., a Gaussian
work, we consider the mean square sense. That is, f ~ , wis function, and rk and are the center and covariance matrix
consistent if (spread).
For a given set of training data, the parameters of an
RBF network, namely, C k > r k and CL’, can be found by
minimizing the training error e*‘( f,gn). However, due to the
But non-linear nature of q5(.), the minimization often gets stuck
in an undesirable local minimum which results in useless
parameter estimates. One practical solution to this problem
is to estimate f k and CL’ first by performing some sort of
clustering procedure on the training data [ 3 ] ,[4]. Then, Ck’S
are found by minimizing the training error. This leads to a set
of linear equations, similar to (Ma) for the wavelet network.
ZHANG et al.: WAVELET NEURAL NETWORKS FOR FUNCTION LEARNING 1491
However, unlike the case for the wavelet network, this set A. Leaming One and Two-dimensional Functions
of linear equations usually does not have a unique solution Typical results for 1-D and 2-D functions are shown in
and therefore, a least squares procedure may have to be used. Figs. 5-7. In Fig. 5, the wavelet network was used to learn
As a result, the c k ' s found in this way, do not necessarily two 1-D functions from i.i.d. uniformly distributed training
minimize the training error (although sometimes they may be data. The two functions are defined on [- 1,$11. The number
good enough for approximating a given function). of hidden nodes was determined to be 41 by using the
When a wavelet frame (non-orthogonal wavelets) replaces procedure described in Section 111-B. Here, the large number
the the radial basis functions in an RBF network, r k and E,' of hidden nodes, or fine scale, is due to the jumps and
are replaced by the shifts and scales of the wavelets. For a transitions (local high-frequency components) in the functions.
wavelet frame network, it is possible to determine the shifts This is consistent with the theory of wavelets. The output
and scales by using the theory of wavelets rather than by the layer weights were determined by using the iterative solution
more heuristic clustering procedures. On the other hand, due toof (15b) rather than by inverting a 41 x 41 matrix. The
the redundancy in the wavelet frame, the set of linear equations
amount of computation is further reduced by noticing that the
for ck may still require a least squares solution, as is the scaling functions used (namely, the Lemarie-Meyer scaling
case for the RBF networks. The orthogonal wavelet network functions), have, for all practical purposes, compact support,
not only has the advantage of the wavelet frame network, as described in Section 111-A. For both functions, the training
the equation for solving its weights (15a) also has a unique process converged within 10 iteration^.^ As shown in Fig.
solution. Furthermore, it provides an efficient representation 5, the wavelet network captured the shapes of the functions
or approximation of a function due to the orthogonality of the accurately except with some small ripples. These ripples can
scaling functions. be attributed to the Gibbs' phenomenon that exists in wavelet
It is interesting to compare the theoretical results for therepresentations (similar to that in Fourier series representa-
wavelet network with those for the RBF network. Both types tions). Indeed, it can be shown that Gibbs' phenomenon exists
of networks have universal and L2 approximation properties! for the Lemarie-Meyer and the Daubechies wavelets, but not
As for convergence rates, we can compare the results in for the Haar wavelets [19]. However, the Lemarie-Meyer
Theorem 2 with those obtained for the RBF network by and Daubechies wavelets have many desirable properties that
Corradi and White [6], Girosi and Anzellotti [7], and Xu et al.the Haar wavelets do not have (e.g., convergence properties
[8]. The convergence rates obtained by Girosi and Anzellotti and compactly supported in frequency domain) and are more
for the uniform and L2 convergence are both O(n-1/2). These widely used. Hence, the Haar wavelets (or scaling functions)
rates are independent of dimensions d provided that some are not used in our experiments.
smoothness conditions similar to those of Theorem 2 are For comparison, we have also shown in Fig. 5 the results
satisfied. Corradi and White and Xu et al. obtained similar obtained by a 3-layer MLP (one hidden layer) with 41 hidden
rates for L2 convergence which are O(n-(1/1+d/2'J)).where layer nodes. This particular MLP was chosen so that it has
q is the number of derivatives that f has. These rates become comparable complexity to the wavelet network. After 300
approximately independent of d by for very large q. Compared iterations of training, the MPL captured the global shapes of
to these results, the rates in Theorem 2 are somewhat better the functions, but not the transitions in the functions. More
in that the exponent a: can be made arbitrarily large. Notice training iterations (up to 10000) did not lead to significant
here that the n in the RBF network and the n in the wavelet improvement. It may be possible to improve the results by
network are closely related. In the RBF network, n is the using more hidden layers; however, this will make the network
number of hidden-layer nodes, while in the wavelet network more complex and longer to train.
n is the scale of the scaling functions. However, when the Similar comparative experiments were also performed for
scheme of Section 1II.B for determining the number of hidden the RBF network. The Gaussian function, which is a popular
nodes is used, the number of nodes equals 2" +
p = n p. +
choice for RBF networks, was used as the basis function.
As for statistical consistency, Xu et al. [8] have shown that
Since the training data has a uniform distribution, clustering
the RBF network is L2 and pointwise consistent in the sense of procedures would be ineffective in deciding the number of
almost surely, in probability, and completely. We have shown hidden layer nodes. To keep the complexity of the Gaussian
in this work (Theorem 3 and the remarks that follow) that the RBF network comparable to the wavelet network, the number
wavelet network is L2 consistent in the mean square sense of hidden layer nodes was selected to be 41 (we also tried
which implies in probability. smaller numbers of hidden layer nodes, e.g., 21 and 31, but
41 seemed to provide the best results). To illustrate how
V. EXPERIMENTAL RESULTS the centers and variances of the Gaussian basis functions
The wavelet neural network described in Section 111 was affect the learning results, we performed the following training
tested in function learning simulations where it was also com- experiments:
pared with the MLP and RBF networks of similar complexity. 1) Minimizing the training error with respect to the centers
Experiments have been performed for learning both low and and variances of the Gaussian functions as well as the
high-dimensional functions, as described below. output layer weights,
4For the universal and L 2 approximation properties of the RBF networks, 'The number of iterations could be further reduced if the conjugate gradient
see e.g.. Xu er al. [8]. method [23] is used since the training error surface is quadratic.
1492 IEEE TRANSACTIONS ON SIGNAL PROCESSING. VOL. 43, NO. 6. JUNE 1995
2501
OO
- 100
(a)
200
I00
50
O
B 20 40 60
1 I
0 100 200 0 0 20 40 60
(4 (C) (d)
2 5 0 7 Fig. 7. Two-dimensional function learning by the wavelet network: (a) 2-D
function; (b) cont. map of (a); (c) wavelet net. approximation; (d) cont. map
of (c).
100 functions were fixed to be o2 such that neighboring Gaussians

overlap by S(T, where s is a positive number. Notice that as o
50
increases, so does s since the number of Gaussian functions
0'
100 200 100 200 was fixed to be 41. In all three experiments, the parameters
(e) (f) that were not fixed were estimated from training data by a
Fig. 5. One-dimensional function learning by the wavelet network and the gradient descent algorithm.
MLP: Function 1; (b) MLP; (c) wavelet network; (d) Function 2; (e) MLP; Experiments were performed for the Gaussian RBF network
(f) wavelet network. according to the setup described above for both functions of
-052 Fig. 5. Since the results are similar in nature for both functions,
only those for the staircase function are shown in Fig. 6. It can
be seen that only after both the centers and variances of the
Gaussians were set properly, could the network perform as
well as the wavelet network. However, unlike the centers and
scales for the scaling functions in the wavelet network, the
0' I centers and variances of the Gaussian basis functions have
100 200
to be determined heuristically by trial and error (indeed, the
(a)
structure of the wavelet network was used as a guide).
2501 2501
The wavelet network was also tested in learning 2-D func-
tions with typical results shown in Fig. 7. Here, the 2-D
function is a simple extension of the 1-D staircase function
and the wavelet network, as is in the I-D case, performed well
except for some small ripples. The number of iterations is 30.
1 1
100 200 100 200
B. Leaming 10-dimensional Functions
(C) (4
The PCAR scheme described in Section 1II.C was used in
Fig. 6. One-dimensional function learning by the Gaussian RBF network:
(a) Nothing fixed; (b) cent. fixed; (c) cent. and var. fixed, s = 2; (d) cent. the wavelet network to learn high-dimensional functions, with
and var. fixed. s = I. typical results shown in Figs. 8 and 9. The function to be
learned in Fig. 8 is a 10-dimensional staircase function whose
2) Fixing the centers of the Gaussian functions while min- training data is generated as follows. First, 200 random 10-
imizing the training error with respect to the variances dimensional vectors, xi,i = 1 , 2 , . . . ,200, were generated.
of the Gaussian functions and the output layer weights, For each xi!xi1 is uniformly distributed in [-1,1] while
3 ) Fixing the centers and variances of the Gaussian func- zi]~ 2 ,5 k 5 10 are Gaussian distributed with zero-mean and
tions while minimizing the training error with respect to variance 0.5, f(zi)takes the value of one of the levels of the
the output layer weights. stairs depending on where xi is. The training data set is shown
In 2 and 3 the centers of the Gaussian functions were fixed in Fig. 8(a). Since we can not show all the 10 dimensions,
such that they coincide with the centers of the scaling functions only {(zil,z i 2 ; f(xi)} are shown. This is similar to the scatter
of the wavelet network. In 3 . the variances of the Gaussian diagram frequently used in visualizing high-dimensional data
ZHANG et al.: WAVELET NEURAL NETWORKS FOR FUNCTION LEARNING 1493
properties and experimental results. In particular, we have

shown in the theoretical analysis that the wavelet network has
universal and L2 approximation properties and is a consistent
function estimator. We have also obtained the convergence
rates associated with these properties. In the experiments, the
wavelet network performed well and compared favorably with
the MLP and RBF networks. Compared to MLP networks with
similar complexity, it is better at approximating functions with
discontinuitiesand its training requires much less computation.
Compared to RBF and wavelet frame based networks, its
parameters (e.g., number of hidden nodes and weights) are
easier to determine. Furthermore, the wavelet network has
somewhat better convergence rates. There is, however, a
0
shortcoming of the wavelet network, which is also shared by
the RBF network. That is, when the number of input d is
0
O X15
10
X I 5
10 large, many hidden layer nodes are needed. This shortcoming
can be overcome by projective data reduction techniques,
(C) (4
such as PCAR and PPR described in Section 111-C, and by
Fig. 8. Learning a 10-D staircase: (a) Training data; (b) wavelet net; (c)
Gaussian RBF net; (d) MLP.
using a multiresolution scheme similar to that of Boubez and
Peskin [ l 11, [12]. On the other hand, compared to the MPL
[24]. Function learning was performed with networks, this added complexity is what makes the solution for
the weights simpler, the convergence faster, and the function
f(5) = fw&w (26) approximation
_ - more accurate for the wavelet network.
For future work, it would be of interest to investigate the
where p is the most significant eigenvector of the covariance PPR wavelet networks proposed in Section 1II.C and networks
matrix of (2%)and fwn(.) is the wavelet network. The learned based on the wavelet packets ,201.
functions by the wavelet network, the Gaussian RBF network
(guided by the parameters of the wavelet network, see Section
V-A) and the MLP, all with 41 hidden-layer nodes, are shown APPENDIX A
in Fig. 8(b)-(d), respectively. In this appendix, we prove first the universal approximation
In Fig. 8(b)-(d), we have only shown {(z2~,zz2,f(zz))} property of wavelet network (Theorem 1). This has already
(other scatter diagrams are similar in nature) and lines are been done for d = 1 in [21]. It involves expressing the right
used to connect data points to help to visualize the surface side of (12) in the form of an integral
corresponding to the learned functions. Clearly, for this func-
tion, the wavelet network and Gaussian RBF performed better fdt)= Cif, cPnr,c)cP,zr.e(t)
than the MLP which was able only to capture the more global k
features of the function due to discontinuities in the function.
Unlike that in Fig. 8, the function to be learned in Fig.
9 is a continuous function. The 500 training data points are
generated in a similar way to those of Fig. 8.In particular, for
each z,, zL1is normally distributed in [0, 61 with mean 3 and
variance 10 while x & , 2 5 k 5 10 are Gaussian distributed where qM(x,t) is the reproducing kemeE [15] of Vnr.Notice
with zero mean and variance 0.1. The 2,'s so generated were here, fhf(t) could also be written as f2.v (t) to emphasize
then rotated 10 times along each axis, each time by 45'. Fig. the dilation of the scaling functions and this is actually more
9(a) shows the scatter diagram of ( ( ~ ~x1, ~ . f(z,))}
. where precise. However, we use f n r ( t ) to simplify notations. It was
f ( . ) follows a sinusoidal function. The learned functions are shown in [21] that for d = l,qhf(x,t) is a quasi-positive
shown in Fig. 9(b)-(d) and in this case, due to the smoothness delta-sequence, i.e., a sequence that converges to 6(z - t) and
of the function, all networks performed well. All the networks satisfies certain regularity conditions (see conditions (i)-(iii)
have 28 hidden-layer nodes. below). It then follows that
VI. SUMMARY SUP IfM ( t ) - f(t)l -+ 0, as M -+

(-42)
tER
In this paper, we described a wavelet-based neural network
for learning functions from training data. The structure of this for continuous f (t).
network is similar to that of the RBF network, except that If d > 1 and cp(z) is separable, i.e., p(z) = rI$l cp(zJ),
here the radial basis functions are replaced by orthonormal then so is qbf(z,t).
scaling functions that are not necessarily radial-symmetric. d
The efficacy of this type of network in function learning is *bf t
(z, ) = 4M,.( t, 1. 1(A31
demonstrated through theoretical analysis of their asymptotic j=1
I494
1.
%1:I
-0.5
!I Ilz-tlllr
dz - 1 < €
q&f(z,t) (A61
'7 '1
I.
w
6 e -7 Xl 6
(C) (d )
Fig. 9. L a m i n g a 10-D continuous function: (a) Training data; (b) wavelet
net: (c) Gaussian RBF net; (d) MLP.
Thus
and q n f ( z ,t ) is a quasi-positive delta sequence on R d . This

means that q h f ( z ,t ) E L 1 ( R d )as a function of z for each
t E R' and
(i) .fp Inndz,t)I dx I C , t E R d , M E 2;
(ii) there is a b > 0 such that
/
Ilz-tllsb
qhf(z,t)dz+1, asM+cc
uniformly on a bounded subset of Rd; where f^(w)is the Fourier transform of f ( t ) .For d = 2, this
(iii) For each y > 0 is just
sup IqM(z,t)I -+ 0, asM + m.
IIz-tll2-r
The same argument used in [21] for the justification of the
1-D version of these conditions carries over to R d .From this We need the following inequality in order to reduce the
we can prove the uniform convergence for f continuous on d-dimensional case to 1-dimension
Rd with compact support. Indeed, we have W; +w; + .. . +U; +d
fnr(t) - f ( t )= f ( t ) [1 lIz-tll<Y
q M ( z , t )d z -
1
1
2 [ ( U ; + l)(w; + 1 ) .. . (U; + l)]'Id.
This is obvious by induction since
(A10)
+ w; + . . . + w; + d)2
/
(U;
q M ( z . t ) [ f ( z ) - f(t)ldz
+ I Iz-tlls-Y 2 + + . . . + ~ i - 1+ d - l ) ( ~+;1 )
(U; W$
+/ (lhf(2,t)f(z)
2 + l)(w; + 1 ) . (W2-1 + l)]l/d-l(w: + 1)
[(U:: ' '
I P-tl I > Y and hence by the same argument

=I1 1 2 13. + + ('44)
(Wf + w; + . . . w; + d)d 2 n ( W ; + l y ( d - 1 ) (w; + l)d/2
We now choose 0 < y < b such that I f ( z ) - f ( t ) l< 6 ,
i#j
whenever 112 - tll < y. Then, 1 2 satisfies .
j = 1,2,...,d
r
The same inequality holds for the geometric mean of the right
hand side
i=l i=l
ZHANG et ai.: WAVELET NEURAL NETWORKS FOR FUNCTION LEARNING 1495
Thus, there is a constant Cd > 0 such that The L2 convergence on bounded subsets U of Rd follows
d from the uniform convergence since
11LdIl2 + 12 cdn(wT +
i=l
Hence i f f ( t ) is separable, so is f(o)

= fl(w1).fz(wz).fd(wd) t€U
and where IIU((is the size of U.So follow the convergence rates.
APPENDIXB
In this Appendix we prove Theorem 3, i.e., the consistency
of the wavelet network. Here the function to be approximated
is assumed to be in VA.1 for some M . As described in Section
3.C, all we need to show is that t N , k -+ c:) in the mean
square sense as N -+ 03. For the sake of simplicity, we show
this for the 1-D case. The extension to multiple dimensions
is straight forward.
First, we assume that 2 . i V . k ' ~ are obtained by using (16) and
to emphasize this connection, they are denoted as E!$!k, where
superscript 2 denotes "the second scheme for estimating the
weights." Since t i , i = 1 , 2 , . . . , N , are i.i.d. and uniformly
distributed in [-1/2,+1/2], from (16) we have
Here t is a uniformly distributed random variable. From (B l),

for T >a / 2 - 1 / 2 and d = 2. The corresponding result for (B2h we have
The conclusion of Theorem 2 now follows from Schwarz's

inequality for Sobolev spaces. For f E H 3 and a = d,/3+ 1 / 2 , Since f has compact Support, both inner products in the right
we have hand side exist and are bounded. Hence, as N -+ CO, -+
cf) in the mean square sense and the rate of convergence is
lfM(t) - f(t)l 5 IIqhl(',t)- a(' - t)ll-d{9--1/2,d
O( V N ) .
' Ilflldp+i/a,d = 0 ( 2 - M B ) . (A13) Now, consider the case when ? N . k ' S are obtained by using
(14a) and are denoted now by Let the corresponding
The f l l ( t )of Theorem 2 is just f&f(t)when 2 M = n (recall
that f . w ( t ) can also be written as f 2 n r ( t ) ) . wavelet network (the one with as output layer weights)
I496 IEEE TRANSACTIONS ON SIGNAL PROCESSING. VOL. 43, NO. 6, JUNE 1995
be denoted by fii!N.
Then, the consistency of f$!N is easy [IO] Y. C. Pati and P. S. Krishnaprasad, “Analysis and synthesis of feed-
forward neural networks using discrete affine wavelet transformations,”
to show if we have
IEEE Trans. Neural Net., vol. 4. Jan. 1993.
[ l l ] T. Boubez and R. L. Peskin, “Wavelet neural networks and receptive
field partitioning,” reprint, 1993.
(B4) [I21 T. Boubez, “Wavelet neural networks and receptive field partitioning,”
reprint, 1993.
[13] Y. Meyer, Wavelets: Algorithms and Applications, translated by R. D.
To see (B4), we notice that from (14a), (1%) and (16), Ryan. Philadelphia, PA: SIAM Press, 1993.
[I41 I. Daubechies, Ten Lectures on Wavelets. Philadelphia, PA: SIAM
Press, 1992.
[I51 S. Mallat, “A theory for multiresolution signal decomposition: The
wavelet transform,” IEEE Trans. Pattern Anal. Machine Intell., vol. 11,
pp. 674-693, July 1989.
[I61 C. K. Chui, An Introduction fo Wavelets. New York: Academic Press,
1992.
where fg!Nis the wavelet network whose hidden layer [ 171 J. Kovacevic and M. Vetterli, “Nonseparable multidimensional perfect
reconstruction banks and wavelet for R”,” IEEE Trans. I f . . Theory,
weights are obtained by (16). Taking the expectation on both vol. 38, pp, 533-555, Mar. 1992.
sides, we have [I81 G. G. Walter, “Approximation of the delta function by wavelets,” J.
A p p n ~ . Theory,
~. vol. 71, pp. 329-343, Dec. 1992.
[19] H. T. Shim, “Gibbs’ phenomenon in wavelets and how to avoid it,”
preprint, 1993.
[20] R. Coifman, Y. Meyer, and M. V. Wickerhauser, “Wavelet analysis and
However, notice that signal processing,” in Wavelets and Their Applications, M. B. Ruskai,
1
r~ 1 et al., Ed. Boston, MA: Jones & Bartlett, 1992.
) fi;IN(t))’]
E [ ( f ( t- =E
k=-K
(cf) - i?t!k)2]
(B7)
[21] G. G. Walter, “Pointwise convergence of wavelet expansions,”J. Approx.
Theory, 1993.
[22] J. T. Tou and R. C. Gonzalez, Pattern Recognition Principles. Reading,
MA: Addison-Wesley, 1979.
and similarly, that [23] D. G. Luenberger, Introduction to Linear and Nonlinear Programming.
Reading, MA: Addison-Wesley, 1973.
r~ 1 [24] D. W. Scott, Multivariate Density Estimation. New York: Wiley, 1992.
[25] J.-N. Hwang, S.-R. Lay, M. Maechler, R. D. Martin, and J. Schimert,
“Regression modeling in back-propagation and projection pursuit learn-
ing,” IEEE Truns. Neural Net., vol. 5, pp. 342-353, May 1994.
[26] G. G. Walter, Wavelets and Other Orthogonal Systems. CRC Press,
(B4) then follows. 1994.
If the training data is corrupted by additive white noise, it [27] S. E. Kelly, M. A. Kon, and L. A. Raphael, “Pointwise convergence of
wavelet expansions,” Bull. AMS, vol. 30, pp. 87-94, 1994.
will be given by
+
where y; = f ( t z ) w;and w; is the white noise. In this case,
using the same procedure as above, it can be shown that the
Jun Zhang (M’88) received the B.S. degree
conclusion of Theorem 3 still holds. The only difference is in electrical and computer engineering from
that the variance of the approximation error in (B3) increases. Harbin Shipbuilding Engineering Institute, Harbin,
Nevertheless, the increase is inside the braces (by a term Heilongjiang, China. in 1982, and was admitted
to the graduate program of the Radio Electronic
corresponding to the noise variance), hence the right-hand-side Department of Tsinghua University. After a brief
of (3B) still converges to zero with rate 0(1/N). stay there, he traveled to the U.S. on a scholarship
from the Li Foundation, Glen Cover, NY. He
received the M.S. and Ph.D. degrees in electrical
REFERENCES
engineering from Rensselaer Polytechnic Institute
in 1985 and 1988. respectively.
[ I ] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning internal He joined the faculty of the Department of Electrical Engineering and
representations by error propagation,” in Parallel Distributed Pmcess-
Computer Science of the University of Wisconsin, Milwaukee, and is currently
ing, J. L. McClelland and D. E. Rumelhart, Eds. Cambridge, MA:
an Associate Professor. His research interests include image processing,
MIT Press, 1986.
computer vision, signal processing, and digital communications.
[2] T. Poggio and F. Girosi, “Networks for approximation and learning,”
Proc. IEEE, vol. 78, pp. 1481-1497, Sept. 1990.
[3] J. Moody and C. J. Darken, “Fast learning in networks of locally-tuned
processing units,’’ Neural Computation, vol. I , pp. 281-294, 1989.
[4] E. J. Hartman, J. D. Keeler, and J. M. Kowalski, “Layered neural
networks with Gaussian hidden units as universal approximations,”
Neural Computation, vol. 2, pp. 210-215, 1990.
T.51 B. Muller and J. Reinhardt, Neural Networks. Berlin, Germany:
~~ Gilbert G. Walter received the B.S. degree in
Springer-Verlag, 1991. electrical engineering from New Mexico State Uni-
161 V. Corradi and H. White, “Regularized neural networks: Some conver-
~~
versity and the Ph.D. degree in mathematics from
gence rate results,” preprint, i993. the Unversity of Wisconsin, Madison, in 1962.
[7] F. Girosi and G. Anzellotti, “Rates of convergence for radial basis Since then, he has been a faculty member at
functions and neural networks,” preprint, 1993. the University of Wisconsin, Milwaukee, and is
[8] L. Xu, A. Krzyzak, and A. Yuille, “On radial basis function nets currently a Professor and Chair of the Department of
and kernel regression: Statistical consistency, convergence rates and Mathematical Sciences. His current research inter-
receptive field size,” preprint, 1993. ests include special functions, generalized functions,
[9] Q. Zhang and A. Benveniste, “Wavelet networks,” IEEE Trans. Neural sampling theorems, and wavelets.
Net., vol. 3, Nov. 1992.
ZHANG et al.: WAVELET NEURAL NETWORKS FOR FUNCTION LEARNING I491
Yubo Mia0 received the B.S. and M.S. degrees in Wan Ngai Wayne Lee (M’93) received the B.S.
computer science from the University of Minnesota degree in electrical engineering from the University
in 1991. and the University of Wisconsin, Milwau- of Wisconsin, Milwaukee, in 1991. He is currently
kee, in 1993, respectively. working towards the M.S. degree in electrical engi-
He was a research assistant in the development neering at the same institution.
of neural networks for manufacturing quality control His interests are in computer architecture, lan-
while obtaining the M.S. degree. Since 1993, he has guages, operating systems, artificial intelligence,
been employed as a programmedsystems analyst at and signal processing.
Marshfield Clinic and has worked in object-oriented
design and programming. His interests include digi-
tal image and signal processing and visualization in
scientific computing.

Zhang Et Al - Wavelet Neural Networks For Function Learning

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Zhang Et Al - Wavelet Neural Networks For Function Learning

Transféré par

Droits d'auteur :

Formats disponibles

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 43, NO. 6.

JUNE 1995 1485

Wavelet Neural Networks for Function Learning

1053-587)(/95$04.00 0 1995 IEEE

properties of the wavelet network. Finally, in experiments we

results are described in Section V, followed by a summary in

11. WAVELETS where V, is the subspace spanned by {2m/2cp(2mt -

cp(t) = hkcp(2t - I C ) (34

Later in the paper, the subscript d in (10) will be omitted,

ing procedure of the wavelet-based neural network and then

For a given set of M and K , the wavelet network described

and the subscript N emphasizes the dependence on the training

Another way to determine the weights of the wavelet M=2

This would be a good approximation to (f(t), c p ~ r , k ( t ) if

output scheme while the more involved PPR scheme is currently

f(2)= Ckfic(Pk5) (18)

100 functions were fixed to be o2 such that neighboring Gaussians

properties and experimental results. In particular, we have

VI. SUMMARY SUP IfM ( t ) - f(t)l -+ 0, as M -+

and q n f ( z ,t ) is a quasi-positive delta sequence on R d . This

I P-tl I > Y and hence by the same argument

Hence i f f ( t ) is separable, so is f(o)

Here t is a uniformly distributed random variable. From (B l),

The conclusion of Theorem 2 now follows from Schwarz's

Vous aimerez peut-être aussi