Continuous Neural Networks

Introduction
Going to infinity
Conclusion
Questions
Continuous neural networks

Nicolas Le Roux
joint work with Yoshua Bengio
April 5th , 2006
Nicolas Le Roux joint work with Yoshua Bengio
Snowbird 2006
Introduction
Going to infinity
Conclusion
Questions
Structure of Neural Networks

Horniks theorem and its consequences
Usual Neural Networks

P
b
f (xt ) = j wj g (vj xt ) + b
g (v1 xt )
g (vn xt )
...
xt,1
...
g is the transfer function: tanh, sigmoid, sign, ...

Nicolas Le Roux
Snowbird 2006
xt,d
Introduction
Going to infinity
Conclusion
Questions

Neural networks are universal approximators

Hornik et al. (1989)
Multilayer feedforward networks with one hidden layer using
arbitrary squashing functions are capable of approximating any
function to any desired degree of accuracy, provided sufficiently
many hidden units are available.
Nicolas Le Roux
Snowbird 2006
Introduction
Going to infinity
Conclusion
Questions

Neural networks are universal approximators

Hornik et al. (1989)
Multilayer feedforward networks with one hidden layer using
arbitrary squashing functions are capable of approximating any
function to any desired degree of accuracy, provided sufficiently
many hidden units are available.
Neural Networks with an infinite number of hidden units should be
able to approximate every function arbitrarily well.
Nicolas Le Roux
Snowbird 2006
Introduction
Going to infinity
Conclusion
Questions
A new vision of neural networks

The input weights function
Properties of continuous neural networks
Another vision
A nice picture before the Maths
...
h v1
O1
...
Op
...
h v2
...
h vk
xt,2
...
xt,d
xt,1
Nicolas Le Roux
Snowbird 2006
...
Introduction
Going to infinity
Conclusion
Questions

Another vision
Going to infinity
b
f (x) =
X
j
wj g (vj x) + b b
f (x) =
Nicolas Le Roux
w (j) g (v (j) x) dj + b
j
Snowbird 2006
Introduction
Going to infinity
Conclusion
Questions

Another vision
Going to infinity
b
f (x) =
X
j
wj g (vj x) + b b
f (x) =
w (j) g (v (j) x) dj + b
j
Horniks theorem tells us that any function f from R d to R can be

approximated arbitrarily well by:
Nicolas Le Roux
Snowbird 2006
Introduction
Going to infinity
Conclusion
Questions

Another vision
Going to infinity
b
f (x) =
X
j
wj g (vj x) + b b
f (x) =
w (j) g (v (j) x) dj + b
j

1
a function from R to R: w the output weights function
Nicolas Le Roux
Snowbird 2006
Introduction
Going to infinity
Conclusion
Questions

Another vision
Going to infinity
b
f (x) =
X
j
wj g (vj x) + b b
f (x) =
w (j) g (v (j) x) dj + b
j

1
a function from R to Rd+1 : v the input weights function
Nicolas Le Roux
Snowbird 2006
Introduction
Going to infinity
Conclusion
Questions

Another vision
Going to infinity
b
f (x) =
X
j
wj g (vj x) + b b
f (x) =
w (j) g (v (j) x) dj + b
j

1
a scalar: b the output bias.
Nicolas Le Roux
Snowbird 2006
Introduction
Going to infinity
Conclusion
Questions

Another vision
Going to infinity
b
f (x) =
X
j
wj g (vj x) + b b
f (x) =
w (j) g (v (j) x) dj + b
j

1
But a function from R to Rd+1 is d + 1 functions from R to R.
Nicolas Le Roux
Snowbird 2006
Introduction
Going to infinity
Conclusion
Questions

Another vision
Going to infinity
b
f (x) =
X
j
wj g (vj x) + b b
f (x) =
w (j) g (v (j) x) dj + b
j

1
But a function from R to Rd+1 is d + 1 functions from R to R.

This is very similar to Kolmogorovs representation theorem.
Nicolas Le Roux
Snowbird 2006
Introduction
Going to infinity
Conclusion
Questions

Another vision
v for usual neural networks

v (j)
wj
vj
b
f (x) =
g (v (u) x) du + b =
X
j
Nicolas Le Roux
Snowbird 2006
j
wj g (vj x) + b
Introduction
Going to infinity
Conclusion
Questions

Another vision
v for usual neural networks

v (j)
wj
vj
b
f (x) =
g (v (u) x) du + b =
j
wj g (vj x) + b
Neural networks approximate v with a piecewise constant function.

Nicolas Le Roux
Snowbird 2006
Introduction
Going to infinity
Conclusion
Questions

Another vision
V is a trajectory in Rd indexed by t
1
The function V is a trajectory in the space of all possible

input weights.
Nicolas Le Roux
Snowbird 2006
Introduction
Going to infinity
Conclusion
Questions

Another vision
1

input weights.
Each point corresponds to an input weight associated to an

infinitesimal output weight.
Nicolas Le Roux
Snowbird 2006
Introduction
Going to infinity
Conclusion
Questions

Another vision
1

input weights.

A piecewise constant trajectory only crosses a finite number of

points in the space of input weights.
Nicolas Le Roux
Snowbird 2006
Introduction
Going to infinity
Conclusion
Questions

Another vision
1

input weights.

A piecewise constant trajectory only crosses a finite number of

points in the space of input weights.
We could imagine trajectories that fill the space a bit more.
Nicolas Le Roux
Snowbird 2006
Introduction
Going to infinity
Conclusion
Questions

Another vision
Piecewise affine approximations

v (j)
Nicolas Le Roux
Snowbird 2006
Introduction
Going to infinity
Conclusion
Questions

Another vision

v (j)
wj
b
f (x) =
X
j
vj
wj
ln
(vj vj1 ) x
Nicolas Le Roux
cosh(vj x)
cosh(vj1 x)
Snowbird 2006
j
+b
Introduction
Going to infinity
Conclusion
Questions

Another vision

v (j)
wj
b
f (x) =
X
j
vj
wj
ln
(vj vj1 ) x
cosh(vj x)
cosh(vj1 x)
j
+b
Seeing v as a function, we could introduce smoothness wrt j using

constraints on successive values of v .
Nicolas Le Roux
Snowbird 2006
Introduction
Going to infinity
Conclusion
Questions

Another vision
Rate of convergence
1
Z

f (x) 2a
f (x) b
Nicolas Le Roux
|(v (t) b
v (t)) x| dt
Snowbird 2006
Introduction
Going to infinity
Conclusion
Questions

Another vision
Rate of convergence
1
Z

f (x) 2a
f (x) b
|(v (t) b
v (t)) x| dt
A good approximation of v yields a good approximation of f .
Nicolas Le Roux
Snowbird 2006
Introduction
Going to infinity
Conclusion
Questions

Another vision
Rate of convergence
1
Z

f (x) 2a
f (x) b
|(v (t) b
v (t)) x| dt
Trapezoid rule (continuous piecewise affine functions) has a

faster convergence rate than rectangle rule (piecewise
constant functions).
Nicolas Le Roux
Snowbird 2006
Introduction
Going to infinity
Conclusion
Questions

Another vision
Rate of convergence
1
Z

f (x) 2a
f (x) b
|(v (t) b
v (t)) x| dt
Trapezoid rule (continuous piecewise affine functions) has a

faster convergence rate than rectangle rule (piecewise
constant functions).
Theorem: rate of convergence of affine neural networks

Affine neural networks converge in O(n 2 ) whereas usual neural
networks converge in O(n 1 ) (when n grows to infinity).
Nicolas Le Roux
Snowbird 2006
Introduction
Going to infinity
Conclusion
Questions

Another vision
Piecewise affine approximations (again)

v (j)
Nicolas Le Roux
Snowbird 2006
Introduction
Going to infinity
Conclusion
Questions

Another vision
A few remarks on the optimization on the input weights

1
Having a complex function V requires lots of pieces.
Nicolas Le Roux
Snowbird 2006
Introduction
Going to infinity
Conclusion
Questions

Another vision

1
Without constraints, having many pieces will lead us nowhere.
Nicolas Le Roux
Snowbird 2006
Introduction
Going to infinity
Conclusion
Questions

Another vision

1
Maybe we could use other parametrizations inducing

constraints on the pieces.
Nicolas Le Roux
Snowbird 2006
Introduction
Going to infinity
Conclusion
Questions

Another vision

1
Maybe we could use other parametrizations inducing

constraints on the pieces.
Instead of optimizing each input weight v (j) independently,

we could parametrize them as the output of a neural network.
Nicolas Le Roux
Snowbird 2006
Introduction
Going to infinity
Conclusion
Questions

Another vision
Input weights function as the output of a neural network

2.5
1.5
0.5
0.5
v (j) =
wv ,k g (vv ,k j + bv ,k )
Nicolas Le Roux
Snowbird 2006
10
Introduction
Going to infinity
Conclusion
Questions

Another vision

1
v (j) =
Nicolas Le Roux
Snowbird 2006
Introduction
Going to infinity
Conclusion
Questions

Another vision

1
v (j) =
Setting a prior on the parameters of that network induces a

prior on v .
Nicolas Le Roux
Snowbird 2006
Introduction
Going to infinity
Conclusion
Questions

Another vision

1
v (j) =

prior on v .
Such priors include the Gaussian prior commonly used.
Nicolas Le Roux
Snowbird 2006
Introduction
Going to infinity
Conclusion
Questions

Another vision

1
v (j) =

prior on v .
The prior over vv ,k and bv ,k determines the level of

dependence between the js.
Nicolas Le Roux
Snowbird 2006
Introduction
Going to infinity
Conclusion
Questions

Another vision

1
v (j) =

prior on v .
The prior over vv ,k and bv ,k determines the level of

dependence between the js.
The prior over wv ,k determines the amplitude of the v (j)s.
Nicolas Le Roux
Snowbird 2006
Introduction
Going to infinity
Conclusion
Questions

Another vision
A bit of recursion
1
v (j) =
Nicolas Le Roux
Snowbird 2006
Introduction
Going to infinity
Conclusion
Questions

Another vision
A bit of recursion
1
v (j) =
What about the vv ,k and the bv ,k ?
Nicolas Le Roux
Snowbird 2006
Introduction
Going to infinity
Conclusion
Questions

Another vision
A bit of recursion
1
v (j) =
We could define them as the output of a neural network.
Nicolas Le Roux
Snowbird 2006
Introduction
Going to infinity
Conclusion
Questions

Another vision
A bit of recursion
1
v (j) =
You should be lost by now.
Nicolas Le Roux
Snowbird 2006
Introduction
Going to infinity
Conclusion
Questions

Another vision
A bit of recursion
1
v (j) =
You should be lost by now.
Lets stop a bit to rest.
Nicolas Le Roux
Snowbird 2006
Introduction
Going to infinity
Conclusion
Questions

Another vision
Summary
1
Input weights can be seen as a function.
Nicolas Le Roux
Snowbird 2006
Introduction
Going to infinity
Conclusion
Questions

Another vision
Summary
1
There are parametrizations of that function that yield

theoretically more powerful networks than the usual ones.
Nicolas Le Roux
Snowbird 2006
Introduction
Going to infinity
Conclusion
Questions

Another vision
Summary
1

Moreover, such parametrizations allow to set different

constraints than the common ones.
Nicolas Le Roux
Snowbird 2006
Introduction
Going to infinity
Conclusion
Questions

Another vision
Summary
1

Moreover, such parametrizations allow to set different

constraints than the common ones.
Example: handling of sequential data.
Nicolas Le Roux
Snowbird 2006
Introduction
Going to infinity
Conclusion
Questions

Another vision
Having all possible input neurons at once

1
Instead Rof optimizing input weights, we could use all of them:

f (x) = E w (v )g (v x) dv
Nicolas Le Roux
Snowbird 2006
Introduction
Going to infinity
Conclusion
Questions

Another vision

1

f (x) = E w (v )g (v x) dv
and only optimize the output weights this is convex.
Nicolas Le Roux
Snowbird 2006
Introduction
Going to infinity
Conclusion
Questions

Another vision

1
2
3

f (x) = E w (v )g (v x) dv
The optimal
P R solution is of the form:
f (x) = i E g (x v )g (xi v ) dv .
Nicolas Le Roux
Snowbird 2006
Introduction
Going to infinity
Conclusion
Questions

Another vision

1
2
3

f (x) = E w (v )g (v x) dv
The optimal
f (x) = i E g (x v )g (xi v ) dv .
With a sign transfer function, this integral can be computed

analytically and yields a kernel machine.
Nicolas Le Roux
Snowbird 2006
Introduction
Going to infinity
Conclusion
Questions

Another vision

1
2
3

f (x) = E w (v )g (v x) dv
The optimal
f (x) = i E g (x v )g (xi v ) dv .

Setting a prior on the output weights, this becomes a GP.
Nicolas Le Roux
Snowbird 2006
Introduction
Going to infinity
Conclusion
Questions

Another vision

1
2
3

f (x) = E w (v )g (v x) dv
The optimal
f (x) = i E g (x v )g (xi v ) dv .

Ksign (x, y ) = A Bkx y k
Nicolas Le Roux
Snowbird 2006
Introduction
Going to infinity
Conclusion
Questions

Another vision

1
2
3

f (x) = E w (v )g (v x) dv
The optimal
f (x) = i E g (x v )g (xi v ) dv .

Ksign (x, y ) = A Bkx y k
This kernel has no hyperparameter.
Nicolas Le Roux
Snowbird 2006
Introduction
Going to infinity
Conclusion
Questions

Another vision
Results on USPS with 6000 training samples

Algorithm
Ksign
G. = 1
G. = 2
G. = 4
G. = 6
G. = 7
wd = 103
2.270.13
58.270.50
7.710.10
1.720.11
1.670.10
1.720.10
wd = 106
1.800.08
58.540.27
7.780.21
2.090.09
2.780.25
3.040.26
Nicolas Le Roux
wd = 1012
1.800.08
58.540.27
7.780.21
2.100.09
3.330.35
4.390.49
Snowbird 2006
Test
4.07
58.29
12.31
4.07
3.58
3.77
Introduction
Going to infinity
Conclusion
Questions

Another vision
Results on MNIST with 6000 training samples

Algorithm
Ksign
G. = 1
G. = 2
G. = 3
G. = 5
G. = 7
wd = 103
5.51 0.22
77.55 0.40
10.51 0.46
3.64 0.10
3.01 0.12
3.15 0.09
wd = 106 , 109 , 1012 , 0

4.54 0.50
77.55 0.40
10.51 0.45
3.64 0.10
3.01 0.12
3.18 0.10
Nicolas Le Roux
Snowbird 2006
Test
4.09
80.03
12.44
4.1
3.33
3.48
Introduction
Going to infinity
Conclusion
Questions

Another vision
Results on LETTERS with 6000 training samples

Algorithm
Ksign
G. = 2
G. = 4
G. = 6
G. = 8
wd = 103
5.36 0.10
5.47 0.14
4.97 0.10
6.27 0.17
8.45 0.19
wd = 106
5.22 0.09
5.93 0.15
11.06 0.29
8.47 0.20
6.11 0.15
Nicolas Le Roux
Snowbird 2006
wd = 109
5.22 0.09
5.92 0.14
12.50 0.35
17.61 0.40
18.69 0.34
Test
5.5
5.8
5.3
6.63
9.25
Introduction
Going to infinity
Conclusion
Questions
Summary
Future work
Summary
1
We showed that training a neural network can be seen as

learning an input weight function.
Nicolas Le Roux
Snowbird 2006
Introduction
Going to infinity
Conclusion
Questions
Summary
Future work
Summary
1

We introduced an affine-by-part parametrization of that

function which corresponds to a continuous number of hidden
units.
Nicolas Le Roux
Snowbird 2006
Introduction
Going to infinity
Conclusion
Questions
Summary
Future work
Summary
1

We introduced an affine-by-part parametrization of that

function which corresponds to a continuous number of hidden
units.
In the extreme case where all the input weights are present,
we showed it is a kernel machine whose kernel can be
computed analytically and possesses no hyperparameter.
Nicolas Le Roux
Snowbird 2006
Introduction
Going to infinity
Conclusion
Questions
Summary
Future work
Future work
1
Learning the transfer function using a neural network.
Nicolas Le Roux
Snowbird 2006
Introduction
Going to infinity
Conclusion
Questions
Summary
Future work
Future work
1
Find other (and better) parametrizations of the input weight

function.
Nicolas Le Roux
Snowbird 2006
Introduction
Going to infinity
Conclusion
Questions
Summary
Future work
Future work
1
Find other (and better) parametrizations of the input weight

function.
Recursively define the input weight function as the output of a

neural network.
Nicolas Le Roux
Snowbird 2006
Introduction
Going to infinity
Conclusion
Questions
Computation of the covariance matrix

Additive and multiplicative invariance of the covariance matrix
Now is the time for ...
Questions?
Nicolas Le Roux
Snowbird 2006
Computing
Introduction
Going to infinity
Conclusion
Questions

sign(v x + b)sign(v y + b) dvdb
sign(x) function is invariant with respect to the norm of x.
sign(v x + b)sign(v y + b) = sign(v x + b)(v y + b)].
When b ranges from M to +M, for M large enough,

x 0 vv 0 y + b is negative on an interval of size |v (x y )|.
R +M
0 0
b=M sign(x vv y + b) db = 2M 2|v (x y )|.
4
5
Integrating this term on the unit hypersphere yields a kernel

of the form K (x, y ) = A Bkx y k.
Nicolas Le Roux
Snowbird 2006
Introduction
Going to infinity
Conclusion
Questions

Additive and multiplicative invariance of the covariance

matrix
1
2
3
In SVM and kernel regression, the elements of the weights

vector sum to 0.
The final solution involves K .
Thus, adding a term to every element of the covariance
matrix yields the solution
(K + n ee 0 ) = K + n e(e 0 ) = K .
C (K , , b, ) = L(K + b, Y ) + 0 K

K
K
, c, b,
c + b, Y + c0 c
= L
c
c
c
c
c
= C (K , c, b, )
Nicolas Le Roux
Snowbird 2006

Continuous Neural Networks

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Continuous Neural Networks

Transféré par

Droits d'auteur :

Formats disponibles

Introduction

Continuous neural networks

April 5th , 2006

Nicolas Le Roux joint work with Yoshua Bengio

Structure of Neural Networks

Usual Neural Networks

g is the transfer function: tanh, sigmoid, sign, ...

Structure of Neural Networks

Neural networks are universal approximators

Structure of Neural Networks

Neural networks are universal approximators

A new vision of neural networks

A nice picture before the Maths

A new vision of neural networks

A new vision of neural networks

Horniks theorem tells us that any function f from R d to R can be

A new vision of neural networks

Horniks theorem tells us that any function f from R d to R can be

a function from R to R: w the output weights function

A new vision of neural networks

Horniks theorem tells us that any function f from R d to R can be

a function from R to R: w the output weights function

a function from R to Rd+1 : v the input weights function

A new vision of neural networks

Horniks theorem tells us that any function f from R d to R can be

a function from R to R: w the output weights function

a function from R to Rd+1 : v the input weights function

a scalar: b the output bias.

A new vision of neural networks

Horniks theorem tells us that any function f from R d to R can be

a function from R to R: w the output weights function

a function from R to Rd+1 : v the input weights function

a scalar: b the output bias.

But a function from R to Rd+1 is d + 1 functions from R to R.

A new vision of neural networks

Horniks theorem tells us that any function f from R d to R can be

a function from R to R: w the output weights function

a function from R to Rd+1 : v the input weights function

a scalar: b the output bias.

But a function from R to Rd+1 is d + 1 functions from R to R.

A new vision of neural networks

v for usual neural networks

A new vision of neural networks

v for usual neural networks

Neural networks approximate v with a piecewise constant function.

A new vision of neural networks

The function V is a trajectory in the space of all possible

A new vision of neural networks

The function V is a trajectory in the space of all possible

Each point corresponds to an input weight associated to an

A new vision of neural networks

The function V is a trajectory in the space of all possible

Each point corresponds to an input weight associated to an

A piecewise constant trajectory only crosses a finite number of

A new vision of neural networks

The function V is a trajectory in the space of all possible

Each point corresponds to an input weight associated to an

A piecewise constant trajectory only crosses a finite number of

We could imagine trajectories that fill the space a bit more.

A new vision of neural networks

Piecewise affine approximations

A new vision of neural networks

Piecewise affine approximations