Vous êtes sur la page 1sur 13


cEi% Microprocessing


ELSEVIER Microprocessing and Microprogramming 41 ( 19%) 757-769

Mixing floating- and fixed-point formats for neural network

learning on neuroprocessors
Davide Anguita a,*, Benedict A. Gomes by’
a Univ. of Genova, D.I.B.E., via Opera Pia Ila, 16145 G-nova, Italy
b ht. Comp. Science Inst., 1947 Center St., Berkeley, CA, USA

Received 1 March 1995; revised 11 October 1995; accepted 18 January 1996

We examine the efficient implementation of back-propagation (BP) type algorithms on TO [ 31, a vector processor with
a fixed-point engine, designed for neural network simulation. Using Matrix Back Propagation (MBP) [2] we achieve an
asymptotically optimal performance on TO (about 0.8 GOPS) for both forward and backward phases, which is not possible
with the standard on-line BP algorithm. We use a mixture of fixed- and floating-point operations in order to guarantee both
high efficiency and fast convergence. Though the most expensive computations are implemented in fixed-point, we achieve
a rate of convergence that is comparable to the floating-point version. The time taken-for conversion between fixed- and
floating-point is also shown to be reasonably low.

Keywords: Neural networks; Neuroprocessors; Fixed-point format

1. Introduction back-propagation (BP). Some well-known examples

in this field are CNAPS [ 131, Lneuro [ 181, MA-16
Among the large number of dedicated VLSI archi- [20], and SPERT [28]: they are the building blocks
tectures for neural networks developed in recent years, for larger systems that exploit massive parallelism to
several of the most successful proposals have regarded achieve performances orders of magnitude greater than
digital implementations. Most of these dedicated pro- conventional workstations [ 2 1,4]. The common char-
cessors are oriented toward the efficient execution of acteristic of these processors is the use of a fixed-point
various learning algorithms with a strong accent on engine, typically 16 bits wide or less, for fast compu-
The drawback for the final user who wants to im-
* Corresponding author. Email: anguita@dibe.unige.it. plement an algorithm for neural network learning on
’ Email: gomes@icsi.berkeley.edu

0165-6074/96/$15.00 @ 1996 Elsevier Science B.V. All rights reserved

758 D. Anguita, BA. Gomes/Microprocessing and Microprogramming 41 (1996) 757-769

Table I
The MBP algorithm
Pseudo-code # of operations Point
/* Feed-forward */
for 1 := I to L
s/ := s,_, W/ (1.1) 2NPN/N/- I fixed
S/ := S/ + b/ lr (1.2) NP NI fixed
s, := f’{S,} (1.3) NpN/kl fixed

/* Emor back-prop */
AL:=T-SL (2.1) NP NL floating
AL := AL x ~{SL} (2.2) NPNL(I + kz) floating
for I := L - 1 to I
A/ := A,+] WT+, (2.3) ~NPNI+INI fixed
A/ := A/ x g{S,} (2.4) NPN/( 1+ k2) fixed

/* Weight variation */
for I := 1 to L
AW”“” ._ ST A I (3.1) ~NPNINI-I fixed
Ab;:“. :;-A,‘:;
(3.2) NPNI fixed
AW 7”” := ,AWy + aAw;‘d
(3.3) ~N/N/-I floating
Ab;“F := qAb;“” + &yd
(3.4) 3Ni

/* Weight update */
for I:= 1 to L
W, := W, + AWY (4.1) N/N/-I floating
b, := b, + Ab;e’ (4.2) Nl floating

this kind of processors is the fixed-point format that there are many variations of the BP algorithm and each
requires greater attention during the implementation, of them can show different sensitivity to the approxi-
compared with a conventional floating-point format. mations caused by the fixed-point arithmetic, leading
This is not a new problem, in fact both analog and dig- to different convergence problems. Some theoretical
ital implementations of neural networks suffer from results on the precision issue have been found [ 1,231,
some constraint due to physical limitations. For this but often they rely on difficulty to predict parameters
reason, the effect of discretization on feed-forward net- (e.g. the number of iterations to convergence).
works and back-propagation learning received some One solution to overcome these limitations is to
attention shortly after the introduction of the algorithm mix conventional floating-point operations with fixed-
[ 10,5,15]. Most of the results indicate that a repre- point operations when required. An example of this
sentation of 16 bits for the fixed-point format is reli- approach is [ 121 where the feed-forward and the back-
able enough to obtain reasonable results with on-line ward phase of the algorithm are computed in fixed- and
backpropagation. On the other hand, despite this gen- floating-point format respectively. However, this solu-
eral agreement, there has been some effort to reduce tion does not address the efficiency issue because the
the precision needed during the computation [ 14,221, most computationally expensive part of the algorithm
mainly because the effect of the discretization during (the backward phase) is still performed in floating-
learning is not completely understood and it seems point format, losing all the advantages of a fast fixed-
to be both problem and algorithm dependent. In fact, point engine.
D. Anguira. B.A. Gomes/Microprocessing and Microprogramming 41 (1996) 757-769 759

We show here a mixed floating/fixed-point imple- particularly important for the efficiency of the imple-
mentation of Matrix Back Propagation (MBP) [2] mentation: if the patterns are stored in row order, the
that isolates the most computationally expensive steps elements of each pattern lie in consecutive memory
of the algorithm and implements them efficiently in locations and can be accessed with no performance
fixed-point format. Other parts of the algorithm with penalty on the vast majority of current processor ar-
less demand in terms of computational power but chitectures including TO. Matrices Si , . . . , SL contain
with more critical needs in terms of accuracy are im- the output of the corresponding layer when So is ap-
plemented in conventional floating-point format. The plied to the input of the network. The size of S1 is
target architecture is the neuroprocessor TO, but the NP x Nl and the size of T is Np x NL.
method is of general validity. The back-propagated error is stored in matrices A,
Despite the need for conversions between the two of size Np x Nl and the variations of weights and
formats and the simulation of the floating-point oper- biases computed at each step are stored respectively
ations in software, good performances are obtainable in matrices AW! of size Nl x Nl_t and vectors Abl
with reasonably large networks, showing a high effi- of size Nl. For simplicity, connections between non-
ciency in exploiting the TO hardware. consecutive layers are not considered.
The following section describes the learning algo- The total number of operations of MBP is
rithm implemented. Section 3 describes the mixed
floating/fixed-point approach. Section 4 summarizes
the main characteristics of TO. Section 5 shows the im- nr’ = 2Np 3 5 N,N,_, - N, No (5)
plementation details and performance evaluation and I=1
Section 6 compares the effect of the mixed approach L L

with the standard algorithm. +(3 + kl + k0’pxN, + 4CN,N,-, - NPNL

I=1 i=I
2. Matrix back propagation
+4xNi, (7)
In Table 1 the MBP algorithm is summarized. It can
be used to represent several BP learning algorithms where kl and k2 are respectively the number of op-
with adaptive step and momentum [ 26,271. The sec- erations needed for the computation of the activation
ond column of the table contains the number of opera- function of the neurons and its derivative. If the acti-
tions needed by each step. The third column indicates vation function is the usual sigmoid, then k:! = 2.
if the computation for each step is performed in fixed- On a conventional RISC, if each operation is com-
or floating-point format (this choice will be explained pleted in a single cycle, the total computational time
in the following section). Bold letters indicate vectors is T cx no’c~es= nap. On vector or multi-ALU proces-
or matrices. sors like TO, the expected time is T 0: ncvcle. = nOl,/P,
We assume that our feed-forward network is com- where P is the number of ALUs. Obviously the im-
posed of L layers of Nr neurons, with 0 < 1 < Z,. The plicit assumptions are: (a) there is no additional cost
weights for each layer are stored in matrices WI of to load or store the data in memory, (b) one instruc-
size N, x N/-t and the biases in vectors bl of size NL. tion can be issued every cycle, and (c) the order in
The learning set consist of Np patterns. Input pat- which the operations are issued allows a complete ex-
terns are stored in matrix SO in row order and target ploitation of the ALUs. It has already been shown [ 21
patterns similarly in matrix T. The order of storing is that with a relatively small effort these constraints can
760 D. Anguita, B.A. Gomes/Microprocessing and Microprogramming 41(1996) 757-769

be satisfied reasonably well on some RISCs. In Sec- Using Table 1 we can observe that the more expen-
tion 4 we will address this problem for TO. sive steps are ( l.l), (2.3) and (3.1). They require
0( n3) operations (where n is in general the size of the
problem), therefore they will be performed in fixed-
point. Note that matrix SOthat contains the input pat-
3. The neuroprocessor TO
terns is likely to be already in fixed-point format in
real-world applications, deriving, for example, from an
TO belongs to the family of neuroprocessors with
A/D conversion. Step ( 1.2) can be easily computed
fast fixed-point capabilities and it will be the first
in the same way.
implementation of the Torrent architecture [ 33. It is
Step (1.3) requires a function computation. With
tailored for neural-networks calculations and inherits
the use of the fixed-point format, this can be substi-
some of the features of a previous neuro-processor
tuted with an indexed load from a table where the val-
[ 281. The next implementation (Tl ) will be the build-
ues of the functions are pre-stored.
ing block for a massively parallel neuro-computer [ 41.
Before starting the error-back propagation, we can
In particular, TO is composed of a standard MIPS-
translate the output of the network to floating-point
II RISC engine [ 171 with no floating-point unit but
in order to have an accurate computation of the error
with a fixed-point vector unit that can execute up to
(2.1) and its derivative (2.2). The interesting side-
two operations per cycle on 8-word vectors, or, in
effect of performing these operations in floating-point
other words, compute 16 results in a single cycle. This
is that we know (after step (2.2) ) the numeric range
translates to a peak performance of 0.8 GOPS (Giga
of the error, therefore it is possible to choose a good
Operations per Second) if the processor is clocked
fixed-point representation for the subsequent steps.
at SOMHz, or approximately 0.2 GCUPS (Giga Con-
The next conversion is performed before step (3.3)
nection Updates per Second) for one hidden layer
and (3.4) in order to compute with great accuracy the
networks, a result comparable to supercomputer im-
variation of the weights and biases of the network.
plementations. Fig. 1 summarizes the architecture of
Note that both ‘I) (the learning step) and LY(the mo-
the vector unit. The two 8-word ALUs are VP0 and
mentum term) are in general floating-point variables.
VP1 connected to the 32-bit vector register bank. Each
To summarize the algorithm: the conversion from
vector register contains 32 elements, therefore each
fixed- to floating-point format must be performed at
ALU can execute an operation on a complete vector
the end of the forward phase on matrix SL and at
in 4 cycles. The data path to/from the memory is 128
the end of the backward phase on AWl and Abr. The
bits wide allowing the loading/storing of eight 16-bit
conversion from floating- to fixed-point format must
words in a single cycle.
be performed at the beginning of the forward phase on
each Wr and bl and at the beginning of the backward
phase on AL.
4. The mixed format algorithm

We will explain here in detail the choice of the 5. Optimal implementation on the TO
format for each step of the algorithm. The main idea neuroprocessor
is to perform the most computational expensive part
of the algorithm in fixed-point and resort to floating- If the implementation of an algorithm on TO is op-
point only where the computation must be particularly timal, in the sense that it can completely exploit its
accurate. hardware, we can expect to have rzcycres= nap/ 16. For
D. Anguita, BA. Gomes/Microprocessing and Microprogramming 41 (1996) 757-769 761

cnd.Mv. 1 cnd.Mv. ) cad. Mv. ml. Mv. Cnd. Mv. Cnd.Mv. ) cI?d.Mv. 1 CMLMV.

Load A@l Load Al@ Load&p LwJAlgn Load A@ Load Alpn Load Al#n
StoreDN Stare Drv stomlhv StOre DN Store DN store DN SWDN


Fig. 1. Simplified architecture of the Vector Unit of TO.

this reason, we will refer to an algorithm as asymp- Our purpose is to show that MBP can be imple-
toticully optimal for TO (or simply opfiml) if the ef- mented optimally in this sense, even though some of
ficiency E of its implementation goes to 1 as the size the computations are done in floating-point and must
of the problem (Np,Nl) grows. In other words: E = be simulated in software.
napI 16ncvcres---f 1. As mentioned before, the computational load be-
762 D. Anguita, BA. Gomes/Microprocessing and Microprogramming 41 (1996) 757-769

Table 2
Scalar and vectorized matrix products
Scalar matrix products Vectorized matrix products
s/ =s/_i .w, for i := 0 to Np - 1 forj:=OtoNl_I-1stepV~
forj:=OtoN/_l-1 for i := 0 to Np - 1 step C/
for k := 0 to N/ - I for k := 0 to N/ - 1
s! += s!-’ ’ I
1.J r.k * wk,j S;,[j,j+V,] +=d,T * w:.[j.j+VL]

s;+, *,
rjj+q, +={+;'&
klj.jth 1

+= L!f-l,k * w:,[j,jtVL)

for i := 0 to Np - 1 for i := 0 to Np - 1 step V

for j := 0 to N/+1 - 1 for j := 0 to N/+1 - 1 step V
for k := 0 to N/ - 1 for k := 0 to Nl - 1 step VL
s! += ,!+I * WI’.+’ s! += s!+’ * w!+’
1J r,k J.k 1.J dk.k+VLl J,Ck.k+VL 1

S!ltb!Jtv +=6!+’
I+V-l,[k.k+V~ I * w!+’
J+V-l,[k.ktV~ I

AW, =S;_, .A, for i := 0 to N/-l - 1 for j := 0 to N/ - 1 step VL

for j := 0 to IV/ - 1 for i := 0 to N/-I - I step U
for k := 0 to Np - 1 for k := 0 to Np - 1
es’k.J AWf.[jj+VL]
. f=&' *a
+='s':' * 6'
. . k,rtl &,[JJ+VL I

longs to steps (l.l), (2.3) and (3.1). To compute is computing the arithmetic operations. The unrolling
these steps, three matrix multiplications must be per- depth is limited by the number of registers available
formed: ( 1.1) is a conventional matrix product, (2.3) for storing intermediate results: in our case U = 8 and
is a matrix product with the second matrix transposed v = 2.
and (3.1) is a matrix product with the first matrix As can be easily noted, the vectorized version per-
transposed. forms its vector references to each matrix in row order
The three operations are shown in pseudo-code in to exploit the memory bandwith of TO. In fact, the use
the second column of Table 2. The third column shows of stride- 1 access to the memory allows the processor
the vectorized versions. V’ is the vector register length to load an entire 8-word vector (of 16 bits) in a sin-
(32 in the current implementation of TO) and U, V gle cycle, while a generic stride-n access to memory
are the unrolling depth needed to fill the processor (n > 1) requires one cycle per element.
pipelines. We will assume in the following text that all the
The increase of the unrolling depth shifts the bal- matrix dimensions are a multiple of VL. If this is not
ance of the loop from memory-bound to CPU-bound, the case, there is some overhead due to an underuti-
therefore extra cycles are available for the memory lization of the vector unit, but it does not affect the
port to load (store) the operands while the processor asymptotical behavior of the implementation. For an
D. Anguitu, EA. Gomes/Microprocessing and Microprogramming 41 (1996) 757-769 763

Table 3
of dot-products. This operation is not directly imple-
Number of cycles for MBP on TO in the general case
mented on TO and needs N 20 cycles for a vector of
V’ = 32 words. This problem is known and could be
(1.1) (4N/llf41/)[NPIUl[N-l/h.1
eventually solved in future releases of the processor
(1.2) (8[N//h1
+ 1)b 131. In any case, the overhead due to the absence of
(1.3) 1.~NP[N//k1 h.
the dot-product is not particularly annoying when deal-
(2.1) kfNp [NLIVLI VL
ing with matrix products: in fact, partial dot-products
(2.2) 3k,N, ~NLIVLI VL of length V’ can be kept in vector registers and the
(2.3) (4 IN//VL~ V* + 20 + V*) INPIV~ rN/+~/vl final result can be computed at the end of the inner
(2.4) 12 [N//VL~ NP loop. Note that matrix-vector products (used in the
(3.1) (4Npu + 4U) [N//VLI [N/-I/~] standard BP algorithm) would suffer from a bigger
(3.2) (~NP + 4) ~N/VL~ overhead, in this case the absence of an implemented
(3.3) 3kfN/ ~N/--I/~L] VL dot-product appears only in the second-order term and
(3.4) 3k.f [N//VL~ VL becomes negligible for large problems.
(4.1) kfN/ [N/-I/~L~ VL The third product (3.1) is similar to ( 1.l ) , but with
(4.2) k,f [N//VLI VL a different order of the loops.
Other steps performed in fixed-point format are:
Table 4 the computation of the output of each neuron through
Number of cycles for optimized matrix multiplications its activation function ( 1.3), the computation of its
Step kyckr
I I derivative in the internal layers (2.4)) the bias addition
(1.1) ~NPN/N/--I + XNPNI-I
in the feed-forward phase ( 1.2) and the bias compu-
(2.3) $PNN+I + ~NPN/+I
tation in the backward phase (3.2).
(3.1) $NPNINI-I + $N,N,-, The computation of the activation function is quite
expensive if it is done using a floating-point math li-
exact computation of the number of cycles in the gen- brary [ 111, and it would cause a large penalty on TO
eral case, the reader can refer to Table 3. due to the absence of a floating-point unit. Yet, if ( 1.3)
Table 4 shows the number of cycles needed by TO is performed in fixed-point format, the activation func-
to perform the optimized matrix multiplications. tion can be easily computed using a look-up table of
Step ( I. 1) requires four cycles in the inner loop to size 2’ where B is the number of bits of the fixed-
compute a single vector multiplification/addition and point format [ 61. The vector unit of TO is provided
four cycles to store each result back in memory at the with a vector instruction to perform indexed load, so
end of the loop. The load of element Si,k can be over- the number of cycles needed to compute the value us-
lapped with the computation thanks to the unrolling ing the table is only N 1 S/element.
of the external loop. It is easy to prove the optimality The pseudo-code for steps ( 1.2)) (2.4) and (3.2)
of (1.1): is shown in Table 5. The three loops are memory-
bounded, therefore the number of cycles is easy to
E(l.I) = v - 2NpNiNl-I
compute (assuming sufficient unrolling). All the other
I&‘.‘) 16 (+NPNINl-, + $Np~l_l) steps are done in floating-point format.

1 Let us consider now the overhead due to the conver-

+ 1. (8) sion of the matrices from floating- to fixed-point for-
= l+l/N1
mat and vice versa. The scalar conversion takes N 46
The second product (2.3) can be seen as a sequence cycles/element on TO, but it is possible to lower this
764 D. Anguita, BA. Gomes/Microprocessing and Microprogramming 41(1996) 757-769

Table 5
Other vectorized operations
Step Pseudo-code
(1.2) for i := 0 to Np - 1
for j := 0 to Nl step VL
si,[j,j+V~]+= bl

(2.4) for i := 0 to Np - 1
for j := 0 to N/ - 1 step VL
. .
4[jj+V~] . .
=‘l(jj+V,] *(I-sfl[jj+V,] . . . .

(3.2) for j := 0 to N/ step VL

for i := 0 to Np

bIj,j+VL I += ‘l ,[j .j+VfJ $NPNI + $NI

number using the vector unit. For vectors between 100
and 1000 elements, the translation from floating-point >
to fixed-point format requires only kfx = 2.6 tf 1.8 (11)
cycles/element and k,. = 3.6 tt 2.5 cycles/element
+ NPNL-~N~N,+~
for the inverse conversion (these figures have been (12)

measured experimentally). L

The total number of cycles needed for the conver- +LNP + 4kf + kfr + kxf + f N (13)
sions is /=I

If we compare the 0(n3) term (10) with the cor-
$$!!b,=(k,f+kfx) CNl(NII + 1) +NPNL . responding term for npP, we can deduce easily the
[ I=1 I optimality of this implementation of MBI?
(9) Obviously, the asymptotical behavior of MBP on
TO is not of primary importance when dealing with
TO does not implement the floating-point unit of
real-world applications. It is interesting therefore to
the MIPS architecture, so the floating-point operation
analyze the second- ( 1 1), ( 12) and first-order ( 13)
must be simulated in software. Currently the RISC
terms of the above expression.
core is used to perform the simulation, but an IEEE
First of all, we note that the overhead due to the con-
compatible floating-point library that uses the vector
versions from fixed- to floating-point and vice-versa
unit is under development and the expected perfor-
depends mainly on the size of the network and only
mance will be in the rangeof 10 ++ 50 cycles/element. marginally on the dimension of the training set, as can
Then the number of cycles for the floating-point steps
be seen from the second term of ( 11) and the first term
of the algorithm will be nzles = kfn$, with kf E
of ( 12). The dependence from the size of the train-
[ 10,501. ing set is controlled by the number of neurons of the
We now have all the elements to compute the num-
output layer ( NL), so we expect better performance
ber of cycles needed by TO to execute MBP,
when dealing with networks with a small number of
outputs (e.g. classification problems, as opposed to
N/N/-I - NINO (10) encoding problems [ 81). If this is not the case, some
techniques to reduce the number of output neurons in


tKUPS(Np.N) -

Fig. 2. Efficiency and performance (MCUPS) of MBP on TO

classification problems can be applied [ 19 I. of Layers is not theoretically justified [9] and practi-
There is also an explicit dependence in the tirst- cal applications seldom require more than four layers
w&r term (13) on the number of layers UT Ihe ner- q see, for example, [ 161 for a real problem that re-
work ( L j, This rem is of small importance being af quires such an architecture).
first order, but we can expect an increase of overhead To sketch the behavior of MBP on TO, we can sim-
in networks with a very large number of layers. Haw- plify both the expressions for nOp and n,,lrs assuming
ever, this is not a common case, as n large number VI Nf zz N and plot the efficiency and the: performance
766 D. Anguita, BA. Games/ Microprocessing and Microprogramming 41 (1996) 757-769

Table 6
Some real-world applications
Name Network size Description
NO Nl N2 N3
NETtalk (241 203 80 26 Pronunciation of text
Neurogammon [ 251 459 24 24 1 Backgammon player
Speech [7] 234 1000 69 _ Speech recognition

in MCUPS (Fig. 2) as functions of the size of the fixed-point representation is too coarse (e.g. E = 2),
training set ( NP) and the network (N). the algorithm tends to get stuck due to the underflow
We assume kl = 6 to compute rzop (as suggested of the back-propagated error. However, thanks to the
in [ 111) and the worst case for floating-point and use of the mixed format, it is possible to choose a
conversion routines on TO (kf = 50, kf, = 4, k,f = 3) good range for the fixed-point variables before starting
to compute n,,.,j,,s. The asymptotic performance is 160 the error back propagation, because the error compu-
MCUPS; obviously, the asymptotic performance of a tation in the last layer is done in floating-point format.
generic RISC processor with the same clock and only The choice of the correct range can easily be done
one FPU would be 10 MCUPS. looking at the largest floating-point value. In this case,
Fig. 2 allows us to easily understand the behavior the learning with mixed format is comparable to the
of the implementation, but it is of little practical use learning in floating-point format in terms of number of
due to the peculiar network architecture. For this rea- learning steps but, of course, far more efficient from
son we show here the performance of MBP on TO a computational point of view.
with networks that have been used in some real-world
applications (Table 6).
Fig. 3 summarizes the performance for the applica-
tions mentioned above. It is interesting to note that, for
all problems, the number of patterns for which half of
the peak performance is attained (ni 12) is reasonably 7. Conclusions
small ( NP N 500).
We have detailed here an efficient implementation
of a back-propagation algorithm on TO. The use of
6. Learning with the mixed format algorithm the mixed fixed/floating-point mode in the implemen-
tation shows good performance with real-world net-
To test the effectiveness of the mixed format algo- works, both in terms of the efficiency of computation
rithm we chose the speech recognition problem de- and in terms of the convergence rate. The limited pre-
scribed in the previous section. cision supported by the hardware is not a problem pro-
Fig. 4 shows the learning on a subset of the speech vided the range is appropriately chosen. The mixed
database with different ranges of the fixed-point vari- model computes the output layer’s error using floating-
ables. In particular, E is the exponent of the most sig- point, and uses the floating-point values to determine
nificant digit of the fixed-point format. With 16-bit an appropriate range for the following fixed-point.
words we can represent values in the range [ -2E, 2E- This work shows that digital neuroprocessors and
2E-‘51. particularly TO can be efficient test beds for various
It is clear that the error back propagation is quite BP-type algorithms, even when limited by fixed-point
sensitive to the range of the fixed-point format. If the formats.
D. Anguita, BA. Gomes/Microprocessing and Microprogramming 41 (1996) 757-769 161

140 h

120 -

100 -

so- ^

so- _

40- ,j’.

,: ,,a’
,:’ ,/

20 - ;..’

.. ;i
o- -

1000 iofml

Fig. 3. Performance for some real-world applications (in MCUPS).

0 5 10 15 20 25 30 35 40 45 50

Fig. 4. Leaming behavior for different fixed-point ranges.

Acknowledgements for several interesting discussions on TO, Naghmeh

Nikki Mirghafori for providing the speech database,
Thanks to David Johnson for providing the emu- and Professor Nelson Morgan for suggestions on the
lation routines for fixed- and floating-point math and learning algorithm. We would also like to thank two
768 D. Anguita, B.A. Gomes/Microprocessing and Microprogramming 41 (1996) 757-769

anonymous reviewers for their suggestions on how to [ 121 E. Fiesler, A. Choudry and H.J. Caulfield, A universal weight
improve this paper. discretization method for multi-layer neural networks, IEEE
Trans. on SMC, to appear.
This work was developed while D. Anguita was
[ 131 D. Hammerstrom, A VLSI architecture for high-performance,
visiting researcher at ICSI, Berkeley, USA, under a
low-cost, on-chip learning, Proc. of the IJCNN ‘90, San
grant of “CNR - ConsiglioNazionale Ricerche”, Italy. Diego, USA (17-21 June 1990) pp. 537-544.
[ 141 M. Hoehfeld and S.E. Fahlman, Learning with numerical
precision using the cascade-correlation algorithm, IEEE
References Trans. on Neural Networks 3(4) (July 1992) 602-611.
1151 PW. Hollis, J.S. Harper and J.J. Paulos, The effect of
] I ] C. Alippi and M.E. Negri, Hardware requirements for digital precision constraints in a backpropagation learning network,
VLSI implementations of neural networks, Inr. Joint Conf Neural Computation 2 ( 3) 1990.
on Neural Networks, Singapore ( 1991) pp. 1873- 1878. M.A. Kramer, Nonlinear principal component analysis using
121D. Anguita, G. Parodi and R. Zunino, An efficient autoassociative neural networks, AIChE I. 37(2) (Feb.
implementation of BP on RISC-based workstations, 1991) 233-243.
Neuracomputing 6 (1994) 57-65. 1171 G. Kane and J. Heinrich, MIPS RISC Architecture (Prentice
131 K. AsanoviC, J. Beck, B. lrissou, D. Kingsbury, N. Morgan Hall, Englewoods Cliffs, NJ, 1992).
and J. Wawrzynek, The TO vector microprocessor, Her Chips
[I81 N. Maudit, M. Duranton, J. Gobert and J.A. Sirat, Lneuro
VII Symposium, Stanford Univ. ( 13-15 Aug. 1995). 1.O: a piece of hardware LEG0 for building neural network
141 K. AsanoviC, J. Beck, J. Feldman, N. Morgan and J. systems, IEEE Trans. on Neural Networks 3( 3) (May 1992)
Wawrzynek, Designing a connectionist network super- 414-421.
computer, Int. J. Neural Systems 4(4) (Dec. 1993) 317-326.
[I91 N. Morgan and H. Bourlard, Factoring networks by a
151 K. AsanoviC and N. Morgan, Experimental determination statistical method, Neural Computation 4( 6) (Nov. 1992)
of precision requirements for back-propagation training of
artificial neural networks, in Proc. of 2nd Inr. Co@ on
1201 U. Ramacher et al., eds., VLSI Design of Neural Networks
Micraeiecfronics for Neural Networks, Munich, Germany
(Kluwer Academic, Dordrecht, 199 1).
(16-18 Oct. 1991) pp. 9-15.
[211 U. Ramacher et al., SYNAPSE-X: a general-purpose
161 V. Bochev, Distributed arithmetic implementation of artificial
neurocomputer, Proc. of the 2nd Int. Conf on Micro-
neural networks, IEEE Trans. on Signal Processing 41(5)
electronics for Neural Networks, Munich, Germany (Oct.
(May 1993).
1991) pp. 401-409.
I7 H. Bourlard and N. Morgan, Continuous speech recognition
by connectionist statistical methods, IEEE Trans. on Neural [221 S. Sakaue, T. Kohda, H. Yamamoto, S. Maruno and Y.
Shimeki, Reduction of required precision bits for back-
Nefworks 4( 6) (Nov. 1993) 893-909.
propagation applied to pattern recognition, IEEE Trans. on
18 S. Carrato, A. Premoli and G.L. Sicuranza, Linear and
Neural Nehvorks 4( 2) (March 1993) 270-275.
nonlinear neural networks for image compression, in Digital
Signal Processing, V. Cappellini and A.G. Constantinides, [=I J.A. Sirat, S. Makram-Ebeid, J.L. Zorer and J.P Nadal,
eds. (Elsevier, Amsterdam, 1991) pp. 526-531. Unlimited accuracy in layered networks, IEEE Inf. Conf an
Arttficial Neural Ne/works, London ( 1989) pp. 18 I- 185.
I9 G. Cybenko, Approximation by superposition of a sigmoidal
function, Math of Control, Signal, and Systems 2 ( 1989) [24 T.J. Sejnowsky and C.R. Rosenberg, Parallel networks that
303-3 14. learn to pronounce English text, Complex Systems 1 (1987)
110 D.D. Caviglia, M. Valle and G.M. Bisio, Effect of weight
discretization on the back propagation learning method: [25 G. Tesauro and T.J. Sejnowsky, A neural network that learns
Algorithm design and hardware realization, Proc. of IJCNN to play backgammon, in Neural Information Processing
‘90, San Diego, USA ( 17-21 June 1990) pp. 631-637. Sysfems, D.Z. Anderson, ed. (1987) pp. 442-456.
] 1 I I A. Corana, C. Roland0 and S. Ridella, A highly efficient [26] T. Tollenaere, SuperSAB: fast adaptive back propagation
implementation of back-propagation algorithm on SIMD with good scaling properties, Neural Networks 3 ( 5) ( 1990)
computers, in High Performance Computing, Proc. of the 561-573.
Inf. Symp., Montpellier, France (22-24 March 1989) J.-L. [27] T.P Vogl, J.K. Mangis, A.K. Rigler, W.T. Zink and D.L.
Delhaye and E. Gelenbe, eds. (Elsevier, Amsterdam, 1989) Alkon, Accelerating the covergence of the back-propagation
pp. 181-190. method, Biological Cybernetics 59 (1989) 257-263.
D. Anguita, BA. Comes/ Microprocessing and Microprogramming 41 (1996) 757-769 769

( 28 I J. Wawrzynek, K. AsanoviC and N. Morgan, The design of Benedict Gomes received the B.S. de-
a neuro-microprocessor, IEEE Trans. on Neural Networks gree in Computer Engineering from Case
4(3) (May 1993) 394-399. Western Reserve University, Cleveland,
OH, and his M.A. in Computer Sci-
ence from U.C. Berkeley. He is cur-
Davide Anguita obtained the “laurea” rently working on his PhD at UC Berke-
degree in Electronic Engineering from ley. His research centers around mapping
Genoa University in 1989. He worked at structured connectionist networks onto
Bailey-Esacontrol in the field of wide- general purpose parallel machines.
area distributed control systems, then he
joined the Department of Biophysical
and Electronic Engineering (DIBE) of
Genoa University, where he received
the Doctorate in Computer Science and
Electronic Engineering. After a one-
year visit to the International Computer
Science Institute, Berkeley, CA, he is
currently a postdoc research assistant at DIBE. His research ac-
tivities cover neurocomputing and parallel architectures, including
applications and implementation of artificial neural networks and
the design of parallel and distributed systems.