12 vues

Transféré par Rachit Sharma

- The BackPropagation Network-1
- Kannada Character recognition
- 00344531_002
- P.J. Gawthrop and D. Sbarbaro- Stochastic Approximation and Multilayer Perceptrons: The Gain Backpropagation Algorithm
- DeepXplore- Automated Whitebox Testing of Deep Learning Systems
- C H a P T E R 2 - Fortran 95 Intrinsic Functions
- acsl_1.5
- Investigating Neural Network Efficiency and Structure by Weight
- 01619839.pdf
- S-ANFIS: Sentiment aware adaptive network-based fuzzy inference system for Predicting Sales Performance using Blogs/Reviews
- Be Project Report
- 1-s2.0-S156849461200186X-main.pdf
- Spectrum Hole Prediction and White Space Ranking for Cognitive Radio Network Using an Artificial Neural Network
- Implementasi Jaringan Saraf Tiruan Dengan MATLAB
- Artificial Neural Network Prediction of Viscosity Index and Specific Heat Capacity of Grease Lubricant produced from Selected Oil Seeds and Blends
- matlab
- Decision Support Model for Operation of Multi-purpose Water Resources Systems
- DB2-Bible1
- Lecture 1 - MATLAB Fundamentals
- Image Processing with Arrays

Vous êtes sur la page 1sur 13

cEi% Microprocessing

l!i!z

.s

__

and

Microprogramming

ELSEVIER Microprocessing and Microprogramming 41 ( 19%) 757-769

learning on neuroprocessors

Davide Anguita a,*, Benedict A. Gomes by’

a Univ. of Genova, D.I.B.E., via Opera Pia Ila, 16145 G-nova, Italy

b ht. Comp. Science Inst., 1947 Center St., Berkeley, CA, USA

Abstract

We examine the efficient implementation of back-propagation (BP) type algorithms on TO [ 31, a vector processor with

a fixed-point engine, designed for neural network simulation. Using Matrix Back Propagation (MBP) [2] we achieve an

asymptotically optimal performance on TO (about 0.8 GOPS) for both forward and backward phases, which is not possible

with the standard on-line BP algorithm. We use a mixture of fixed- and floating-point operations in order to guarantee both

high efficiency and fast convergence. Though the most expensive computations are implemented in fixed-point, we achieve

a rate of convergence that is comparable to the floating-point version. The time taken-for conversion between fixed- and

floating-point is also shown to be reasonably low.

in this field are CNAPS [ 131, Lneuro [ 181, MA-16

Among the large number of dedicated VLSI archi- [20], and SPERT [28]: they are the building blocks

tectures for neural networks developed in recent years, for larger systems that exploit massive parallelism to

several of the most successful proposals have regarded achieve performances orders of magnitude greater than

digital implementations. Most of these dedicated pro- conventional workstations [ 2 1,4]. The common char-

cessors are oriented toward the efficient execution of acteristic of these processors is the use of a fixed-point

various learning algorithms with a strong accent on engine, typically 16 bits wide or less, for fast compu-

tation.

The drawback for the final user who wants to im-

* Corresponding author. Email: anguita@dibe.unige.it. plement an algorithm for neural network learning on

’ Email: gomes@icsi.berkeley.edu

PIISO165-6074(96)00012-9

758 D. Anguita, BA. Gomes/Microprocessing and Microprogramming 41 (1996) 757-769

Table I

The MBP algorithm

Pseudo-code # of operations Point

/* Feed-forward */

for 1 := I to L

s/ := s,_, W/ (1.1) 2NPN/N/- I fixed

S/ := S/ + b/ lr (1.2) NP NI fixed

s, := f’{S,} (1.3) NpN/kl fixed

/* Emor back-prop */

AL:=T-SL (2.1) NP NL floating

AL := AL x ~{SL} (2.2) NPNL(I + kz) floating

for I := L - 1 to I

A/ := A,+] WT+, (2.3) ~NPNI+INI fixed

A/ := A/ x g{S,} (2.4) NPN/( 1+ k2) fixed

/* Weight variation */

for I := 1 to L

AW”“” ._ ST A I (3.1) ~NPNINI-I fixed

Ab;:“. :;-A,‘:;

(3.2) NPNI fixed

AW 7”” := ,AWy + aAw;‘d

(3.3) ~N/N/-I floating

Ab;“F := qAb;“” + &yd

floating

(3.4) 3Ni

/* Weight update */

for I:= 1 to L

W, := W, + AWY (4.1) N/N/-I floating

b, := b, + Ab;e’ (4.2) Nl floating

this kind of processors is the fixed-point format that there are many variations of the BP algorithm and each

requires greater attention during the implementation, of them can show different sensitivity to the approxi-

compared with a conventional floating-point format. mations caused by the fixed-point arithmetic, leading

This is not a new problem, in fact both analog and dig- to different convergence problems. Some theoretical

ital implementations of neural networks suffer from results on the precision issue have been found [ 1,231,

some constraint due to physical limitations. For this but often they rely on difficulty to predict parameters

reason, the effect of discretization on feed-forward net- (e.g. the number of iterations to convergence).

works and back-propagation learning received some One solution to overcome these limitations is to

attention shortly after the introduction of the algorithm mix conventional floating-point operations with fixed-

[ 10,5,15]. Most of the results indicate that a repre- point operations when required. An example of this

sentation of 16 bits for the fixed-point format is reli- approach is [ 121 where the feed-forward and the back-

able enough to obtain reasonable results with on-line ward phase of the algorithm are computed in fixed- and

backpropagation. On the other hand, despite this gen- floating-point format respectively. However, this solu-

eral agreement, there has been some effort to reduce tion does not address the efficiency issue because the

the precision needed during the computation [ 14,221, most computationally expensive part of the algorithm

mainly because the effect of the discretization during (the backward phase) is still performed in floating-

learning is not completely understood and it seems point format, losing all the advantages of a fast fixed-

to be both problem and algorithm dependent. In fact, point engine.

D. Anguira. B.A. Gomes/Microprocessing and Microprogramming 41 (1996) 757-769 759

We show here a mixed floating/fixed-point imple- particularly important for the efficiency of the imple-

mentation of Matrix Back Propagation (MBP) [2] mentation: if the patterns are stored in row order, the

that isolates the most computationally expensive steps elements of each pattern lie in consecutive memory

of the algorithm and implements them efficiently in locations and can be accessed with no performance

fixed-point format. Other parts of the algorithm with penalty on the vast majority of current processor ar-

less demand in terms of computational power but chitectures including TO. Matrices Si , . . . , SL contain

with more critical needs in terms of accuracy are im- the output of the corresponding layer when So is ap-

plemented in conventional floating-point format. The plied to the input of the network. The size of S1 is

target architecture is the neuroprocessor TO, but the NP x Nl and the size of T is Np x NL.

method is of general validity. The back-propagated error is stored in matrices A,

Despite the need for conversions between the two of size Np x Nl and the variations of weights and

formats and the simulation of the floating-point oper- biases computed at each step are stored respectively

ations in software, good performances are obtainable in matrices AW! of size Nl x Nl_t and vectors Abl

with reasonably large networks, showing a high effi- of size Nl. For simplicity, connections between non-

ciency in exploiting the TO hardware. consecutive layers are not considered.

The following section describes the learning algo- The total number of operations of MBP is

rithm implemented. Section 3 describes the mixed

floating/fixed-point approach. Section 4 summarizes

the main characteristics of TO. Section 5 shows the im- nr’ = 2Np 3 5 N,N,_, - N, No (5)

plementation details and performance evaluation and I=1

Section 6 compares the effect of the mixed approach L L

I=1 i=I

(6)

2. Matrix back propagation

+4xNi, (7)

kl

In Table 1 the MBP algorithm is summarized. It can

be used to represent several BP learning algorithms where kl and k2 are respectively the number of op-

with adaptive step and momentum [ 26,271. The sec- erations needed for the computation of the activation

ond column of the table contains the number of opera- function of the neurons and its derivative. If the acti-

tions needed by each step. The third column indicates vation function is the usual sigmoid, then k:! = 2.

if the computation for each step is performed in fixed- On a conventional RISC, if each operation is com-

or floating-point format (this choice will be explained pleted in a single cycle, the total computational time

in the following section). Bold letters indicate vectors is T cx no’c~es= nap. On vector or multi-ALU proces-

or matrices. sors like TO, the expected time is T 0: ncvcle. = nOl,/P,

We assume that our feed-forward network is com- where P is the number of ALUs. Obviously the im-

posed of L layers of Nr neurons, with 0 < 1 < Z,. The plicit assumptions are: (a) there is no additional cost

weights for each layer are stored in matrices WI of to load or store the data in memory, (b) one instruc-

size N, x N/-t and the biases in vectors bl of size NL. tion can be issued every cycle, and (c) the order in

The learning set consist of Np patterns. Input pat- which the operations are issued allows a complete ex-

terns are stored in matrix SO in row order and target ploitation of the ALUs. It has already been shown [ 21

patterns similarly in matrix T. The order of storing is that with a relatively small effort these constraints can

760 D. Anguita, B.A. Gomes/Microprocessing and Microprogramming 41(1996) 757-769

be satisfied reasonably well on some RISCs. In Sec- Using Table 1 we can observe that the more expen-

tion 4 we will address this problem for TO. sive steps are ( l.l), (2.3) and (3.1). They require

0( n3) operations (where n is in general the size of the

problem), therefore they will be performed in fixed-

point. Note that matrix SOthat contains the input pat-

3. The neuroprocessor TO

terns is likely to be already in fixed-point format in

real-world applications, deriving, for example, from an

TO belongs to the family of neuroprocessors with

A/D conversion. Step ( 1.2) can be easily computed

fast fixed-point capabilities and it will be the first

in the same way.

implementation of the Torrent architecture [ 33. It is

Step (1.3) requires a function computation. With

tailored for neural-networks calculations and inherits

the use of the fixed-point format, this can be substi-

some of the features of a previous neuro-processor

tuted with an indexed load from a table where the val-

[ 281. The next implementation (Tl ) will be the build-

ues of the functions are pre-stored.

ing block for a massively parallel neuro-computer [ 41.

Before starting the error-back propagation, we can

In particular, TO is composed of a standard MIPS-

translate the output of the network to floating-point

II RISC engine [ 171 with no floating-point unit but

in order to have an accurate computation of the error

with a fixed-point vector unit that can execute up to

(2.1) and its derivative (2.2). The interesting side-

two operations per cycle on 8-word vectors, or, in

effect of performing these operations in floating-point

other words, compute 16 results in a single cycle. This

is that we know (after step (2.2) ) the numeric range

translates to a peak performance of 0.8 GOPS (Giga

of the error, therefore it is possible to choose a good

Operations per Second) if the processor is clocked

fixed-point representation for the subsequent steps.

at SOMHz, or approximately 0.2 GCUPS (Giga Con-

The next conversion is performed before step (3.3)

nection Updates per Second) for one hidden layer

and (3.4) in order to compute with great accuracy the

networks, a result comparable to supercomputer im-

variation of the weights and biases of the network.

plementations. Fig. 1 summarizes the architecture of

Note that both ‘I) (the learning step) and LY(the mo-

the vector unit. The two 8-word ALUs are VP0 and

mentum term) are in general floating-point variables.

VP1 connected to the 32-bit vector register bank. Each

To summarize the algorithm: the conversion from

vector register contains 32 elements, therefore each

fixed- to floating-point format must be performed at

ALU can execute an operation on a complete vector

the end of the forward phase on matrix SL and at

in 4 cycles. The data path to/from the memory is 128

the end of the backward phase on AWl and Abr. The

bits wide allowing the loading/storing of eight 16-bit

conversion from floating- to fixed-point format must

words in a single cycle.

be performed at the beginning of the forward phase on

each Wr and bl and at the beginning of the backward

phase on AL.

4. The mixed format algorithm

We will explain here in detail the choice of the 5. Optimal implementation on the TO

format for each step of the algorithm. The main idea neuroprocessor

is to perform the most computational expensive part

of the algorithm in fixed-point and resort to floating- If the implementation of an algorithm on TO is op-

point only where the computation must be particularly timal, in the sense that it can completely exploit its

accurate. hardware, we can expect to have rzcycres= nap/ 16. For

D. Anguita, BA. Gomes/Microprocessing and Microprogramming 41 (1996) 757-769 761

cnd.Mv. 1 cnd.Mv. ) cad. Mv. ml. Mv. Cnd. Mv. Cnd.Mv. ) cI?d.Mv. 1 CMLMV.

Load A@l Load Al@ Load&p LwJAlgn Load A@ Load Alpn Load Al#n

StoreDN Stare Drv stomlhv StOre DN Store DN store DN SWDN

m&us<127:0>

this reason, we will refer to an algorithm as asymp- Our purpose is to show that MBP can be imple-

toticully optimal for TO (or simply opfiml) if the ef- mented optimally in this sense, even though some of

ficiency E of its implementation goes to 1 as the size the computations are done in floating-point and must

of the problem (Np,Nl) grows. In other words: E = be simulated in software.

napI 16ncvcres---f 1. As mentioned before, the computational load be-

762 D. Anguita, BA. Gomes/Microprocessing and Microprogramming 41 (1996) 757-769

Table 2

Scalar and vectorized matrix products

Scalar matrix products Vectorized matrix products

s/ =s/_i .w, for i := 0 to Np - 1 forj:=OtoNl_I-1stepV~

forj:=OtoN/_l-1 for i := 0 to Np - 1 step C/

for k := 0 to N/ - I for k := 0 to N/ - 1

s! += s!-’ ’ I

1.J r.k * wk,j S;,[j,j+V,] +=d,T * w:.[j.j+VL]

s;+, *,

rjj+q, +={+;'&

,

*w'

klj.jth 1

I

si+O-l.[j,jtVL]

+= L!f-l,k * w:,[j,jtVL)

for j := 0 to N/+1 - 1 for j := 0 to N/+1 - 1 step V

for k := 0 to N/ - 1 for k := 0 to Nl - 1 step VL

s! += ,!+I * WI’.+’ s! += s!+’ * w!+’

1J r,k J.k 1.J dk.k+VLl J,Ck.k+VL 1

S!ltb!Jtv +=6!+’

I+V-l,[k.k+V~ I * w!+’

J+V-l,[k.ktV~ I

for j := 0 to IV/ - 1 for i := 0 to N/-I - I step U

for k := 0 to Np - 1 for k := 0 to Np - 1

Aw!.

1.J

+=J’-Y’

k-1

es’k.J AWf.[jj+VL]

. f=&' *a

k,CjJ+k1

+='s':' * 6'

Aw!+l[jjtV,]

. . k,rtl &,[JJ+VL I

longs to steps (l.l), (2.3) and (3.1). To compute is computing the arithmetic operations. The unrolling

these steps, three matrix multiplications must be per- depth is limited by the number of registers available

formed: ( 1.1) is a conventional matrix product, (2.3) for storing intermediate results: in our case U = 8 and

is a matrix product with the second matrix transposed v = 2.

and (3.1) is a matrix product with the first matrix As can be easily noted, the vectorized version per-

transposed. forms its vector references to each matrix in row order

The three operations are shown in pseudo-code in to exploit the memory bandwith of TO. In fact, the use

the second column of Table 2. The third column shows of stride- 1 access to the memory allows the processor

the vectorized versions. V’ is the vector register length to load an entire 8-word vector (of 16 bits) in a sin-

(32 in the current implementation of TO) and U, V gle cycle, while a generic stride-n access to memory

are the unrolling depth needed to fill the processor (n > 1) requires one cycle per element.

pipelines. We will assume in the following text that all the

The increase of the unrolling depth shifts the bal- matrix dimensions are a multiple of VL. If this is not

ance of the loop from memory-bound to CPU-bound, the case, there is some overhead due to an underuti-

therefore extra cycles are available for the memory lization of the vector unit, but it does not affect the

port to load (store) the operands while the processor asymptotical behavior of the implementation. For an

D. Anguitu, EA. Gomes/Microprocessing and Microprogramming 41 (1996) 757-769 763

Table 3

of dot-products. This operation is not directly imple-

Number of cycles for MBP on TO in the general case

Step

mented on TO and needs N 20 cycles for a vector of

Q&!,

V’ = 32 words. This problem is known and could be

(1.1) (4N/llf41/)[NPIUl[N-l/h.1

eventually solved in future releases of the processor

(1.2) (8[N//h1

+ 1)b 131. In any case, the overhead due to the absence of

(1.3) 1.~NP[N//k1 h.

the dot-product is not particularly annoying when deal-

(2.1) kfNp [NLIVLI VL

ing with matrix products: in fact, partial dot-products

(2.2) 3k,N, ~NLIVLI VL of length V’ can be kept in vector registers and the

(2.3) (4 IN//VL~ V* + 20 + V*) INPIV~ rN/+~/vl final result can be computed at the end of the inner

(2.4) 12 [N//VL~ NP loop. Note that matrix-vector products (used in the

(3.1) (4Npu + 4U) [N//VLI [N/-I/~] standard BP algorithm) would suffer from a bigger

(3.2) (~NP + 4) ~N/VL~ overhead, in this case the absence of an implemented

(3.3) 3kfN/ ~N/--I/~L] VL dot-product appears only in the second-order term and

(3.4) 3k.f [N//VL~ VL becomes negligible for large problems.

(4.1) kfN/ [N/-I/~L~ VL The third product (3.1) is similar to ( 1.l ) , but with

(4.2) k,f [N//VLI VL a different order of the loops.

Other steps performed in fixed-point format are:

Table 4 the computation of the output of each neuron through

Number of cycles for optimized matrix multiplications its activation function ( 1.3), the computation of its

Step kyckr

I I derivative in the internal layers (2.4)) the bias addition

(1.1) ~NPN/N/--I + XNPNI-I

in the feed-forward phase ( 1.2) and the bias compu-

(2.3) $PNN+I + ~NPN/+I

tation in the backward phase (3.2).

(3.1) $NPNINI-I + $N,N,-, The computation of the activation function is quite

expensive if it is done using a floating-point math li-

exact computation of the number of cycles in the gen- brary [ 111, and it would cause a large penalty on TO

eral case, the reader can refer to Table 3. due to the absence of a floating-point unit. Yet, if ( 1.3)

Table 4 shows the number of cycles needed by TO is performed in fixed-point format, the activation func-

to perform the optimized matrix multiplications. tion can be easily computed using a look-up table of

Step ( I. 1) requires four cycles in the inner loop to size 2’ where B is the number of bits of the fixed-

compute a single vector multiplification/addition and point format [ 61. The vector unit of TO is provided

four cycles to store each result back in memory at the with a vector instruction to perform indexed load, so

end of the loop. The load of element Si,k can be over- the number of cycles needed to compute the value us-

lapped with the computation thanks to the unrolling ing the table is only N 1 S/element.

of the external loop. It is easy to prove the optimality The pseudo-code for steps ( 1.2)) (2.4) and (3.2)

of (1.1): is shown in Table 5. The three loops are memory-

bounded, therefore the number of cycles is easy to

n(l.l)

E(l.I) = v - 2NpNiNl-I

compute (assuming sufficient unrolling). All the other

I&‘.‘) 16 (+NPNINl-, + $Np~l_l) steps are done in floating-point format.

c+T

+ 1. (8) sion of the matrices from floating- to fixed-point for-

= l+l/N1

mat and vice versa. The scalar conversion takes N 46

The second product (2.3) can be seen as a sequence cycles/element on TO, but it is possible to lower this

764 D. Anguita, BA. Gomes/Microprocessing and Microprogramming 41(1996) 757-769

Table 5

Other vectorized operations

Step Pseudo-code

(1.2) for i := 0 to Np - 1

for j := 0 to Nl step VL

I

si,[j,j+V~]+= bl

(2.4) for i := 0 to Np - 1

for j := 0 to N/ - 1 step VL

iN,Np

. .

4[jj+V~] . .

=‘l(jj+V,] *(I-sfl[jj+V,] . . . .

*‘l[jj+Vi])

for i := 0 to Np

L

number using the vector unit. For vectors between 100

~N,N,_,

and 1000 elements, the translation from floating-point >

/=I

to fixed-point format requires only kfx = 2.6 tf 1.8 (11)

cycles/element and k,. = 3.6 tt 2.5 cycles/element

(4kf+kfx+k,f-i

>

NP No

+ NPNL-~N~N,+~

for the inverse conversion (these figures have been (12)

measured experimentally). L

The total number of cycles needed for the conver- +LNP + 4kf + kfr + kxf + f N (13)

>C

sions is /=I

L

If we compare the 0(n3) term (10) with the cor-

$$!!b,=(k,f+kfx) CNl(NII + 1) +NPNL . responding term for npP, we can deduce easily the

[ I=1 I optimality of this implementation of MBI?

(9) Obviously, the asymptotical behavior of MBP on

TO is not of primary importance when dealing with

TO does not implement the floating-point unit of

real-world applications. It is interesting therefore to

the MIPS architecture, so the floating-point operation

analyze the second- ( 1 1), ( 12) and first-order ( 13)

must be simulated in software. Currently the RISC

terms of the above expression.

core is used to perform the simulation, but an IEEE

First of all, we note that the overhead due to the con-

compatible floating-point library that uses the vector

versions from fixed- to floating-point and vice-versa

unit is under development and the expected perfor-

depends mainly on the size of the network and only

mance will be in the rangeof 10 ++ 50 cycles/element. marginally on the dimension of the training set, as can

Then the number of cycles for the floating-point steps

be seen from the second term of ( 11) and the first term

of the algorithm will be nzles = kfn$, with kf E

of ( 12). The dependence from the size of the train-

[ 10,501. ing set is controlled by the number of neurons of the

We now have all the elements to compute the num-

output layer ( NL), so we expect better performance

ber of cycles needed by TO to execute MBP,

when dealing with networks with a small number of

outputs (e.g. classification problems, as opposed to

N/N/-I - NINO (10) encoding problems [ 81). If this is not the case, some

techniques to reduce the number of output neurons in

765

1

0.9

DB

0.7

0.6

n5

0.4

0.3

0.2

0.1

D

tKUPS(Np.N) -

classification problems can be applied [ 19 I. of Layers is not theoretically justified [9] and practi-

There is also an explicit dependence in the tirst- cal applications seldom require more than four layers

w&r term (13) on the number of layers UT Ihe ner- q see, for example, [ 161 for a real problem that re-

work ( L j, This rem is of small importance being af quires such an architecture).

first order, but we can expect an increase of overhead To sketch the behavior of MBP on TO, we can sim-

in networks with a very large number of layers. Haw- plify both the expressions for nOp and n,,lrs assuming

ever, this is not a common case, as n large number VI Nf zz N and plot the efficiency and the: performance

766 D. Anguita, BA. Games/ Microprocessing and Microprogramming 41 (1996) 757-769

Table 6

Some real-world applications

Name Network size Description

NO Nl N2 N3

NETtalk (241 203 80 26 Pronunciation of text

Neurogammon [ 251 459 24 24 1 Backgammon player

Speech [7] 234 1000 69 _ Speech recognition

in MCUPS (Fig. 2) as functions of the size of the fixed-point representation is too coarse (e.g. E = 2),

training set ( NP) and the network (N). the algorithm tends to get stuck due to the underflow

We assume kl = 6 to compute rzop (as suggested of the back-propagated error. However, thanks to the

in [ 111) and the worst case for floating-point and use of the mixed format, it is possible to choose a

conversion routines on TO (kf = 50, kf, = 4, k,f = 3) good range for the fixed-point variables before starting

to compute n,,.,j,,s. The asymptotic performance is 160 the error back propagation, because the error compu-

MCUPS; obviously, the asymptotic performance of a tation in the last layer is done in floating-point format.

generic RISC processor with the same clock and only The choice of the correct range can easily be done

one FPU would be 10 MCUPS. looking at the largest floating-point value. In this case,

Fig. 2 allows us to easily understand the behavior the learning with mixed format is comparable to the

of the implementation, but it is of little practical use learning in floating-point format in terms of number of

due to the peculiar network architecture. For this rea- learning steps but, of course, far more efficient from

son we show here the performance of MBP on TO a computational point of view.

with networks that have been used in some real-world

applications (Table 6).

Fig. 3 summarizes the performance for the applica-

tions mentioned above. It is interesting to note that, for

all problems, the number of patterns for which half of

the peak performance is attained (ni 12) is reasonably 7. Conclusions

small ( NP N 500).

We have detailed here an efficient implementation

of a back-propagation algorithm on TO. The use of

6. Learning with the mixed format algorithm the mixed fixed/floating-point mode in the implemen-

tation shows good performance with real-world net-

To test the effectiveness of the mixed format algo- works, both in terms of the efficiency of computation

rithm we chose the speech recognition problem de- and in terms of the convergence rate. The limited pre-

scribed in the previous section. cision supported by the hardware is not a problem pro-

Fig. 4 shows the learning on a subset of the speech vided the range is appropriately chosen. The mixed

database with different ranges of the fixed-point vari- model computes the output layer’s error using floating-

ables. In particular, E is the exponent of the most sig- point, and uses the floating-point values to determine

nificant digit of the fixed-point format. With 16-bit an appropriate range for the following fixed-point.

words we can represent values in the range [ -2E, 2E- This work shows that digital neuroprocessors and

2E-‘51. particularly TO can be efficient test beds for various

It is clear that the error back propagation is quite BP-type algorithms, even when limited by fixed-point

sensitive to the range of the fixed-point format. If the formats.

D. Anguita, BA. Gomes/Microprocessing and Microprogramming 41 (1996) 757-769 161

140 h

-ii

-$i

_

120 -

100 -

so- ^

A

so- _

(II’

40- ,j’.

.>.’

,‘/

,: ,,a’

,:’ ,/

20 - ;..’

.. ;i

o- -

1000 iofml

ND

0.01

_I

0 5 10 15 20 25 30 35 40 45 50

Iteration

Nikki Mirghafori for providing the speech database,

Thanks to David Johnson for providing the emu- and Professor Nelson Morgan for suggestions on the

lation routines for fixed- and floating-point math and learning algorithm. We would also like to thank two

768 D. Anguita, B.A. Gomes/Microprocessing and Microprogramming 41 (1996) 757-769

anonymous reviewers for their suggestions on how to [ 121 E. Fiesler, A. Choudry and H.J. Caulfield, A universal weight

improve this paper. discretization method for multi-layer neural networks, IEEE

Trans. on SMC, to appear.

This work was developed while D. Anguita was

[ 131 D. Hammerstrom, A VLSI architecture for high-performance,

visiting researcher at ICSI, Berkeley, USA, under a

low-cost, on-chip learning, Proc. of the IJCNN ‘90, San

grant of “CNR - ConsiglioNazionale Ricerche”, Italy. Diego, USA (17-21 June 1990) pp. 537-544.

[ 141 M. Hoehfeld and S.E. Fahlman, Learning with numerical

precision using the cascade-correlation algorithm, IEEE

References Trans. on Neural Networks 3(4) (July 1992) 602-611.

1151 PW. Hollis, J.S. Harper and J.J. Paulos, The effect of

] I ] C. Alippi and M.E. Negri, Hardware requirements for digital precision constraints in a backpropagation learning network,

VLSI implementations of neural networks, Inr. Joint Conf Neural Computation 2 ( 3) 1990.

on Neural Networks, Singapore ( 1991) pp. 1873- 1878. M.A. Kramer, Nonlinear principal component analysis using

121D. Anguita, G. Parodi and R. Zunino, An efficient autoassociative neural networks, AIChE I. 37(2) (Feb.

implementation of BP on RISC-based workstations, 1991) 233-243.

Neuracomputing 6 (1994) 57-65. 1171 G. Kane and J. Heinrich, MIPS RISC Architecture (Prentice

131 K. AsanoviC, J. Beck, B. lrissou, D. Kingsbury, N. Morgan Hall, Englewoods Cliffs, NJ, 1992).

and J. Wawrzynek, The TO vector microprocessor, Her Chips

[I81 N. Maudit, M. Duranton, J. Gobert and J.A. Sirat, Lneuro

VII Symposium, Stanford Univ. ( 13-15 Aug. 1995). 1.O: a piece of hardware LEG0 for building neural network

141 K. AsanoviC, J. Beck, J. Feldman, N. Morgan and J. systems, IEEE Trans. on Neural Networks 3( 3) (May 1992)

Wawrzynek, Designing a connectionist network super- 414-421.

computer, Int. J. Neural Systems 4(4) (Dec. 1993) 317-326.

[I91 N. Morgan and H. Bourlard, Factoring networks by a

151 K. AsanoviC and N. Morgan, Experimental determination statistical method, Neural Computation 4( 6) (Nov. 1992)

of precision requirements for back-propagation training of

835-838.

artificial neural networks, in Proc. of 2nd Inr. Co@ on

1201 U. Ramacher et al., eds., VLSI Design of Neural Networks

Micraeiecfronics for Neural Networks, Munich, Germany

(Kluwer Academic, Dordrecht, 199 1).

(16-18 Oct. 1991) pp. 9-15.

[211 U. Ramacher et al., SYNAPSE-X: a general-purpose

161 V. Bochev, Distributed arithmetic implementation of artificial

neurocomputer, Proc. of the 2nd Int. Conf on Micro-

neural networks, IEEE Trans. on Signal Processing 41(5)

electronics for Neural Networks, Munich, Germany (Oct.

(May 1993).

1991) pp. 401-409.

I7 H. Bourlard and N. Morgan, Continuous speech recognition

by connectionist statistical methods, IEEE Trans. on Neural [221 S. Sakaue, T. Kohda, H. Yamamoto, S. Maruno and Y.

Shimeki, Reduction of required precision bits for back-

Nefworks 4( 6) (Nov. 1993) 893-909.

propagation applied to pattern recognition, IEEE Trans. on

18 S. Carrato, A. Premoli and G.L. Sicuranza, Linear and

Neural Nehvorks 4( 2) (March 1993) 270-275.

nonlinear neural networks for image compression, in Digital

Signal Processing, V. Cappellini and A.G. Constantinides, [=I J.A. Sirat, S. Makram-Ebeid, J.L. Zorer and J.P Nadal,

eds. (Elsevier, Amsterdam, 1991) pp. 526-531. Unlimited accuracy in layered networks, IEEE Inf. Conf an

Arttficial Neural Ne/works, London ( 1989) pp. 18 I- 185.

I9 G. Cybenko, Approximation by superposition of a sigmoidal

function, Math of Control, Signal, and Systems 2 ( 1989) [24 T.J. Sejnowsky and C.R. Rosenberg, Parallel networks that

303-3 14. learn to pronounce English text, Complex Systems 1 (1987)

145-168.

110 D.D. Caviglia, M. Valle and G.M. Bisio, Effect of weight

discretization on the back propagation learning method: [25 G. Tesauro and T.J. Sejnowsky, A neural network that learns

Algorithm design and hardware realization, Proc. of IJCNN to play backgammon, in Neural Information Processing

‘90, San Diego, USA ( 17-21 June 1990) pp. 631-637. Sysfems, D.Z. Anderson, ed. (1987) pp. 442-456.

] 1 I I A. Corana, C. Roland0 and S. Ridella, A highly efficient [26] T. Tollenaere, SuperSAB: fast adaptive back propagation

implementation of back-propagation algorithm on SIMD with good scaling properties, Neural Networks 3 ( 5) ( 1990)

computers, in High Performance Computing, Proc. of the 561-573.

Inf. Symp., Montpellier, France (22-24 March 1989) J.-L. [27] T.P Vogl, J.K. Mangis, A.K. Rigler, W.T. Zink and D.L.

Delhaye and E. Gelenbe, eds. (Elsevier, Amsterdam, 1989) Alkon, Accelerating the covergence of the back-propagation

pp. 181-190. method, Biological Cybernetics 59 (1989) 257-263.

D. Anguita, BA. Comes/ Microprocessing and Microprogramming 41 (1996) 757-769 769

( 28 I J. Wawrzynek, K. AsanoviC and N. Morgan, The design of Benedict Gomes received the B.S. de-

a neuro-microprocessor, IEEE Trans. on Neural Networks gree in Computer Engineering from Case

4(3) (May 1993) 394-399. Western Reserve University, Cleveland,

OH, and his M.A. in Computer Sci-

ence from U.C. Berkeley. He is cur-

Davide Anguita obtained the “laurea” rently working on his PhD at UC Berke-

degree in Electronic Engineering from ley. His research centers around mapping

Genoa University in 1989. He worked at structured connectionist networks onto

Bailey-Esacontrol in the field of wide- general purpose parallel machines.

area distributed control systems, then he

joined the Department of Biophysical

and Electronic Engineering (DIBE) of

Genoa University, where he received

the Doctorate in Computer Science and

Electronic Engineering. After a one-

year visit to the International Computer

Science Institute, Berkeley, CA, he is

currently a postdoc research assistant at DIBE. His research ac-

tivities cover neurocomputing and parallel architectures, including

applications and implementation of artificial neural networks and

the design of parallel and distributed systems.

- The BackPropagation Network-1Transféré parsch203
- Kannada Character recognitionTransféré parManjunath Ji
- 00344531_002Transféré parSanjay Singh
- P.J. Gawthrop and D. Sbarbaro- Stochastic Approximation and Multilayer Perceptrons: The Gain Backpropagation AlgorithmTransféré parTuhma
- DeepXplore- Automated Whitebox Testing of Deep Learning SystemsTransféré parJan Hula
- C H a P T E R 2 - Fortran 95 Intrinsic FunctionsTransféré parAnonymous lV8E5mEO
- acsl_1.5Transféré parCha Ben
- Investigating Neural Network Efficiency and Structure by WeightTransféré parSyariful Syafiq Shamsudin
- 01619839.pdfTransféré parusman3686
- S-ANFIS: Sentiment aware adaptive network-based fuzzy inference system for Predicting Sales Performance using Blogs/ReviewsTransféré pareditor3854
- Be Project ReportTransféré parHarman Singh Somal
- 1-s2.0-S156849461200186X-main.pdfTransféré parBro Edwin
- Spectrum Hole Prediction and White Space Ranking for Cognitive Radio Network Using an Artificial Neural NetworkTransféré parIJSTR Research Publication
- Implementasi Jaringan Saraf Tiruan Dengan MATLABTransféré parUsep Siregar
- Artificial Neural Network Prediction of Viscosity Index and Specific Heat Capacity of Grease Lubricant produced from Selected Oil Seeds and BlendsTransféré parAJER JOURNAL
- matlabTransféré parKrystal Mann
- Decision Support Model for Operation of Multi-purpose Water Resources SystemsTransféré parmpilgir
- DB2-Bible1Transféré parDarko Karac
- Lecture 1 - MATLAB FundamentalsTransféré parKrishnan Mohan
- Image Processing with ArraysTransféré par4gen_3
- Linear SlidesTransféré parDaniel Barlow
- Memristor-based multilayer neural networks with online gradient descent trainingTransféré paraserius
- IntegerTransféré parZakir Khaan
- Number Node NNTransféré parq_man2512
- Full TextTransféré parPablo Castro Valiño
- Lecture3 FlatTransféré parShubham
- HP 50g - ManualTransféré parMilutinMM
- Application of Charge Coupled Devices for Parallel ComputingTransféré parMukul Gupta
- (750502599) neuroPID.docxTransféré parDhb Alilo
- New Microsoft Office Word DocumentTransféré parJohannes Paulo M. Cardona

- 8051 microcontroller.pptTransféré parRachit Sharma
- 8051 microcontroller.pptTransféré parRachit Sharma
- Timer 555Transféré pardeodhaix
- Taglines: Useful for Business awareness ( general awareness for MBA exams)Transféré parRachit Sharma
- Mumbai SmilesTransféré parRachit Sharma
- Rachit Resume ( till Jan 2013)Transféré parRachit Sharma
- HTML TAGSTransféré parRachit Sharma
- Ppt on Micro - CopyTransféré parRachit Sharma
- Ppt on Micro - CopyTransféré parRachit Sharma
- Barker 1996 Micro Processing and Micro ProgrammingTransféré parRachit Sharma
- Chang 1996 Micro Processing and Micro ProgrammingTransféré parRachit Sharma
- Aksoy 2010 Microprocessors and MicrosystemsTransféré parRachit Sharma
- Romanovsky 1996 Micro Processing and Micro ProgrammingTransféré parRachit Sharma
- Dewhurst 1996 Micro Processing and Micro ProgrammingTransféré parRachit Sharma
- Liu 1996 Micro Processing and Micro ProgrammingTransféré parRachit Sharma
- Guang 2010 Microprocessors and MicrosystemsTransféré parRachit Sharma
- Khalid 1996 Micro Processing and Micro ProgrammingTransféré parRachit Sharma

- physics430_lecture06.pptTransféré parFrancis Karanja
- Ip Lecture 1Transféré parSathish Kumar
- Application of Linear AlgebraTransféré parBelong Ridhwan
- Derivation of the Shell Element , Ahmed Element , Midlin Element in Finite Element Analysis- Hani Aziz AmeenTransféré parHani Aziz Ameen
- Lectures on Classical Mechanics-2Transféré parAbhijit Kar Gupta
- Mpc Invariant setTransféré parsmshariatzadeh
- Matlab, An IntroductionTransféré parNisarg Parekh
- Chebyshev PolynomialsTransféré parmahmoud.kaed7589
- 2016 multiplication repeated addition planner term 3Transféré parapi-257364822
- Olympiad 2006Transféré parAbhishek Gupta
- ThesisTransféré pararchit003
- BS Aeronautical Engineering (BSAE) - School of AviationTransféré parUniversity of Perpetual Help System DALTA
- CortexM4_FPUTransféré parMohamedAliFerjani
- DeterminantTransféré parmHeidelberg
- chapter25 Dourmashkin.pdfTransféré parDe Nicolas Jaidar
- Bca Business StatisticsTransféré parRenuka Lenka
- A Primer of Ecological Statistics - Nicholas J. Gotelli, Aaron M. Ellison 2th EditionTransféré parPerséfone
- Badiou Manifesto for PhilosophyTransféré parprotevi
- Intermediate First Year Mathematics Important QuestionsTransféré parSalmanAnjans
- Backpropagation LectureTransféré parBoonKhaiYeoh
- Reference ManualTransféré pareunaosoudesleal7223
- Discrete Random Variables LectureTransféré parHarumoto
- Linear Model for OptimizationTransféré parVamshi Krishna Mallela
- HeatTransféré parXi Wang
- Jeopardy Powerpoint (Math 20: Angles and Polygons)Transféré parRSCHAAB
- BayessianTransféré parrusticrage2883
- Lauer 1982Transféré parAlluri Appa Rao
- ITCclass-2Transféré parSubin Suresh
- m8 ch2 solutionsTransféré parapi-272721387
- oe5170Transféré parNikita Pinto