educed Hardware General Purpose Systolic Array Design
A. Adibi, Senior Member IEEE, H. Bonakdar
Electrical Engineering Department Amirkabir University Hafez Ave.,Tehran ,Iran Abstract- Although there may he many general purpose systolic array desi gn and Implementation which could be used t o configure some especial systolic structure, but the degree of their hardware complexity and flexibility is a main question. In this paper we have designed a new systolic array system to realize a range of various algorithms without iiicreasing the hardware structure significantly. The proposed concepts have been wed are: 1- Using a novel architecture to allow any number of desired time delays along the data paths. 2- Dividing instructions and data into two formats; tagged and untagged formats. 3- Suitable processing element (P.E.) architecture design. I. INTKOUCTION I n recent years a l ot of array processors have been generated where they may be categorized as; bi t level 11-21, integer level [3,4] and floating point level [5,6]. I nteger level systolics like SPC[3], are si mpl e but suffer from low precision of 8 bit data. But in the other hand, iWarp [7] a floating point systolic, is powerful and general purpose but some of i ts uni ts such as i ts communication unit, register file and control unit which are very area consumming have not been used perfectly. I n this paper we have presented a general purpose but simple systolic design concept which covers a wide range of algorithms. The architecture of our design and the description of the processing elements (PES) will be presented. 11. CATEGORIES AND CONCEPTS Systolic arrays can be divided into some categories by their particular features consisting of: The figure of array (usually mesh and triangular), the data path lines (how many and in which direction) the delay of the data paths (how many latches in each data path?) and the function of PES. In order to design a flexible systolic array to cover the above categories, the following aspects have been used in this paper: 1 - Since systolic array concept has been based on the sequential information flow, the possibility of inserting the desired time delays along the data paths will enable the designer to change from one special architecture to the other. CMOS switches will offer significant benefits in this ragards as they have been used to switch from one delay feature to two or more delays feature. Also they will allow the simplicity of the control uni t design criteria. Fig1 displays the multi-delay systolic array configuration. T1, E, T3 SY'TTCFIES L 1, L2 L3 LATCHES PE PROCESSWG ELEMENT I I , Gr* 1 L2 L3 Fig. 1. The architecture of latches along data paths For one delay systolic setting, as it has been shown, the array control uni t will send the proper signals to the gates of T2 and T3 switches in order to bypass L2 and L3 0-7803-2428-5195 $4.00 0 1995 IEEE 31 0 of the latches. leaving swiches T1 and T2 off and setting only T3, will convert the arrays to two delay systolic system. In some new design aspects as the case of pipelining, each PE will function and operate on the data which is generated by the preceding PES. But the functions of PES may be different and so then must be a control on time delays between PES to syncronize operations along the array. Therefore the output of the first PE must be delayed by the proper ti me interval to allow the second PE to finish the multiplication operati on. This will be done through our proposed latch-delay method. 2 - In designing the processor elements architecture, since their functions are mostly arithmatic operations whi ch may be i mpl emented combi nati onal y or sequentially (depending on the speed of the operation), there will be a choi ce and a degree of freedom to combine these two procedures according to the design specifications. For example regarding multiplication operati on ,one may use a ri ppl e carry full adder to manipulate a series of add and shift operation or he can use a fully combi nati onal l ogi c network for the multiplication purposes. It is obvious that the speeding the operations up by a full parallel processing will increase hardware complexity significantly. On the other hand using more sequential structure will result in a simple hardware architecture while decreasing the speed of operati on pronouncely. Therefore depending on the degree of simplicity of the required architecture or the speed of the systolic, one has to combine the two mentioned methods approprately. In fig. 2 the architecture of the array is depicted. Each PE has two horizental and one vertical bus. The array itself is controlled via four FI FO connected to the four sides of array. A clock line, sync line and serial instruction line are also distribute over the array. In fig.3, one PE data path is shown which possesses three arithmatic units. REG FILE ADDlSW I NORMALIZE 0 FIFO & INTERFACE I 1 1 DMSI ON WAUACE TREE m I I I NORMALIZE - LATCH Int er f ace I L3-L Memory Fig. 2. Architecture of Array Mul ti pl i cuti on is designed on the basis of modified Booth al gori thm [6] and Wallace tree [ 5] while SRT algorithm [S-91 is used for the division. Fig. 3. PE data path 3 - There are al so two kinds of i nstructi on registers, tagged and untagged to i nform the PES whether to operate on the data or to transfer to the next PE. Instructions enter the PES via the serial lines or via one of the three data buses. Microinstructions are embedded i nto the instruction format to deal with delays in the paths, routing and tags. The executable operations in the PES are selected from the general ones resulting in a vast range of applications without encountering the design complications. 111. CONCLUSION We have introduced a new archi tecture of systolic design to generalize and cover a wide range of special applications. Fig.1, fig.2 and fig.3 display our proposed 31 1 concepts towards a general purpose systolic array design. The design criteria relies on the selectivity of data paths through a number of time delays from the output of one PE towards the i nput of the second PE together with choosing suitable PE architecture having tag possibility for instructions and data. REFERENCES 1 ) R. Schreiber, "The Sapy-1M: Architecture and algorithms," Proc. SPI E (highly parallel signal processing architecture), Vol. 614, 1986, pp.92-95. 2) J .S. Ward, J .V. McCanny, and J .G. McWirter, "Bit level systolic array i mpl ementati on of the winogard fouri er transform algorithms," IEE Proc., Pt. f, Vol 12, No. 6, 1989, pp. 241-243. 3) J .V. McCanny and J .G. McWhirter," I mplementation of signal processing functions using 1-Bit systolic Array," Electron. Letts., Vol. 18,No. 6, 1986, pp. 241-243. 4) S.Y. Kung, VLSI Array Processors, Prentice-Hall, Englewood Cliffs, N.J ., 1987. 5) C.S. Wallace, "A Suggestion for Fast Multipliers,"I EEE Tran. Electronic Computers, vol. EC-13, Feb, 1964, pp. 14-17. 6 ) A.D. Booth, "A Suggestion for Fast Multipliers,"I EEE Tran. Electronic Computers, Vol. EC-13, FebJ 964, pp. 14-17. 7) M. Annaratone, et al ., "Warp computer: Archi tecture, I mplementation, and Performance," I EEE Tran on Comp., Dec 1987, 1523-1538. 8) K. Hwang, Computer Arithmatic, Principles, Architecture, and Design, J ohn Wiley & Sons, 1979. 9) C. Peterson, et al., "iWarp, A 100 MOPS LI W Microprocessor or Multicomputers," I EEE Micro, J une 1991, pp. 26-29, 81-87. 31 2