Vous êtes sur la page 1sur 8

P S E U D O MIMD A R R A Y P R O C E S S O R - A A P 2

Toshio Kondo, Toshio Tsuchiya, Yoshihiro Kitamura, Yoshi Sugiyama, Takashi Kimura and Takayoshi Nakashima NTT Electrical Communications Laboratories, 3-1, Morinosato Wakamiya, Atsugi-shi, Kanagawa 243-01, J a p a n

ABSTRACT A highly integrated array processor (AAP2) -LSI has been developed. After the past 3 years study on the adaptive array processor 1 (AAP1), a challenging improvements on the SIMD's restraints are achieved by using the AAP2-LSI. The AAP2 array system makes it possible to carry out wideband modifiable operation

using

highly

integrated

processing

modules.

The

outstanding features of the AAP2 provides pseudo MIMD processing and a large memory capacity of over 8 Kbits to each PE. This paper describes the hardware and software architecture, as well as the potential of this AAP2. application

II. Architecture A preliminary study on the AAP1 array processor has been carried out for the past 3 years. The A_AP1 system is shown in Fig,1. Each processing element is a one bit processor composed of a 1 bit RALU and two data transfer units. It executes bit serial operations in the form of rows and/or columns of the array. The AAP1 was designed as an SIMD (Single Instruction Multiple Data stream) machine. The SIMD operation can be modified by using a 2 bitcontrol-register in each PE. The 2 bit control data are stored in each PE's internal register file beforehand. The data are used to mask an instruction which is common to all PEs. The individual PE's execution can be modified by referring to the control data, which are read out to the control register. This function gives the SIMD machine three basic operation modes as shown in Fig.2. The AAP1 also supports a ripple-through data transfer operation in an array. The ripple through data transfer operation accomplishes a long distance data transfer along a data path within one machine cycle. By combining the modification and the ripple-through data transfer scheme, the A A P l - a r r a y can carry out some interesting operations, such as an accumulation operation on the distributed multiple data in parallel and at a high speed. In addition to the AAP1 LSrs features and system architecture, the /kAP2 LSI is designed to have the following additional unique features : (i) wideband modifiable operations in the array (pseudo MIMD), (ii)

(pseudo MIMD). Furthermore, each PE is capable of supporting a large amount of memory. The AAP2 potential for massively, parallel and pipelined processing is discussed in the field of image processing and CAD applications.

I . INTRODUCTION Since the cellular array processor was proposed in 1958, ~1~many studies have been carried out on both the machine and its application. Illiac IV ~2~ , DAP (Distributed Array Processor)(3), MPP (Massively Parallel Processor)(4) are representative array processors. In 1982, we reported on a high efficiently adaptive array processor (AAP1) based on VLSI technology(5) (6). Since then, the AAP1 has been applied to a CAD system(7) (8). During the last 3 years, several features of the cellular array system, which require improvements, have been pointed out. To achieve greater efficiency and flexible usage, more functions and a larger memory capacity must be incorporated in each processing element(PE). As is common in array processors, such versatility leads to an increase in the hardware scale, which may result in a reduction in total performance. In the present study, the PE versatility of both the implemented functions and accessible memory capacity is realized on the basis of VLSI technology as well as highly integrated packaging technology. The array processor AAP2 is now being developed,

0884-7495/86/0000/0330$01.00 1986 IEEE

330

large memory capacity, expandability of up to 1 Mbit logical address space for every PE, did 2 bit-width data path between adjacent PE's.

PE

(a) PE UNIT OPERATION (BIT-SERIAL OPERATION)

f~.WORD

~-PE

,-BLOCK
,I, ~ ,

(bl WORO UNIT OPERATION

(c) BLOCKUNIT OPERATION

Fig.1 AAP1 System


The basic structure is the same as that for the A/kP1. The AAP2-PE block diagram is shown in Fig.3. The AAP2-PE is composed of a versatile data transfer unit, a 1 bit ALU, a register file of 16 record X 9 word 1.bit, and a specific control register of 16 bits. A micro instruction of 40 bits is simultaneously sent to every PE. The data path structure consists of a 8 directional data path and a 4 directional data path. These two types of data path make possible not only high rate of data transfer but also bypassing PEs, in which data are not required to be written. A wideband modifiable operation, or pseudo MIMD operation, is achieved by means of the 16 bit control register and the versatile data transfer unit. The schematic MIMD operation is depicted in Fig.4. Each PE can individually choose an appropriate data path from among 8 directions. Furthermore, individual bit strings on each 2 bit-wide data path can be separately and independently switched. The logical and arithmetic operations are also modified in a way similar to that of the transfer modification in an array.

Fig.2 Basic Operating Modes


~ EXT RAM (

4 neighbor PEs 8 neighbor PEs

modif' 'ie'r t~-

micro instruction

Fig.3 A block diagram of the AAP2-PE

331

A micro-photograph of the AAP2-LSI is shown in Fig.5 with the LSI features summarized in Table 1. This LSI is mounted on a special package, which is designed to integrate the LSI and 8 external memory chips. The module in Fig.5(b), accommodates 64 PEs, or one AAP2-LSI, and 64 K Byte memory capacity when 8X64 Kbit static RAMs are used. The integration technique, based on both LSI and packaging technologies, makes possible high speed and highly efficient data processing in the fields of image processing and CAD. A block diagram of the AAP2 array processor, now under construction, is shown in Fig.6. The AAP2 processor consists of 256X256 PE array units, an interface unit, and an array control unit, which is equipped with a data buffer memory, an instruction memory and a scalar processor. The array unit is composed of 1024 AAP2-LSIs. The AAP2 system is interconnected via the interface unit with a host processor. The host processor sends objective data and a set of micro instruction data to control all the PEs as well as the array control unit. When the array processor has completed the processing then the results are returned to the host processor.

Table 1. AAP2-LSI FEATURE


AAP2 Number of PEs Technology Number of transistors Chip size Minimum cycle time Power dissipation Number of pads Module Data transfer unit structure Bypass function Register file of each PE 8x 8 1.5pro CMOS 104,900 8.8 x 8.16 mm2 100 ns 300 mW 160 164 pin LCCP* (64 k DRAM mounted) Path 1:4 neighbor PEs Path 2:8 neighbor PEs Pathl, Path2 16 Record X 9Word x lbit

* Lead Chip Carrier Package

EOR I

NOF

OR ~,~ I

NOP / AND

EOR

,\
/~

OR - ~ AND

JNOE

(a) A micro p h o t o g r a p h of the AAP2-LSI

EOR - ~AND-~EOR

NOP /
- )NOT

OR

NOF

(b) A p h o t o g r a p h of the AAP2 module

Fig.4 A schematic M I M D operation


332

Fig.5 AAP2 LSI and the module

40

17

0
F

01
40

DTU Cont.
17 1

DM Adrs.
5 I

I
I 0

10
40

EXT RAM adrs. &cont.

RFadrs. &cont.

[ ALU func.

vSIMULATION DEFAULT ; DEFAULT MASK ; MASK DATA_~RFB[143]

DD~-~-sn~[8~ DD
DATA+DEFAULT

17
E

11

DTU cont.

RF adrs. & cont.

ALU func.

DTU C o n t .

: data transfer unit control code

IT]~--D ~-DATA <-MASKo~ P] [']+-C[1] D

D M Adrs.

: data buffer m e m o r y address

EXT R A M adrs.& c o n t

: external memory address and control code RF adrs.& cont : register file address and control code ALU func. : A L U function code

O~-c D
--0
v

[i]

Fig.7 Instruction format for AAP2-PE


The instruction format for the A A P 2 - P E is shown in Fig.7. There are three types of instruction. The firstis for data transfer operations between the PEs and the array control unit. The second is for P E operations related to the expanded memory. The third is for operations in the P E array. U p to 20 fieldis provided in the instruction for the external m e m o r y address.

Fig.8 An AAP2 program


A n example of the A A P 2 program is shown in Fig.8. This A A P 2 program is translated into A A P 2 micro instruction sequence. The A A P 2 language constructions for logical / arithmetic operations are just like the APL. The constructions for data transfer operations are extension of APL. The language constructions are closely related to the A A P 2 hardware architecture.

3n

( ARRAY CONTROL UNIT

IM

>

__! J

2 5 6 x 256 PE ARRAY Fig.6 The AAP2

I array
333

INTERFACE
:__3 .os
UNIT processor

t~

alar

~1 ocessorL/I

PROCESSOR

Image data
III.Applications
The AAP2 module can be applied to various image processings as well as CAD applications. The inherent parallelism is expected to realize high speed and efficient processing. Especially, routing in VLSI design can be carried out at a speed of 1000 MIPS by using the AAP2 array processor. The potential for other parallel processings is demonstrated here. A. Visual pattern preprocessing. Let us examine the problem of an image data processing, such that, first, a grey tone level histogram is calculated, second, a threshold value is derived from the histogram, and finally, the image data is converted to binary data by using the threshold value. This processing scheme is typical and is used in a wide variety of applications. Special hardware for this process can be designed using the AAP2 modules in a two dimensional orthogonal array. The digitized data, arranged in array, shown in Fig. 9(a), are sent into the AAP array and shifted down through the array. Each PE in the first row of the array has a role to detect level 0 data and to count the n u m b e r of the data, while the image data are shifted down through the array. Each PE in the second row counts level 1 data, the third row PEs count level 2 data, and so on.

Pixel

3010" 0312 1031 0332 (a)

PE
2201 1021 0002 1220
%
' (b) i

Level 0 1 2 3
Sum

up

PE-.

ram Level

2445'
1 1 3 4 0 0 0 2

0
1 2

1 355 (c)

Fig.9 A histogram caluculation


-PE

v
a

h f
c b

// \ N

(a) d e

I 1

/ \
1

e
a b c
Temporary Data 1

Distorted Original image After the whole image data has been shifted through the array, each PE contains the number of the data, which each PE has counted, as shown in Fig. 9(b). d e f // The counted number in PEs are transfered right (b) a b c \ N and summed up along each PE rows by the ripple transfer operation, simultaneously. Then, the final Original image shifted by one bit sum of each PE row is obtained at the right edge to the upper P E of every row. That is, a histogram is obtained, direction

1
1

/
1 \
i~ a

e c

Temporary Data2

as shown in Fig. 9(c). e The threshold value is calculated easily by lel a e c conventional method using the histogram. After the (c) e U Icl \ a b c obtained threshold value is broadcasted to all PEs, a a b c! ~ Corrected Temporary comparison operation between the threshold value and Temporary Image Data2 Data 1 the original image data is executed in every PE. Finally, the binary image data are obtained. Fig.10 Distorted Image correction

I/

a I II)

334

The histogram calculation on image data, for instance a 600 m m x 800 m m size image of 6800 X 4800 pixels, can be carried out by using two AAP2 modules within 25 seconds. B. Distorted Image Correction. A distorted image must be properly stretched and/or shrunk. In Fig.10, the pixels of the image data, distributed over the array, are indicated by a, b, c, -... h. To which PE each pixel should move in order to correct distortion, is supposed to be already-known. The correct procedure is controled by the AAP2 pseudo MIMD function. Figure 10(a) shows t h a t the temporary data 1 is made by AND operation between the original data and the array data, shown in the center of the figure. It is shown in Fig. 10(b) that the temporary data 2 is made by AND operation between the original image data shifted by one bit to the upper direction and the array data shown in the center. Figure 10(c) shows that the corrected image is given by OR operation between the temporary data 1 and the temporary data 2. Special hardware for this processing operation is constructed on a 200 turn X 300 m m size printed-circuitboard, where only one AAP2 module is used. For 8 bit X 100 X 100 pixel data, the process is expected at a speed of 30 images/second.

For example, the procedure to get the run-length to the right direction on an equipment composed of AAP2 modules is shown in Figs.12(a)~(c). First, the original image data shown in Fig.12(a) are shifted left by one bit, and then, Ex. OR operation is executed between the shifted d a m and the original data. Next, data in the PE array, shown in Fig. 12(a), are summed up to the left direction, while the partial transfered left. In this procedure, a PE, "1" data in Fig. 12(b), does not allow sum to enter into the PE. Figure 12(c) shows the results, or the sums are which has the partial run-length

array map. Repetition of the procedure results the run-length maps of the other directions. Thus, AAP2 functions are very effective for parallel execution on character feature extraction.

:~"

', L ~ " ~ ' T ~~-" J ~ l .

- 1

C. Feature extraction of character image. Next, let us examine feature extraction processing of binary character images. The character feature is characterized by a 'run-length' and a 'crossing number' for each pixel as shown in Fig.ll. The run-length for each pixel is defined to be the distance between the pixel and the nearest edge, where "0" data area is changed to "1" data area, or vice versa. The 'crossing number' for each pixel is defined to be the number how m a n y times a line, which is made from the pixel toward one of the eight directions, crosses the boundaries of "0" data area and "r' data area.A set of the run-length and the crossing number for every pixel define the character image feature.

2 ,"

~ to the character patternedge


<

~Crossing

number

8 directions (b) Run-length Fig.1 1 Character features definition

335

D. Logic simulation for CAD. The outstanding feature of the AAP2 pseudo MIMD function is demonstrated by logic simulation. One logic gate is assigned to one PE, and the whole interconnection network of an objective logic circuit is mapped onto the two dimensional PE array. Figure 13 shows a simple example of a logic circuit and the assignment to PE array. In order to efficiently carry out logic simulation, different types of logical operations, such as AND, OR, NOR, NOT etc., are required to be executed simultaneously in the array. Multi-directional data paths m u s t also be established. The AAP2 executes t h e multi-functional logical operations and the multidirectional data transfers in the whole array as follows. A set of 16 bit data, which can modify microinstructions, are loaded into every PE's control register from the external data memory. The combination of the micro-instruction, which is common to every PE, and the individual control data can give the respectively characterized function to the ALU and the data path switch in every PE. This modification of microinstructions enables the array processor to carry out a flexible parallel processing. Especially, the two independent data paths in each PE, which are programmable, m a k e it possible to assign a short cut path through PEs and to efficiently use each PE for the mapping network.

i
(a)Logic Circuit (b)Logic Simulation on the AAP-2

Fig.13 Logic Simulation


Owing to the technique described above, the AAP2 array processor is expected to carry out logic simulation for a sufficiently large scale logic circuit at a speed of about 10,000 MIPS. The AAP2 achieves 10 times faster execution than that of the AAP1, which is the first generation of this project.

IV. Conclusion A highly integrated array processor project has been described. The AAP2 is completely modularized in an integrated package consisting of a 64PE-VLSI and 64 KBytes peripheral memory chips. The sophisticated functions of this AAP2, such as the pseudo MIMD control and the large memory capacity for each PE, will contribute towards the wide spread applications of a r r a y processors.

i:1:1i 1 1 1 1 1111 1 1 1 1

1 1~ 1 11Jl lllJl .1 l ! l i l

L L

1 1

1 !1

1 1 1 1 1

1
1

1
[i

1 I ! 1
1 1 I 1

.1L1.1 k
1L1 1 m

1
1

]1
I1

1 1 1 1 1 1

43 4 3!2 ~ 4

2 1

:3
m 3 2jl 4

3 2 1
3 z 1

1 1 1 1

1;1il m

,~

-4:3i211

3 2 1

(a) A given binary image

(b) Edge detection

(c) Result
The number in each PE is the run-length to right direction

Fig.12 Parallel caluculation on


336

run-length

REFERENCES (1) S.Unger,,"A computer oriented toward spacial problems", Proc. IRE,vol.46, pp.1744-1750,1958. (2) G.H. Barnes, et al., '~The ILLIAC IV computer", IEEE Trans computer, c-17, pp. 746~757, 1968. (3) P.M. Flanders et al., "Efficient High Speed Computing with the Distributed Array Processor", in High Speed Computer and Algorithm Organization, New York: Academic, pp. 113-128, 1977. (4) K.E. Batcher, "Design on a massively parallel processor", IEEE Trans, Computer, vol. c-29, pp. 836~840, sept.1980. (5) T.Sudo, et al , "An LSI Adaptive Array Processor", 1982 ISSCC Digest of Technical Papers, p.122-123 (6) T.Kondo, et al., "An LSI Adaptive Array Processor", IEEE Joun., vol. sc-18, No.2, pp.147~156, APRIL 1983. (7) T.Kondo, et al., "An Large Scale Cellular Array Processor : AA_P-I", Proc. 1985 ACM Computer Science Conf., March 1985, pp.100~111. (8) T. Watanabe, et al., "Parallel Adaptable Routing Algorithm and its Implementation on a two dimensional array processor", IEEE Trans, CAD, submitted. (9) T.Nakayama,, "Algorithm for Histogram calculation and Median Filter on SIMD Computer", Tech. Paper of JEE, Pr183-16, 1983(in Japanese).

337

Vous aimerez peut-être aussi