Vous êtes sur la page 1sur 6

Proceedings of the 7th Korea-Russia InfernationalSymposium.

KORWS 2003

DESIGN OF 32-BIT RISC PROCESSOR AND EFFICIENT


VERIFICATION
Geun-young Jeong, Ju-sung Park
Dept. of Electronics Engineering, Pusan National University
San-30 Jangjeon-2 dong, Geumjeong-gu, Pusan 609-735, Korea
Tel: +82(51)5 101702, Fax: +82(51)5 155190
E-mail: gyjoung@pusan.ac.kr,juspark@pusan.ac.kr
Abstract
The design and verification of a 32-hit general- purpose microprocessor, which is compatible
with ARM7 RlSC core, is described. In the architectural point of view, the processor has
3-stage pipeline, 6 register hanks, 32-hit ALU, and 4-cycle MAC. The core described here was
designed by latch base for low power and low complexity. Its functional operation was verified
by comparison the results of logic simulation with those of the commercial simulator. Each
instruction and its random combinations were all tested. The core was implemented by FPGA
to check its proper operation for various applications, such as ADPCM (G.721-speech coding),
SOLA (voice speed variation), MP3 decoding. It carried out successfully those algorithms.

Keywords: FUSC, microprocessor, ARM7, FPGA

1. Introduction
In these days, it is required that hardware system have to be fast, low power, and multifunctional for better communication
and information services. To,fulfill this requirement, the trend to integrate all the major system functions into a single chip
becomes rapidly increased. The key technology in integrating large mount of hardware and software into a single chip is
not chip fabrication but CPU core design. It is possible that hardware systems with SOC's (System on Chip), wIiich have
embedded CPU core, can have lot of flexibility in implementing complex algorithms for information technology fields.
In this point of view, ARM core family is widely used in information & communication and digital electric appliance
fields. It is often used as macrocell in application specific complex system- level chips due to its small sizeand low power
design. In the architectural point of view, the target processor contains 3-stage pipeline, 6 register banks, 32-bit ALU and
4-cycle multiply-accumulator (MAC). Moreover, 16-bit wide instructions, block data transfer and conditional instroction
execution also improve processor perfoniunce.
In this paper, the design and verification procedure of a 32-bit RISC processor, compatible with ARM7TDMI at the
instruction- level, is described. Section 2 briefly describes the processor architecture, while the detailed design process
will be presented in section 3. The implementation.process using FPGA, verification and application results in many
audio algorithms will be given in section 4. Finally, a conclusion for this work will be made in section 5.

2. Architecture
The core consists of six major functional blocks and three internal buses as shown in Fig. 1. The register bank has 3 1
general registers and 6 status registers to store the processor state. It has two read ports and one write port, which can be
used to access any register. Each read pori is connected on A bus and B bus as shown in Fig. 1. Two accessed operands
from registers are transferred to the multiplier or the'ALU through these two buses. The value on B bus is shifted by barrel
shifter and combined with the value on A bus at the ALU. Since the multiplier uses the existing shifter and ALU, the
partial sum on A bus and partial carry on B bus, and the partial product of the multiplier, are added in the ALU. All
product of the ALU, which can be address and data, is transferred to the register bank or the address register. The address
register and incrementer select and hold all memory addresses and generate sequential addresses, and the data registers
hold data passing to and from memory. These operations are controlled by the instruction decoder and the related logic.

222

Authorized licensed use limited to: California State University Fresno. Downloaded on February 17,2010 at 17:29:56 EST from IEEE Xplore. Restrictions apply.
Proceedings of the 7th Korea-Russia lnremational Symposium, KORUS 2003

A[31:01

c c0ntrC.i
3. Analysis and design '
Overall design procedure. is composed of pipeline, stage
analysis, instruction execution analysis, RT-level (Register
Transfer Level) functional unit composition, and control
signal generation.

3.1 Pipeline Stage


The core employs 3-stage pipeline such as fetch, decode, and
execute. Although most of data processing instruction is
executed in a single cycle, some other instructions take
multiple clock cycles. The number of cycle in the execute
stage is in the range of 1 to 17 cycles. The pipeline operations
for Memory transfer instruction are shown in Fig. 2. In execute
stage, there are 3 execute operations, which is address
computation, memory read, and register write., This
multi-cycle execution makes next instruction delayed as
additional cycles.

3.2 Instruction execution


,
A I
In order to understand the execution operation of ARM7
instructions, we analyzed the datapath and the movement of
operand for each instruction. From this analysis, we can find
Write Data Register * 0
.
1. R.pl.Ur
out partial control signals in execute stage of each instruction.
The flow of a data processing instruction in the execute stage
is illustrated in Fig:3. If register and data U 0 units are
connected with datapath and active buses, operand movement
0[31:01
is carried out. Each instruction can be grouped into several
Fig. 1. The target core architecture types,'such as data processing, data transfer, branch, and the
others. Each type determines the execution operation of an
instruction.

Cycle 1 2 3 4 5 6 7 8 9 IO

xx
LDR
A00
SUB
MOV

Fig. 2. Instruction pipeline operation (a) STR (b) LDR

Fig. 3. Instruction datapath activity

223

Authorized licensed use limited to: California State University Fresno. Downloaded on February 17,2010 at 17:29:56 EST from IEEE Xplore. Restrictions apply.
Proceedings of the 7th Koreo-Russia lnternationol Symposium.KORUS 2003

3.3 RT-level functional unit composition


From the analysis of hardware architecture and instruction operation, functional units are designed at the RT-level with
Verilog HDL, and each functional units are grouped as fetch, decode, execute, and memory block. The function of the
fetch block is to read the data from a targeted memory address. The read data is multiplexed by control signals from the
decode block. A general data value of the memory transfer instruction is stored at the data input (DIN) register and an
instruction is stored at the pipeline register. The decode block generates the control signals for the datapath to use at the
next cycle and execution sequences. It contains decoding programmable logic array (PLA), sequence generator, THUMB
expansion unit, and instruction register. 16-bit wide inspcfions are internally decompressed into 32-bit wide instruction
and then it is decoded. The block diagram related to pipeline cycle is shown in Fig. 4. The fetched instruction is decoded
at the PLA and the PLA output generates control signals for each instruction, as shown in Fig. 5 . Most instructions have
one cycle execution time hut multiply and memory access instructions have more than one. Thus additional logic for these
instructions is necessary. In order to generate critical control signals with short delay time, main decoder has sub PLA,
which is small and fast.
. .

instruction
coprocessor

decoder
multiply
control

loadlstore
multiple

Fig. 4. The THUMB instruction decompressor scheme Fig. 5 . Control logic structure

The memory block consists of the address register, the address incrementer. and the register bank. The register bank has
31 general-purpose registers and 6 status registers. But only 15 general-purpose registers and one status register are used
in most cases. The remaining registers are used only for system-level programming and handling exceptions. The address
register and incrementer select and hold all memory addresses and generate sequential addresses.
The Execute block has 32-bit integer ALU, multiply-add unit, and barrel shifter. The input operands to the ALU are
selectively inverted, then added and combined in the logic unit according to instructions. Finally the required result is
selected and loaded to the bus. Negative, Zero, Carry, and Overflow flags are generated in this unit. The barrel shifter
allows one of the ALU input operands to he shifted or rotated by any number of hits prior to ALU operations. Since the
32-hit addition time has a significant effect on the datapath cycle time, the ALU was composed of carry-select adder. The
overall composition of ALU is shown in Fig. 6.
The multiplier employs a modified Booth‘s algorithm and an early termination algorithm that multiplies multiplicands in
8-bit stages. The procedure of multiply operation is shown in Fig. 7. Each 8-bit multiply stage requires one instruction
-
cycle to execute. Thus, a multiply instruction can take 2 5 cycles. The multiplier employs carry-save adder to avoid the
carry-propagate delays as shown in Fig. 8. Intermediate results are held as partial s u m and partial carries where the true
binary result is obtained by adding these two together in main ALU.
Exceptions are usually used to handle unexpected events, such as intenupts or memory faults. In the ARM architecture,
software interrupt, undefmed instruction traps and the system reset is also considered as an exception. When an exception
happens, the core completes the current instruction, and then it changes the operating mode and saves old status. Once it
has changed to an exception mode, interrupt inputs are intercepted and Program counter is forced to access the
predetermined vector address. These control signals for this operation are made in the exception control block.

224

Authorized licensed use limited to: California State University Fresno. Downloaded on February 17,2010 at 17:29:56 EST from IEEE Xplore. Restrictions apply.
. . 1 .. . :, ....,. ;:. . . .
Proceedings of the 71h Korea-Russia lnnternalional Symposium;KORUS 2003 ' ,

inven A- I XOR galer invert B

C in
functio logic functions C
V

logiclarithmelic N
li
1 zero d e t e c t 1
v
I I
re5"ll Fig. 7. Multiplier-accumulatorarchitecture
Fig. 6. ALU architecture

4. Verification
Each functional logic block is coded by Verilog HDL and its operation is verified by logic simulation.The verified blocks
are combined to.make overall core. For the verification of the designed core at instruction level, we organized top-level
logic simulation environment, as shown in Fig. 8. It contains two ROMs (one for the processor setup and the other for the
main program) and one RAM.All results of instructions are compared with those of ARMulator, which is a software
emulator of the ARM processor, at every cycle as shown Fig. 11. To effectively find out design errors, instructions were
grouped into several types, and the test vector was generated by the combination of instruction groups. Every time
generating test vector, its operands are randomly generated and test routine is inserted for self test as shown in Fig. 9. By
this combination test, many errors that are not detected on the general individual test are found and fixed. Finally, the
operation of all instruction was verified by simulation.

variable

-
1 DIN I
DOUT b
L
x
XDATA[3I:OI

Fig. 8. Top-level logic simulation environment


IargetReg

status Reg - Memory

Memory

. . .

Fig. 9. The generation procedure of test vectors

The various audio processing algorithm were used to debug the design and to find out combinational errors of
-
instructions. The algorithms used are MP3 decoding, ADPCM (G.721-adaptive differential pulse code modulation) and
EDSOLA (synchronized overlap and add using edge detection). These algorithms are widely used in audio signal
processing. Each ADPCM and SOLA algorithm is programmed by assembly language. To test core under more various
conditions, we used as many as instruction types for the programs. Because MP3 algorithm is more complex than the
others, it is compiled by ARM compiler and partially reprogrammed by hand.

225

Authorized licensed use limited to: California State University Fresno. Downloaded on February 17,2010 at 17:29:56 EST from IEEE Xplore. Restrictions apply.
Proceedings o j f h e 71h Korea-Russia InfernafionalSymposium. KORUS 2003

I Comparator
I
Logic Simulation ___..
/ ,

8W02008 : 11l8tff8 @ 584591


8W0200a : WowWO @ 698747
8000200~: WWowO @ 812933,
8W0200e : WOoOwO @ 926555
8W02010 : MxMWW @I040207
8CCQ2012 : oMxwx00 @1153859
80002014 : 00080008 81277771
80002016 : 00wOMM @1398263
8wO2018 : 00000000 81518251
8wO201a : 0008wO8 @I648499
8w02oic : ono8ooo8
~~~~~~~~. 6177471s
iOii2Ole : OWOOWO 81896215
80002020 : II181 II 8 82028767
80002022 : WWOOOO 821502667

Fig. 10. Result comparison logic simulation with commercial software emulator

ARM7

Core

FPGA
Logic .....................
~ ~~~~~ ~ ~..,

.. ig. 12. Emulation board organization


.Fig. 1 I. FPGA emulation hoard

The core was implemented by FPGA to check errorsrelated to interrupt problems from data in-out processing. The design
was composed of 3,126 slices (25%), 2,299 registers (9%), and 4,483 LUT (18%) with FPGA-XCV1000BG560. The total
equivalent gate count is 45,419 and the operating speed is 12.5 MHz. The FPGA emulation hoard has two XCV chips as
shown in Fig. 11. One is for the designed core and the other is for various I/O interface. Because the board contains
various peripherals, such as 3 types of memory (DRAM, SRAM, FROM), the UART controller and the audio CODEC as
shown Fig. 12, it is useful for the verification of a design and an application program. The audio algorithm codes used at
logic simulation step are modified for memory map and inserted interface routines for CODEC and UART. ADPCM and
SOLA algorithms run at real time on the designed core. MP3 decoding algorithm requires the clock speed of 40MHz for
real time operation. Thus audio data is decoded at slow speed and then stored results are played. Various exception tests
are executed during emulation.

5. Conclusions
The design and verification procedure of 32-bit RISC processor core, which is compatible with ARM7TDM1, is described.
It was designed using Verilog HDL and verified by the simulation at the instruction-level. The verified design was
implemented with FPGA (XCV1000-BG560). The total gate count is 45,419 and the operating speed is 12.5 MHz.
ADPCM and SOLA algorithm nms at'real time, and MP3 decoding algorithm runs not at real time, but is carried out
exactly. In the future, the optional block logic (scan block and emulation block) will be added to the core, and
. ,
optimization for gate complexity and speed will be done.

Acknowledgment
This work was supported by IDEC, SIPAC, KIP0 and Pusan National University

226

Authorized licensed use limited to: California State University Fresno. Downloaded on February 17,2010 at 17:29:56 EST from IEEE Xplore. Restrictions apply.
Proceedings ojrhe 71h Korea-Russia Inlernarional Symposium. KORUS 2003

. . .

References
[ 11 Steve Furber, 1996, ARM System Architecture, Addison-Wesley.
[ 2 ] ARM Ltd, 1995, ARM7TDMI Data Sheet (ARM DDI 0029E), Advanced RISC Machines Ltd.
[3] Dave Jagger, 1996, ARM Architectural Reference Manual, Prentice Hall, London.. ..
[4] R. Gonzalez and M. Horowitz, Oct 1995, “Energy Dissipation in General Purpose Process-ors,” Proc. of the IEEE
Symposium on Low Power Electronics, pp. 12 13. -
[SI C. A. Papaclnistou, M. Spining, June 1999, “A Multiple Clocking Scheme for Low-Power RTL Design,” IEEE Trans.
-
on VLSI, Vol. 1, No. 2 , pp. 266 276.
[6] L. Benini, P. Siegel, and G . De Micheli, June 1996, “Automatic Synthesis of Low-Power Gated-Clock Finite State
Machines,” IEEE Trans. on CAD, Vol. 15, No. 6, pp. 630 643.
[7] S. Senars and et al., Oct 1995, “Embedded conaol problems, thumb and the ARM7TDM1,” IEEE Micro, pp. 23 30. -
[8] Ta-Chung Chang, 200, “A Biased Random Inshuc-tion Gemeration Environment.for Architectural Verification of
Pipelined Processor,” in Journal of Electronic Testing: Theory and Applications 16, pp.13 270.
[9] C. Pixley, and et al., 1996, “Commercial Design Verification: Methodology and Tools,” Proc. IEEE Int. Test Conf., pp.
-
839 848.
[lo] ARM Ltd, 1995, Programming Techniques (ARM DUI 0021A), Advanced RISC Machines Ltd.
[ 111 Michael Dolle, Manfred Schlett, Oct 1995, “A cost-effective RISCDSP microprocessor for embedded systems,”
IEEE Micro, Vol. 15, pp. 32 40.-

227

Authorized licensed use limited to: California State University Fresno. Downloaded on February 17,2010 at 17:29:56 EST from IEEE Xplore. Restrictions apply.

Vous aimerez peut-être aussi