Académique Documents
Professionnel Documents
Culture Documents
KORWS 2003
1. Introduction
In these days, it is required that hardware system have to be fast, low power, and multifunctional for better communication
and information services. To,fulfill this requirement, the trend to integrate all the major system functions into a single chip
becomes rapidly increased. The key technology in integrating large mount of hardware and software into a single chip is
not chip fabrication but CPU core design. It is possible that hardware systems with SOC's (System on Chip), wIiich have
embedded CPU core, can have lot of flexibility in implementing complex algorithms for information technology fields.
In this point of view, ARM core family is widely used in information & communication and digital electric appliance
fields. It is often used as macrocell in application specific complex system- level chips due to its small sizeand low power
design. In the architectural point of view, the target processor contains 3-stage pipeline, 6 register banks, 32-bit ALU and
4-cycle multiply-accumulator (MAC). Moreover, 16-bit wide instructions, block data transfer and conditional instroction
execution also improve processor perfoniunce.
In this paper, the design and verification procedure of a 32-bit RISC processor, compatible with ARM7TDMI at the
instruction- level, is described. Section 2 briefly describes the processor architecture, while the detailed design process
will be presented in section 3. The implementation.process using FPGA, verification and application results in many
audio algorithms will be given in section 4. Finally, a conclusion for this work will be made in section 5.
2. Architecture
The core consists of six major functional blocks and three internal buses as shown in Fig. 1. The register bank has 3 1
general registers and 6 status registers to store the processor state. It has two read ports and one write port, which can be
used to access any register. Each read pori is connected on A bus and B bus as shown in Fig. 1. Two accessed operands
from registers are transferred to the multiplier or the'ALU through these two buses. The value on B bus is shifted by barrel
shifter and combined with the value on A bus at the ALU. Since the multiplier uses the existing shifter and ALU, the
partial sum on A bus and partial carry on B bus, and the partial product of the multiplier, are added in the ALU. All
product of the ALU, which can be address and data, is transferred to the register bank or the address register. The address
register and incrementer select and hold all memory addresses and generate sequential addresses, and the data registers
hold data passing to and from memory. These operations are controlled by the instruction decoder and the related logic.
222
Authorized licensed use limited to: California State University Fresno. Downloaded on February 17,2010 at 17:29:56 EST from IEEE Xplore. Restrictions apply.
Proceedings of the 7th Korea-Russia lnremational Symposium, KORUS 2003
A[31:01
c c0ntrC.i
3. Analysis and design '
Overall design procedure. is composed of pipeline, stage
analysis, instruction execution analysis, RT-level (Register
Transfer Level) functional unit composition, and control
signal generation.
Cycle 1 2 3 4 5 6 7 8 9 IO
xx
LDR
A00
SUB
MOV
223
Authorized licensed use limited to: California State University Fresno. Downloaded on February 17,2010 at 17:29:56 EST from IEEE Xplore. Restrictions apply.
Proceedings of the 7th Koreo-Russia lnternationol Symposium.KORUS 2003
instruction
coprocessor
decoder
multiply
control
loadlstore
multiple
Fig. 4. The THUMB instruction decompressor scheme Fig. 5 . Control logic structure
The memory block consists of the address register, the address incrementer. and the register bank. The register bank has
31 general-purpose registers and 6 status registers. But only 15 general-purpose registers and one status register are used
in most cases. The remaining registers are used only for system-level programming and handling exceptions. The address
register and incrementer select and hold all memory addresses and generate sequential addresses.
The Execute block has 32-bit integer ALU, multiply-add unit, and barrel shifter. The input operands to the ALU are
selectively inverted, then added and combined in the logic unit according to instructions. Finally the required result is
selected and loaded to the bus. Negative, Zero, Carry, and Overflow flags are generated in this unit. The barrel shifter
allows one of the ALU input operands to he shifted or rotated by any number of hits prior to ALU operations. Since the
32-hit addition time has a significant effect on the datapath cycle time, the ALU was composed of carry-select adder. The
overall composition of ALU is shown in Fig. 6.
The multiplier employs a modified Booth‘s algorithm and an early termination algorithm that multiplies multiplicands in
8-bit stages. The procedure of multiply operation is shown in Fig. 7. Each 8-bit multiply stage requires one instruction
-
cycle to execute. Thus, a multiply instruction can take 2 5 cycles. The multiplier employs carry-save adder to avoid the
carry-propagate delays as shown in Fig. 8. Intermediate results are held as partial s u m and partial carries where the true
binary result is obtained by adding these two together in main ALU.
Exceptions are usually used to handle unexpected events, such as intenupts or memory faults. In the ARM architecture,
software interrupt, undefmed instruction traps and the system reset is also considered as an exception. When an exception
happens, the core completes the current instruction, and then it changes the operating mode and saves old status. Once it
has changed to an exception mode, interrupt inputs are intercepted and Program counter is forced to access the
predetermined vector address. These control signals for this operation are made in the exception control block.
224
Authorized licensed use limited to: California State University Fresno. Downloaded on February 17,2010 at 17:29:56 EST from IEEE Xplore. Restrictions apply.
. . 1 .. . :, ....,. ;:. . . .
Proceedings of the 71h Korea-Russia lnnternalional Symposium;KORUS 2003 ' ,
C in
functio logic functions C
V
logiclarithmelic N
li
1 zero d e t e c t 1
v
I I
re5"ll Fig. 7. Multiplier-accumulatorarchitecture
Fig. 6. ALU architecture
4. Verification
Each functional logic block is coded by Verilog HDL and its operation is verified by logic simulation.The verified blocks
are combined to.make overall core. For the verification of the designed core at instruction level, we organized top-level
logic simulation environment, as shown in Fig. 8. It contains two ROMs (one for the processor setup and the other for the
main program) and one RAM.All results of instructions are compared with those of ARMulator, which is a software
emulator of the ARM processor, at every cycle as shown Fig. 11. To effectively find out design errors, instructions were
grouped into several types, and the test vector was generated by the combination of instruction groups. Every time
generating test vector, its operands are randomly generated and test routine is inserted for self test as shown in Fig. 9. By
this combination test, many errors that are not detected on the general individual test are found and fixed. Finally, the
operation of all instruction was verified by simulation.
variable
-
1 DIN I
DOUT b
L
x
XDATA[3I:OI
Memory
. . .
The various audio processing algorithm were used to debug the design and to find out combinational errors of
-
instructions. The algorithms used are MP3 decoding, ADPCM (G.721-adaptive differential pulse code modulation) and
EDSOLA (synchronized overlap and add using edge detection). These algorithms are widely used in audio signal
processing. Each ADPCM and SOLA algorithm is programmed by assembly language. To test core under more various
conditions, we used as many as instruction types for the programs. Because MP3 algorithm is more complex than the
others, it is compiled by ARM compiler and partially reprogrammed by hand.
225
Authorized licensed use limited to: California State University Fresno. Downloaded on February 17,2010 at 17:29:56 EST from IEEE Xplore. Restrictions apply.
Proceedings o j f h e 71h Korea-Russia InfernafionalSymposium. KORUS 2003
I Comparator
I
Logic Simulation ___..
/ ,
Fig. 10. Result comparison logic simulation with commercial software emulator
ARM7
Core
FPGA
Logic .....................
~ ~~~~~ ~ ~..,
The core was implemented by FPGA to check errorsrelated to interrupt problems from data in-out processing. The design
was composed of 3,126 slices (25%), 2,299 registers (9%), and 4,483 LUT (18%) with FPGA-XCV1000BG560. The total
equivalent gate count is 45,419 and the operating speed is 12.5 MHz. The FPGA emulation hoard has two XCV chips as
shown in Fig. 11. One is for the designed core and the other is for various I/O interface. Because the board contains
various peripherals, such as 3 types of memory (DRAM, SRAM, FROM), the UART controller and the audio CODEC as
shown Fig. 12, it is useful for the verification of a design and an application program. The audio algorithm codes used at
logic simulation step are modified for memory map and inserted interface routines for CODEC and UART. ADPCM and
SOLA algorithms run at real time on the designed core. MP3 decoding algorithm requires the clock speed of 40MHz for
real time operation. Thus audio data is decoded at slow speed and then stored results are played. Various exception tests
are executed during emulation.
5. Conclusions
The design and verification procedure of 32-bit RISC processor core, which is compatible with ARM7TDM1, is described.
It was designed using Verilog HDL and verified by the simulation at the instruction-level. The verified design was
implemented with FPGA (XCV1000-BG560). The total gate count is 45,419 and the operating speed is 12.5 MHz.
ADPCM and SOLA algorithm nms at'real time, and MP3 decoding algorithm runs not at real time, but is carried out
exactly. In the future, the optional block logic (scan block and emulation block) will be added to the core, and
. ,
optimization for gate complexity and speed will be done.
Acknowledgment
This work was supported by IDEC, SIPAC, KIP0 and Pusan National University
226
Authorized licensed use limited to: California State University Fresno. Downloaded on February 17,2010 at 17:29:56 EST from IEEE Xplore. Restrictions apply.
Proceedings ojrhe 71h Korea-Russia Inlernarional Symposium. KORUS 2003
. . .
References
[ 11 Steve Furber, 1996, ARM System Architecture, Addison-Wesley.
[ 2 ] ARM Ltd, 1995, ARM7TDMI Data Sheet (ARM DDI 0029E), Advanced RISC Machines Ltd.
[3] Dave Jagger, 1996, ARM Architectural Reference Manual, Prentice Hall, London.. ..
[4] R. Gonzalez and M. Horowitz, Oct 1995, “Energy Dissipation in General Purpose Process-ors,” Proc. of the IEEE
Symposium on Low Power Electronics, pp. 12 13. -
[SI C. A. Papaclnistou, M. Spining, June 1999, “A Multiple Clocking Scheme for Low-Power RTL Design,” IEEE Trans.
-
on VLSI, Vol. 1, No. 2 , pp. 266 276.
[6] L. Benini, P. Siegel, and G . De Micheli, June 1996, “Automatic Synthesis of Low-Power Gated-Clock Finite State
Machines,” IEEE Trans. on CAD, Vol. 15, No. 6, pp. 630 643.
[7] S. Senars and et al., Oct 1995, “Embedded conaol problems, thumb and the ARM7TDM1,” IEEE Micro, pp. 23 30. -
[8] Ta-Chung Chang, 200, “A Biased Random Inshuc-tion Gemeration Environment.for Architectural Verification of
Pipelined Processor,” in Journal of Electronic Testing: Theory and Applications 16, pp.13 270.
[9] C. Pixley, and et al., 1996, “Commercial Design Verification: Methodology and Tools,” Proc. IEEE Int. Test Conf., pp.
-
839 848.
[lo] ARM Ltd, 1995, Programming Techniques (ARM DUI 0021A), Advanced RISC Machines Ltd.
[ 111 Michael Dolle, Manfred Schlett, Oct 1995, “A cost-effective RISCDSP microprocessor for embedded systems,”
IEEE Micro, Vol. 15, pp. 32 40.-
227
Authorized licensed use limited to: California State University Fresno. Downloaded on February 17,2010 at 17:29:56 EST from IEEE Xplore. Restrictions apply.