Tms320c64x Architecture

TMS320C64x
• TMS320C64x is a family of 16-bit Very Long

Instruction Word (VLIW) DSP from Texas Instruments
• At clock rates of up to 1 GHz, C64x DSPs can process
information at rates up to 8000 MIPS
• C64x DSPs can do more work each cycle with built-in
extensions.
• They can process all C62x object code unmodified
(but not vice-versa)
Applications for the C64x
TMS320C64x can be used as a CPU in the following

devices:
 Wireless local base stations;

 Remote access server (RAS);
 Digital subscriber loop (DSL) systems;
 Cable modems;
 Multichannel telephony systems;
 Pooled modems;
New extensions
• Register file enhancements

• Data path extensions
• Packed data processing
• Additional functional unit hardware
• Increased orthogonality
Register file enhancements
• The ’C64x register file has double the number of

general-purpose registers than the ’C62x/’C67x cores
• There are 32 32-bit registers per data path
A0-A31 for file A and B0-B31 for file B
• A0 may also be used as a condition register bringing
the total to six condition registers.
• In all ’C6000 devices, registers A4-A7 and B4-B7 can
be used for circular addressing.
Packed data processing
• The ’C64x register file supports all the ’C62x data
types and extends this by additionally supporting
packed 8-bit types and 64-bit fixed-point data types.
• Packed data types store either four 8-bit values or
two 16-bit values in a single 32-bit register or four 16-
bit values in a 64-bit register pair.
• Besides being able to perform all the ’C62x
instructions, the ’C64x also contains many 8–bit and
16–bit extensions to the instruction set.
Eg: MPYU4 instruction performs four 8x8 unsigned
multiplies with a single instruction on a .M unit.
Data path extensions
• On the ’C64x, all eight of the functional units have
access to the register file on the opposite side via a
cross path.
• on the ’C62x/’C67x, only six functional units have
access to the register file on the opposite side via a
cross path; the .D units do not have a data cross
path.
• The ’C64x pipelines data cross path accesses
allowing multiple units per side to read the same
cross path source simultaneously.
• In ’C62x/’C67x, only one functional unit per data path
per execute packet could get an operand from the
opposite register file.
Additional Functional Unit Hardware
• the .L units can perform byte shifts and the .M units
can perform bi-directional variable shifts in addition to
the .S unit’s ability to do shifts.
• Bit-count and rotate hardware on the .M unit extends
support for bit-level algorithms such as binary
morphology, image metric calculations and encryption
algorithms.
Increased Orthogonality
• The .D unit can now perform 32-bit logical
instructions in addition to the .S and .L units.
• Also, the .D unit now directly supports load and store
instructions for double-word data values
Block diagram
L1 Program cache
Direct-mapped
SDRAM 16 K Bytes total
EMIF A
SBSRAM
ZBT RAM EMIF B

Enhanced L2
DMA
CPU
Memory
FIFO Controller 1024K
CORE
(64-channel) bytes
SRAM
.
I/O devices
L1 Data cache
2-way set-associative
16 K Bytes total
C64X CPU
Architecture Overview
• 2 (almost) identical fixed-point data paths that
each contain
– 1 ALU (The .L Unit)
– 1 Shifter (The .S Unit)
– 1 Multiplier (The .M Unit)
– 1 Adder/Subtractor used for address
generation (The .D Unit)
– 1 register file containing thirty-two 32-bit
registers
• The 8 execution units in the 2 data paths are
capable of executing up to 8 instructions in
parallel.
• Can operate on 8-, 16-, 32-, and 40-bit data
• Can perform double-word (64-bit) loads and

stores by using 2 registers for the one operation.
General-Purpose Register Files
 The C64x register file contains 32 32-bit registers (A0-
A31 for file A and B0-B31 for file B);
 can be used for data, pointers or conditions
 Values larger than 32 bits (40-bit long and 64-bit float
quantities) are stored in register pairs.
 Packed data types are: four 8-bit values or two 16-bit
values in a single 32-bit register, four 16-bit values in a
64-bit register pair.
Odd register 39 32 31 Even register 0

Zero filled
Delay Slots
• Delay slots mean “how many CPU cycles come
between the current instruction and when the
results of the instruction can be used by another
instruction”
• Single Cycle Instructions: 0 delay slots
• 16x16 Single Multiply and .M Unit non-multiply
Instructions: 1 delay slot
• Store: 0 delay slots
– If a load occurs before a store (either in parallel or not),
then the old data is loaded from memory before the new
data is stored.
– If a load occurs after a store, (either in parallel or not), then
the new data is stored before the data is loaded.
• C64x Multiply Extensions: 3 delay slots
• Load: 4 delay slots
• Branch: 5 delay slots
– The branch target is in the PG slot when the branch
condition is determined in E1. There are 5 slots between
PG and E1 when the branch target begins executing useful
code again.
Memory
 The C64x has different spaces for program and data memory;
 Uses two-level cache memory scheme;

Internal Memory
The C64x has a 32-bit byte-addressable memory with the
following features:
 Separate data and program address spaces;
 Large on chip RAM, up to 7MB;
 2-level cache;
 Single internal program memory port with an
instruction-fetch bandwidth of 256 bits;
 Two 64-bit internal data memory ports;

Memory Map (Internal and External
Memory)
• Level 1 Program Cache is 128 Kbit direct
mapped
• Level 1 Data cache is 128Kbit 2-way set-
associative
• Shared Level 2 Program/Data
Memory/Cache of 4Mbit
– Can be configured as mapped memory
– Cache (up to 256 Kbytes)
– Combination of the two
Memory Buses
• Instruction fetch using 32-bit address bus
and 256-bit data bus
• two 64-bit load buses (LD1 and LD2)
• two 64-bit store buses (ST1 and ST2)
Interrupts
• 16 prioritized interrupts: INT_00 to INT_15
• INT_00 has the highest priority and is dedicated
to RESET. This halts the CPU and returns it to
a known state
• The first four interrupts (INT_00 – INT_03) are
fixed and non maskable
• INT_01 – INT_03 are generally used to alert the
CPU of an impending hardware problem, such
as an imminent power failure
• The remaining interrupts are maskable and can
be programmed
Interrupt Performance
Consideration
• Overhead for all CPU interrupts is 7 cycles
• Interrupt latency is 11 cycles
• Interrupts can be recognized every 2
cycles
• 2 occurrences of a specific interrupt can
be recognized in 2 cycles
Peripheral Set
• 2 multichannel buffered audio serial ports
• 2 inter-integrated circuit bus modules (I2Cs)
• 3 multichannel buffered serial ports (McBSPs)
• 3 32-bit general-purpose timers
• 1 user-configurable 16-bit or 32-bit host-port interface
(HPI16/HPI32)
• 1 16-pin general-purpose input/output port (GP0) with
programmable interrupt/event generation modes
• 1 32-bit glueless external memory interface (EMIFA),
capable of interfacing to synchronous and asynchronous
memories and peripherals.
ZBT RAM
• Zero Bus Turnaround (ZBT) is a synchronous SRAM
architecture optimized for networking and
telecommunications applications.
• It can increase the internal bandwidth of a switch
fabric when compared to standard SyncBurst SRAM.
• The ZBT architecture is optimized for switching and
other applications with highly random READs and
WRITEs.
• ZBT SRAMs eliminate all idle cycles when turning the
data bus around from a WRITE operation to a READ
operation
Packaging – Top View
Packaging - Bottom View
Sum of products example
C code: TI TMS C64x code:
int DotP(short* m, short* n, int count) { LOOP:

int i, product, sum = 0; [A0] SUB .L1 A0, 1, A0
for(i = 0; i < count; i++)
| | [!A0] ADD .S1 A6, A5, A5
{
|| MPY .M1X B4, A4, A6
product = m[i] * n[i];
| | [B0] BDEC .S2 LOOP, B0
sum+=product;
} LDH .D1T1 *A3++, A4
return(sum); LDH .D2T2 *B5++, B4
}
Another code example
MIPS:
loop: LW R1, 0(R11)

MUL R2, R1, R10
SW R2, 0(R12)
ADDI R12, R12, #-4
ADDI R11, R11, #-4
BGTZ R12, loop
TI TMS C64x:
ADDK .S1 #-4,A11 || LDW .D1 A1,0(A11) || MVK .S2 #-4,B1

ADDK .S1 #-4,A11 || LDW .D1 A1,0(A11) || MUL .M1 A1,A10,A2 || ADDK .S2 #-12,B12
loop: ADDK .S1 #-4,A11 || LDW .D1 A1,0(A11) || MUL .M1 A1,A10,A2 || STW .D2x A2,0(B12) ||
ADD .L2 B12,B1,B12 || BGTZ .S2 B12, loop
ADD .L2 B12, B1, B12 || MUL .M1 A1,A10,A2 || STW .D2x A2,0(B12)
ADD .L2 B12, B1, B12 || STW .D2x A2,0(B12)
Special purpose instructions
Instruction Description Example Application
BITC4 Bit counter Machine vision
GMPY4 Galois Field MPY Reed Solomon support
SHFL Bit interleaving Convolution encoder
DEAL Bit de-interleaving Cable modem
SWAP4 Byte swap Endian swap
XPNDx Bit expansion Graphics
MPYHIx, MPYLIx Extended precision 16x32 MPYs Audio
AVGx Quad 8-bit, Dual 16-bit average Motion compensation
SUBABS4 Quad 8-bit Absolute of Motion estimation
differences
SSHVL, SSHVR Signed variable shift GSM
THE END

Tms320c64x Architecture

Transféré par

Informations du document

Copyright

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Tms320c64x Architecture

Transféré par

Droits d'auteur :

TMS320C64x

• TMS320C64x is a family of 16-bit Very Long

TMS320C64x can be used as a CPU in the following

 Wireless local base stations;

• Register file enhancements

• The ’C64x register file has double the number of

ZBT RAM EMIF B

• Can perform double-word (64-bit) loads and

Odd register 39 32 31 Even register 0

 Separate data and program address spaces;

 Large on chip RAM, up to 7MB;

 Two 64-bit internal data memory ports;

C code: TI TMS C64x code:

int DotP(short* m, short* n, int count) { LOOP:

loop: LW R1, 0(R11)

ADDK .S1 #-4,A11 || LDW .D1 A1,0(A11) || MVK .S2 #-4,B1

Vous aimerez peut-être aussi