Vous êtes sur la page 1sur 29

TMS320C64x

• TMS320C64x is a family of 16-bit Very Long


Instruction Word (VLIW) DSP from Texas Instruments
• At clock rates of up to 1 GHz, C64x DSPs can process
information at rates up to 8000 MIPS
• C64x DSPs can do more work each cycle with built-in
extensions.
• They can process all C62x object code unmodified
(but not vice-versa)
Applications for the C64x

TMS320C64x can be used as a CPU in the following


devices:

 Wireless local base stations;


 Remote access server (RAS);
 Digital subscriber loop (DSL) systems;
 Cable modems;
 Multichannel telephony systems;
 Pooled modems;
New extensions

• Register file enhancements


• Data path extensions
• Packed data processing
• Additional functional unit hardware
• Increased orthogonality
Register file enhancements

• The ’C64x register file has double the number of


general-purpose registers than the ’C62x/’C67x cores
• There are 32 32-bit registers per data path
A0-A31 for file A and B0-B31 for file B
• A0 may also be used as a condition register bringing
the total to six condition registers.
• In all ’C6000 devices, registers A4-A7 and B4-B7 can
be used for circular addressing.
Packed data processing
• The ’C64x register file supports all the ’C62x data
types and extends this by additionally supporting
packed 8-bit types and 64-bit fixed-point data types.
• Packed data types store either four 8-bit values or
two 16-bit values in a single 32-bit register or four 16-
bit values in a 64-bit register pair.
• Besides being able to perform all the ’C62x
instructions, the ’C64x also contains many 8–bit and
16–bit extensions to the instruction set.
Eg: MPYU4 instruction performs four 8x8 unsigned
multiplies with a single instruction on a .M unit.
Data path extensions
• On the ’C64x, all eight of the functional units have
access to the register file on the opposite side via a
cross path.
• on the ’C62x/’C67x, only six functional units have
access to the register file on the opposite side via a
cross path; the .D units do not have a data cross
path.
• The ’C64x pipelines data cross path accesses
allowing multiple units per side to read the same
cross path source simultaneously.
• In ’C62x/’C67x, only one functional unit per data path
per execute packet could get an operand from the
opposite register file.
Additional Functional Unit Hardware
• the .L units can perform byte shifts and the .M units
can perform bi-directional variable shifts in addition to
the .S unit’s ability to do shifts.
• Bit-count and rotate hardware on the .M unit extends
support for bit-level algorithms such as binary
morphology, image metric calculations and encryption
algorithms.
Increased Orthogonality
• The .D unit can now perform 32-bit logical
instructions in addition to the .S and .L units.
• Also, the .D unit now directly supports load and store
instructions for double-word data values
Block diagram
L1 Program cache
Direct-mapped
SDRAM 16 K Bytes total

EMIF A
SBSRAM

ZBT RAM EMIF B


Enhanced L2
DMA
CPU
Memory
FIFO Controller 1024K
CORE
(64-channel) bytes
SRAM

.
I/O devices

L1 Data cache
2-way set-associative
16 K Bytes total
C64X CPU
Architecture Overview
• 2 (almost) identical fixed-point data paths that
each contain
– 1 ALU (The .L Unit)
– 1 Shifter (The .S Unit)
– 1 Multiplier (The .M Unit)
– 1 Adder/Subtractor used for address
generation (The .D Unit)
– 1 register file containing thirty-two 32-bit
registers
• The 8 execution units in the 2 data paths are
capable of executing up to 8 instructions in
parallel.
• Can operate on 8-, 16-, 32-, and 40-bit data

• Can perform double-word (64-bit) loads and


stores by using 2 registers for the one operation.
General-Purpose Register Files
 The C64x register file contains 32 32-bit registers (A0-
A31 for file A and B0-B31 for file B);
 can be used for data, pointers or conditions
 Values larger than 32 bits (40-bit long and 64-bit float
quantities) are stored in register pairs.
 Packed data types are: four 8-bit values or two 16-bit
values in a single 32-bit register, four 16-bit values in a
64-bit register pair.

Odd register 39 32 31 Even register 0


Zero filled
Delay Slots
• Delay slots mean “how many CPU cycles come
between the current instruction and when the
results of the instruction can be used by another
instruction”
• Single Cycle Instructions: 0 delay slots
• 16x16 Single Multiply and .M Unit non-multiply
Instructions: 1 delay slot
• Store: 0 delay slots
– If a load occurs before a store (either in parallel or not),
then the old data is loaded from memory before the new
data is stored.
– If a load occurs after a store, (either in parallel or not), then
the new data is stored before the data is loaded.
• C64x Multiply Extensions: 3 delay slots
• Load: 4 delay slots
• Branch: 5 delay slots
– The branch target is in the PG slot when the branch
condition is determined in E1. There are 5 slots between
PG and E1 when the branch target begins executing useful
code again.
Memory
 The C64x has different spaces for program and data memory;
 Uses two-level cache memory scheme;

Internal Memory
The C64x has a 32-bit byte-addressable memory with the
following features:

 Separate data and program address spaces;

 Large on chip RAM, up to 7MB;

 2-level cache;
 Single internal program memory port with an
instruction-fetch bandwidth of 256 bits;

 Two 64-bit internal data memory ports;


Memory Map (Internal and External
Memory)
• Level 1 Program Cache is 128 Kbit direct
mapped
• Level 1 Data cache is 128Kbit 2-way set-
associative
• Shared Level 2 Program/Data
Memory/Cache of 4Mbit
– Can be configured as mapped memory
– Cache (up to 256 Kbytes)
– Combination of the two
Memory Buses
• Instruction fetch using 32-bit address bus
and 256-bit data bus
• two 64-bit load buses (LD1 and LD2)
• two 64-bit store buses (ST1 and ST2)
Interrupts
• 16 prioritized interrupts: INT_00 to INT_15
• INT_00 has the highest priority and is dedicated
to RESET. This halts the CPU and returns it to
a known state
• The first four interrupts (INT_00 – INT_03) are
fixed and non maskable
• INT_01 – INT_03 are generally used to alert the
CPU of an impending hardware problem, such
as an imminent power failure
• The remaining interrupts are maskable and can
be programmed
Interrupt Performance
Consideration
• Overhead for all CPU interrupts is 7 cycles
• Interrupt latency is 11 cycles
• Interrupts can be recognized every 2
cycles
• 2 occurrences of a specific interrupt can
be recognized in 2 cycles
Peripheral Set
• 2 multichannel buffered audio serial ports
• 2 inter-integrated circuit bus modules (I2Cs)
• 3 multichannel buffered serial ports (McBSPs)
• 3 32-bit general-purpose timers
• 1 user-configurable 16-bit or 32-bit host-port interface
(HPI16/HPI32)
• 1 16-pin general-purpose input/output port (GP0) with
programmable interrupt/event generation modes
• 1 32-bit glueless external memory interface (EMIFA),
capable of interfacing to synchronous and asynchronous
memories and peripherals.
ZBT RAM
• Zero Bus Turnaround (ZBT) is a synchronous SRAM
architecture optimized for networking and
telecommunications applications.
• It can increase the internal bandwidth of a switch
fabric when compared to standard SyncBurst SRAM.
• The ZBT architecture is optimized for switching and
other applications with highly random READs and
WRITEs.
• ZBT SRAMs eliminate all idle cycles when turning the
data bus around from a WRITE operation to a READ
operation
Packaging – Top View
Packaging - Bottom View
Sum of products example

C code: TI TMS C64x code:

int DotP(short* m, short* n, int count) { LOOP:


int i, product, sum = 0; [A0] SUB .L1 A0, 1, A0
for(i = 0; i < count; i++)
| | [!A0] ADD .S1 A6, A5, A5
{
|| MPY .M1X B4, A4, A6
product = m[i] * n[i];
| | [B0] BDEC .S2 LOOP, B0
sum+=product;
} LDH .D1T1 *A3++, A4
return(sum); LDH .D2T2 *B5++, B4
}
Another code example
MIPS:

loop: LW R1, 0(R11)


MUL R2, R1, R10
SW R2, 0(R12)
ADDI R12, R12, #-4
ADDI R11, R11, #-4
BGTZ R12, loop

TI TMS C64x:

ADDK .S1 #-4,A11 || LDW .D1 A1,0(A11) || MVK .S2 #-4,B1


ADDK .S1 #-4,A11 || LDW .D1 A1,0(A11) || MUL .M1 A1,A10,A2 || ADDK .S2 #-12,B12
loop: ADDK .S1 #-4,A11 || LDW .D1 A1,0(A11) || MUL .M1 A1,A10,A2 || STW .D2x A2,0(B12) ||
ADD .L2 B12,B1,B12 || BGTZ .S2 B12, loop

ADD .L2 B12, B1, B12 || MUL .M1 A1,A10,A2 || STW .D2x A2,0(B12)
ADD .L2 B12, B1, B12 || STW .D2x A2,0(B12)
Special purpose instructions
Instruction Description Example Application
BITC4 Bit counter Machine vision
GMPY4 Galois Field MPY Reed Solomon support
SHFL Bit interleaving Convolution encoder
DEAL Bit de-interleaving Cable modem
SWAP4 Byte swap Endian swap
XPNDx Bit expansion Graphics
MPYHIx, MPYLIx Extended precision 16x32 MPYs Audio
AVGx Quad 8-bit, Dual 16-bit average Motion compensation
SUBABS4 Quad 8-bit Absolute of Motion estimation
differences
SSHVL, SSHVR Signed variable shift GSM
THE END

Vous aimerez peut-être aussi