Pipe Lining 8

Pipelining for MultiCore Architectures
Multi-Core Technology
2004 Single Core
Cache
+ Cache
2005 Dual Core
2007 Multi-Core
4 or more cores
more 2X res co
Core
Cache
+ Cache 2 or more cores + Cache + Cache
Core
Why multi-core ?
Difficult to make single-core clock frequencies even higher Deeply pipelined circuits: heat problems Clock problems Efficiency (Stall) problems Doubling issue rates above todays 3-6 instructions per clock, say to 6 to 12 instructions, is extremely difficult issue 3 or 4 data memory accesses per cycle, rename and access more than 20 registers per cycle, and fetch 12 to 24 instructions per cycle. Many new applications are multithreaded General trend in computer architecture (shift towards more parallelism)
Instruction-level parallelism
Parallelism at the machine-instruction level The processor can re-order, pipeline instructions, split them into microinstructions, do aggressive branch prediction, etc. Instruction-level parallelism enabled rapid increases in processor speeds over the last 15 years
4
Thread-level parallelism (TLP)

This is parallelism on a more coarser scale Server can serve each client in a separate thread (Web server, database server) A computer game can do AI, graphics, and sound in three separate threads Single-core superscalar processors cannot fully exploit TLP Multi-core architectures are the next step in processor evolution: explicitly exploiting TLP
5
What applications benefit from multi-core?

Database servers Web servers (Web commerce) Multimedia applications Scientific applications, CAD/CAM In general, applications with Thread-level parallelism (as opposed to instructionlevel parallelism)
Each can run on its own core
More examples
Editing a photo while recording a TV show through a digital video recorder Downloading software while running an anti-virus program Anything that can be threaded today will map efficiently to multi-core BUT: some applications difficult to parallelize
7
Core 2 Duo Microarchitecture
Without SMT, only a single thread can run at any given time
L1 D-Cache D-TLB
L2 Cache and Control
Integer
Floating Point
Schedulers
Uop queues Rename/Alloc BTB Trace Cache Decoder uCode ROM
Bus
BTB and I-TLB Thread 1: floating point
Without SMT, only a single thread can run at any given time
L1 D-Cache D-TLB
Integer
Floating Point
Schedulers
Bus
BTB and I-TLB Thread 2: integer operation
10
SMT processor: both threads can run concurrently

L1 D-Cache D-TLB
Integer
Floating Point
Schedulers
Bus
BTB and I-TLB Thread 2: Thread 1: floating point integer operation
11
But: Cant simultaneously use the same functional unit

L1 D-Cache D-TLB
Integer
Floating Point
Schedulers
BTB and I-TLB Thread 1 Thread 2 IMPOSSIBLE
This scenario is impossible with SMT on a single core (assuming a single 12 integer unit)
Bus
Multi-core: threads can run on separate cores

L1 D-Cache D-TLB L1 D-Cache D-TLB
Schedulers Uop queues Rename/Alloc

BTB Trace Cache uCode ROM
Integer
Floating Point
Integer
Floating Point

Decoder
Decoder
Bus
BTB and I-TLB Thread 1
Bus

13
Multi-core: threads can run on separate cores


Integer
Floating Point
Integer
Floating Point

Decoder
Decoder
Bus
Bus

14
Combining Multi-core and SMT

Cores can be SMT-enabled (or not) The different combinations:
Single-core, non-SMT: standard uniprocessor Single-core, with SMT Multi-core, non-SMT Multi-core, with SMT: our fish machines
The number of SMT threads: 2, 4, or sometimes 8 simultaneous threads Intel calls them hyper-threads
15
SMT Dual-core: all four threads can run concurrently


Integer
Floating Point
Integer
Floating Point

Decoder
Decoder
Bus
BTB and I-TLB Thread 1 Thread 3
Bus
BTB and I-TLB Thread 2 T Thread 4

16
Multi-Core and caches coherence

CORE0 CORE1 CORE1 CORE0
L1 cache L2 cache
L1 cache L2 cache
L1 cache L2 cache L3 cache
L1 cache
L2 cache L3 cache
memory memory
Both L1 and L2 are private Examples: AMD Opteron, AMD Athlon, Intel Pentium D A design with L3 caches Example: Intel Itanium 2
17
The cache coherence problem

Since we have private caches: How to keep the data consistent across caches? Each core should perceive the memory as a monolithic array, shared by all the cores
18

Suppose variable x initially contains 15213
Core 1 Core 2 Core 3 Core 4
One or more levels of cache
multi-core chip
Main memory x=15213
19

Core 1 reads x
One or more levels of cache x=15213
multi-core chip
Main memory x=15213
20

Core 2 reads x
multi-core chip
Main memory x=15213
21

Core 1 writes to x, setting it to 21660
multi-core chip
Main memory x=21660 assuming write-through caches
22

Core 2 attempts to read x gets a stale copy
multi-core chip
Main memory x=21660
23
The Memory Wall Problem
24
Memory Wall
1000
Proc 60%/yr. Moores Law (2X/1.5yr ) Processor-Memory Performance Gap: (grows 50% / year) DRAM DRAM 9%/yr. (2X/10 yrs)
CPU
Performance
100 10 1
1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000
25
Latency in a Single PC
500 1000
Memory Access Time

100 Time (ns) 10 1 0.1
400 300 200
CPU Time
1997 1999 2001 2003 2006 2009
100 0
X-Axis CPU Clock Period (ns) Ratio Memory System Access Time
THE WALL
26
Memory to CPU Ratio
Ratio
Processor L1 I (12Ki) L1 D (8KiB) L2 cache (512 KiB) L3 cache (2 MiB) Memory
Pentium 4 Cache hierarchy

Cycles: 2 Cycles: 19 Cycles: 43 Cycles: 206
27
Technology Trends
Capacity Logic: DRAM: Disk:
Year 1980 1983 1986 1989 1992 1996 1998 2000 2002 2006 Size 64 Kb 256 Kb 1 Mb 4 Mb 16 Mb 64 Mb 128 Mb 256 Mb 512 Mb 1024 Mb
Speed (latency)
2x in 3 years 2x in 3 years 4x in 3 years 2x in 10 years 4x in 3 years 2x in 10 years DRAM Generations

Cycle Time 250 ns 220 ns 190 ns 165 ns 120 ns 110 ns 100 ns 90 ns 80 ns 60ns
16000:1 (Capacity)
4:1 (Latency)
28
Processor-DRAM Performance Gap Impact: Example

To illustrate the performance impact, assume a single-issue pipelined CPU with CPI = 1 using non-ideal memory. The minimum cost of a full memory access in terms of number of wasted CPU cycles:
Year
CPU speed
MHZ
CPU cycle
ns
Memory Access
ns
Minimum CPU cycles or instructions wasted
1986: 1989: 1992: 1996: 1998: 2000: 2003:
8 33 60 200 300 1000 2000
125 30 16.6 5 3.33 1 .5
190 165 120 110 100 90 80
190/125 - 1 = 0.5 165/30 -1 = 4.5 120/16.6 -1 = 6.2 110/5 -1 = 21 100/3.33 -1 = 29 90/1 - 1 = 89 80/.5 - 1 = 159
29
Main memory generally uses Dynamic RAM (DRAM), which uses a single transistor to store a bit, but requires a periodic data refresh (~every 8 msec). Cache uses SRAM: Static Random Access Memory
No refresh (6 transistors/bit vs. 1 transistor/bit for DRAM)
Main Memory
Size: DRAM/SRAM - 4-8, Cost & Cycle time: SRAM/DRAM - 8-16 Main memory performance: Memory latency: Access time: The time it takes between a memory access request and
the time the requested information is available to cache/CPU. Cycle time: The minimum time between requests to memory (greater than access time in DRAM to allow address lines to be stable)
Memory bandwidth: The maximum sustained data transfer

rate between main memory and cache/CPU.
30
Architects Use Transistors to Tolerate Slow Memory

Cache
Small, Fast Memory Holds information (expected) to be used soon Mostly Successful
Apply Recursively
Level-one cache(s) Level-two cache
Most of microprocessor die area is cache!

31

Pipe Lining 8

Transféré par

Informations du document

Description originale:

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Pipe Lining 8

Transféré par

Droits d'auteur :

Formats disponibles

Pipelining for MultiCore Architectures

2005 Dual Core

+ Cache 2 or more cores + Cache + Cache

Thread-level parallelism (TLP)

What applications benefit from multi-core?

Core 2 Duo Microarchitecture

L2 Cache and Control

Uop queues Rename/Alloc BTB Trace Cache Decoder uCode ROM

BTB and I-TLB Thread 1: floating point

L2 Cache and Control

Uop queues Rename/Alloc BTB Trace Cache Decoder uCode ROM

BTB and I-TLB Thread 2: integer operation

SMT processor: both threads can run concurrently

L2 Cache and Control

Uop queues Rename/Alloc BTB Trace Cache Decoder uCode ROM

BTB and I-TLB Thread 2: Thread 1: floating point integer operation

But: Cant simultaneously use the same functional unit

L2 Cache and Control

Uop queues Rename/Alloc BTB Trace Cache Decoder uCode ROM

BTB and I-TLB Thread 1 Thread 2 IMPOSSIBLE

Multi-core: threads can run on separate cores

L2 Cache and Control

Schedulers Uop queues Rename/Alloc

L2 Cache and Control

Schedulers Uop queues Rename/Alloc

BTB and I-TLB Thread 1

BTB and I-TLB Thread 2

Multi-core: threads can run on separate cores

L2 Cache and Control

Schedulers Uop queues Rename/Alloc

L2 Cache and Control

Schedulers Uop queues Rename/Alloc

BTB and I-TLB Thread 3

BTB and I-TLB Thread 4

Combining Multi-core and SMT

SMT Dual-core: all four threads can run concurrently

L2 Cache and Control

Schedulers Uop queues Rename/Alloc

L2 Cache and Control

Schedulers Uop queues Rename/Alloc

BTB and I-TLB Thread 1 Thread 3

BTB and I-TLB Thread 2 T Thread 4

Multi-Core and caches coherence

L1 cache L2 cache L3 cache

The cache coherence problem

The cache coherence problem

One or more levels of cache

One or more levels of cache

One or more levels of cache

One or more levels of cache

The cache coherence problem

One or more levels of cache x=15213

One or more levels of cache

One or more levels of cache

One or more levels of cache

The cache coherence problem

One or more levels of cache x=15213

One or more levels of cache x=15213

One or more levels of cache

One or more levels of cache

The cache coherence problem

One or more levels of cache x=21660

One or more levels of cache x=15213