Académique Documents
Professionnel Documents
Culture Documents
Multi-Core Technology
2004 Single Core
Cache
+ Cache
2007 Multi-Core
4 or more cores
more 2X res co
Core
Cache
Core
Why multi-core ?
Difficult to make single-core clock frequencies even higher Deeply pipelined circuits: heat problems Clock problems Efficiency (Stall) problems Doubling issue rates above todays 3-6 instructions per clock, say to 6 to 12 instructions, is extremely difficult issue 3 or 4 data memory accesses per cycle, rename and access more than 20 registers per cycle, and fetch 12 to 24 instructions per cycle. Many new applications are multithreaded General trend in computer architecture (shift towards more parallelism)
Instruction-level parallelism
Parallelism at the machine-instruction level The processor can re-order, pipeline instructions, split them into microinstructions, do aggressive branch prediction, etc. Instruction-level parallelism enabled rapid increases in processor speeds over the last 15 years
4
More examples
Editing a photo while recording a TV show through a digital video recorder Downloading software while running an anti-virus program Anything that can be threaded today will map efficiently to multi-core BUT: some applications difficult to parallelize
7
Without SMT, only a single thread can run at any given time
L1 D-Cache D-TLB
Integer
Floating Point
Schedulers
Bus
Without SMT, only a single thread can run at any given time
L1 D-Cache D-TLB
Integer
Floating Point
Schedulers
Bus
10
Integer
Floating Point
Schedulers
Bus
11
Integer
Floating Point
Schedulers
This scenario is impossible with SMT on a single core (assuming a single 12 integer unit)
Bus
Integer
Floating Point
Integer
Floating Point
Decoder
Decoder
Bus
Bus
Integer
Floating Point
Integer
Floating Point
Decoder
Decoder
Bus
Bus
The number of SMT threads: 2, 4, or sometimes 8 simultaneous threads Intel calls them hyper-threads
15
Integer
Floating Point
Integer
Floating Point
Decoder
Decoder
Bus
Bus
L1 cache L2 cache
L1 cache L2 cache
L1 cache
L2 cache L3 cache
memory memory
Both L1 and L2 are private Examples: AMD Opteron, AMD Athlon, Intel Pentium D A design with L3 caches Example: Intel Itanium 2
17
18
multi-core chip
Main memory x=15213
19
multi-core chip
Main memory x=15213
20
multi-core chip
Main memory x=15213
21
multi-core chip
Main memory x=21660 assuming write-through caches
22
multi-core chip
Main memory x=21660
23
24
Memory Wall
1000
Proc 60%/yr. Moores Law (2X/1.5yr ) Processor-Memory Performance Gap: (grows 50% / year) DRAM DRAM 9%/yr. (2X/10 yrs)
CPU
Performance
100 10 1
1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000
25
Latency in a Single PC
500 1000
CPU Time
1997 1999 2001 2003 2006 2009
100 0
X-Axis CPU Clock Period (ns) Ratio Memory System Access Time
THE WALL
26
Ratio
Technology Trends
Capacity Logic: DRAM: Disk:
Year 1980 1983 1986 1989 1992 1996 1998 2000 2002 2006 Size 64 Kb 256 Kb 1 Mb 4 Mb 16 Mb 64 Mb 128 Mb 256 Mb 512 Mb 1024 Mb
Speed (latency)
16000:1 (Capacity)
4:1 (Latency)
28
Year
CPU speed
MHZ
CPU cycle
ns
Memory Access
ns
190/125 - 1 = 0.5 165/30 -1 = 4.5 120/16.6 -1 = 6.2 110/5 -1 = 21 100/3.33 -1 = 29 90/1 - 1 = 89 80/.5 - 1 = 159
29
Main memory generally uses Dynamic RAM (DRAM), which uses a single transistor to store a bit, but requires a periodic data refresh (~every 8 msec). Cache uses SRAM: Static Random Access Memory
No refresh (6 transistors/bit vs. 1 transistor/bit for DRAM)
Main Memory
Size: DRAM/SRAM - 4-8, Cost & Cycle time: SRAM/DRAM - 8-16 Main memory performance: Memory latency: Access time: The time it takes between a memory access request and
the time the requested information is available to cache/CPU. Cycle time: The minimum time between requests to memory (greater than access time in DRAM to allow address lines to be stable)
Apply Recursively
Level-one cache(s) Level-two cache