Vous êtes sur la page 1sur 30

Multicore: Commercial

Processors

© Sudhakar Yalamanchili, Georgia Institute of Technology

Some Examples

• Desktop and Server/Enterprise Space


– Intel

– AMD

– SUN Microsystems

• The Embedded Space: Freescale Semiconductor

ECE 4100/6100 (2)

1
Focus

• The Chip Level Architecture


– What do we have on chip?

• The Core Architecture


– Note the presence/absence/configuration of concepts
studied earlier in class
– Rationalize the design decisions that led to the preceding
– What can/should we expect next?

• Building systems using multicore chips

ECE 4100/6100 (3)

The Intel Core Duo Processor


Series

© Sudhakar Yalamanchili, Georgia Institute of Technology

2
Intel Core Duo

• Homogeneous cores
• Bus based on chip interconnect
• Shared Memory
• Traditional I/O

Classic OOO: Reservation Stations,


Issue ports, Schedulers…etc Source: Intel Corp.

Large, shared set associative, prefetch,


etc.

ECE 4100/6100 (5)

Intel Core Duo: Vital Stats


• 151 million transistors; Shared 2 MB L2 cache
• Each core has a 12 stage pipeline (Yonah)
• Low-power (less than 25 watts) Dual Core microprocessor
• Supports Intel’s Vanderpool virtualization technology
• EM64T (Intel x86-64 extensions) is not supported
– Desktop market – not severe due to lack of OS and software
– Sossaman processor for servers, which is based on Yonah, also lacks
EM64T-support Æ severe disadvantage
• Communication between the L2 cache and both execution cores is
handled by an arbitration bus unit
– Eliminates cache coherency traffic over the FSB
– Raises the core-to-L2 latency
– The increase in clock frequency offsets the impact
• Core processors communicate with the system chipset over a 667
MT/s front side bus (FSB), up from 533 MT/s used by the fastest
Pentium M.
• Intel Core Solo uses the same two-core die as the Core Duo, but
features only one active core
– Chips failing quality control can be sold
– Core 2 Duo processors will also include the ability to disable one core to
conserve power

ECE 4100/6100 (6)

3
The Core™ micro-architecture

Source: Ars Technica

ECE 4100/6100 (7)

The Core Execution core

Source: Ars Technica

ECE 4100/6100 (8)

4
Intel Core Duo
• High memory latency due to the lack of on-die
memory controller (further aggravated by system-
chipset's use of DDR-II RAM)
• Main-memory transactions have to pass through
the Northbridge of the chipset
– Higher latency compared to the AMD's Turion platform.
– Weakness shared by the entire line of Pentium processors
– L2-cache is quite effective at hiding main-memory latency
• Execution units
– Three 64-bit integer exec units
– one CIU (complex) + two SIU (simple)
– Two FPUs
– Poor Floating Point Unit (FPU) throughput
• Limited to little "performance per watt" in single
threaded applications compared to its predecessor.
ECE 4100/6100 (9)

Core 2 Duo and Core Duo

Source: Intel Corp.

• Very similar architectures


• Bump in the processor speed
• Increase in Level 2 cache. (2MB to 4MB)
• Both chips have a 65-nm process technology architecture and
support a 667 MHz front-side-bus (FSB).
• 14 stage pipeline

ECE 4100/6100 (10)

5
Intel® CoreTM2 Duo Processor

Process Technology 65 nm

Number of Processor Cores 2

L2 Cache Size (shared between 2 processor cores) Up to 4MB

Transistor Gate Height / Gate Oxide Thickness (65 nm) 1.2 nm

Transistor Gate Length (for 65nm Process Technology) 35 nm

Line Width 65 nm

Number of Transistors 291 million

Processor Die Size 143 mm2

Average Power <1.1 Watt

ECE 4100/6100 (11)

Intel Core 2 Duo

Source: Hard Core Hardware

ECE 4100/6100 (12)

6
Wide Dynamic Execution

Source: Bit Tech


ECE 4100/6100 (13)

Wide Dynamic Execution

Source: Bit Tech


ECE 4100/6100 (14)

7
Wide Dynamic Execution

• Pipe width of 4 execution units per chip (Pentium


M/Pentium 4 Netburst have 3)
• Delivery of more instructions per clock cycle
• Pipeline depth of 14 vs. 31 in Pentium Prescott 4
– Compromise between efficient execution of short
instructions and long instructions
• Ops fusion
– Less work for the processor pipeline to run
– Micro-ops fusion
– fuse together repetitive instructions in x86 code
– Macro-ops fusion
– works on the x86 instructions themselves, not just their micro
derivatives.
– Instruction loads and micro-ops can be reduced by
approximately 15% and 10%, respectively

ECE 4100/6100 (15)

Intelligent Power Capability

Source: Bit Tech


ECE 4100/6100 (16)

8
Intelligent Power Capability

• SpeedStep technology
– Dyamic clock speed reduction
– Intel mobile processors include this already
– Enhanced SpeedStep used in Core 2 Duo

• Controller that turns on sections of the processor as


needed. One core can be shut down for single-
threaded applications

• Power consumption decreased by enhancements to


Intel's 65nm process node
– use Low-K dielectrics and strained silicon
– use low-leakage and "sleep" transistors

ECE 4100/6100 (17)

Advanced Smart Cache

Source: Bit Tech


ECE 4100/6100 (18)

9
Advanced Smart Cache

Source: Bit Tech


• Both cores share data stored in the L2 cache via an arbitration
bus unit embedded in the cache.
– Dynamically allocates cache space between the two cores,
minimising bus traffic by allowing both cores to access one copy
of data
• Does larger L2 cache matter?
– Studies point out that improvements in execution time are low
from a 2MB to 4MB for most applications (2-4%)

ECE 4100/6100 (19)

Smart Memory Access

Source: Bit Tech


ECE 4100/6100 (20)

10
Smart Memory Access

Execution with and without memory disambiguation


Memory Aliasing
• Improved prefetch units
• Memory disambiguation
– Allows re-ordering instructions
more efficiently

Example from
http://arstechnica.com/articles/pae
Execution without memory disambiguation dia/cpu/core.ars/8

Source: Ars Technica ECE 4100/6100 (21)

Advanced Digital Media Boost

Source: Bit Tech ECE 4100/6100 (22)

11
Advanced Digital Media Boost

• Streaming SIMD Extension (SSE) instructions


– SSE instructions are an extension of the standard x86
instruction set.
– Utilized in multimedia encoding, decoding, image
manipulation and encryption
• SSE instructions are 128-bit.
– Up from 64-bits
– Double the SSE performance over previous generation

ECE 4100/6100 (23)

Comparison of SSE to prior


processors

Source: Ars Technica


ECE 4100/6100 (24)

12
Intel Conroe Vs Presler
Conroe Presler

Source: Bit Tech


• What is the major difference?
– Shared L2 versus separate caches

ECE 4100/6100 (25)

Intel’s Roadmap for Multicore

8C 12MB
Enterprise processors

8C 12MB shared
Mobile processors
Desktop processors

shared (45nm)
(45nm) QC 8/16MB
DC 3MB /6MB shared
shared (45nm) DC 3 MB/6
MB shared QC 4MB
(45nm)
DC 4MB DC 2/4MB
shared DC 16MB
DC 2/4MB
shared DC 2MB DC 4MB
SC 1MB DC 2MB
DC 2/4MB

SC 512KB/
1/ 2MB

2006 2007 2008 2006 2007 2008


2006 2007 2008

Source: Adapted from Tom’s Hardware

• Drivers are
– Market segments
– More cache
– More cores
• 80 core processor prototype has been designed!
ECE 4100/6100 (26)

13
Intel Chipset Example

Source: Extreme Tech ECE 4100/6100 (27)

References and Links

• http://www.intel.com/products/processor/coreduo/
• http://en.wikipedia.org/wiki/Intel_Core
• http://www.hothardware.com/viewarticle.aspx?articleid=845&cid=1
• http://www.bit-tech.net/hardware/2006/03/10/intel_core_microarchitecture/
• http://www.bit-
tech.net/hardware/2006/05/19/intel_core_duo_t2600_on_the_desktop
• http://www.bit-tech.net/hardware/2006/07/14/intel_core_2_duo_processors/
• http://www.hardcoreware.net/reviews/review-347-1.htm
• http://www.trustedreviews.com/cpu-memory/review/2006/08/28/Intel-Core-2-
Duo-Merom-Notebooks/p1
• http://www.trustedreviews.com/cpu-memory/review/2006/07/14/Intel-Core-2-
Duo-Conroe-E6400-E6600-E6700-X6800/p1
• http://techreport.com/reviews/2006q2/core-duo/index.x?pg=1
• http://arstechnica.com/articles/paedia/cpu/core.ars/1
• http://www.anandtech.com/mobile/showdoc.aspx?i=2663&p=4
• http://www.extremetech.com/article2/0,1697,1988794,00.asp
• http://www.coreduoinfo.com/blog/about-intel-core-duo/
• http://67.91.114.164/intel_c2d_info.htm
• http://www.pcper.com/article.php?aid=272&type=expert

ECE 4100/6100 (28)

14
AMD MultiCore Processors

© Sudhakar Yalamanchili, Georgia Institute of Technology

Dual Core AMD Opteron

Source: AMD

ECE 4100/6100 (30)

15
AMD Multicore (Dualcore)
Opteron
• Two AMD Opteron CPU
cores on a single die
– Each has 1MB L2 cache
Core 0 • 90nm, ~205 million
1-MB L2 transistors
– Approximately same die size
as 130nm single-core AMD
Northbridge Opteron processor
• 95 watt power envelope
– fits into 90nm power
infrastructure
1-MB L2 • Introduced with “K8”
Core 1 Revision E core in April
2005

Source: AMD ECE 4100/6100 (31)

Opteron Core Pipeline

Source: Chip
Architect ECE 4100/6100 (32)

16
AMD Opteron Processor Core Architecture

Branch
Fetch
L1 Prediction
Icache
64KB Scan/Align/Decode
Fastpath Microcode Engine

µops

L1 Instruction Control Unit (72 entries)


Dcache
64KB

Int Decode & Rename FP Decode & Rename

Res Res Res 36-entry FP scheduler


44-entry
Load/Store AGU AGU AGU FADD FMUL FMISC
Queue ALU ALU ALU
MULT

Source: The 3D shop ECE 4100/6100 (33)

Dual Core AMD Opteron

• AMD64 technology
– Runs 32-bit applications and is 64-bit capable
– Compatible with the x86 software infrastructure
– Enables a single architecture across 32- and 64-bit environments
• Direct Connect Architecture
– NUMA system
– Each processor shares its memory with other processors in the
system
– Integrated Memory Controller on-die
– DDR2 DRAM memory controller offers memory BW up to 10.7 GB/s
per processor
– HyperTransport
– Point-to-point interconnect can be used to build a mesh of multiple-
processor Opteron systems
– Scalable bandwidth interconnect between processors, I/O
subsystems, and other chipsets
– 24.0 GB/s peak bandwidth per processor

ECE 4100/6100 (34)

17
Dual Core AMD Opteron

• Not a simple aggregation of K8 cores


– Integrated the cores for efficiency
• Dual-core Opteron acts very much like a SMP system
• Compatible with existing single-threaded, multi-threaded
(hyperthreaded) software
• MOESI coherency protocol (O – “Owns”)
– Updates through system request interface
• SSE3 support with 10 new instructions.
• Quad-core upgradeability
• Hardware assisted AMD Virtualization
• Optimized Power Management

ECE 4100/6100 (35)

Dual Core AMD Opteron

Source: Elec Design ECE 4100/6100 (36)

18
AMD Opteron (SOI)

Source: Chip Architect ECE 4100/6100 (37)

AMD 64 bit Core

• 1MB L2 Cache
• Detailed discussion of the 64-bit core architecture
at:
– http://chip-
architect.com/news/2003_09_21_Detailed_Architecture_of
_AMDs_64bit_Core.html

ECE 4100/6100 (38)

19
Multiprocessor Systems using AMD
Opteron

8 GB/S
CPU CPU
CPU CPUCPU
CPU CPUCPU
CPU CPU CPU
CPU
SRQ SRQ
Crossbar Crossbar
Mem.Ctrlr HT Mem.Ctrlr HT

8 GB/S 8 GB/S

PCI-E
PCI-E
Memory
Memory Bridge
Bridge
PCI-E PCI-E
PCI-E
I/O
I/O Hub PCI-E
Controller PCI-E PCI-E
I/OHub
I/O Hub
Hub Controller
Bridge Bridge
Bridge PCI-E PCI-E
Bridge
Hub
Hub PCI-E
PCI-E Bridge Bridge
Bridge Bridge
Bridge
Bridge
8 GB/S

USB
USB
I/O
I/OHub
Hub
PCI
PCI

Legacy x86 Architecture AMD64 Direct Connect Architecture


• CPUs, Memory, I/O all share a bus • Eliminates FSB bottleneck
• Major bottleneck to performance • HyperTransport™ Technology interconnect
for high bandwidth and low latency
• Faster CPUs or more cores for performance
• Each CPU has its own memory
• Symmetric Multiprocessing • Each CPU can access the main memory of
another processor, transparent to the
programmer Æ Different from SMP

Source: AMD ECE 4100/6100 (39)

Multiprocessor Systems using


AMD Opteron

Source: XBitlabs
ECE 4100/6100 (40)

20
Cache coherency

Source: Chip Architect ECE 4100/6100 (41)

AMD Athlon 64 X2

Source: AMD ECE 4100/6100 (42)

21
References and Links

• http://techreport.com/reviews/2005q2/opteron-x75/index.x?pg=1
• http://www.tomshardware.com/2005/06/03/dual_core_stress_test/index.html
• http://www.a1-
electronics.net/AMD_Section/CPUs/2005/AMD_Athlon64x2_Apr.shtml
• http://en.wikipedia.org/wiki/Opteron
• http://en.wikipedia.org/wiki/Athlon_64_X2
• http://www.amd.com/us-
en/Processors/ProductInformation/0,,30_118_8796_14309,00.html
• http://chip-
architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core
.html
• http://firingsquad.com/hardware/amd_dual-core_opteron_875/page2.asp
• http://www.xbitlabs.com/articles/cpu/display/opteron-ws_4.html
• http://www.extremetech.com/article2/0,1697,1675784,00.asp
• http://www.elecdesign.com/Articles/Index.cfm?AD=1&ArticleID=11991
• http://www.the3dshop.com/userimages/amd_systems/opteron_dualcore.htm
• http://www.nextcomputing.com/advantages/thruadv.shtml
• http://arstechnica.com/news.ars/post/20060817-7535.html
• http://www.bit-tech.net/hardware/2005/05/09/amd_a64x2_4800/1.html

ECE 4100/6100 (43)

SUN – UltraSPARC Multicore

© Sudhakar Yalamanchili, Georgia Institute of Technology

22
SUN – UltraSPARC T1

• Eight cores, each 4-way


threaded
• 1.2 GHz
• Cache
– 16K 4-way 32B L1-I
– 8K 4-way 16B L1-D
– 3MB internal L2 cache
partitioned into four banks
and four memory
controllers.
– Data moved between the
L2 and the cores using an
integrated crossbar switch
to provide high throughput

Source: Sun ECE 4100/6100 (45)

SUN – UltraSPARC T1

Source: Sun ECE 4100/6100 (46)

23
SUN – UltraSPARC T1 Pipeline

• T1's integer pipeline


– Fetch, Thread Selection, Decode, Execute, Memory Access,
Writeback

Source: Sun ECE 4100/6100 (47)

SUN UltraSPARC T2 – Niagara 2

Source: Sun ECE 4100/6100 (48)

24
SUN UltraSPARC T2

• Ultra SPARC T2 has 8 threads/core (8 Sparc Cores)


• 8 stage integer pipeline ( as opposed to 6 for T1)
• Twice the performance of T1 with a transactional workload
(under the same power envelope)
• Each thread, increased to 1.4 GHz from 1.2 GHz
• One PCI Express port (x8 1.0)
• Two 10 Gigabit Ethernet ports with packet classification and
filtering
• L2 cache size increased to 4 MB shared (8-banks, 16-way
associative)
• 1 floating point unit per core
• Eight encryption engines
• Four dual-channel FBDIMM memory controllers
• 711 signal I/O,1831 total

ECE 4100/6100 (49)

UltraSparc T2 Core Microarchitecture

Source: Realworld Tech ECE 4100/6100 (50)

25
UltraSparc T2 Memory System

Source: Sun ECE 4100/6100 (51)

UltraSparc T2 Core Block Diagram

• IFU – Instruction Fetch Unit


– 16 KB I$, 32B lines, 8-way SA
– 64-entry fully-associative ITLB
• EXU0/1 – Integer Execution Units
– 4 threads share each unit
– Executes one integer instrn/cycle
• LSU – Load/Store Unit
– 8KB D$, 16B lines, 4-way SA 128-
entry fully-associative
– DTLB
• FGU – Floating/Graphics Unit
• SPU – Stream Processing Unit
– Cryptographic acceleration
• TLU – Trap Logic Unit
– Updates machine state, handles
exceptions and interrupts
• MMU – Memory Management Unit
– Hardware tablewalk (HWTW)
– 8KB, 64KB, 4MB, 256MB pages

Source: Sun ECE 4100/6100 (52)

26
UltraSparc T2 Core Pipeline

• 8 stages for integer operations:


– Fetch, Cache, Pick, Decode, Execute, Memory, Bypass,
Writeback
– > 3-cycle load-use
– Memory (translation, tag/data access)
– Bypass (late select, formatting)
• 12 stages for floating-point:
– Fetch, Cache, Pick, Decode, Execute, FX1, FX2, FX3,
FX4, FX5, FB, FW
– 6-cycle latency for dependent FP ops
– Longer pipeline for divide/sqrt

ECE 4100/6100 (53)

References and Links

• http://realworldtech.com/page.cfm?ArticleID=RWT0
90406012516&p=4
• http://www.opensparc.net/cgi-
bin/goto.php?w=/pubs/preszo/06/HotChips06_09_p
pt_master.pdf
• http://www.freescale.com/files/netcomm/doc/fact_sh
eet/MPC8572FS.pdf

ECE 4100/6100 (54)

27
The Embedded Multicores

© Sudhakar Yalamanchili, Georgia Institute of Technology

Freescale MPC8572 PowerQUICC III


Processor

Source: Freescale ECE 4100/6100 (56)

28
Freescale MPC8572 PowerQUICC III
Processor

• Dual Embedded e500 core 36-bit physical


addressing
• Double-precision floating-point
• Integrated L1/L2 cache
– L1 cache—32 KB data and 32 KB
– Shared L2 cache—1 MB with ECC
– L2 configurable as SRAM, cache and I/O transactions can
be stashed into L2 cache regions
• Integrated DDR memory controller with
• full ECC support
• Integrated security engine, Pattern Matching
Engine, Packet Deflate Engine
• Four on-chip triple-speed Ethernet controllers

ECE 4100/6100 (57)

References and Links

• http://www.freescale.com/files/netcomm/doc/fact_sh
eet/MPC8572FS.pdf

ECE 4100/6100 (58)

29
Summary

• Multicore technology spans the product spectrum


– The downward migration of leading edge technology
continues

• Architectural principles are key to


– Developers: extracting performance
– Designers: improving performance
– Marketing: understanding new markets for performance

• Research spans the spectrum of software, security,


reliability, parallelelism, virtualization and much
more!

ECE 4100/6100 (59)

30

Vous aimerez peut-être aussi