Vous êtes sur la page 1sur 9

ECE 752: Advanced Computer Architecture I 1

ECE/CS757:Advanced
ComputerArchitectureII
Instructor:Mikko HLipasti
Spring2013
UniversityofWisconsinMadison
LecturenotesbasedonslidescreatedbyJohnShen,
MarkHill,DavidWood,Guri Sohi,JimSmith,Natalie
EnrightJerger,MichelDubois,Murali Annavaram,
PerStenstrm andprobablyothers
ComputerArchitecture
InstructionSetArchitecture(IBM360)
theattributesofa[computing]systemasseenbythe
programmer.I.e.theconceptualstructureand
functionalbehavior,asdistinctfromtheorganization
ofthedataflowsandcontrols,thelogicdesign,andthe
physicalimplementation. Amdahl,Blaaw,&Brooks,
1964
MachineOrganization(microarchitecture)
ALUS,Buses,Caches,Memories,etc.
MachineImplementation(realization)
Gates,cells,transistors,wires
757InContext
Priorcourses
352 gatesuptomultiplexorsandadders
354 highlevellanguagedowntomachinelanguageinterfaceor
instructionsetarchitecture (ISA)
552 implementlogicthatprovidesISAinterface
CS537 providesOSbackground(coreq.OK)
Thiscourse 757 coversparallelmachines
Multiprocessorsystems
Dataparallelsystems
MemorysystemsthatexploitMLP
Etc.
Additionalcourses
ECE752coversadvanceduniprocessor design(notaprereq)
Willreviewkeytopicsinnextlecture
ECE755coversVLSIdesign
ME/ECE759coversparallelprogramming
CS758coversspecialtopics(recentlyparallelprogramming)
WhyTake757?
Tobecomeacomputerdesigner
Alumniofthisclasshelpeddesignyourcomputer
Tolearnwhatisunderthehood ofacomputer
Innatecuriosity
Tobetterunderstandwhenthingsbreak
Towritebettercode/applications
Towritebettersystemsoftware(O/S,compiler,etc.)
Becauseitisintellectuallyfascinating!
Becausemulticore/parallelsystemsare
ubiquitous
ComputerArchitecture
Exerciseinengineeringtradeoffanalysis
Findthefastest/cheapest/powerefficient/etc.solution
Optimizationproblemwith100sofvariables
Allthevariablesarechanging
Atnonuniformrates
Withinflectionpoints
Onlyoneguarantee:Todaysrightanswerwillbewrong
tomorrow
Twohighleveleffects:
Technologypush
ApplicationPull
Trends
MooresLawfordeviceintegration
Chippowerconsumption
Singlethreadperformancetrend
[source:Intel] Mikko Lipasti-Universityof Wisconsin
ECE 752: Advanced Computer Architecture I 2
DynamicPower
StaticCMOS:currentflowswhenactive
Combinationallogicevaluatesnewinputs
Flipflop,latchcapturesnewvalue(clockedge)
Terms
C:capacitanceofcircuit
wirelength,numberandsizeoftransistors
V:supplyvoltage
A:activityfactor
f:frequency
Future:Fundamentallypowerconstrained

units i
i i i dyn
f A V C k P
2
Mikko Lipasti-Universityof Wisconsin
Mikko Lipasti-Universityof Wisconsin
MulticoreMania
First,servers
IBMPower4,2001
Thendesktops
AMDAthlonX2,2005
Thenlaptops
IntelCoreDuo,2006
Now,cellphone&tablet
Qualcomm,NvidiaTegra,AppleA6,etc.
WhyMulticore
SingleCore DualCore QuadCore
Corearea A ~A/2 ~A/4
Corepower W ~W/2 ~W/4
Chippower W +O W +O W+O
Coreperformance P 0.9P 0.8P
Chipperformance P 1.8P 3.2P
Mikko Lipasti-Universityof Wisconsin
Core Core Core
Core
Core
Core
Core
f
AmdahlsLaw
f fractionthatcanruninparallel
1f fractionthatmustrunserially
Mikko Lipasti-Universityof Wisconsin
Time
#

C
P
U
s
1
1-f
f
n
n
f
f
Speedup

) 1 (
1
f
n
f
f
n



1
1
1
1
lim
FixedChipPowerBudget
AmdahlsLaw
Ignores(power)costofncores
RevisedAmdahlsLaw
Morecoreseachcoreisslower
Parallelspeedup<n
Serialportion(1f)takeslonger
Also,interconnectandscalingoverhead
Mikko Lipasti-Universityof Wisconsin
#

C
P
U
s
Time
1
1-f
f
n
FixedPowerScaling
Fixedpowerbudgetforcesslowcores
Serialcodequicklydominates
Mikko Lipasti-Universityof Wisconsin
1
2
4
8
16
32
64
128
1 2 4 8 16 32 64 128
C
h
i
p

P
e
r
f
o
r
m
a
n
c
e
#ofcores/chip
99.9%Parallel
99%Parallel
90%Parallel
80%Parallel
ECE 752: Advanced Computer Architecture I 3
Challenges
Parallelscalinglimitsmanycore
>4coresonlyforwellbehavedprograms
Optimisticaboutnew applications
Interconnectoverhead
Singlethreadperformance
Willdegradeunlessweinnovate
Parallelprogramming
Express/extractparallelisminnewways
Retrainprogrammingworkforce
Mikko Lipasti-Universityof Wisconsin
FindingParallelism
1. Functionalparallelism
Car:{engine,brakes,entertain,nav,}
Game:{physics,logic,UI,render,}
2. Automaticextraction
Decomposeserialprograms
3. Dataparallelism
Vector,matrix,dbtable,pixels,
4. Requestparallelism
Web,shareddatabase,telephony,
Mikko Lipasti-Universityof Wisconsin
BalancingWork
Amdahlsparallelphasef:allcoresbusy
Ifnotperfectlybalanced
(1f)termgrows(fnotfullyparallel)
Performancescalingsuffers
Manageablefordata&requestparallelapps
Verydifficultproblemforothertwo:
Functionalparallelism
Automaticallyextracted
Mikko Lipasti-Universityof Wisconsin
CoordinatingWork
Synchronization
Somedatasomewhereisshared
Coordinate/orderupdatesandreads
Otherwisechaos
Traditionally:locksandmutualexclusion
Hardtogetright,evenhardertotuneforperf.
Researchtoreality:TransactionalMemory
Programmer:Declarepotentialconflict
Hardwareand/orsoftware:speculate&check
Commitorrollbackandretry
IBMandIntelannouncedsupport(soon)
Mikko Lipasti-Universityof Wisconsin
SinglethreadPerformance
Stillmostattractivesourceofperformance
Speedsupparallelandserialphases
Canuseittobuybackpower
Mustfocusonpowerconsumption
PerformancebenefitPowercost
Focusof752;briefreviewcomingup
Mikko Lipasti-Universityof Wisconsin
FocusofthisCourse
Howtominimizetheseoverheads
Interconnect
Synchronization
CacheCoherence
Memorysystems
Also
Howtowriteparallelprograms(alittle)
Noncachecoherentsystems(clusters,MPP)
Dataparallelsystems
ECE 752: Advanced Computer Architecture I 4
ExpectedBackground
ECE/CS552orequivalent
Designsimpleuniprocessor
Simpleinstructionsets
Organization
Datapathdesign
Hardwired/microprogrammedcontrol
Simplepipelining
Basiccaches
Some752content(optionalreview)
Highlevelprogrammingexperience
C/UNIXskills modifysimulators
AboutThisCourse
Readings
Postedonwebsitelaterthisweek
Makesureyoukeepupwiththese!Oftendiscussedin
depthinlecture,withrequiredparticipation
Subsetofpapersmustbereviewedinwriting,submitted
throughlearn@uw
Lecture
Attendancerequired,popquizzes
Homeworks
Notcollected,foryourbenefitonly
Developdeeperunderstanding,prepareformidterms
AboutThisCourse
Exams
Midterm1:Friday3/1inclass
Midterm2:Monday4/8inclass
Keepupwithreadinglist!
Textbook
Dubois,Annavaram,Stenstrm,ParallelComputer
OrganizationandDesign,CambridgeUniv.Press,2012.
Forreference:4betachaptersfromJimSmith
Postedoncoursewebsite
AboutThisCourse
CourseProject
Researchproject
Replicateresultsfromapaper
Orattemptsomethingnovel
Parallelize/characterizenewapplication
Proposaldue3/22,statusreport4/22
Finalprojectincludesawrittenreportand
anoralpresentation
Writtenreportsdue5/14
Presentationsduringclasstime5/6,5/8,5/10
AboutThisCourse
Grading
Quizzes and Paper Reviews 20%
Midterm 1 25%
Midterm 2 25%
Project 30%
WebPage(checkregularly)
http://ece757.ece.wisc.edu
AboutThisCourse
OfficeHours
Prof.Lipasti:EH4613,M911,orbyappt.
Communicationchannels
Emailtoinstructor,classemaillist
ece7571s13@lists.wisc.edu
Webpage
http://ece757.ece.wisc.edu
Officehours
ECE 752: Advanced Computer Architecture I 5
AboutThisCourse
OtherResources
ComputerArchitectureColloquium
Tuesday45PM,1325CSS
ComputerEngineeringSeminar Friday12
1PM,EH4610
Architecturemailinglist:
http://lists.cs.wisc.edu/mailman/listinfo/architecture
WWWComputerArchitecturePage
http://www.cs.wisc.edu/~arch/www
AboutThisCourse
Lectureschedule:
MWF1:002:15
Cancel1of3lectures(onaverage)
Freeupseveralweeksnearendforprojectwork
26
TentativeSchedule
Week 1 Introduction, 752 review
Week 2 752 review, Multithreading& Multicore
Week 3 MP Software, Memory Systems
Week 4 MP Memory Systems
Week 5 Coherence& Consistency
Week 6 Lecturecancelled, midterm1 on 3/1
Week 7 Simulation methodology, transactional memory
Week 8 Interconnection networks
Week 9 SIMD, MPP
Week 10 Dataflow, Clusters, GPGPUs
Week 11 Midterm2
Week 12 No lecture
Week 13 No lecture
Week 14 No lecture
Week 15 Project talks, CourseEvaluation
Finals Week Project reports due5/14
28
BriefIntroductiontoParallelComputing
Threadlevelparallelism
MultiprocessorSystems
CacheCoherence
Snoopy
Scalable
FlynnTaxonomy
UMAvs.NUMA
29
ThreadlevelParallelism
Instructionlevelparallelism(752focus)
Reapsperformancebyfindingindependentworkina
singlethread
Threadlevelparallelism
Reapsperformancebyfindingindependentworkacross
multiplethreads
Historically,requiresexplicitlyparallelworkloads
Originatesfrommainframetimesharingworkloads
Eventhen,CPUspeed>>I/Ospeed
HadtooverlapI/Olatencywithsomethingelseforthe
CPUtodo
Hence,operatingsystemwouldscheduleother
tasks/processes/threadsthatweretimesharingthe
CPU
30
ThreadlevelParallelism
Reduceseffectivenessoftemporalandspatiallocality
CPU1
CPU1
CPU2
Disk access
CPU3
Disk access
Disk access
CPU1
CPU2
Think time
CPU3
Think time
Think time
Single user:
CPU1 Disk access Think time
Increase in
number of
active threads
reduces
effectiveness
of spatial
locality by
increasing
working set.
Time-shared:
Time dilation of each thread reduces
effectiveness of temporal locality.
ECE 752: Advanced Computer Architecture I 6
31
ThreadlevelParallelism
InitiallymotivatedbytimesharingofsingleCPU
OS,applicationswrittentobemultithreaded
QuicklyledtoadoptionofmultipleCPUsinasinglesystem
EnabledscalableproductlinefromentrylevelsingleCPUsystems
tohighendmultipleCPUsystems
Sameapplications,OS,runseamlessly
AddingCPUsincreasesthroughput(performance)
Morerecently:
Multiplethreadsperprocessorcore
Coarsegrainedmultithreading(akaswitchonevent)
Finegrainedmultithreading
Simultaneousmultithreading
Multipleprocessorcoresperdie
Chipmultiprocessors(CMP)
32
MultiprocessorSystems
Primaryfocusonsharedmemorysymmetric
multiprocessors
Manyothertypesofparallelprocessorsystemshavebeen
proposedandbuilt
Keyattributesare:
Sharedmemory:allphysicalmemoryisaccessibletoallCPUs
Symmetricprocessors:allCPUsarealike
Otherparallelprocessorsmay:
Sharesomememory,sharedisks,sharenothing
Haveasymmetricprocessingunits
Sharedmemoryidealisms
Fullysharedmemory
Unitlatency
Lackofcontention
Instantaneouspropagationofwrites
33
Motivation
Sofar:oneprocessorinasystem
WhynotuseNprocessors
Higherthroughputviaparalleljobs
Costeffective
Adding3CPUsmayget4xthroughputatonly2xcost
Lowerlatencyfrommultithreadedapplications
Softwarevendorhasdonetheworkforyou
E.g.database,webserver
Lowerlatencythroughparallelizedapplications
Muchharderthanitsounds
34
WheretoConnectProcessors?
Atprocessor?
Singleinstructionmultipledata(SIMD)
AtI/Osystem?
Clustersormulticomputers
Atmemorysystem?
Sharedmemorymultiprocessors
FocusonSymmetricMultiprocessors(SMP)
2005MikkoLipasti
35
ConnectatProcessor(SIMD)
Control
Processor
Instruction Memory
Data
Memory
Registers
ALU
Data
Memory
Registers
ALU
Data
Memory
Registers
ALU
. . .
. . .
Interconnection Network
36
ConnectatProcessor
SIMDAssessment
Amortizescostofcontrolunitovermany
datapaths
Enablesefficient,widedatapaths
Programmingmodelhaslimitedflexibility
Regularcontrolflow,dataaccesspatterns
SIMDwidelyemployedtoday
MMX,SSE,3DNOWvectorextensions
Dataelementsare8bmultimediaoperands
ECE 752: Advanced Computer Architecture I 7
37
ConnectatI/O
Connectwithstandardnetwork(e.g.Ethernet)
Calledacluster
Adequatebandwidth(GBEthernet,goingto10GB)
Latencyveryhigh
Cheap,butgetwhatyoupayfor
Connectwithcustomnetwork(e.g.IBMSP1,SP2,
SP3)
Sometimescalledamulticomputer
Highercostthancluster
Poorercommunicationthanmultiprocessor
Internetdatacentersbuiltthisway
38
ConnectatMemory:
Multiprocessors
SharedMemoryMultiprocessors
Allprocessorscanaddressallphysicalmemory
Demandsevolutionaryoperatingsystemschanges
Higherthroughputwithnoapplicationchanges
Lowlatency,butrequiresparallelizationwithproper
synchronization
Mostsuccessful:SymmetricMPorSMP
264microprocessorsonabus
Stillusecachememories
39
CacheCoherenceProblem
P0 P1
Load A
A 0
Load A
A 0
Store A<= 1
1
Load A
Memory
40
CacheCoherenceProblem
P0 P1
Load A
A 0
Load A
A 0
Store A<= 1
Memory
1
Load A
A 1
SampleInvalidateProtocol(MESI)
M
I
S E
BR
LW
EV or
BW
EV or
BW or
BU
LR/S LR/~S
LW
BW
LW
EV or
BW
BR
SampleInvalidateProtocol(MESI)
Current
State s
Event and Local Coherence Controller Responses and Actions (s' refers to next state)
Local Read (LR) Local Write
(LW)
Local
Eviction (EV)
Bus Read
(BR)
Bus Write
(BW)
Bus Upgrade
(BU)
Invalid (I) Issuebusread
if nosharersthen
s' =E
elses' =S
Issuebus
write
s' =M
s' =I Donothing Donothing Donothing
Shared (S) Donothing Issuebus
upgrade
s' =M
s' =I Respond
shared
s' =I s' =I
Exclusive
(E)
Donothing s' =M s' =I Respond
shared
s' =S
s' =I Error
Modified
(M)
Donothing Donothing Writedata
back;
s' =I
Respond
dirty;
Writedata
back;
s' =S
Respond
dirty;
Writedata
back;
s' =I
Error
ECE 752: Advanced Computer Architecture I 8
43
SnoopyCacheCoherence
Allrequestsbroadcastonbus
Allprocessorsandmemorysnoopandrespond
Cacheblockswriteableatoneprocessororread
onlyatseveral
Singlewriterprotocol
Snoopsthathitdirtylines?
Flushmodifieddataoutofcache
Eitherwritebacktomemory,thensatisfyremotemiss
frommemory,or
Providedirtydatadirectlytorequestor
BigprobleminMPsystems
Dirty/coherence/sharing misses
44
ScaleableCacheCoherence
Eschewphysicalbusbutstillsnoop
Pointtopointtreestructure
Rootoftreeprovidesorderingpoint
Or,uselevelofindirectionthroughdirectory
Directoryatmemoryremembers:
Whichprocessorissinglewriter
Forwardsrequeststoit
Whichprocessorsaresharedreaders
Forwardswritepermissionrequeststothem
Levelofindirectionhasaprice
Dirtymissesrequire3hopsinsteadoftwo
Snoop:Requestor>Owner>Requestor
Directory:Requestor>Directory>Owner>Requestor
FlynnTaxonomy
Flynn (1966) Single Data Multiple Data
Single Instruction SISD SIMD
Multiple Instruction MISD MIMD
45
MikkoLipasti-University of Wisconsin
MISD
Faulttolerance
Pipelineprocessing/streaming orsystolicarray
NowextendedtoSPMD
singleprogrammultipledata
MemoryOrganization:UMAvs.NUMA
Processor Cache Processor Cache Processor Cache Processor
Memory Memory Memory Memory
Cache
Interconnection network
Uniform
memory
latency
Uniform
Memory
Access
(dancehall)
Processor Cache Processor Cache Processor Cache Processor
Memory Memory Memory Memory
Cache
Interconnection network
Short
local
latency
Nonuniform
Memory
Access
Long remote memory latency
46
MikkoLipasti-University of Wisconsin
MemoryTaxonomy
For Shared Memory Uniform
Memory
Nonuniform
Memory
Cache Coherence CC-UMA CC-NUMA
No Cache Coherence NCC-UMA NCC-NUMA
47
MikkoLipasti-University of Wisconsin
NUMAwinsoutforpracticalimplementation
Cache coherencefavorsprogrammer
Commoningeneralpurposesystems
NCCwidespreadinscalable systems
CCoverheadistoohigh,notalwaysnecessary
48
ExampleCommercialSystems
CCUMA(SMP)
SunE10000:http://doi.ieeecomputersociety.org/10.1109/40.653032
CCNUMA
SGIOrigin2000:TheSGIOrigin:AccnumaHighlyScalableServer
NCCNUMA
CrayT3E:http://www.cs.wisc.edu/~markhill/Misc/asplos96_t3e_comm.pdf
Clusters
ASCI:https://www.llnl.gov/str/Seager.html
ECE 752: Advanced Computer Architecture I 9
WeakScalingandGustafsonsLaw
Gustafsonredefinesspeedup
Workloadsgrowasmorecoresbecomeavailable
Assumethatlargerworkload(e.g.biggerdataset)
providesmorerobustutilizationofparallel
machine
LetF=p/(s+p).ThenS
P
=(s+pP)/(s+p)=1F+FP=1+F(P1)
T
P
s p + = T
1
s pP + =
50
Summary
Threadlevelparallelism
MultiprocessorSystems
CacheCoherence
Snoopy
Scalable
FlynnTaxonomy
UMAvs.NUMA
GustafsonsLawvs.AmdahlsLaw