Intel 64 and IA-32 Architectures Optimization Reference Manual

N
Intel 64 and IA-32 Architectures

Optimization Reference Manual
Order Number: 248966-015
May 2007
I NFORMATI ON I N THI S DOCUMENT I S PROVI DED I N CONNECTI ON WI TH I NTEL PRODUCTS. NO LI CENSE,
EXPRESS OR I MPLI ED, BY ESTOPPEL OR OTHERWI SE, TO ANY I NTELLECTUAL PROPERTY RI GHTS I S GRANT-
ED BY THI S DOCUMENT. EXCEPT AS PROVI DED I N I NTELS TERMS AND CONDI TI ONS OF SALE FOR SUCH
PRODUCTS, I NTEL ASSUMES NO LI ABI LI TY WHATSOEVER, AND I NTEL DI SCLAI MS ANY EXPRESS OR I MPLI ED
WARRANTY, RELATI NG TO SALE AND/ OR USE OF I NTEL PRODUCTS I NCLUDI NG LI ABI LI TY OR WARRANTI ES
RELATI NG TO FI TNESS FOR A PARTI CULAR PURPOSE, MERCHANTABI LI TY, OR I NFRI NGEMENT OF ANY
PATENT, COPYRI GHT OR OTHER I NTELLECTUAL PROPERTY RI GHT. I NTEL PRODUCTS ARE NOT I NTENDED
FOR USE I N MEDI CAL, LI FE SAVI NG, OR LI FE SUSTAI NI NG APPLI CATI ONS.
I nt el may make changes t o specificat ions and product descript ions at any t ime, wit hout not ice.
Developers must not r ely on t he absence or charact erist ics of any f eat ures or inst r uct ions marked re-
ser ved or undefined. I mpr oper use of r eser ved or undef ined f eat ures or inst ruct ions may cause unpre-
dict able behavior or f ailure in developer ' s soft war e code when r unning on an I nt el pr ocessor. I nt el r eser ves
t hese f eat ur es or inst r uct ions f or f ut ur e def init ion and shall have no responsibilit y what soever f or conf lict s
or incompat ibilit ies ar ising f rom t heir unaut hor ized use.
The I nt el
64 ar chit ect ure processor s may cont ain design def ect s or er ror s known as errat a. Cur rent char-
act er ized errat a ar e available on r equest .
Hyper -Thr eading Technology r equir es a comput er syst em wit h an I nt el
pr ocessor suppor t ing Hyper-

Threading Technology and an HT Technology enabled chipset , BI OS and operat ing syst em. Per f or mance will
vary depending on t he specif ic har dwar e and sof t war e you use. For mor e inf or mat ion, see http://www.in-
tel.com/technology/hyperthread/index.htm; including det ails on which pr ocessor s support HT Technology.
I nt el
Virt ualizat ion Technology r equir es a comput er syst em wit h an enabled I nt el
pr ocessor, BI OS, vir t ual

machine monit or ( VMM) and f or some uses, cert ain plat f orm soft war e enabled f or it . Funct ionalit y, perf or-
mance or ot her benef it s will var y depending on har dwar e and soft war e configurat ions. I nt el
Vir t ualizat ion

Technology- enabled BI OS and VMM applicat ions are curr ent ly in development .
64- bit comput ing on I nt el ar chit ect ur e r equir es a comput er syst em wit h a pr ocessor, chipset , BI OS, oper -
at ing syst em, device dr ivers and applicat ions enabled for I nt el
64 ar chit ect ure. Processor s will not operat e

( including 32- bit operat ion) wit hout an I nt el
64 ar chit ect ur e- enabled BI OS. Per for mance will vary de-
pending on your hardware and sof t war e conf igurat ions. Consult wit h your syst em vendor f or more inf or-
mat ion.
I nt el, Pent ium, I nt el Cent rino, I nt el Cent rino Duo, I nt el Xeon, I nt el Net Burst , I nt el Core Solo, I nt el Cor e
Duo, I nt el Cor e 2 Duo, I nt el Core 2 Ext r eme, I nt el Pent ium D, I t anium, I nt el SpeedSt ep, MMX, and VTune
ar e t rademar ks or r egist er ed t rademar ks of I nt el Cor porat ion or it s subsidiar ies in t he Unit ed St at es and
ot her count r ies.
* Ot her names and brands may be claimed as t he pr oper t y of ot her s.
Cont act your local I nt el sales off ice or your dist ribut or t o obt ain t he lat est specif icat ions and bef or e placing
your pr oduct or der.
Copies of document s which have an order ing number and are r ef er enced in t his document , or ot her I nt el
lit erat ure, may be obt ained f rom:
I nt el Cor porat ion
P. O. Box 5937
Denver, CO 80217- 9808
or call 1- 800- 548- 4725
or visit I nt els websit e at http://www.intel.com
Copyr ight 1997- 2007 I nt el Corporat ion
iii
CONTENTS
PAGE
CHAPTER 1
INTRODUCTION
1.1 TUNING YOUR APPLICATION. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-1
1.2 ABOUT THIS MANUAL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-2
1.3 RELATED INFORMATION. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-3
CHAPTER 2
INTEL 64 AND IA-32 PROCESSOR ARCHITECTURES
2.1 INTEL

CORE MICROARCHITECTURE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-2
2.1.1 Intel Core Microarchitecture Pipeline Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-3
2.1.2 Front End. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-5
2.1.2.1 Branch Prediction Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-6
2.1.2.2 Instruction Fetch Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-6
2.1.2.3 Instruction Queue (IQ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-7
2.1.2.4 Instruction Decode. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-8
2.1.2.5 Stack Pointer Tracker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-8
2.1.2.6 Micro-fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-9
2.1.3 Execution Core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-9
2.1.3.1 Issue Ports and Execution Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-10
2.1.4 Intel

Advanced Memory Access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-13
2.1.4.1 Loads and Stores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-14
2.1.4.2 Data Prefetch to L1 caches. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-14
2.1.4.3 Data Prefetch Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-15
2.1.4.4 Store Forwarding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-16
2.1.4.5 Memory Disambiguation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-16
2.1.5 Intel

Advanced Smart Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-17
2.1.5.1 Loads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-18
2.1.5.2 Stores. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-19
2.2 INTEL NETBURST MICROARCHITECTURE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-19
2.2.1 Design Goals. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-20
2.2.2 Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-20
2.2.2.1 Front End. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-22
2.2.2.2 Out-of-order Core. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-22
2.2.2.3 Retirement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-23
2.2.3 Front End Pipeline Detail . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-23
2.2.3.1 Prefetching. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-23
2.2.3.2 Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-24
2.2.3.3 Execution Trace Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-24
2.2.3.4 Branch Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-24
2.2.4 Execution Core Detail . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-25
2.2.4.1 Instruction Latency and Throughput . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-26
2.2.4.2 Execution Units and Issue Ports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-26
2.2.4.3 Caches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-28
2.2.4.4 Data Prefetch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-29
2.2.4.5 Loads and Stores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-31
2.2.4.6 Store Forwarding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-32
2.3 INTEL

PENTIUM

M PROCESSOR MICROARCHITECTURE . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-32
CONTENTS
iv
PAGE
2.3.1 Front End. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-33
2.3.2 Data Prefetching. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-34
2.3.3 Out-of-Order Core. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-35
2.3.4 In-Order Retirement. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-35
2.4 MICROARCHITECTURE OF INTEL

CORE SOLO AND INTEL CORE

DUO PROCESSORS 2-36
2.4.1 Front End. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-36
2.4.2 Data Prefetching. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-37
2.5 INTEL HYPER-THREADING TECHNOLOGY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-37
2.5.1 Processor Resources and HT Technology. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-38
2.5.1.1 Replicated Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-39
2.5.1.2 Partitioned Resources. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-39
2.5.1.3 Shared Resources. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-39
2.5.2 Microarchitecture Pipeline and HT Technology. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-40
2.5.3 Front End Pipeline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-40
2.5.4 Execution Core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-40
2.5.5 Retirement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-41
2.6 MULTICORE PROCESSORS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-41
2.6.1 Microarchitecture Pipeline and MultiCore Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-43
2.6.2 Shared Cache in Intel Core Duo Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-43
2.6.2.1 Load and Store Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-43
2.7 INTEL

64 ARCHITECTURE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-45
2.8 SIMD TECHNOLOGY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-45
2.8.1 Summary of SIMD Technologies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-48
2.8.1.1 MMX Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-48
2.8.1.2 Streaming SIMD Extensions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-48
2.8.1.3 Streaming SIMD Extensions 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-48
2.8.1.4 Streaming SIMD Extensions 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-49
2.8.1.5 Supplemental Streaming SIMD Extensions 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-49
CHAPTER 3
GENERAL OPTIMIZATION GUIDELINES
3.1 PERFORMANCE TOOLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-1
3.1.1 Intel

C++ and Fortran Compilers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-1
3.1.2 General Compiler Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-2
3.1.3 VTune Performance Analyzer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-2
3.2 PROCESSOR PERSPECTIVES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-3
3.2.1 CPUID Dispatch Strategy and Compatible Code Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . 3-4
3.2.2 Transparent Cache-Parameter Strategy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-5
3.2.3 Threading Strategy and Hardware Multithreading Support . . . . . . . . . . . . . . . . . . . . . . . . 3-5
3.3 CODING RULES, SUGGESTIONS AND TUNING HINTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-5
3.4 OPTIMIZING THE FRONT END . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-6
3.4.1 Branch Prediction Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-6
3.4.1.1 Eliminating Branches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-7
3.4.1.2 Spin-Wait and Idle Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-9
3.4.1.3 Static Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-9
3.4.1.4 Inlining, Calls and Returns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-11
3.4.1.5 Code Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-12
3.4.1.6 Branch Type Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-13
3.4.1.7 Loop Unrolling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-15
3.4.1.8 Compiler Support for Branch Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-16
3.4.2 Fetch and Decode Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-17
v
CONTENTS
PAGE
3.4.2.1 Optimizing for Micro-fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-17
3.4.2.2 Optimizing for Macro-fusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-18
3.4.2.3 Length-Changing Prefixes (LCP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-21
3.4.2.4 Optimizing the Loop Stream Detector (LSD) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-23
3.4.2.5 Scheduling Rules for the Pentium 4 Processor Decoder. . . . . . . . . . . . . . . . . . . . . . . .3-23
3.4.2.6 Scheduling Rules for the Pentium M Processor Decoder . . . . . . . . . . . . . . . . . . . . . . .3-24
3.4.2.7 Other Decoding Guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-24
3.5 OPTIMIZING THE EXECUTION CORE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-24
3.5.1 Instruction Selection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-25
3.5.1.1 Use of the INC and DEC Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-26
3.5.1.2 Integer Divide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-26
3.5.1.3 Using LEA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-26
3.5.1.4 Using SHIFT and ROTATE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-27
3.5.1.5 Address Calculations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-27
3.5.1.6 Clearing Registers and Dependency Breaking Idioms . . . . . . . . . . . . . . . . . . . . . . . . . .3-27
3.5.1.7 Compares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-29
3.5.1.8 Using NOPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-30
3.5.1.9 Mixing SIMD Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-31
3.5.1.10 Spill Scheduling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-31
3.5.2 Avoiding Stalls in Execution Core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-32
3.5.2.1 ROB Read Port Stalls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-32
3.5.2.2 Bypass between Execution Domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-33
3.5.2.3 Partial Register Stalls. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-34
3.5.2.4 Partial XMM Register Stalls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-35
3.5.2.5 Partial Flag Register Stalls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-37
3.5.2.6 Floating Point/SIMD Operands in Intel NetBurst microarchitecture . . . . . . . . . . . . .3-37
3.5.3 Vectorization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-38
3.5.4 Optimization of Partially Vectorizable Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-39
3.5.4.1 Alternate Packing Techniques. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-42
3.5.4.2 Simplifying Result Passing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-42
3.5.4.3 Stack Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-44
3.5.4.4 Tuning Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-44
3.6 OPTIMIZING MEMORY ACCESSES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-46
3.6.1 Load and Store Execution Bandwidth. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-46
3.6.2 Enhance Speculative Execution and Memory Disambiguation. . . . . . . . . . . . . . . . . . . . . .3-47
3.6.3 Alignment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-48
3.6.4 Store Forwarding. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-50
3.6.4.1 Store-to-Load-Forwarding Restriction on Size and Alignment . . . . . . . . . . . . . . . . . .3-51
3.6.4.2 Store-forwarding Restriction on Data Availability . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-55
3.6.5 Data Layout Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-56
3.6.6 Stack Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-59
3.6.7 Capacity Limits and Aliasing in Caches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-60
3.6.7.1 Capacity Limits in Set-Associative Caches. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-60
3.6.7.2 Aliasing Cases in Processors Based on Intel NetBurst Microarchitecture . . . . . . . .3-61
3.6.7.3 Aliasing Cases in the Pentium M, Intel Core Solo, Intel Core Duo and
Intel Core 2 Duo Processors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-62
3.6.8 Mixing Code and Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-63
3.6.8.1 Self-modifying Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-64
3.6.9 Write Combining. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-64
3.6.10 Locality Enhancement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-65
3.6.11 Minimizing Bus Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-67
3.6.12 Non-Temporal Store Bus Traffic. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-67
CONTENTS
vi
PAGE
3.7 PREFETCHING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-68
3.7.1 Hardware Instruction Fetching and Software Prefetching . . . . . . . . . . . . . . . . . . . . . . . . 3-69
3.7.2 Software and Hardware Prefetching in Prior Microarchitectures. . . . . . . . . . . . . . . . . . 3-69
3.7.3 Hardware Prefetching for First-Level Data Cache. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-70
3.7.4 Hardware Prefetching for Second-Level Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-73
3.7.5 Cacheability Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-74
3.7.6 REP Prefix and Data Movement. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-74
3.8 FLOATING-POINT CONSIDERATIONS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-77
3.8.1 Guidelines for Optimizing Floating-point Code. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-77
3.8.2 Floating-point Modes and Exceptions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-79
3.8.2.1 Floating-point Exceptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-79
3.8.2.2 Dealing with floating-point exceptions in x87 FPU code. . . . . . . . . . . . . . . . . . . . . . . 3-79
3.8.2.3 Floating-point Exceptions in SSE/SSE2/SSE3 Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-80
3.8.3 Floating-point Modes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-81
3.8.3.1 Rounding Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-81
3.8.3.2 Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-83
3.8.3.3 Improving Parallelism and the Use of FXCH. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-84
3.8.4 x87 vs. Scalar SIMD Floating-point Trade-offs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-84
3.8.4.1 Scalar SSE/SSE2 Performance on Intel Core Solo and Intel Core Duo Processors 3-85
3.8.4.2 x87 Floating-point Operations with Integer Operands. . . . . . . . . . . . . . . . . . . . . . . . . 3-86
3.8.4.3 x87 Floating-point Comparison Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-86
3.8.4.4 Transcendental Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-86
CHAPTER 4
CODING FOR SIMD ARCHITECTURES
4.1 CHECKING FOR PROCESSOR SUPPORT OF SIMD TECHNOLOGIES . . . . . . . . . . . . . . . . . . . . . . 4-1
4.1.1 Checking for MMX Technology Support. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-2
4.1.2 Checking for Streaming SIMD Extensions Support. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-2
4.1.3 Checking for Streaming SIMD Extensions 2 Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-3
4.1.4 Checking for Streaming SIMD Extensions 3 Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-3
4.1.5 Checking for Supplemental Streaming SIMD Extensions 3 Support . . . . . . . . . . . . . . . . . 4-4
4.2 CONSIDERATIONS FOR CODE CONVERSION TO SIMD PROGRAMMING. . . . . . . . . . . . . . . . . . 4-4
4.2.1 Identifying Hot Spots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-6
4.2.2 Determine If Code Benefits by Conversion to SIMD Execution . . . . . . . . . . . . . . . . . . . . . 4-6
4.3 CODING TECHNIQUES. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-7
4.3.1 Coding Methodologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-8
4.3.1.1 Assembly. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-9
4.3.1.2 Intrinsics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-10
4.3.1.3 Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-11
4.3.1.4 Automatic Vectorization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-12
4.4 STACK AND DATA ALIGNMENT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-13
4.4.1 Alignment and Contiguity of Data Access Patterns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-13
4.4.1.1 Using Padding to Align Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-14
4.4.1.2 Using Arrays to Make Data Contiguous . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-14
4.4.2 Stack Alignment For 128-bit SIMD Technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-15
4.4.3 Data Alignment for MMX Technology. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-16
4.4.4 Data Alignment for 128-bit data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-16
4.4.4.1 Compiler-Supported Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-16
4.5 IMPROVING MEMORY UTILIZATION. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-18
4.5.1 Data Structure Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-18
4.5.2 Strip-Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-22
vii
CONTENTS
PAGE
4.5.3 Loop Blocking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-23
4.6 INSTRUCTION SELECTION. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-25
4.6.1 SIMD Optimizations and Microarchitectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-26
4.7 TUNING THE FINAL APPLICATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-27
CHAPTER 5
OPTIMIZING FOR SIMD INTEGER APPLICATIONS
5.1 GENERAL RULES ON SIMD INTEGER CODE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-1
5.2 USING SIMD INTEGER WITH X87 FLOATING-POINT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-2
5.2.1 Using the EMMS Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-2
5.2.2 Guidelines for Using EMMS Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-3
5.3 DATA ALIGNMENT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-4
5.4 DATA MOVEMENT CODING TECHNIQUES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-6
5.4.1 Unsigned Unpack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-6
5.4.2 Signed Unpack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-7
5.4.3 Interleaved Pack with Saturation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-8
5.4.4 Interleaved Pack without Saturation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-10
5.4.5 Non-Interleaved Unpack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-10
5.4.6 Extract Word. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-12
5.4.7 Insert Word . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-13
5.4.8 Move Byte Mask to Integer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-14
5.4.9 Packed Shuffle Word for 64-bit Registers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-15
5.4.10 Packed Shuffle Word for 128-bit Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-17
5.4.11 Shuffle Bytes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-18
5.4.12 Unpacking/interleaving 64-bit Data in 128-bit Registers . . . . . . . . . . . . . . . . . . . . . . . . . .5-18
5.4.13 Data Movement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-18
5.4.14 Conversion Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-18
5.5 GENERATING CONSTANTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-19
5.6 BUILDING BLOCKS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-19
5.6.1 Absolute Difference of Unsigned Numbers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-20
5.6.2 Absolute Difference of Signed Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-20
5.6.3 Absolute Value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-21
5.6.4 Pixel Format Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-21
5.6.5 Endian Conversion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-23
5.6.6 Clipping to an Arbitrary Range [High, Low] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-25
5.6.6.1 Highly Efficient Clipping. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-25
5.6.6.2 Clipping to an Arbitrary Unsigned Range [High, Low] . . . . . . . . . . . . . . . . . . . . . . . . . .5-27
5.6.7 Packed Max/Min of Signed Word and Unsigned Byte. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-28
5.6.7.1 Signed Word . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-28
5.6.7.2 Unsigned Byte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-28
5.6.8 Packed Multiply High Unsigned . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-28
5.6.9 Packed Sum of Absolute Differences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-28
5.6.10 Packed Average (Byte/Word). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-29
5.6.11 Complex Multiply by a Constant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-30
5.6.12 Packed 32*32 Multiply. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-30
5.6.13 Packed 64-bit Add/Subtract. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-30
5.6.14 128-bit Shifts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-31
5.7 MEMORY OPTIMIZATIONS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-31
5.7.1 Partial Memory Accesses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-32
5.7.1.1 Supplemental Techniques for Avoiding Cache Line Splits. . . . . . . . . . . . . . . . . . . . . . .5-34
5.7.2 Increasing Bandwidth of Memory Fills and Video Fills . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-35
CONTENTS
viii
PAGE
5.7.2.1 Increasing Memory Bandwidth Using the MOVDQ Instruction . . . . . . . . . . . . . . . . . . 5-35
5.7.2.2 Increasing Memory Bandwidth by Loading and Storing to and from the
Same DRAM Page. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-36
5.7.2.3 Increasing UC and WC Store Bandwidth by Using Aligned Stores . . . . . . . . . . . . . . . 5-36
5.8 CONVERTING FROM 64-BIT TO 128-BIT SIMD INTEGERS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-36
5.8.1 SIMD Optimizations and Microarchitectures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-37
5.8.1.1 Packed SSE2 Integer versus MMX Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-37
CHAPTER 6
OPTIMIZING FOR SIMD FLOATING-POINT APPLICATIONS
6.1 GENERAL RULES FOR SIMD FLOATING-POINT CODE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-1
6.2 PLANNING CONSIDERATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-1
6.3 USING SIMD FLOATING-POINT WITH X87 FLOATING-POINT. . . . . . . . . . . . . . . . . . . . . . . . . . . 6-2
6.4 SCALAR FLOATING-POINT CODE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-2
6.5 DATA ALIGNMENT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-3
6.5.1 Data Arrangement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-3
6.5.1.1 Vertical versus Horizontal Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-3
6.5.1.2 Data Swizzling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-7
6.5.1.3 Data Deswizzling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-10
6.5.1.4 Using MMX Technology Code for Copy or Shuffling Functions. . . . . . . . . . . . . . . . . . 6-12
6.5.1.5 Horizontal ADD Using SSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-13
6.5.2 Use of CVTTPS2PI/CVTTSS2SI Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-16
6.5.3 Flush-to-Zero and Denormals-are-Zero Modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-16
6.6 SIMD OPTIMIZATIONS AND MICROARCHITECTURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-16
6.6.1 SIMD Floating-point Programming Using SSE3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-17
6.6.1.1 SSE3 and Complex Arithmetics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-18
6.6.1.2 SSE3 and Horizontal Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-21
6.6.1.3 Packed Floating-Point Performance in Intel Core Duo Processor . . . . . . . . . . . . . . . 6-22
CHAPTER 8
MULTICORE AND HYPER-THREADING TECHNOLOGY
8.1 PERFORMANCE AND USAGE MODELS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-1
8.1.1 Multithreading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-2
8.1.2 Multitasking Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-3
8.2 PROGRAMMING MODELS AND MULTITHREADING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-4
8.2.1 Parallel Programming Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-5
8.2.1.1 Domain Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-5
8.2.2 Functional Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-5
8.2.3 Specialized Programming Models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-6
8.2.3.1 Producer-Consumer Threading Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-7
8.2.4 Tools for Creating Multithreaded Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-10
8.2.4.1 Programming with OpenMP Directives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-10
8.2.4.2 Automatic Parallelization of Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-10
8.2.4.3 Supporting Development Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-11
8.2.4.4 Intel

Thread Checker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-11
8.2.4.5 Thread Profiler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-11
8.3 OPTIMIZATION GUIDELINES. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-11
8.3.1 Key Practices of Thread Synchronization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-12
8.3.2 Key Practices of System Bus Optimization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-12
8.3.3 Key Practices of Memory Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-12
ix
CONTENTS
PAGE
8.3.4 Key Practices of Front-end Optimization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-13
8.3.5 Key Practices of Execution Resource Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-13
8.3.6 Generality and Performance Impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-13
8.4 THREAD SYNCHRONIZATION. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-14
8.4.1 Choice of Synchronization Primitives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-14
8.4.2 Synchronization for Short Periods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-16
8.4.3 Optimization with Spin-Locks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-18
8.4.4 Synchronization for Longer Periods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-18
8.4.4.1 Avoid Coding Pitfalls in Thread Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-19
8.4.5 Prevent Sharing of Modified Data and False-Sharing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-21
8.4.6 Placement of Shared Synchronization Variable. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-21
8.5 SYSTEM BUS OPTIMIZATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-23
8.5.1 Conserve Bus Bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-23
8.5.2 Understand the Bus and Cache Interactions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-24
8.5.3 Avoid Excessive Software Prefetches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-25
8.5.4 Improve Effective Latency of Cache Misses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-25
8.5.5 Use Full Write Transactions to Achieve Higher Data Rate . . . . . . . . . . . . . . . . . . . . . . . . .7-26
8.6 MEMORY OPTIMIZATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-26
8.6.1 Cache Blocking Technique. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-27
8.6.2 Shared-Memory Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-27
8.6.2.1 Minimize Sharing of Data between Physical Processors. . . . . . . . . . . . . . . . . . . . . . . .7-27
8.6.2.2 Batched Producer-Consumer Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-28
8.6.3 Eliminate 64-KByte Aliased Data Accesses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-30
8.6.4 Preventing Excessive Evictions in First-Level Data Cache . . . . . . . . . . . . . . . . . . . . . . . . .7-30
8.6.4.1 Per-thread Stack Offset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-31
8.6.4.2 Per-instance Stack Offset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-32
8.7 FRONT-END OPTIMIZATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-33
8.7.1 Avoid Excessive Loop Unrolling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-34
8.7.2 Optimization for Code Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-34
8.8 USING THREAD AFFINITIES TO MANAGE SHARED PLATFORM RESOURCES. . . . . . . . . . . 7-34
8.9 OPTIMIZATION OF OTHER SHARED RESOURCES. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-41
8.9.1 Using Shared Execution Resources in a Processor Core . . . . . . . . . . . . . . . . . . . . . . . . . . .7-42
CHAPTER 9
OPTIMIZING CACHE USAGE
9.1 GENERAL PREFETCH CODING GUIDELINES. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-1
9.2 HARDWARE PREFETCHING OF DATA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-3
9.3 PREFETCH AND CACHEABILITY INSTRUCTIONS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-4
9.4 PREFETCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-4
9.4.1 Software Data Prefetch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-4
9.4.2 Prefetch Instructions Pentium

4 Processor Implementation . . . . . . . . . . . . . . . . . . . . . 8-5
9.4.3 Prefetch and Load Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-6
9.5 CACHEABILITY CONTROL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-7
9.5.1 The Non-temporal Store Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-7
9.5.1.1 Fencing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-7
9.5.1.2 Streaming Non-temporal Stores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-7
9.5.1.3 Memory Type and Non-temporal Stores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-8
9.5.1.4 Write-Combining. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-8
9.5.2 Streaming Store Usage Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-9
9.5.2.1 Coherent Requests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-9
9.5.2.2 Non-coherent requests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-9
CONTENTS
x
PAGE
9.5.3 Streaming Store Instruction Descriptions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-10
9.5.4 FENCE Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-10
9.5.4.1 SFENCE Instruction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-10
9.5.4.2 LFENCE Instruction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-11
9.5.4.3 MFENCE Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-11
9.5.5 CLFLUSH Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-12
9.6 MEMORY OPTIMIZATION USING PREFETCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-12
9.6.1 Software-Controlled Prefetch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-13
9.6.2 Hardware Prefetch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-13
9.6.3 Example of Effective Latency Reduction with Hardware Prefetch . . . . . . . . . . . . . . . . 8-14
9.6.4 Example of Latency Hiding with S/W Prefetch Instruction . . . . . . . . . . . . . . . . . . . . . . . . 8-15
9.6.5 Software Prefetching Usage Checklist . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-17
9.6.6 Software Prefetch Scheduling Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-17
9.6.7 Software Prefetch Concatenation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-18
9.6.8 Minimize Number of Software Prefetches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-20
9.6.9 Mix Software Prefetch with Computation Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-21
9.6.10 Software Prefetch and Cache Blocking Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-22
9.6.11 Hardware Prefetching and Cache Blocking Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-26
9.6.12 Single-pass versus Multi-pass Execution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-27
9.7 MEMORY OPTIMIZATION USING NON-TEMPORAL STORES . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-30
9.7.1 Non-temporal Stores and Software Write-Combining. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-30
9.7.2 Cache Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-31
9.7.2.1 Video Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-31
9.7.2.2 Video Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-31
9.7.2.3 Conclusions from Video Encoder and Decoder Implementation. . . . . . . . . . . . . . . . . 8-32
9.7.2.4 Optimizing Memory Copy Routines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-32
9.7.2.5 TLB Priming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-33
9.7.2.6 Using the 8-byte Streaming Stores and Software Prefetch. . . . . . . . . . . . . . . . . . . . 8-34
9.7.2.7 Using 16-byte Streaming Stores and Hardware Prefetch. . . . . . . . . . . . . . . . . . . . . . 8-34
9.7.2.8 Performance Comparisons of Memory Copy Routines . . . . . . . . . . . . . . . . . . . . . . . . . 8-36
9.7.3 Deterministic Cache Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-37
9.7.3.1 Cache Sharing Using Deterministic Cache Parameters . . . . . . . . . . . . . . . . . . . . . . . . . 8-39
9.7.3.2 Cache Sharing in Single-Core or Multicore . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-39
9.7.3.3 Determine Prefetch Stride . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-39
CHAPTER 9
64-BIT MODE CODING GUIDELINES
9.1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-1
9.2 CODING RULES AFFECTING 64-BIT MODE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-1
9.2.1 Use Legacy 32-Bit Instructions When Data Size Is 32 Bits . . . . . . . . . . . . . . . . . . . . . . . . . 9-1
9.2.2 Use Extra Registers to Reduce Register Pressure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-2
9.2.3 Use 64-Bit by 64-Bit Multiplies To Produce 128-Bit Results Only When Necessary . 9-2
9.2.4 Sign Extension to Full 64-Bits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-2
9.3 ALTERNATE CODING RULES FOR 64-BIT MODE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-3
9.3.1 Use 64-Bit Registers Instead of Two 32-Bit Registers for 64-Bit Arithmetic. . . . . . . . 9-3
9.3.2 CVTSI2SS and CVTSI2SD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-4
9.3.3 Using Software Prefetch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-5
xi
CONTENTS
PAGE
CHAPTER 10
POWER OPTIMIZATION FOR MOBILE USAGES
10.1 OVERVIEW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-1
10.2 MOBILE USAGE SCENARIOS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-2
10.3 ACPI C-STATES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-3
10.3.1 Processor-Specific C4 and Deep C4 States . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10-4
10.4 GUIDELINES FOR EXTENDING BATTERY LIFE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-5
10.4.1 Adjust Performance to Meet Quality of Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10-5
10.4.2 Reducing Amount of Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10-6
10.4.3 Platform-Level Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10-7
10.4.4 Handling Sleep State Transitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10-7
10.4.5 Using Enhanced Intel SpeedStep

Technology. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10-8
10.4.6 Enabling Intel

Enhanced Deeper Sleep . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-10
10.4.7 Multicore Considerations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-10
10.4.7.1 Enhanced Intel SpeedStep

Technology. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-11
10.4.7.2 Thread Migration Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-11
10.4.7.3 Multicore Considerations for C-States . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-12
APPENDIX A
APPLICATION PERFORMANCE
TOOLS
A.1 COMPILERS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-1
A.1.1 Recommended Optimization Settings for Intel 64 and IA-32 Processors . . . . . . . . . . . . A-2
A.1.2 Vectorization and Loop Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-4
A.1.2.1 Multithreading with OpenMP* . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-5
A.1.2.2 Automatic Multithreading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-5
A.1.3 Inline Expansion of Library Functions (/Oi, /Oi-) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-5
A.1.4 Floating-point Arithmetic Precision
(/Op, /Op-, /Qprec, /Qprec_div, /Qpc, /Qlong_double) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-5
A.1.5 Rounding Control Option (/Qrcr, /Qrcd) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-5
A.1.6 Interprocedural and Profile-Guided Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-5
A.1.6.1 Interprocedural Optimization (IPO). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-6
A.1.6.2 Profile-Guided Optimization (PGO) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-6
A.1.7 Auto-Generation of Vectorized Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-6
A.2 INTEL VTUNE PERFORMANCE ANALYZER. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-10
A.2.1 Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .A-10
A.2.1.1 Time-based Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .A-11
A.2.1.2 Event-based Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .A-11
A.2.1.3 Workload Characterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .A-11
A.2.2 Call Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .A-11
A.2.3 Counter Monitor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .A-12
A.3 INTEL

PERFORMANCE LIBRARIES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-12
A.3.1 Benefits Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .A-13
A.3.2 Optimizations with the Intel Performance Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . .A-13
A.4 INTEL THREADING ANALYSIS TOOLS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-14
A.4.1 Intel Thread Checker 3.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .A-14
A.4.2 Intel Thread Profiler 3.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .A-14
A.4.3 Intel Threading Building Blocks 1.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .A-15
A.5 INTEL SOFTWARE COLLEGE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-16
CONTENTS
xii
PAGE
APPENDIX B
USING PERFORMANCE MONITORING EVENTS
B.1 PENTIUM 4 PROCESSOR PERFORMANCE METRICS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-1
B.1.1 Pentium 4 Processor-Specific Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-1
B.1.1.1 Bogus, Non-bogus, Retire. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-1
B.1.1.2 Bus Ratio. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-2
B.1.1.3 Replay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-2
B.1.1.4 Assist . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-2
B.1.1.5 Tagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-2
B.1.2 Counting Clocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-3
B.1.2.1 Non-Halted Clock Ticks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-4
B.1.2.2 Non-Sleep Clock Ticks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-4
B.1.2.3 Time-Stamp Counter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-5
B.2 METRICS DESCRIPTIONS AND CATEGORIES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-5
B.2.1 Trace Cache Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-30
B.2.2 Bus and Memory Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-30
B.2.2.1 Reads due to program loads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-32
B.2.2.2 Reads due to program writes (RFOs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-32
B.2.2.3 Writebacks (dirty evictions). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-32
B.2.3 Usage Notes for Specific Metrics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-33
B.2.4 Usage Notes on Bus Activities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-34
B.3 PERFORMANCE METRICS AND TAGGING MECHANISMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-35
B.3.1 Tags for replay_event. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-35
B.3.2 Tags for front_end_event . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-37
B.3.3 Tags for execution_event . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-37
B.4 USING PERFORMANCE METRICS WITH HYPER-THREADING TECHNOLOGY . . . . . . . . . . . . B-39
B.5 USING PERFORMANCE EVENTS OF INTEL CORE SOLO AND INTEL CORE DUO
PROCESSORS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-42
B.5.1 Understanding the Results in a Performance Counter . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-42
B.5.2 Ratio Interpretation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-43
B.5.3 Notes on Selected Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-44
B.6 DRILL-DOWN TECHNIQUES FOR PERFORMANCE ANALYSIS . . . . . . . . . . . . . . . . . . . . . . . . . . B-45
B.6.1 Cycle Composition at Issue Port. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-47
B.6.2 Cycle Composition of OOO Execution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-48
B.6.3 Drill-Down on Performance Stalls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-49
B.7 EVENT RATIOS FOR INTEL CORE MICROARCHITECTURE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-50
B.7.1 Clocks Per Instructions Retired Ratio (CPI) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-51
B.7.2 Front-end Ratios. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-51
B.7.2.1 Code Locality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-51
B.7.2.2 Branching and Front-end . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-52
B.7.2.3 Stack Pointer Tracker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-52
B.7.2.4 Macro-fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-52
B.7.2.5 Length Changing Prefix (LCP) Stalls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-52
B.7.2.6 Self Modifying Code Detection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-53
B.7.3 Branch Prediction Ratios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-53
B.7.3.1 Branch Mispredictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-53
B.7.3.2 Virtual Tables and Indirect Calls. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-53
B.7.3.3 Mispredicted Returns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-54
B.7.4 Execution Ratios. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-54
B.7.4.1 Resource Stalls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-54
B.7.4.2 ROB Read Port Stalls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-54
xiii
CONTENTS
PAGE
B.7.4.3 Partial Register Stalls. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .B-54
B.7.4.4 Partial Flag Stalls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .B-55
B.7.4.5 Bypass Between Execution Domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .B-55
B.7.4.6 Floating Point Performance Ratios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .B-55
B.7.5 Memory Sub-System - Access Conflicts Ratios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .B-56
B.7.5.1 Loads Blocked by the L1 Data Cache. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .B-56
B.7.5.2 4K Aliasing and Store Forwarding Block Detection. . . . . . . . . . . . . . . . . . . . . . . . . . . . .B-56
B.7.5.3 Load Block by Preceding Stores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .B-56
B.7.5.4 Memory Disambiguation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .B-57
B.7.5.5 Load Operation Address Translation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .B-57
B.7.6 Memory Sub-System - Cache Misses Ratios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .B-57
B.7.6.1 Locating Cache Misses in the Code. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .B-57
B.7.6.2 L1 Data Cache Misses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .B-58
B.7.6.3 L2 Cache Misses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .B-58
B.7.7 Memory Sub-system - Prefetching. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .B-58
B.7.7.1 L1 Data Prefetching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .B-58
B.7.7.2 L2 Hardware Prefetching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .B-58
B.7.7.3 Software Prefetching. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .B-59
B.7.8 Memory Sub-system - TLB Miss Ratios. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .B-59
B.7.9 Memory Sub-system - Core Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .B-60
B.7.9.1 Modified Data Sharing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .B-60
B.7.9.2 Fast Synchronization Penalty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .B-60
B.7.9.3 Simultaneous Extensive Stores and Load Misses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .B-60
B.7.10 Memory Sub-system - Bus Characterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .B-61
B.7.10.1 Bus Utilization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .B-61
B.7.10.2 Modified Cache Lines Eviction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .B-61
APPENDIX C
INSTRUCTION LATENCY AND THROUGHPUT
C.1 OVERVIEW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-1
C.2 DEFINITIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-2
C.3 LATENCY AND THROUGHPUT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-3
C.3.1 Latency and Throughput with Register Operands. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-3
C.3.2 Table Footnotes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-25
C.3.3 Latency and Throughput with Memory Operands. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-26
APPENDIX D
STACK ALIGNMENT
D.1 STACK FRAMES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-1
D.1.1 Aligned ESP-Based Stack Frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-3
D.1.2 Aligned EDP-Based Stack Frames. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-4
D.1.3 Stack Frame Optimizations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-6
D.2 INLINED ASSEMBLY AND EBX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-7
APPENDIX E
SUMMARY OF RULES AND SUGGESTIONS
E.1 ASSEMBLY/COMPILER CODING RULES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E-1
E.2 USER/SOURCE CODING RULES. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E-7
E.3 TUNING SUGGESTIONS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E-11
CONTENTS
xiv
PAGE
EXAMPLES
Example 3-1. Assembly Code with an Unpredictable Branch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-8
Example 3-2. Code Optimization to Eliminate Branches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-8
Example 3-4. Use of PAUSE Instruction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-9
Example 3-3. Eliminating Branch with CMOV Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-9
Example 3-5. Pentium 4 Processor Static Branch Prediction Algorithm . . . . . . . . . . . . . . . . . . . . . . 3-10
Example 3-6. Static Taken Prediction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-11
Example 3-7. Static Not-Taken Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-11
Example 3-8. Indirect Branch With Two Favored Targets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-14
Example 3-9. A Peeling Technique to Reduce Indirect Branch Misprediction . . . . . . . . . . . . . . . . . 3-15
Example 3-10. Loop Unrolling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-16
Example 3-11. Macro-fusion, Unsigned Iteration Count . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-19
Example 3-12. Macro-fusion, If Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-20
Example 3-13. Macro-fusion, Signed Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-21
Example 3-14. Macro-fusion, Signed Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-21
Example 3-15. Avoiding False LCP Delays with 0xF7 Group Instructions . . . . . . . . . . . . . . . . . . . . . . 3-23
Example 3-16. Clearing Register to Break Dependency While Negating Array Elements. . . . . . . . 3-28
Example 3-17. Spill Scheduling Code. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-32
Example 3-18. Dependencies Caused by Referencing Partial Registers . . . . . . . . . . . . . . . . . . . . . . . 3-35
Example 3-19. Avoiding Partial Register Stalls in Integer Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-35
Example 3-20. Avoiding Partial Register Stalls in SIMD Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-36
Example 3-21. Avoiding Partial Flag Register Stalls. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-37
Example 3-22. Reference Code Template for Partially Vectorizable Program . . . . . . . . . . . . . . . . . 3-41
Example 3-23. Three Alternate Packing Methods for Avoiding Store Forwarding Difficulty . . . . 3-42
Example 3-24. Using Four Registers to Reduce Memory Spills and Simplify Result Passing. . . . . 3-43
Example 3-25. Stack Optimization Technique to Simplify Parameter Passing. . . . . . . . . . . . . . . . . . 3-44
Example 3-26. Base Line Code Sequence to Estimate Loop Overhead . . . . . . . . . . . . . . . . . . . . . . . . 3-45
Example 3-27. Loads Blocked by Stores of Unknown Address. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-47
Example 3-28. Code That Causes Cache Line Split . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-49
Example 3-29. Situations Showing Small Loads After Large Store . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-52
Example 3-31. A Non-forwarding Situation in Compiler Generated Code . . . . . . . . . . . . . . . . . . . . . . 3-53
Example 3-32. Two Ways to Avoid Non-forwarding Situation in Example 3-31. . . . . . . . . . . . . . . . 3-53
Example 3-30. Non-forwarding Example of Large Load After Small Store. . . . . . . . . . . . . . . . . . . . . 3-53
Example 3-33. Large and Small Load Stalls. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-54
Example 3-34. Loop-carried Dependence Chain. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-56
Example 3-35. Rearranging a Data Structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-57
Example 3-36. Decomposing an Array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-57
Example 3-37. Dynamic Stack Alignment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-59
Example 3-38. Aliasing Between Loads and Stores Across Loop Iterations. . . . . . . . . . . . . . . . . . . . 3-63
Example 3-39. Using Non-temporal Stores and 64-byte Bus Write Transactions . . . . . . . . . . . . . . 3-68
Example 3-40. On-temporal Stores and Partial Bus Write Transactions . . . . . . . . . . . . . . . . . . . . . . . 3-68
Example 3-41. Using DCU Hardware Prefetch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-71
Example 3-42. Avoid Causing DCU Hardware Prefetch to Fetch Un-needed Lines . . . . . . . . . . . . . 3-72
Example 3-43. Technique For Using L1 Hardware Prefetch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-73
Example 3-44. REP STOSD with Arbitrary Count Size and 4-Byte-Aligned Destination . . . . . . . . . 3-76
Example 3-45. Algorithm to Avoid Changing Rounding Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-82
xv
CONTENTS
PAGE
Example 4-1. Identification of MMX Technology with CPUID. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-2
Example 4-2. Identification of SSE with CPUID. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-2
Example 4-3. Identification of SSE2 with cpuid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-3
Example 4-4. Identification of SSE3 with CPUID . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-3
Example 4-5. Identification of SSSE3 with cpuid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-4
Example 4-6. Simple Four-Iteration Loop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-9
Example 4-7. Streaming SIMD Extensions Using Inlined Assembly Encoding . . . . . . . . . . . . . . . . . .4-10
Example 4-8. Simple Four-Iteration Loop Coded with Intrinsics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-11
Example 4-9. C++ Code Using the Vector Classes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-12
Example 4-10. Automatic Vectorization for a Simple Loop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-13
Example 4-11. C Algorithm for 64-bit Data Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-16
Example 4-13. SoA Data Structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-19
Example 4-14. AoS and SoA Code Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-19
Example 4-12. AoS Data Structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-19
Example 4-15. Hybrid SoA Data Structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-21
Example 4-16. Pseudo-code Before Strip Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-22
Example 4-17. Strip Mined Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-23
Example 4-18. Loop Blocking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-24
Example 4-19. Emulation of Conditional Moves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-26
Example 5-1. Resetting Register Between __m64 and FP Data Types Code. . . . . . . . . . . . . . . . . . . 5-4
Example 5-2. FIR Processing Example in C language Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-5
Example 5-3. SSE2 and SSSE3 Implementation of FIR Processing Code . . . . . . . . . . . . . . . . . . . . . . . 5-5
Example 5-5. Signed Unpack Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-7
Example 5-4. Zero Extend 16-bit Values into 32 Bits Using Unsigned Unpack
Instructions Code. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-7
Example 5-6. Interleaved Pack with Saturation Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-9
Example 5-7. Interleaved Pack without Saturation Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-10
Example 5-8. Unpacking Two Packed-word Sources in Non-interleaved Way Code. . . . . . . . . . . .5-12
Example 5-9. PEXTRW Instruction Code. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-13
Example 5-11. Repeated PINSRW Instruction Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-14
Example 5-10. PINSRW Instruction Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-14
Example 5-12. PMOVMSKB Instruction Code. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-15
Example 5-13. PSHUF Instruction Code. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-16
Example 5-14. Broadcast Code, Using 2 Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-17
Example 5-15. Swap Code, Using 3 Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-17
Example 5-16. Reverse Code, Using 3 Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-18
Example 5-17. Generating Constants. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-19
Example 5-18. Absolute Difference of Two Unsigned Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-20
Example 5-19. Absolute Difference of Signed Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-20
Example 5-20. Computing Absolute Value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-21
Example 5-21. Basic C Implementation of RGBA to BGRA Conversion . . . . . . . . . . . . . . . . . . . . . . . . .5-21
Example 5-22. Color Pixel Format Conversion Using SSE2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-22
Example 5-23. Color Pixel Format Conversion Using SSSE3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-23
Example 5-24. Big-Endian to Little-Endian Conversion Using BSWAP . . . . . . . . . . . . . . . . . . . . . . . . . .5-24
Example 5-25. Big-Endian to Little-Endian Conversion Using PSHUFB . . . . . . . . . . . . . . . . . . . . . . . . .5-25
Example 5-26. Clipping to a Signed Range of Words [High, Low] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-26
Example 5-27. Clipping to an Arbitrary Signed Range [High, Low] . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-26
CONTENTS
xvi
PAGE
Example 5-29. Clipping to an Arbitrary Unsigned Range [High, Low] . . . . . . . . . . . . . . . . . . . . . . . . . . 5-27
Example 5-28. Simplified Clipping to an Arbitrary Signed Range . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-27
Example 5-30. Complex Multiply by a Constant. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-30
Example 5-31. A Large Load after a Series of Small Stores (Penalty) . . . . . . . . . . . . . . . . . . . . . . . . . 5-32
Example 5-32. Accessing Data Without Delay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-32
Example 5-33. A Series of Small Loads After a Large Store . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-33
Example 5-34. Eliminating Delay for a Series of Small Loads after a Large Store . . . . . . . . . . . . . . 5-33
Example 5-35. An Example of Video Processing with Cache Line Splits . . . . . . . . . . . . . . . . . . . . . . . 5-34
Example 5-36. Video Processing Using LDDQU to Avoid Cache Line Splits. . . . . . . . . . . . . . . . . . . . . 5-35
Example 6-1. Pseudocode for Horizontal (xyz, AoS) Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-6
Example 6-2. Pseudocode for Vertical (xxxx, yyyy, zzzz, SoA) Computation . . . . . . . . . . . . . . . . . . 6-7
Example 6-3. Swizzling Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-8
Example 6-4. Swizzling Data Using Intrinsics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-9
Example 6-5. Deswizzling Single-Precision SIMD Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-10
Example 6-6. Deswizzling Data Using the movlhps and shuffle Instructions. . . . . . . . . . . . . . . . . . 6-11
Example 6-7. Deswizzling Data 128-bit Integer SIMD Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-12
Example 6-8. Using MMX Technology Code for Copying or Shuffling. . . . . . . . . . . . . . . . . . . . . . . . . 6-13
Example 6-9. Horizontal Add Using MOVHLPS/MOVLHPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-14
Example 6-10. Horizontal Add Using Intrinsics with MOVHLPS/MOVLHPS. . . . . . . . . . . . . . . . . . . . . 6-15
Example 6-11. Multiplication of Two Pair of Single-precision Complex Number . . . . . . . . . . . . . . . . 6-18
Example 6-12. Division of Two Pair of Single-precision Complex Numbers . . . . . . . . . . . . . . . . . . . . 6-19
Example 6-13. Double-Precision Complex Multiplication of Two Pairs . . . . . . . . . . . . . . . . . . . . . . . . . 6-20
Example 6-14. Double-Precision Complex Multiplication Using Scalar SSE2. . . . . . . . . . . . . . . . . . . . 6-20
Example 6-15. Dot Product of Vector Length 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-21
Example 6-16. Unrolled Implementation of Four Dot Products . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-21
Example 8-1. Serial Execution of Producer and Consumer Work Items. . . . . . . . . . . . . . . . . . . . . . . . 7-6
Example 8-2. Basic Structure of Implementing Producer Consumer Threads . . . . . . . . . . . . . . . . . . 7-7
Example 8-3. Thread Function for an Interlaced Producer Consumer Model . . . . . . . . . . . . . . . . . . . 7-9
Example 8-4. Spin-wait Loop and PAUSE Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-17
Example 8-5. Coding Pitfall using Spin Wait Loop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-20
Example 8-6. Placement of Synchronization and Regular Variables. . . . . . . . . . . . . . . . . . . . . . . . . . 7-22
Example 8-7. Declaring Synchronization Variables without Sharing a Cache Line. . . . . . . . . . . . . 7-22
Example 8-8. Batched Implementation of the Producer Consumer Threads . . . . . . . . . . . . . . . . . . 7-29
Example 8-9. Adding an Offset to the Stack Pointer of Three Threads . . . . . . . . . . . . . . . . . . . . . . 7-31
Example 8-10. Adding a Pseudo-random Offset to the Stack Pointer in the Entry Function . . . . 7-33
Example 8-11. Assembling 3-level IDs, Affinity Masks for Each Logical Processor . . . . . . . . . . . . . 7-35
Example 8-12. Assembling a Look up Table to Manage Affinity Masks and
Schedule Threads to Each Core First . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-37
Example 8-13. Discovering the Affinity Masks for Sibling Logical Processors
Sharing the Same Cache. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-38
Example 9-1. Pseudo-code Using CLFLUSH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-12
Example 9-2. Populating an Array for Circular Pointer Chasing with Constant Stride . . . . . . . . . 8-14
Example 9-3. Prefetch Scheduling Distance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-18
Example 9-4. Using Prefetch Concatenation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-19
Example 9-5. Concatenation and Unrolling the Last Iteration of Inner Loop . . . . . . . . . . . . . . . . . . 8-19
Example 9-6. Data Access of a 3D Geometry Engine without Strip-mining . . . . . . . . . . . . . . . . . . . 8-25
Example 9-7. Data Access of a 3D Geometry Engine with Strip-mining . . . . . . . . . . . . . . . . . . . . . . 8-25
xvii
CONTENTS
PAGE
Example 9-8. Using HW Prefetch to Improve Read-Once Memory Traffic. . . . . . . . . . . . . . . . . . . . .8-27
Example 9-9. Basic Algorithm of a Simple Memory Copy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-32
Example 9-10. A Memory Copy Routine Using Software Prefetch. . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-33
Example 9-11. Memory Copy Using Hardware Prefetch and Bus Segmentation. . . . . . . . . . . . . . . .8-35
Example 10-2. Auto-Generated Code of Storing Absolutes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-7
Example 10-3. Changes Signs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-7
Example 10-1. Storing Absolute Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-7
Example 10-5. Data Conversion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-8
Example 10-4. Auto-Generated Code of Sign Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-8
Example 10-7. Un-aligned Data Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-9
Example 10-6. Auto-Generated Code of Data Conversion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-9
Example 10-8. Auto-Generated Code to Avoid Unaligned Loads. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-10
Example D-1. Aligned esp-Based Stack Frame . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-3
Example D-2. Aligned ebp-based Stack Frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-5
CONTENTS
xviii
PAGE
FI GURES
Figure 2-1. Intel Core Microarchitecture Pipeline Functionality. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-4
Figure 2-2. Execution Core of Intel Core Microarchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-12
Figure 2-3. Intel

Advanced Smart Cache Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-17
Figure 2-4. The Intel NetBurst Microarchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-21
Figure 2-5. Execution Units and Ports in Out-Of-Order Core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-27
Figure 2-6. The Intel Pentium M Processor Microarchitecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-33
Figure 2-7. Hyper-Threading Technology on an SMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-38
Figure 2-8. Pentium D Processor, Pentium Processor Extreme Edition,
Intel Core Duo Processor, Intel Core 2 Duo Processor, and Intel Core 2 Quad
Processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-42
Figure 2-9. Typical SIMD Operations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-46
Figure 2-10. SIMD Instruction Register Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-47
Figure 3-1. Generic Program Flow of Partially Vectorized Code . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-40
Figure 3-2. Cache Line Split in Accessing Elements in a Array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-49
Figure 3-3. Size and Alignment Restrictions in Store Forwarding. . . . . . . . . . . . . . . . . . . . . . . . . . 3-51
Figure 4-1. Converting to Streaming SIMD Extensions Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-5
Figure 4-2. Hand-Coded Assembly and High-Level Compiler Performance Trade-offs. . . . . . . . 4-8
Figure 4-3. Loop Blocking Access Pattern. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-25
Figure 5-1. PACKSSDW mm, mm/mm64 Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-8
Figure 5-2. Interleaved Pack with Saturation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-9
Figure 5-4. Result of Non-Interleaved Unpack High in MM1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-11
Figure 5-3. Result of Non-Interleaved Unpack Low in MM0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-11
Figure 5-5. PEXTRW Instruction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-12
Figure 5-6. PINSRW Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-13
Figure 5-7. PMOVSMKB Instruction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-15
Figure 5-8. pshuf PSHUF Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-16
Figure 5-9. PSADBW Instruction Example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-29
Figure 6-1. Homogeneous Operation on Parallel Data Elements. . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-4
Figure 6-2. Horizontal Computation Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-4
Figure 6-3. Dot Product Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-6
Figure 6-4. Horizontal Add Using MOVHLPS/MOVLHPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-14
Figure 6-5. Asymmetric Arithmetic Operation of the SSE3 Instruction. . . . . . . . . . . . . . . . . . . . . 6-17
Figure 6-6. Horizontal Arithmetic Operation of the SSE3 Instruction HADDPD . . . . . . . . . . . . . 6-18
Figure 8-1. Amdahls Law and MP Speed-up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-2
Figure 8-2. Single-threaded Execution of Producer-consumer Threading Model . . . . . . . . . . . . . 7-6
Figure 8-3. Execution of Producer-consumer Threading Model
on a Multicore Processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-7
Figure 8-4. Interlaced Variation of the Producer Consumer Model . . . . . . . . . . . . . . . . . . . . . . . . . . 7-8
Figure 8-5. Batched Approach of Producer Consumer Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-28
Figure 9-1. Effective Latency Reduction as a Function of Access Stride . . . . . . . . . . . . . . . . . . . 8-15
Figure 9-2. Memory Access Latency and Execution Without Prefetch . . . . . . . . . . . . . . . . . . . . . 8-16
Figure 9-3. Memory Access Latency and Execution With Prefetch . . . . . . . . . . . . . . . . . . . . . . . . 8-16
Figure 9-4. Prefetch and Loop Unrolling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-20
Figure 9-5. Memory Access Latency and Execution With Prefetch . . . . . . . . . . . . . . . . . . . . . . . . 8-21
Figure 9-6. Spread Prefetch Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-22
xix
CONTENTS
PAGE
Figure 9-7. Cache Blocking Temporally Adjacent and Non-adjacent Passes . . . . . . . . . . . . . . .8-23
Figure 9-8. Examples of Prefetch and Strip-mining for Temporally Adjacent and
Non-Adjacent Passes Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-24
Figure 9-9. Single-Pass Vs. Multi-Pass 3D Geometry Engines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-29
Figure 10-1. Performance History and State Transitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10-2
Figure 10-2. Active Time Versus Halted Time of a Processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10-3
Figure 10-3. Application of C-states to Idle Time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10-4
Figure 10-4. Profiles of Coarse Task Scheduling and Power Consumption . . . . . . . . . . . . . . . . . . .10-9
Figure 10-5. Thread Migration in a Multicore Processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-12
Figure 10-6. Progression to Deeper Sleep . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-13
Figure A-1. Intel Thread Profiler Showing Critical Paths of Threaded Execution Timelines. . .A-15
Figure B-1. Relationships Between Cache Hierarchy, IOQ, BSQ and FSB . . . . . . . . . . . . . . . . . . . .B-31
Figure B-2. Performance Events Drill-Down and Software Tuning Feedback Loop . . . . . . . . . .B-46
Figure 4-1. Stack Frames Based on Alignment Type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-2
CONTENTS
xx
PAGE
TABLES
Table 2-1. Components of the Front End . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-5
Table 2-2. Issue Ports of Intel Core Microarchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-11
Table 2-3. Cache Parameters of Processors based on Intel Core Microarchitecture . . . . . . . . 2-18
Table 2-4. Characteristics of Load and Store Operations in Intel Core Microarchitecture . . . 2-18
Table 2-5. Pentium 4 and Intel Xeon Processor Cache Parameters . . . . . . . . . . . . . . . . . . . . . . . 2-28
Table 2-7. Cache Parameters of Pentium M, Intel Core Solo,and Intel Core Duo Processors. 2-35
Table 2-6. Trigger Threshold and CPUID Signatures for Processor Families . . . . . . . . . . . . . . . 2-35
Table 2-8. Family And Model Designations of Microarchitectures. . . . . . . . . . . . . . . . . . . . . . . . . 2-43
Table 2-9. Characteristics of Load and Store Operations in Intel Core Duo Processors . . . . . 2-44
Table 3-1. Store Forwarding Restrictions of Processors Based on Intel Core
Microarchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-54
Table 5-1. PAHUF Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-16
Table 6-1. SoA Form of Representing Vertices Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-5
Table 8-1. Properties of Synchronization Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-15
Table 9-1. Software Prefetching Considerations into Strip-mining Code . . . . . . . . . . . . . . . . . . 8-26
Table 9-2. Relative Performance of Memory Copy Routines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-36
Table 9-3. Deterministic Cache Parameters Leaf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-38
Table A-1. Recommended IA-32 Processor Optimization Options . . . . . . . . . . . . . . . . . . . . . . . . . . A-2
Table A-2. Recommended Processor Optimization Options for 64-bit Code . . . . . . . . . . . . . . . . A-3
Table A-3. Vectorization Control Switch Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-4
Table B-1. Performance Metrics - General. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-6
Table B-2. Performance Metrics - Branching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-8
Table B-3. Performance Metrics - Trace Cache and Front End . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-9
Table B-4. Performance Metrics - Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-12
Table B-5. Performance Metrics - Bus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-17
Table B-6. Performance Metrics - Characterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-27
Table B-7. Performance Metrics - Machine Clear. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-29
Table B-8. Metrics That Utilize Replay Tagging Mechanism. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-36
Table B-9. Metrics That Utilize the Front-end Tagging Mechanism. . . . . . . . . . . . . . . . . . . . . . . . B-37
Table B-10. Metrics That Utilize the Execution Tagging Mechanism. . . . . . . . . . . . . . . . . . . . . . . . B-37
Table B-11. New Metrics for Pentium 4 Processor (Family 15, Model 3). . . . . . . . . . . . . . . . . . . . B-38
Table B-12. Metrics Supporting Qualification by Logical Processor and Parallel Counting . . . . B-40
Table B-13. Metrics Independent of Logical Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-42
Table C-1. Supplemental Streaming SIMD Extension 3 SIMD Instructions . . . . . . . . . . . . . . . . . . .C-4
Table C-2. Streaming SIMD Extension 3 SIMD Floating-point Instructions . . . . . . . . . . . . . . . . . . .C-5
Table C-3. Streaming SIMD Extension 2 128-bit Integer Instructions . . . . . . . . . . . . . . . . . . . . . . .C-5
Table C-4. Streaming SIMD Extension 2 Double-precision Floating-point Instructions . . . . . . C-10
Table C-5. Streaming SIMD Extension Single-precision Floating-point Instructions. . . . . . . . . C-13
Table C-6. Streaming SIMD Extension 64-bit Integer Instructions . . . . . . . . . . . . . . . . . . . . . . . . C-17
Table C-7. MMX Technology 64-bit Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-18
Table C-8. MMX Technology 64-bit Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-19
Table C-9. x87 Floating-point Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-21
Table C-10. General Purpose Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-23
1-1
CHAPTER 1
INTRODUCTION
The I nt el 64 and I A- 32 Archit ect ures Opt imizat ion Reference Manual describes how
t o opt imize soft ware t o t ake advant age of t he performance charact erist ics of I A- 32
and I nt el 64 archit ect ure processors. Opt imizat ions described in t his manual apply t o
processors based on t he I nt el

Core microarchit ect ure, I nt el

Net Burst
microar-
chit ect ure, t he I nt el
Core Duo, I nt el Core Solo, Pent ium
M processor families.
The t arget audience for t his manual includes soft ware programmers and compiler
writ ers. This manual assumes t hat t he reader is familiar wit h t he basics of t he I A- 32
archit ect ure and has access t o t he I nt el 64 and I A- 32 Archit ect ures Soft ware Devel-
opers Manual ( five volumes) . A det ailed underst anding of I nt el 64 and I A- 32 proces-
sors is oft en required. I n many cases, knowledge of t he underlying microarchit ect ures
is required.
The design guidelines t hat are discussed in t his manual for developing high-
performance soft ware generally apply t o current as well as t o fut ure I A- 32 and
I nt el 64 processors. The coding rules and code opt imizat ion t echniques list ed t arget
t he I nt el Core microarchit ect ure, t he I nt el Net Burst microarchit ect ure and t he
Pent ium M processor microarchit ect ure. I n most cases, coding rules apply t o soft -
ware running in 64- bit mode of I nt el 64 archit ect ure, compat ibilit y mode of I nt el 64
archit ect ure, and I A- 32 modes ( I A- 32 modes are support ed in I A- 32 and I nt el 64
archit ect ures) . Coding rules specific t o 64- bit modes are not ed separat ely.
1.1 TUNING YOUR APPLICATION
Tuning an applicat ion for high performance on any I nt el 64 or I A- 32 processor
requires underst anding and basic skills in:
I nt el 64 and I A- 32 archit ect ure
C and Assembly language
hot - spot regions in t he applicat ion t hat have impact on performance
opt imizat ion capabilit ies of t he compiler
t echniques used t o evaluat e applicat ion performance
The I nt el
VTune Performance Analyzer can help you analyze and locat e hot - spot
regions in your applicat ions. On t he I nt el
Core2 Duo, I nt el Core Duo, I nt el Core

Solo, Pent ium 4, I nt el
Xeon
and Pent ium M processors, t his t ool can monit or an

applicat ion t hrough a select ion of performance monit oring event s and analyze t he
performance event dat a t hat is gat hered during code execut ion.
This manual also describes informat ion t hat can be gat hered using t he performance
count ers t hrough Pent ium 4 processor s performance monit oring event s.
1-2
INTRODUCTION
1.2 ABOUT THIS MANUAL
I n t his document , references t o t he Pent ium 4 processor r efer t o processors based on
t he I nt el Net Burst microarchit ect ure. This includes t he I nt el Pent ium 4 processor and
many I nt el Xeon processors based on I nt el Net Burst microarchit ect ure. Where
appropriat e, differences are not ed ( for example, some I nt el Xeon processors have
t hird level cache) .
The I nt el Xeon processor 5300, 5100, 3000, and 3200 series and I nt el Core 2 Duo,
I nt el Core 2 Ext reme, I nt el Core 2 Quad processors are based on t he I nt el Core
microarchit ect ure. I n most cases, references t o t he I nt el Core 2 Duo processor also
apply t o I nt el Xeon processor 3000 and 5100 series. The Dual- Core I nt el
Xeon

processor LV is based on t he same archit ect ure as I nt el Core Duo processor.
The following bullet s summarize chapt ers in t his manual.
Chapt er 1: I nt r oduct i on Defines t he purpose and out lines t he cont ent s of
t his manual.
Chapt er 2: I nt el
64 and I A- 32 Pr ocessor Ar chi t ect ur es Describes t he

microarchit ect ure of recent I A- 32 and I nt el 64 processor families, and ot her
feat ures relevant t o soft ware opt imizat ion.
Chapt er 3: Gener al Opt i mi zat i on Gui del i nes Describes general code
development and opt imizat ion t echniques t hat apply t o all applicat ions designed
t o t ake advant age of t he common feat ures of t he I nt el Net Burst microarchi-
t ect ure and Pent ium M processor microarchit ect ure.
Chapt er 4: Codi ng f or SI MD Ar chi t ect ur es Describes t echniques and
concept s for using t he SI MD int eger and SI MD float ing- point inst ruct ions
provided by t he MMX t echnology, St reaming SI MD Ext ensions, St reaming
SI MD Ext ensions 2, and St reaming SI MD Ext ensions 3.
Chapt er 5: Opt i mi zi ng f or SI MD I nt eger Appl i cat i ons Provides opt imi-
zat ion suggest ions and common building blocks for applicat ions t hat use t he
64- bit and 128- bit SI MD int eger inst ruct ions.
Chapt er 6: Opt i mi zi ng f or SI MD Fl oat i ng- poi nt Appl i cat i ons Provides
opt imizat ion suggest ions and common building blocks for applicat ions t hat use
t he single- precision and double- precision SI MD float ing- point inst ruct ions.
Chapt er 7: Opt i mi zi ng Cache Usage Describes how t o use t he PREFETCH
inst ruct ion, cache cont rol management inst ruct ions t o opt imize cache usage, and
t he det erminist ic cache paramet ers.
Chapt er 8: Mul t i pr ocessor and Hyper - Thr eadi ng Technol ogy Describes
guidelines and t echniques for opt imizing mult it hreaded applicat ions t o achieve
opt imal performance scaling. Use t hese when t arget ing mult icore processor,
processors support ing Hyper-Threading Technology, or mult iprocessor ( MP)
syst ems.
Chapt er 9: 64- Bi t Mode Codi ng Gui del i nes This chapt er describes a set of
addit ional coding guidelines for applicat ion soft ware writ t en t o run in 64- bit
mode.
1-3
INTRODUCTION
Chapt er 10: Pow er Opt i mi zat i on f or Mobi l e Usages This chapt er provides
background on power saving t echniques in mobile processors and makes recom-
mendat ions t hat developers can leverage t o provide longer bat t ery life.
Appendi x A: Appl i cat i on Per f or mance Tool s I nt roduces t ools for analyzing
and enhancing applicat ion performance wit hout having t o writ e assembly code.
Appendi x B: I nt el Pent i um 4 Pr ocessor Per f or mance Met r i cs Provides
informat ion t hat can be gat hered using Pent ium 4 processor s performance
monit oring event s. These performance met rics can help programmers det ermine
how effect ively an applicat ion is using t he feat ures of t he I nt el Net Burst microar-
chit ect ure.
Appendi x C: I A- 32 I nst r uct i on Lat ency and Thr oughput Provides lat ency
and t hroughput dat a for t he I A- 32 inst ruct ions. I nst ruct ion t iming dat a specific t o
t he Pent ium 4 and Pent ium M processors are provided.
Appendi x D: St ack Al i gnment Describes st ack alignment convent ions and
t echniques t o opt imize performance of accessing st ack- based dat a.
Appendi x E: The Mat hemat i cs of Pr ef et ch Schedul i ng Di st ance
Discusses t he opt imum spacing t o insert PREFETCH inst ruct ions and present s a
mat hemat ical model for det ermining t he prefet ch scheduling dist ance ( PSD) for
your applicat ion.
Appendi x F: Summar y of Rul es and Suggest i ons Summarizes t he rules
and t uning suggest ions referenced in t he manual.
1.3 RELATED INFORMATION
For more informat ion on t he I nt el
archit ect ure, t echniques, and t he processor

archit ect ure t erminology, t he following are of part icular int erest :
I nt el
64 and I A- 32 Archit ect ures Soft ware Developers Manual ( in five volumes)
ht t p: / / developer. int el. com/ product s/ processor/ manuals/ index. ht m
I nt el
Processor I dent ificat ion wit h t he CPUI D I nst ruct ion, AP- 485
ht t p: / / www. int el. com/ support / processors/ sb/ cs- 009861. ht m
Developing Mult i- t hreaded Applicat ions: A Plat form Consist ent Approach
ht t p: / / cache-
www.int el. com/ cd/ 00/ 00/ 05/ 15/ 51534_developing_mult it hreaded_applicat ions. pdf
I nt el
C+ + Compiler document at ion and online help

ht t p: / / www. int el. com/ cd/ soft ware/ product s/ asmo- na/ eng/ index. ht m
I nt el
Fort ran Compiler document at ion and online help

I nt el
VTune Performance Analyzer document at ion and online help

1-4
INTRODUCTION
Using Spin- Loops on I nt el Pent ium 4 Processor and I nt el Xeon Processor MP
ht t p: / / www3. int el. com/ cd/ ids/ developer/ asmo-
na/ eng/ dc/ t hreading/ knowledgebase/ 19083. ht m
More relevant links are:
Soft ware net work link:
ht t p: / / soft warecommunit y. int el. com/ isn/ home/
Developer cent ers:
ht t p: / / www. int el. com/ cd/ ids/ developer/ asmo- na/ eng/ dc/ index. ht m
Processor support general link:
ht t p: / / www. int el. com/ support / processors/
Soft ware product s and packages:
I nt el 64 and I A- 32 processor manuals ( print ed or PDF downloads) :
ht t p: / / developer. int el. com/ product s/ processor/ manuals/ index. ht m
I nt el Mult i- Core Technology:
ht t p: / / developer. int el. com/ mult i- core/ index. ht m
Hyper-Threading Technology ( HT Technology) :
ht t p: / / developer. int el. com/ t echnology/ hypert hread/
2-1
CHAPTER 2
INTEL
64 AND IA-32 PROCESSOR ARCHITECTURES

This chapt er gives an overview of feat ures relevant t o soft ware opt imizat ion for
current generat ions of I nt el 64 and I A- 32 processors ( processors based on t he I nt el
Core microarchit ect ure, I nt el Net Burst microarchit ect ure; including I nt el Core Solo,
I nt el Core Duo, and I nt el Pent ium M processors) . These feat ures are:
Microarchit ect ures t hat enable execut ing inst ruct ions wit h high t hroughput at
high clock rat es, a high speed cache hierarchy and high speed syst em bus
Mult icore archit ect ure available in I nt el Core 2 Ext reme, I nt el Core 2 Quad, I nt el
Core 2 Duo, I nt el Core Duo, I nt el Pent ium D processors, Pent ium processor
Ext reme Edit ion
1
, and Quad- core I nt el Xeon, Dual- Core I nt el Xeon processors
Hyper-Threading Technology
2
( HT Technology) support
I nt el 64 archit ect ure on I nt el 64 processors
SI MD inst ruct ion ext ensions: MMX t echnology, St reaming SI MD Ext ensions
( SSE) , St reaming SI MD Ext ensions 2 ( SSE2) , St reaming SI MD Ext ensions 3
( SSE3) , and Supplement al St reaming SI MD Ext ensions 3 ( SSSE3)
The I nt el Pent ium M processor int roduced a power- efficient microarchit ect ure wit h
balanced performance. Dual- Core I nt el Xeon processor LV, I nt el Core Solo and I nt el
Core Duo processors incorporat e enhanced Pent ium M processor microarchit ect ure.
The I nt el Core 2, I nt el Core 2 Ext reme, I nt el Core 2 Quad processor family, I nt el
Xeon processor 3000, 3200, 5100, 5300 series are based on t he high- performance
and power- efficient I nt el Core microarchit ect ure. I nt el Core 2 Ext reme QX6700
processor, I nt el Core 2 Quad processors, I nt el Xeon processors 3200 series, 5300
series are quad- core processors. I nt el Pent ium 4 processors, I nt el Xeon processors,
Pent ium D processors, and Pent ium processor Ext reme Edit ions are based on I nt el

Net Burst microarchit ect ure.
1. Quad-core platform requires an Intel Xeon processor 3200 series, 5300 series, an Intel Core 2
Extreme quad-core processor, an Intel Core 2 Quad processor, with appropriate chipset, BIOS, and
operating system. Dual-core platform requires an Intel Xeon processor 3000 series, Intel Xeon
processor 5100 series, Intel Core 2 Duo, Intel Core 2 Extreme processor X6800, Dual-Core Intel
Xeon processors, Intel Core Duo, Pentium D processor or Pentium processor Extreme Edition,
with appropriate chipset, BIOS, and operating system. Performance varies depending on the
hardware and software used.
2. Hyper-Threading Technology requires a computer system with an Intel processor supporting HT
Technology and an HT Technology enabled chipset, BIOS and operating system. Performance
varies depending on the hardware and software used.
2-2
2.1 INTEL

CORE
MICROARCHITECTURE
I nt el Core microarchit ect ure int roduces t he following feat ures t hat enable high
performance and power- efficient performance for single- t hreaded as well as mult i-
t hreaded workloads:
I nt el
Wi de Dynami c Ex ecut i on enables each processor core t o fet ch,

dispat ch, execut e wit h high bandwidt hs and ret ire up t o four inst ruct ions per
cycle. Feat ures include:
Fourt een- st age efficient pipeline
Three arit hmet ic logical unit s
Four decoders t o decode up t o five inst ruct ion per cycle
Macro- fusion and micro- fusion t o improve front - end t hroughput
Peak issue rat e of dispat ching up t o six ops per cycle
Peak ret irement bandwidt h of up t o four ops per cycle
Advanced branch predict ion
St ack point er t racker t o improve efficiency of execut ing funct ion/ procedure
ent ries and exit s
I nt el
Advanced Smar t Cache delivers higher bandwidt h from t he second

level cache t o t he core, opt imal performance and flexibilit y for single- t hreaded
and mult i- t hreaded applicat ions. Feat ures include:
Opt imized for mult icore and single- t hreaded execut ion environment s
256 bit int ernal dat a pat h t o improve bandwidt h from L2 t o first - level dat a
cache
Unified, shared second- level cache of 4 Mbyt e, 16 way ( or 2 MByt e, 8 way)
I nt el
Smar t Memor y Access prefet ches dat a from memory in response t o

dat a access pat t erns and reduces cache- miss exposure of out - of- order
execut ion. Feat ures include:
Hardware prefet chers t o reduce effect ive lat ency of second- level cache
misses
Hardware prefet chers t o reduce effect ive lat ency of first - level dat a cache
misses
Memory disambiguat ion t o improve efficiency of speculat ive execut ion
execut ion engine
2-3
I nt el
Advanced Di gi t al Medi a Boost improves most 128- bit SI MD inst ruc-

t ions wit h single- cycle t hroughput and float ing- point operat ions. Feat ures
include:
Single- cycle t hroughput of most 128- bit SI MD inst ruct ions
Up t o eight float ing- point operat ions per cycle
Three issue port s available t o dispat ching SI MD inst ruct ions for execut ion
2.1.1 Intel
Core
Microarchitecture Pipeline Overview

The pipeline of t he I nt el Core microarchit ect ure cont ains:
An in- order issue front end t hat fet ches inst ruct ion st reams from memory, wit h
four inst ruct ion decoders t o supply decoded inst ruct ion ( ops) t o t he out - of-
order execut ion core.
An out - of- order superscalar execut ion core t hat can issue up t o six ops per cycle
( see Table 2- 2) and reorder ops t o execut e as soon as sources are ready and
execut ion resources are available.
An in- order ret irement unit t hat ensures t he result s of execut ion of ops are
processed and archit ect ural st at es are updat ed according t o t he original program
order.
I nt el Core 2 Ext reme processor X6800, I nt el Core 2 Duo processors and I nt el Xeon
processor 3000, 5100 series implement t wo processor cores based on t he I nt el Core
microarchit ect ure. I nt el Core 2 Ext reme quad- core processor, I nt el Core 2 Quad
processors and I nt el Xeon processor 3200 series, 5300 series implement four
processor cores. Each physical package of t hese quad- core processors cont ains t wo
processor dies, each die cont aining t wo processor cores. The funct ionalit y of t he
subsyst ems in each core are depict ed in Figure 2- 1.
2-4
Figure 2-1. Intel Core Microarchitecture Pipeline Functionality
Decode
ALU
Branch
MMX/SSE/FP
Move
Load
Shared L2 Cache
Up to 10.7 GB/s
FSB
Retirement Unit
(Re-Order Buffer)
L1D Cache and DTLB
Instruction Fetch and PreDecode
Instruction Queue
Rename/Alloc
ALU
FAdd
MMX/SSE
ALU
FMul
MMX/SSE
Scheduler
Micro-
code
ROM
Store
OM19808
2-5
2.1.2 Front End
The front ends needs t o supply decoded inst ruct ions ( ops) and sust ain t he st ream
t o a six- issue wide out - of- order engine. The component s of t he front end, t heir func-
t ions, and t he performance challenges t o microarchit ect ural design are described in
Table 2- 1.
Table 2-1. Components of the Front End
Component Functions Performance Challenges
Branch Prediction
Unit (BPU)
Helps the instruction fetch unit
fetch the most likely instruction
to be executed by predicting
the various branch types:
conditional, indirect, direct, call,
and return. Uses dedicated
hardware for each type.
Enables speculative
execution.
Improves speculative
execution efficiency by
reducing the amount of
code in the non-architected
path
1
to be fetched into
the pipeline.
NOTES:
1. Code paths that the processor thought it should execute but then found out it should go in
another path and therefore reverted from its initial intention.
Instruction Fetch
Unit
Prefetches instructions that are
likely to be executed
Caches frequently-used
instructions
Predecodes and buffers
instructions, maintaining a
constant bandwidth despite
irregularities in the instruction
stream
Variable length instruction
format causes unevenness
(bubbles) in decode
bandwidth.
Taken branches and
misaligned targets causes
disruptions in the overall
bandwidth delivered by the
fetch unit.
Instruction Queue
and Decode Unit
Decodes up to four instructions,
or up to five with macro-fusion
Stack pointer tracker algorithm
for efficient procedure entry
and exit
Implements the Macro-Fusion
feature, providing higher
performance and efficiency
The Instruction Queue is also
used as a loop cache, enabling
some loops to be executed with
both higher bandwidth and
lower power
Varying amounts of work
per instruction requires
expansion into variable
numbers of ops.
Prefix adds a dimension of
decoding complexity.
Length Changing Prefix
(LCP) can cause front end
bubbles.
2-6
2.1.2.1 Branch Prediction Unit
Branch predict ion enables t he processor t o begin execut ing inst ruct ions long before
t he branch out come is decided. All branches ut ilize t he BPU for predict ion. The BPU
cont ains t he following feat ures:
16- ent ry Ret urn St ack Buffer ( RSB) . I t enables t he BPU t o accurat ely predict RET
inst ruct ions.
Front end queuing of BPU lookups. The BPU makes branch predict ions for 32
byt es at a t ime, t wice t he widt h of t he fet ch engine. This enables t aken branches
t o be predict ed wit h no penalt y.
Even t hough t his BPU mechanism generally eliminat es t he penalt y for t aken
branches, soft ware should st ill regard t aken branches as consuming more
resources t han do not - t aken branches.
The BPU makes t he following t ypes of predict ions:
Direct Calls and Jumps. Target s are read as a t arget array, wit hout regarding t he
t aken or not - t aken predict ion.
I ndirect Calls and Jumps. These may eit her be predict ed as having a monot onic
t arget or as having t arget s t hat vary in accordance wit h recent program behavior.
Condit ional branches. Predict s t he branch t arget and whet her or not t he branch
will be t aken.
For informat ion about opt imizing soft ware for t he BPU, see Sect ion 3. 4, Opt imizing
t he Front End .
2.1.2.2 Instruction Fetch Unit
The inst ruct ion fet ch unit comprises t he inst ruct ion t ranslat ion lookaside buffer
( I TLB) , an inst ruct ion prefet cher, t he inst ruct ion cache and t he predecode logic of t he
inst ruct ion queue ( I Q) .
Instruction Cache and ITLB
An inst ruct ion fet ch is a 16- byt e aligned lookup t hrough t he I TLB int o t he inst ruct ion
cache and inst ruct ion prefet ch buffers. A hit in t he inst ruct ion cache causes 16 byt es
t o be delivered t o t he inst ruct ion predecoder. Typical programs average slight ly less
t han 4 byt es per inst ruct ion, depending on t he code being execut ed. Since most
inst ruct ions can be decoded by all decoders, an ent ire fet ch can oft en be consumed
by t he decoders in one cycle.
A misaligned t arget reduces t he number of inst ruct ion byt es by t he amount of offset
int o t he 16 byt e fet ch quant it y. A t aken branch reduces t he number of inst ruct ion
byt es delivered t o t he decoders since t he byt es aft er t he t aken branch are not
decoded. Branches are t aken approximat ely every 10 inst ruct ions in t ypical int eger
code, which t ranslat es int o a part ial inst ruct ion fet ch every 3 or 4 cycles.
2-7
Due t o st alls in t he rest of t he machine, front end st arvat ion does not usually cause
performance degradat ion. For ext remely fast code wit h larger inst ruct ions ( such as
SSE2 int eger media kernels) , it may be beneficial t o use t arget ed alignment t o
prevent inst ruct ion st arvat ion.
Instruction PreDecode
The predecode unit accept s t he sixt een byt es from t he inst ruct ion cache or prefet ch
buffers and carries out t he following t asks:
Det ermine t he lengt h of t he inst ruct ions.
Decode all prefixes associat ed wit h inst ruct ions.
Mark various propert ies of inst ruct ions for t he decoders ( for example, is
branch. ) .
The predecode unit can writ e up t o six inst ruct ions per cycle int o t he inst ruct ion
queue. I f a fet ch cont ains more t han six inst ruct ions, t he predecoder cont inues t o
decode up t o six inst ruct ions per cycle unt il all inst ruct ions in t he fet ch are writ t en t o
t he inst ruct ion queue. Subsequent fet ches can only ent er predecoding aft er t he
current fet ch complet es.
For a fet ch of seven inst ruct ions, t he predecoder decodes t he first six in one cycle,
and t hen only one in t he next cycle. This process would support decoding 3. 5 inst ruc-
t ions per cycle. Even if t he inst ruct ion per cycle ( I PC) rat e is not fully opt imized, it is
higher t han t he performance seen in most applicat ions. I n general, soft ware usually
does not have t o t ake any ext ra measures t o prevent inst ruct ion st arvat ion.
The following inst ruct ion prefixes cause problems during lengt h decoding. These
prefixes can dynamically change t he lengt h of inst ruct ions and are known as lengt h
changing prefixes ( LCPs) :
Operand Size Override ( 66H) preceding an inst ruct ion wit h a word immediat e
dat a
Address Size Override ( 67H) preceding an inst ruct ion wit h a mod R/ M in real,
16- bit prot ect ed or 32- bit prot ect ed modes
When t he predecoder encount ers an LCP in t he fet ch line, it must use a slower lengt h
decoding algorit hm. Wit h t he slower lengt h decoding algorit hm, t he predecoder
decodes t he fet ch in 6 cycles, inst ead of t he usual 1 cycle.
Normal queuing wit hin t he processor pipeline usually cannot hide LCP penalt ies.
The REX prefix ( 4xh) in t he I nt el 64 archit ect ure inst ruct ion set can change t he size
of t wo classes of inst ruct ion: MOV offset and MOV immediat e. Nevert heless, it does
not cause an LCP penalt y and hence is not considered an LCP.
2.1.2.3 Instruction Queue (IQ)
The inst ruct ion queue is 18 inst ruct ions deep. I t sit s bet ween t he inst ruct ion prede-
code unit and t he inst ruct ion decoders. I t sends up t o five inst ruct ions per cycle, and
2-8
support s one macro- fusion per cycle. I t also serves as a loop cache for loops smaller
t han 18 inst ruct ions. The loop cache operat es as described below.
A Loop St ream Det ect or ( LSD) resides in t he BPU. The LSD at t empt s t o det ect loops
which are candidat es for st reaming from t he inst ruct ion queue ( I Q) . When such a
loop is det ect ed, t he inst ruct ion byt es are locked down and t he loop is allowed t o
st ream from t he I Q unt il a mispredict ion ends it . When t he loop plays back from t he
I Q, it provides higher bandwidt h at reduced power ( since much of t he rest of t he
front end pipeline is shut off ) .
The LSD provides t he following benefit s:
No loss of bandwidt h due t o t aken branches
No loss of bandwidt h due t o misaligned inst ruct ions
No LCP penalt ies, as t he predecode st age has already been passed
Reduced front end power consumpt ion, because t he inst ruct ion cache, BPU and
predecode unit can be idle
Soft ware should use t he loop cache funct ionalit y opport unist ically. Loop unrolling and
ot her code opt imizat ions may make t he loop t oo big t o fit int o t he LSD. For high
performance code, loop unrolling is generally preferable for performance even when
it overflows t he loop cache capabilit y.
2.1.2.4 Instruction Decode
The I nt el Core microarchit ect ure cont ains four inst ruct ion decoders. The first ,
Decoder 0, can decode I nt el 64 and I A- 32 inst ruct ions up t o 4 ops in size. Three
ot her decoders handles single- op inst ruct ions. The microsequencer can provide up
t o 3 ops per cycle, and helps decode inst ruct ions larger t han 4 ops.
All decoders support t he common cases of single op flows, including: micro- fusion,
st ack point er t racking and macro- fusion. Thus, t he t hree simple decoders are not
limit ed t o decoding single- op inst ruct ions. Packing inst ruct ions int o a 4- 1- 1- 1
t emplat e is not necessary and not recommended.
Macro- fusion merges t wo inst ruct ions int o a single op. I nt el Core microarchit ect ure
is capable of one macro- fusion per cycle in 32- bit operat ion ( including compat ibilit y
sub- mode of t he I nt el 64 archit ect ure) , but not in 64- bit mode because code t hat
uses longer inst ruct ions ( lengt h in byt es) more oft en is less likely t o t ake advant age
of hardware support for macro- fusion.
2.1.2.5 Stack Pointer Tracker
The I nt el 64 and I A- 32 archit ect ures have several commonly used inst ruct ions for
paramet er passing and procedure ent ry and exit : PUSH, POP, CALL, LEAVE and RET.
These inst ruct ions implicit ly updat e t he st ack point er regist er ( RSP) , maint aining a
combined cont rol and paramet er st ack wit hout soft ware int ervent ion. These inst ruc-
t ions are t ypically implement ed by several ops in previous microarchit ect ures.
2-9
The St ack Point er Tracker moves all t hese implicit RSP updat es t o logic cont ained in
t he decoders t hemselves. The feat ure provides t he following benefit s:
I mproves decode bandwidt h, as PUSH, POP and RET are single op inst ruct ions
in I nt el Core microarchit ect ure.
Conserves execut ion bandwidt h as t he RSP updat es do not compet e for execut ion
resources.
I mproves parallelism in t he out of order execut ion engine as t he implicit serial
dependencies bet ween ops are removed.
I mproves power efficiency as t he RSP updat es are carried out on small, dedicat ed
hardware.
2.1.2.6 Micro-fusion
Micro- fusion fuses mult iple ops from t he same inst ruct ion int o a single complex
op. The complex op is dispat ched in t he out - of- order execut ion core. Micro- fusion
provides t he following performance advant ages:
I mproves inst ruct ion bandwidt h delivered from decode t o ret irement .
Reduces power consumpt ion as t he complex op represent s more work in a
smaller format ( in t erms of bit densit y) , reducing overall bit - t oggling in t he
machine for a given amount of work and virt ually increasing t he amount of
st orage in t he out - of- order execut ion engine.
Many inst ruct ions provide regist er flavors and memory flavors. The flavor involving a
memory operand will decodes int o a longer flow of ops t han t he regist er version.
Micro- fusion enables soft ware t o use memory t o regist er operat ions t o express t he
act ual program behavior wit hout worrying about a loss of decode bandwidt h.
2.1.3 Execution Core
The execut ion core of t he I nt el Core microarchit ect ure is superscalar and can process
inst ruct ions out of order. When a dependency chain causes t he machine t o wait for a
resource ( such as a second- level dat a cache line) , t he execut ion core execut es ot her
inst ruct ions. This increases t he overall rat e of inst ruct ions execut ed per cycle ( I PC) .
The execut ion core cont ains t he following t hree maj or component s:
Renamer Moves ops from t he front end t o t he execut ion core. Archit ect ural
regist ers are renamed t o a larger set of microarchit ect ural regist ers. Renaming
eliminat es false dependencies known as read- aft er- read and writ e- aft er- read
hazards.
Reor der buf f er ( ROB) Holds ops in various st ages of complet ion, buffers
complet ed ops, updat es t he archit ect ural st at e in order, and manages ordering
of except ions. The ROB has 96 ent ries t o handle inst ruct ions in flight .
2-10
Reser vat i on st at i on ( RS) Queues ops unt il all source operands are ready,
schedules and dispat ches ready ops t o t he available execut ion unit s. The RS has
32 ent ries.
The init ial st ages of t he out of order core move t he ops from t he front end t o t he
ROB and RS. I n t his process, t he out of order core carries out t he following st eps:
Allocat es resources t o ops ( for example: t hese resources could be load or st ore
buffers) .
Binds t he op t o an appropriat e issue port .
Renames sources and dest inat ions of ops, enabling out of order execut ion.
Provides dat a t o t he op when t he dat a is eit her an immediat e value or a regist er
value t hat has already been calculat ed.
The following list describes various t ypes of common operat ions and how t he core
execut es t hem efficient ly:
Mi cr o- ops w i t h si ngl e- cy cl e l at ency Most ops wit h single- cycle lat ency
can be execut ed by mult iple execut ion unit s, enabling mult iple st reams of
dependent operat ions t o be execut ed quickly.
Fr equent l y - used ops w i t h l onger l at ency These ops have pipelined
execut ion unit s so t hat mult iple ops of t hese t ypes may be execut ing in different
part s of t he pipeline simult aneously.
Oper at i ons w i t h dat a- dependent l at enci es Some operat ions, such as
division, have dat a dependent lat encies. I nt eger division parses t he operands t o
perform t he calculat ion only on significant port ions of t he operands, t hereby
speeding up common cases of dividing by small numbers.
Fl oat i ng poi nt oper at i ons w i t h f i x ed l at ency f or oper ands t hat meet
cer t ai n r est r i ct i ons Operands t hat do not fit t hese rest rict ions are
considered except ional cases and are execut ed wit h higher lat ency and reduced
t hroughput . The lower- t hroughput cases do not affect lat ency and t hroughput for
more common cases.
Memor y oper ands w i t h v ar i abl e l at ency , ev en i n t he case of an L1 cache
hi t Loads t hat are not known t o be safe from forwarding may wait unt il a st ore-
address is resolved before execut ing. The memory order buffer ( MOB) accept s
and processes all memory operat ions. See Sect ion 2.1. 5 for more informat ion
about t he MOB.
2.1.3.1 Issue Ports and Execution Units
The scheduler can dispat ch up t o six ops per cycle t hrough t he issue port s depict ed
in Table 2- 2. The t able provides lat ency and t hroughput dat a of common int eger and
float ing- point ( FP) operat ions for each issue port in cycles.
2-11
Table 2-2. Issue Ports of Intel Core Microarchitecture
Port
Executable
operations Latency
Through
put
Writeback
Port Comment
Port 0 Integer ALU
Integer SIMD ALU
Single-precision (SP)
FP MUL
Double-precision FP
MUL
FP MUL (X87)
FP/SIMD/SSE2 Move
and Logic
Shuffle
1
1
4
5
5
1
1
1
1
1
2
1
Writeback 0 Includes 64-bit mode
integer MUL.
Mixing operations of
different latencies that
use the same port can
result in writeback bus
conflicts; this can
reduce overall
throughput.
Excludes QW shuffles
Port 1 Integer ALU
Integer SIMD MUL
FP ADD
FP/SIMD/SSE2 Move
and Logic
QW Shuffle
1
1
3
1
1
1
1
1
1
1
Writeback 1 Excludes 64-bit mode
integer MUL.
Mixing operations of
different latencies that
use the same port can
result in writeback bus
conflicts; this can
reduce overall
throughput.
Port 2 Integer loads
FP loads
3
4
1
1
Writeback 2
Port 3 Store address 3 1 None (flags) Prepares the store
forwarding and store
retirement logic with
the address of the data
being stored.
Port 4 Store data None Prepares the store
forwarding and store
retirement logic with
the data being stored.
Port 5 Integer ALU
Integer SIMD ALU
FP/SIMD/SSE2 Move
and Logic
QW Shuffle
1
1
1
1
1
1
1
1
Writeback 5
2-12
I n each cycle, t he RS can dispat ch up t o six ops. Each cycle, up t o 4 result s may be
writ t en back t o t he RS and ROB, t o be used as early as t he next cycle by t he RS. This
high execut ion bandwidt h enables execut ion burst s t o keep up wit h t he funct ional
expansion of t he micro- fused ops t hat are decoded and ret ired.
The execut ion core cont ains t he following t hree execut ion st acks:
SI MD int eger
regular int eger
x87/ SI MD float ing point
The execut ion core also cont ains connect ions t o and from t he memory clust er. See
Figure 2- 2.
Not ice t hat t he t wo dark squares inside t he execut ion block ( in grey color) and
appear in t he pat h connect ing t he int eger and SI MD int eger st acks t o t he float ing
point st ack. This delay shows up as an ext ra cycle called a bypass delay. Dat a from
t he L1 cache has one ext ra cycle of lat ency t o t he float ing point unit . The dark-
colored squares in Figure 2- 2 represent t he ext ra cycle of lat ency.
Figure 2-2. Execution Core of Intel Core Microarchitecture
Dat a Cache
Unit
dt lb
Memory ordering
st ore forwarding
0,1,5
SI MD
I nt eger
0,1,5

I nt eger
0,1,5
Float ing
Point
Load 2
St ore (address) 3
St ore (dat a) 4
I nt eger/
SI MD
MUL
EXE
2-13
2.1.4 Intel

Advanced Memory Access
The I nt el Core microarchit ect ure cont ains an inst ruct ion cache and a first - level dat a
cache in each core. The t wo cores share a 2 or 4- MByt e L2 cache. All caches are
writ eback and non- inclusive. Each core cont ains:
L1 dat a cache, k now n as t he dat a cache uni t ( DCU) The DCU can handle
mult iple out st anding cache misses and cont inue t o service incoming st ores and
loads. I t support s maint aining cache coherency. The DCU has t he following speci-
ficat ions:
32- KByt es size
8- way set associat ive
64- byt es line size
Dat a t r ansl at i on l ook asi de buf f er ( DTLB) The DTLB in I nt el Core microar-
chit ect ure implement s t wo levels of hierarchy. Each level of t he DTLB have
mult iple ent ries and can support eit her 4- KByt e pages or large pages. The ent ries
of t he inner level ( DTLB0) is used for loads. The ent ries in t he out er level ( DTLB1)
support st ore operat ions and loads t hat missed DTLB0. All ent ries are 4- way
associat ive. Here is a list of ent ries in each DTLB:
DTLB1 for large pages: 32 ent ries
DTLB1 for 4- KByt e pages: 256 ent ries
DTLB0 for large pages: 16 ent ries
DTLB0 for 4- KByt e pages: 16 ent ries
An DTLB0 miss and DTLB1 hit causes a penalt y of 2 cycles. Soft ware only pays
t his penalt y if t he DTLB0 is used in some dispat ch cases. The delays associat ed
wit h a miss t o t he DTLB1 and PMH are largely non- blocking due t o t he design of
I nt el Smart Memory Access.
Page mi ss handl er ( PMH)
A memor y or der i ng buf f er ( MOB) Which:
enables loads and st ores t o issue speculat ively and out of order
ensures ret ired loads and st ores have t he correct dat a upon ret irement
ensures loads and st ores follow memory ordering rules of t he I nt el 64 and
I A- 32 archit ect ures.
The memory clust er of t he I nt el Core microarchit ect ure uses t he following t o speed
up memory operat ions:
128- bit load and st ore operat ions
dat a prefet ching t o L1 caches
dat a prefet ch logic for prefet ching t o t he L2 cache
st ore forwarding
memory disambiguat ion
2-14
8 fill buffer ent ries
20 st ore buffer ent ries
out of order execut ion of memory operat ions
pipelined read- for- ownership operat ion ( RFO)
For informat ion on opt imizing soft ware for t he memory clust er, see Sect ion 3.6,
Opt imizing Memory Accesses.
2.1.4.1 Loads and Stores
The I nt el Core microarchit ect ure can execut e up t o one 128- bit load and up t o one
128- bit st ore per cycle, each t o different memory locat ions. The microarchit ect ure
enables execut ion of memory operat ions out of order wit h respect t o ot her inst ruc-
t ions and wit h respect t o ot her memory operat ions.
Loads can:
issue before preceding st ores when t he load address and st ore address are
known not t o conflict
be carried out speculat ively, before preceding branches are resolved
t ake cache misses out of order and in an overlapped manner
issue before preceding st ores, speculat ing t hat t he st ore is not going t o be t o a
conflict ing address
Loads cannot :
speculat ively t ake any sort of fault or t rap
speculat ively access t he uncacheable memory t ype
Fault ing or uncacheable loads are det ect ed and wait unt il ret irement , when t hey
updat e t he programmer visible st at e. x87 and float ing point SI MD loads add 1 addi-
t ional clock lat ency.
St ores t o memory are execut ed in t wo phases:
Ex ecut i on phase Prepares t he st ore buffers wit h address and dat a for st ore
forwarding. Consumes dispat ch port s, which are port s 3 and 4.
Compl et i on phase The st ore is ret ired t o programmer- visible memory. I t
may compet e for cache banks wit h execut ing loads. St ore ret irement is
maint ained as a background t ask by t he memory order buffer, moving t he dat a
from t he st ore buffers t o t he L1 cache.
2.1.4.2 Data Prefetch to L1 caches
I nt el Core microarchit ect ure provides t wo hardware prefet chers t o speed up dat a
accessed by a program by prefet ching t o t he L1 dat a cache:
Dat a cache uni t ( DCU) pr ef et cher This prefet cher, also known as t he
st reaming prefet cher, is t riggered by an ascending access t o very recent ly loaded
2-15
dat a. The processor assumes t hat t his access is part of a st reaming algorit hm
and aut omat ically fet ches t he next line.
I nst r uct i on poi nt er ( I P) - based st r i ded pr ef et cher This prefet cher keeps
t rack of individual load inst ruct ions. I f a load inst ruct ion is det ect ed t o have a
regular st ride, t hen a prefet ch is sent t o t he next address which is t he sum of t he
current address and t he st ride. This prefet cher can prefet ch forward or backward
and can det ect st rides of up t o half of a 4KB- page, or 2 KByt es.
Dat a prefet ching works on loads only when t he following condit ions are met :
Load is from writ eback memory t ype.
Prefet ch request is wit hin t he page boundary of 4 Kbyt es.
No fence or lock is in progress in t he pipeline.
Not many ot her load misses are in progress.
The bus is not very busy.
There is not a cont inuous st ream of st ores.
DCU Prefet ching has t he following effect s:
I mproves performance if dat a in large st ruct ures is arranged sequent ially in t he
order used in t he program.
May cause slight performance degradat ion due t o bandwidt h issues if access
pat t erns are sparse inst ead of local.
On rare occasions, if t he algorit hm' s working set is t uned t o occupy most of t he
cache and unneeded prefet ches evict lines required by t he program, hardware
prefet cher may cause severe performance degradat ion due t o cache capacit y of
L1.
I n cont rast t o hardware prefet chers relying on hardware t o ant icipat e dat a t raffic,
soft ware prefet ch inst ruct ions relies on t he programmer t o ant icipat e cache miss
t raffic, soft ware prefet ch act as hint s t o bring a cache line of dat a int o t he desired
levels of t he cache hierarchy. The soft ware- cont rolled prefet ch is int ended for
prefet ching dat a, but not for prefet ching code.
2.1.4.3 Data Prefetch Logic
Dat a prefet ch logic ( DPL) prefet ches dat a t o t he second- level ( L2) cache based on
past request pat t erns of t he DCU from t he L2. The DPL maint ains t wo independent
arrays t o st ore addresses from t he DCU: one for upst reams ( 12 ent ries) and one for
down st reams ( 4 ent ries) . The DPL t racks accesses t o one 4K byt e page in each
ent ry. I f an accessed page is not in any of t hese arrays, t hen an array ent ry is allo-
cat ed.
The DPL monit ors DCU reads for increment al sequences of request s, known as
st reams. Once t he DPL det ect s t he second access of a st ream, it prefet ches t he next
cache line. For example, when t he DCU request s t he cache lines A and A+ 1, t he DPL
assumes t he DCU will need cache line A+ 2 in t he near fut ure. I f t he DCU t hen reads
2-16
A+ 2, t he DPL prefet ches cache line A+ 3. The DPL works similarly for downward
loops.
The I nt el Pent ium M processor int roduced DPL. The I nt el Core microarchit ect ure
added t he following feat ures t o DPL:
The DPL can det ect more complicat ed st reams, such as when t he st ream skips
cache lines. DPL may issue 2 prefet ch request s on every L2 lookup. The DPL in
t he I nt el Core microarchit ect ure can run up t o 8 lines ahead from t he load
request .
DPL in t he I nt el Core microarchit ect ure adj ust s dynamically t o bus bandwidt h and
t he number of request s. DPL prefet ches far ahead if t he bus is not busy, and less
far ahead if t he bus is busy.
DPL adj ust s t o various applicat ions and syst em configurat ions.
Ent ries for t he t wo cores are handled separat ely.
2.1.4.4 Store Forwarding
I f a load follows a st ore and reloads t he dat a t hat t he st ore writ es t o memory, t he
I nt el Core microarchit ect ure can forward t he dat a direct ly from t he st ore t o t he load.
This process, called st ore t o load forwarding, saves cycles by enabling t he load t o
obt ain t he dat a direct ly from t he st ore operat ion inst ead of t hrough memory.
The following rules must be met for st ore t o load forwarding t o occur:
The st ore must be t he last st ore t o t hat address prior t o t he load.
The st ore must be equal or great er in size t han t he size of dat a being loaded.
The load cannot cross a cache line boundary.
The load cannot cross an 8- Byt e boundary. 16- Byt e loads are an except ion t o t his
rule.
The load must be aligned t o t he st art of t he st ore address, except for t he
following except ions:
An aligned 64- bit st ore may forward eit her of it s 32- bit halves
An aligned 128- bit st ore may forward any of it s 32- bit quart ers
An aligned 128- bit st ore may forward eit her of it s 64- bit halves
Soft ware can use t he except ions t o t he last rule t o move complex st ruct ures wit hout
losing t he abilit y t o forward t he subfields.
2.1.4.5 Memory Disambiguation
A load inst ruct ion op may depend on a preceding st ore. Many microarchit ect ures
block loads unt il all preceding st ore address are known.
2-17
The memory disambiguat or predict s which loads will not depend on any previous
st ores. When t he disambiguat or predict s t hat a load does not have such a depen-
dency, t he load t akes it s dat a from t he L1 dat a cache.
Event ually, t he predict ion is verified. I f an act ual conflict is det ect ed, t he load and all
succeeding inst ruct ions are re- execut ed.
2.1.5 Intel

Advanced Smart Cache
The I nt el Core microarchit ect ure opt imized a number of feat ures for t wo processor
cores on a single die. The t wo cores share a second- level cache and a bus int erface
unit , collect ively known as I nt el Advanced Smart Cache. This sect ion describes t he
component s of I nt el Advanced Smart Cache. Figure 2- 3 illust rat es t he archit ect ure of
t he I nt el Advanced Smart Cache.
Table 2- 3 det ails t he paramet ers of caches in t he I nt el Core microarchit ect ure. For
informat ion on enumerat ing t he cache hierarchy ident ificat ion using t he det erminist ic
cache paramet er leaf of CPUI D inst ruct ion, see t he I nt el 64 and I A- 32 Archit ect ures
Soft ware Developers Manual, Volume 2A.
Figure 2-3. Intel

Advanced Smart Cache Architecture
Branch
Predict ion
Ret irement
Execut ion
Fet ch/
Decode
L1 Dat a
Cache
L1 I nst r.
Cache
Core 1
Branch
Predict ion
Ret irement
Execut ion
Fet ch/
Decode
L1 Dat a
Cache
L1 I nst r.
Cache
Core 0
L2 Cache
Bus I nt erface Unit
Syst em Bus
2-18
2.1.5.1 Loads
When an inst ruct ion reads dat a from a memory locat ion t hat has writ e- back ( WB)
t ype, t he processor looks for t he cache line t hat cont ains t his dat a in t he caches and
memory in t he following order:
1. DCU of t he init iat ing core
2. DCU of t he ot her core and second- level cache
3. Syst em memory
The cache line is t aken from t he DCU of t he ot her core only if it is modified, ignoring
t he cache line availabilit y or st at e in t he L2 cache.
Table 2- 4 shows t he charact erist ics of fet ching t he first four byt es of different locali-
t ies from t he memory clust er. The lat ency column provides an est imat e of access
lat ency. However, t he act ual lat ency can vary depending on t he load of cache,
memory component s, and t heir paramet ers.
Table 2-3. Cache Parameters of Processors based on Intel Core Microarchitecture
Level Capacity
Associativit
y (ways)
Line Size
(bytes)
Access
Latency
(clocks)
Access
Throughput
(clocks)
Write Update
Policy
First Level 32 KB 8 64 3 1 Writeback
Instruction 32 KB 8 N/A N/A N/A N/A
Second Level
(Shared L2)
2, 4 MB 8 or 16 64 14
1
NOTES:
1. Software-visible latency will vary depending on access patterns and other factors.
2 Writeback
Table 2-4. Characteristics of Load and Store Operations
in Intel Core Microarchitecture
Load Store
Data Locality Latency Throughput Latency Throughput
DCU 3 1 2 1
DCU of the other
core in modified
state
14 + 5.5 bus
cycles
14 + 5.5 bus
cycles
14 + 5.5 bus
cycles
2nd-level cache 14 3 14 3
Memory 14 + 5.5 bus
cycles + memory
Depends on bus
read protocol
14 + 5.5 bus
cycles + memory
Depends on bus
write protocol
2-19
Somet imes a modified cache line has t o be evict ed t o make space for a new cache
line. The modified cache line is evict ed in parallel t o bringing t he new dat a and does
not require addit ional lat ency. However, when dat a is writ t en back t o memory, t he
evict ion uses cache bandwidt h and possibly bus bandwidt h as well. Therefore, when
mult iple cache misses require t he evict ion of modified lines wit hin a short t ime, t here
is an overall degradat ion in cache response t ime.
2.1.5.2 Stores
When an inst ruct ion writ es dat a t o a memory locat ion t hat has WB memory t ype, t he
processor first ensures t hat t he line is in Exclusive or Modified st at e in it s own DCU.
The processor looks for t he cache line in t he following locat ions, in t he specified
order:
1. DCU of init iat ing core
2. DCU of t he ot her core and L2 cache
3. Syst em memory
t he cache line availabilit y or st at e in t he L2 cache. Aft er reading for ownership is
complet ed, t he dat a is writ t en t o t he first - level dat a cache and t he line is marked as
modified.
Reading for ownership and st oring t he dat a happens aft er inst ruct ion ret irement and
follows t he order of ret irement . Therefore, t he st ore lat ency does not effect t he st ore
inst ruct ion it self. However, several sequent ial st ores may have cumulat ive lat ency
t hat can affect performance. Table 2- 4 present s st ore lat encies depending on t he
init ial cache line locat ion.
2.2 INTEL NETBURST
MICROARCHITECTURE
The Pent ium 4 processor, Pent ium 4 processor Ext reme Edit ion support ing Hyper-
Threading Technology, Pent ium D processor, and Pent ium processor Ext reme Edit ion
implement t he I nt el Net Burst microarchit ect ure. I nt el Xeon processors t hat imple-
ment I nt el Net Burst microarchit ect ure can be ident ified using CPUI D ( family
encoding 0FH) .
This sect ion describes t he feat ures of t he I nt el Net Burst microarchit ect ure and it s
operat ion common t o t he above processors. I t provides t he t echnical background
required t o underst and opt imizat ion recommendat ions and t he coding rules
discussed in t he rest of t his manual. For implement at ion det ails, including inst ruct ion
lat encies, see Appendix C, I nst ruct ion Lat ency and Throughput .
I nt el Net Burst microarchit ect ure is designed t o achieve high performance for int eger
and float ing- point comput at ions at high clock rat es. I t support s t he following
feat ures:
hyper- pipelined t echnology t hat enables high clock rat es
2-20
high- performance, quad- pumped bus int erface t o t he I nt el Net Burst microarchi-
t ect ure syst em bus
rapid execut ion engine t o reduce t he lat ency of basic int eger inst ruct ions
out - of- order speculat ive execut ion t o enable parallelism
superscalar issue t o enable parallelism
hardware regist er renaming t o avoid regist er name space limit at ions
cache line sizes of 64 byt es
hardware prefet ch
2.2.1 Design Goals
The design goals of I nt el Net Burst microarchit ect ure are:
To execut e legacy I A- 32 applicat ions and applicat ions based on single-
inst ruct ion, mult iple- dat a ( SI MD) t echnology at high t hroughput
To operat e at high clock rat es and t o scale t o higher performance and clock rat es
in t he fut ure
Design advances of t he I nt el Net Burst microarchit ect ure include:
A deeply pipelined design t hat allows for high clock rat es ( wit h different part s of
t he chip running at different clock rat es) .
A pipeline t hat opt imizes for t he common case of frequent ly execut ed inst ruc-
t ions; t he most frequent ly- execut ed inst ruct ions in common circumst ances ( such
as a cache hit ) are decoded efficient ly and execut ed wit h short lat encies.
Employment of t echniques t o hide st all penalt ies; Among t hese are parallel
execut ion, buffering, and speculat ion. The microarchit ect ure execut es inst ruc-
t ions dynamically and out - of- order, so t he t ime it t akes t o execut e each individual
inst ruct ion is not always det erminist ic.
Chapt er 3, General Opt imizat ion Guidelines, list s opt imizat ions t o use and sit ua-
t ions t o avoid. The chapt er also gives a sense of relat ive priorit y. Because most opt i-
mizat ions are implement at ion dependent , t he chapt er does not quant ify expect ed
benefit s and penalt ies.
The following sect ions provide more informat ion about key feat ures of t he I nt el
2.2.2 Pipeline
The pipeline of t he I nt el Net Burst microarchit ect ure cont ains:
an in- order issue front end
an out - of- order superscalar execut ion core
an in- order ret irement unit
2-21
The front end supplies inst ruct ions in program order t o t he out - of- order core. I t
fet ches and decodes inst ruct ions. The decoded inst ruct ions are t ranslat ed int o ops.
The front ends primary j ob is t o feed a cont inuous st ream of ops t o t he execut ion
core in original program order.
The out - of- order core aggressively reorders ops so t hat ops whose input s are
ready ( and have execut ion resources available) can execut e as soon as possible. The
core can issue mult iple ops per cycle.
The ret irement sect ion ensures t hat t he result s of execut ion are processed according
t o original program order and t hat t he proper archit ect ural st at es are updat ed.
Figure 2- 4 illust rat es a diagram of t he maj or funct ional blocks associat ed wit h t he
I nt el Net Burst microarchit ect ure pipeline. The following subsect ions provide an over-
view for each.
Figure 2-4. The Intel NetBurst Microarchitecture
Fetch/Decode
Trace Cache
Microcode ROM
Execution
Out-Of-Order Core
Retirement
1st Level Cache
4-way
2nd Level Cache
8-Way
BTBs/Branch Prediction
Bus Unit
System Bus
Frequently used paths
Less frequently used paths
Front End
3rd Level Cache
Optional
Branch History Update
OM19806
2-22
2.2.2.1 Front End
The front end of t he I nt el Net Burst microarchit ect ure consist s of t wo part s:
fet ch/ decode unit
execut ion t race cache
I t performs t he following funct ions:
prefet ches inst ruct ions t hat are likely t o be execut ed
fet ches required inst ruct ions t hat have not been prefet ched
decodes inst ruct ions int o ops
generat es microcode for complex inst ruct ions and special- purpose code
delivers decoded inst ruct ions from t he execut ion t race cache
predict s branches using advanced algorit hms
The front end is designed t o address t wo problems t hat are sources of delay:
t ime required t o decode inst ruct ions fet ched from t he t arget
wast ed decode bandwidt h due t o branches or a branch t arget in t he middle of a
cache line
I nst ruct ions are fet ched and decoded by a t ranslat ion engine. The t ranslat ion engine
t hen builds decoded inst ruct ions int o op sequences called t races. Next , t races are
t hen st ored in t he execut ion t race cache.
The execut ion t race cache st ores ops in t he pat h of program execut ion flow, where
t he result s of branches in t he code are int egrat ed int o t he same cache line. This
increases t he inst ruct ion flow from t he cache and makes bet t er use of t he overall
cache st orage space since t he cache no longer st ores inst ruct ions t hat are branched
over and never execut ed.
The t race cache can deliver up t o 3 ops per clock t o t he core.
The execut ion t race cache and t he t ranslat ion engine have cooperat ing branch
predict ion hardware. Branch t arget s are predict ed based on t heir linear address
using branch predict ion logic and fet ched as soon as possible. Branch t arget s are
fet ched from t he execut ion t race cache if t hey are cached, ot herwise t hey are fet ched
from t he memory hierarchy. The t ranslat ion engines branch predict ion informat ion is
used t o form t races along t he most likely pat hs.
2.2.2.2 Out-of-order Core
The cores abilit y t o execut e inst ruct ions out of order is a key fact or in enabling paral-
lelism. This feat ure enables t he processor t o reorder inst ruct ions so t hat if one op is
delayed while wait ing for dat a or a cont ended resource, ot her ops t hat appear lat er
in t he program order may proceed. This implies t hat when one port ion of t he pipeline
experiences a delay, t he delay may be covered by ot her operat ions execut ing in
parallel or by t he execut ion of ops queued up in a buffer.
2-23
The core is designed t o facilit at e parallel execut ion. I t can dispat ch up t o six ops per
cycle t hrough t he issue port s ( Figure 2- 5) . Not e t hat six ops per cycle exceeds t he
t race cache and ret irement op bandwidt h. The higher bandwidt h in t he core allows
for peak burst s of great er t han t hree ops and t o achieve higher issue rat es by
allowing great er flexibilit y in issuing ops t o different execut ion port s.
Most core execut ion unit s can st art execut ing a new op every cycle, so several
inst ruct ions can be in flight at one t ime in each pipeline. A number of arit hmet ic
logical unit ( ALU) inst ruct ions can st art at t wo per cycle; many float ing- point inst ruc-
t ions st art one every t wo cycles. Finally, ops can begin execut ion out of program
order, as soon as t heir dat a input s are ready and resources are available.
2.2.2.3 Retirement
The ret irement sect ion receives t he result s of t he execut ed ops from t he execut ion
core and processes t he result s so t hat t he archit ect ural st at e is updat ed according t o
t he original program order. For semant ically correct execut ion, t he result s of I nt el 64
and I A- 32 inst ruct ions must be commit t ed in original program order before t hey are
ret ired. Except ions may be raised as inst ruct ions are ret ired. For t his reason, excep-
t ions cannot occur speculat ively.
When a op complet es and writ es it s result t o t he dest inat ion, it is ret ired. Up t o
t hree ops may be ret ired per cycle. The reorder buffer ( ROB) is t he unit in t he
processor which buffers complet ed ops, updat es t he archit ect ural st at e and
manages t he ordering of except ions.
The ret irement sect ion also keeps t rack of branches and sends updat ed branch t arget
informat ion t o t he branch t arget buffer ( BTB) . This updat es branch hist ory.
Figure 2- 9 illust rat es t he pat hs t hat are most frequent ly execut ing inside t he I nt el
Net Burst microarchit ect ure: an execut ion loop t hat int eract s wit h mult ilevel cache
hierarchy and t he syst em bus.
The following sect ions describe in more det ail t he operat ion of t he front end and t he
execut ion core. This informat ion provides t he background for using t he opt imizat ion
t echniques and inst ruct ion lat ency dat a document ed in t his manual.
2.2.3 Front End Pipeline Detail
The following informat ion about t he front end operat ion is be useful for t uning soft -
ware wit h respect t o prefet ching, branch predict ion, and execut ion t race cache oper-
at ions.
2.2.3.1 Prefetching
The I nt el Net Burst microarchit ect ure support s t hree prefet ching mechanisms:
a hardware inst ruct ion fet cher t hat aut omat ically prefet ches inst ruct ions
2-24
a hardware mechanism t hat aut omat ically fet ches dat a and inst ruct ions int o t he
unified second- level cache
a mechanism fet ches dat a only and includes t wo dist inct component s: ( 1) a
hardware mechanism t o fet ch t he adj acent cache line wit hin a 128- byt e sect or
t hat cont ains t he dat a needed due t o a cache line miss, t his is also referred t o as
adj acent cache line prefet ch ( 2) a soft ware cont rolled mechanism t hat fet ches
dat a int o t he caches using t he prefet ch inst ruct ions.
The hardware inst ruct ion fet cher reads inst ruct ions along t he pat h predict ed by t he
branch t arget buffer ( BTB) int o inst ruct ion st reaming buffers. Dat a is read in 32- byt e
chunks st art ing at t he t arget address. The second and t hird mechanisms are
described lat er.
2.2.3.2 Decoder
The front end of t he I nt el Net Burst microarchit ect ure has a single decoder t hat
decodes inst ruct ions at t he maximum rat e of one inst ruct ion per clock. Some
complex inst ruct ions must enlist t he help of t he microcode ROM. The decoder opera-
t ion is connect ed t o t he execut ion t race cache.
2.2.3.3 Execution Trace Cache
The execut ion t race cache ( TC) is t he primary inst ruct ion cache in t he I nt el Net Burst
microarchit ect ure. The TC st ores decoded inst ruct ions ( ops) .
I n t he Pent ium 4 processor implement at ion, TC can hold up t o 12- Kbyt e ops and
can deliver up t o t hree ops per cycle. TC does not hold all of t he ops t hat need t o
be execut ed in t he execut ion core. I n some sit uat ions, t he execut ion core may need
t o execut e a microcode flow inst ead of t he op t races t hat are st ored in t he t race
cache.
The Pent ium 4 processor is opt imized so t hat most frequent ly- execut ed inst ruct ions
come from t he t race cache while only a few inst ruct ions involve t he microcode ROM.
2.2.3.4 Branch Prediction
Branch predict ion is import ant t o t he performance of a deeply pipelined processor. I t
enables t he processor t o begin execut ing inst ruct ions long before t he branch
out come is cert ain. Branch delay is t he penalt y t hat is incurred in t he absence of
correct predict ion. For Pent ium 4 and I nt el Xeon processors, t he branch delay for a
correct ly predict ed inst ruct ion can be as few as zero clock cycles. The branch delay
for a mispredict ed branch can be many cycles, usually equivalent t o t he pipeline
dept h.
Branch predict ion in t he I nt el Net Burst microarchit ect ure predict s near branches
( condit ional calls, uncondit ional calls, ret urns and indirect branches) . I t does not
predict far t ransfers ( far calls, iret s and soft ware int errupt s) .
2-25
Mechanisms have been implement ed t o aid in predict ing branches accurat ely and t o
reduce t he cost of t aken branches. These include:
abilit y t o dynamically predict t he direct ion and t arget of branches based on an
inst ruct ions linear address, using t he branch t arget buffer ( BTB)
if no dynamic predict ion is available or if it is invalid, t he abilit y t o st at ically
predict t he out come based on t he offset of t he t arget : a backward branch is
predict ed t o be t aken, a forward branch is predict ed t o be not t aken
abilit y t o predict ret urn addresses using t he 16- ent ry ret urn address st ack
abilit y t o build a t race of inst ruct ions across predict ed t aken branches t o avoid
branch penalt ies
The St at i c Pr edi ct or. Once a branch inst ruct ion is decoded, t he direct ion of t he
branch ( forward or backward) is known. I f t here was no valid ent ry in t he BTB for t he
branch, t he st at ic predict or makes a predict ion based on t he direct ion of t he branch.
The st at ic predict ion mechanism predict s backward condit ional branches ( t hose wit h
negat ive displacement , such as loop- closing branches) as t aken. Forward branches
are predict ed not t aken.
To t ake advant age of t he forward- not - t aken and backward- t aken st at ic predict ions,
code should be arranged so t hat t he likely t arget of t he branch immediat ely follows
forward branches ( see also Sect ion 3. 4. 1, Branch Predict ion Opt imizat ion ) .
Br anch Tar get Buf f er. Once branch hist ory is available, t he Pent ium 4 processor
can predict t he branch out come even before t he branch inst ruct ion is decoded. The
processor uses a branch hist ory t able and a branch t arget buffer ( collect ively called
t he BTB) t o predict t he direct ion and t arget of branches based on an inst ruct ions
linear address. Once t he branch is ret ired, t he BTB is updat ed wit h t he t arget
address.
Ret ur n St ack . Ret urns are always t aken; but since a procedure may be invoked
from several call sit es, a single predict ed t arget does not suffice. The Pent ium 4
processor has a Ret urn St ack t hat can predict ret urn addresses for a series of proce-
dure calls. This increases t he benefit of unrolling loops cont aining funct ion calls. I t
also mit igat es t he need t o put cert ain procedures inline since t he ret urn penalt y
port ion of t he procedure call overhead is reduced.
Even if t he direct ion and t arget address of t he branch are correct ly predict ed, a t aken
branch may reduce available parallelism in a t ypical processor ( since t he decode
bandwidt h is wast ed for inst ruct ions which immediat ely follow t he branch and
precede t he t arget , if t he branch does not end t he line and t arget does not begin t he
line) . The branch predict or allows a branch and it s t arget t o coexist in a single t race
cache line, maximizing inst ruct ion delivery from t he front end.
2.2.4 Execution Core Detail
The execut ion core is designed t o opt imize overall performance by handling common
cases most efficient ly. The hardware is designed t o execut e frequent operat ions in a
2-26
common cont ext as fast as possible, at t he expense of infrequent operat ions using
rare cont ext s.
Some part s of t he core may speculat e t hat a common condit ion holds t o allow fast er
execut ion. I f it does not , t he machine may st all. An example of t his pert ains t o st ore-
t o- load forwarding ( see St ore Forwarding in t his chapt er) . I f a load is predict ed t o
be dependent on a st ore, it get s it s dat a from t hat st ore and t ent at ively proceeds. I f
t he load t urned out not t o depend on t he st ore, t he load is delayed unt il t he real dat a
has been loaded from memory, t hen it proceeds.
2.2.4.1 Instruction Latency and Throughput
The superscalar out - of- order core cont ains hardware resources t hat can execut e
mult iple ops in parallel. The cores abilit y t o make use of available parallelism of
execut ion unit s can enhanced by soft wares abilit y t o:
Select inst ruct ions t hat can be decoded in less t han 4 ops and/ or have short
lat encies
Order inst ruct ions t o preserve available parallelism by minimizing long
dependence chains and covering long inst ruct ion lat encies
Order inst ruct ions so t hat t heir operands are ready and t heir corresponding issue
port s and execut ion unit s are free when t hey reach t he scheduler
This subsect ion describes port rest rict ions, result lat encies, and issue lat encies ( also
referred t o as t hroughput ) . These concept s form t he basis t o assist soft ware for
ordering inst ruct ions t o increase parallelism. The order t hat ops are present ed t o
t he core of t he processor is furt her affect ed by t he machines scheduling resources.
I t is t he execut ion core t hat react s t o an ever- changing machine st at e, reordering
ops for fast er execut ion or delaying t hem because of dependence and resource
const raint s. The ordering of inst ruct ions in soft ware is more of a suggest ion t o t he
hardware.
Appendix C, I nst ruct ion Lat ency and Throughput , list s some of t he more-
commonly- used I nt el 64 and I A- 32 inst ruct ions wit h t heir lat ency, t heir issue
t hroughput , and associat ed execut ion unit s ( where relevant ) . Some execut ion unit s
are not pipelined ( meaning t hat ops cannot be dispat ched in consecut ive cycles and
t he t hroughput is less t han one per cycle) . The number of ops associat ed wit h each
inst ruct ion provides a basis for select ing inst ruct ions t o generat e. All ops execut ed
out of t he microcode ROM involve ext ra overhead.
2.2.4.2 Execution Units and Issue Ports
At each cycle, t he core may dispat ch ops t o one or more of four issue port s. At t he
microarchit ect ure level, st ore operat ions are furt her divided int o t wo part s: st ore
dat a and st ore address operat ions. The four port s t hrough which ops are dispat ched
t o execut ion unit s and t o load and st ore operat ions are shown in Figure 2- 5. Some
port s can dispat ch t wo ops per clock. Those execut ion unit s are marked Double
Speed.
2-27
Por t 0. I n t he first half of t he cycle, port 0 can dispat ch eit her one float ing- point
move op ( a float ing- point st ack move, float ing- point exchange or float ing- point
st ore dat a) or one arit hmet ic logical unit ( ALU) op ( arit hmet ic, logic, branch or st ore
dat a) . I n t he second half of t he cycle, it can dispat ch one similar ALU op.
Por t 1. I n t he first half of t he cycle, port 1 can dispat ch eit her one float ing- point
execut ion ( all float ing- point operat ions except moves, all SI MD operat ions) op or
one normal- speed int eger ( mult iply, shift and rot at e) op or one ALU ( arit hmet ic)
op. I n t he second half of t he cycle, it can dispat ch one similar ALU op.
Por t 2. This port support s t he dispat ch of one load operat ion per cycle.
Por t 3. This port support s t he dispat ch of one st ore address operat ion per cycle.
The t ot al issue bandwidt h can range from zero t o six ops per cycle. Each pipeline
cont ains several execut ion unit s. The ops are dispat ched t o t he pipeline t hat corre-
sponds t o t he correct t ype of operat ion. For example, an int eger arit hmet ic logic unit
and t he float ing- point execut ion unit s ( adder, mult iplier, and divider) can share a
pipeline.
Figure 2-5. Execution Units and Ports in Out-Of-Order Core
OM15151
ALU 0
Double
Speed
Port 0
ADD/SUB
Logic
Store Data
Branches
FP Move
FP Store Data
FXCH
ALU 1
Double
Speed
ADD/SUB Shift/Rotate
FP
Execute
FP_ADD
FP_MUL
FP_DIV
FP_MISC
MMX_SHFT
MMX_ALU
MMX_MISC
Port 1
Memory
Store
Memory
Load
All Loads
Prefetch
Port 2 Port 3
Store
Address
FP
Move
Integer
Operation
Normal
Speed
Note:
FP_ADD refers to x87 FP, and SIMD FP add and subtract operations
FP_MUL refers to x87 FP, and SIMD FP multiply operations
FP_DIV refers to x87 FP, and SIMD FP divide and square root operations
MMX_ALU refers to SIMD integer arithmetic and logic operations
MMX_SHFT handles Shift, Rotate, Shuffle, Pack and Unpack operations
MMX_MISC handles SIMD reciprocal and some integer operations
2-28
2.2.4.3 Caches
The I nt el Net Burst microarchit ect ure support s up t o t hree levels of on- chip cache. At
least t wo levels of on- chip cache are implement ed in processors based on t he I nt el
Net Burst microarchit ect ure. The I nt el Xeon processor MP and select ed Pent ium and
I nt el Xeon processors may also cont ain a t hird- level cache.
The first level cache ( nearest t o t he execut ion core) cont ains separat e caches for
inst ruct ions and dat a. These include t he first - level dat a cache and t he t race cache
( an advanced first - level inst ruct ion cache) . All ot her caches are shared bet ween
inst ruct ions and dat a.
Levels in t he cache hierarchy are not inclusive. The fact t hat a line is in level i does
not imply t hat it is also in level i+ 1. All caches use a pseudo- LRU ( least recent ly used)
replacement algorit hm.
Table 2- 5 provides paramet ers for all cache levels for Pent ium and I nt el Xeon Proces-
sors wit h CPUI D model encoding equals 0, 1, 2 or 3.
Table 2-5. Pentium 4 and Intel Xeon Processor Cache Parameters
Level (Model) Capacity
Associativity
(ways)
Line Size
(bytes)
Access
Latency,
Integer/
floating-point
(clocks)
Write Update
Policy
First (Model 0,
1, 2)
8 KB 4 64 2/9 write through
First (Model 3) 16 KB 8 64 4/12 write through
TC (All models) 12K ops 8 N/A N/A N/A
Second (Model
0, 1, 2)
256 KB or
512 KB
1
NOTES:
1. Pentium 4 and Intel Xeon processors with CPUID model encoding value of 2 have a second level
cache of 512 KB.
8 64
2
2. Each read due to a cache miss fetches a sector, consisting of two adjacent cache lines; a write
operation is 64 bytes.
7/7 write back
Second (Model
3, 4)
1 MB 8 64
2
18/18 write back
Second (Model
3, 4, 6)
2 MB 8 64
2
20/20 write back
Third (Model
0, 1, 2)
0, 512 KB,
1 MB or 2 MB
8 64
2
14/14 write back
2-29
On processors wit hout a t hird level cache, t he second- level cache miss init iat es a
t ransact ion across t he syst em bus int erface t o t he memory subsyst em. On proces-
sors wit h a t hird level cache, t he t hird- level cache miss init iat es a t ransact ion across
t he syst em bus. A bus writ e t ransact ion writ es 64 byt es t o cacheable memory, or
separat e 8- byt e chunks if t he dest inat ion is not cacheable. A bus read t ransact ion
from cacheable memory fet ches t wo cache lines of dat a.
The syst em bus int erface support s using a scalable bus clock and achieves an effec-
t ive speed t hat quadruples t he speed of t he scalable bus clock. I t t akes on t he order
of 12 processor cycles t o get t o t he bus and back wit hin t he processor, and 6- 12 bus
cycles t o access memory if t here is no bus congest ion. Each bus cycle equals several
processor cycles. The rat io of processor clock speed t o t he scalable bus clock speed
is referred t o as bus rat io. For example, one bus cycle for a 100 MHz bus is equal t o
15 processor cycles on a 1. 50 GHz processor. Since t he speed of t he bus is implemen-
t at ion- dependent , consult t he specificat ions of a given syst em for furt her det ails.
2.2.4.4 Data Prefetch
The Pent ium 4 processor and ot her processors based on t he Net Burst microarchit ec-
t ure have t wo t ype of mechanisms for prefet ching dat a: soft ware prefet ch inst ruc-
t ions and hardware- based prefet ch mechanisms.
Sof t w ar e cont r ol l ed pr ef et ch is enabled using t he four prefet ch inst ruct ions
( PREFETCHh) int roduced wit h SSE. The soft ware prefet ch is not int ended for
prefet ching code. Using it can incur significant penalt ies on a mult iprocessor syst em
if code is shared.
Soft ware prefet ch can provide benefit s in select ed sit uat ions. These sit uat ions
include when:
t he pat t ern of memory access operat ions in soft ware allows t he programmer t o
hide memory lat ency
a reasonable choice can be made about how many cache lines t o fet ch ahead of
t he line being execut e
a choice can be made about t he t ype of prefet ch t o use
SSE prefet ch inst ruct ions have different behaviors, depending on cache levels
updat ed and t he processor implement at ion. For inst ance, a processor may imple-
ment t he non- t emporal prefet ch by ret urning dat a t o t he cache level closest t o t he
processor core. This approach has t he following effect :
minimizes dist urbance of t emporal dat a in ot her cache levels
avoids t he need t o access off- chip caches, which can increase t he realized
bandwidt h compared t o a normal load- miss, which ret urns dat a t o all cache levels
Sit uat ions t hat are less likely t o benefit from soft ware prefet ch are:
For cases t hat are already bandwidt h bound, prefet ching t ends t o increase
bandwidt h demands.
2-30
Prefet ching far ahead can cause evict ion of cached dat a from t he caches prior t o
t he dat a being used in execut ion.
Not prefet ching far enough can reduce t he abilit y t o overlap memory and
execut ion lat encies.
Soft ware prefet ches are t reat ed by t he processor as a hint t o init iat e a request t o
fet ch dat a from t he memory syst em, and consume resources in t he processor and
t he use of t oo many prefet ches can limit t heir effect iveness. Examples of t his include
prefet ching dat a in a loop for a reference out side t he loop and prefet ching in a basic
block t hat is frequent ly execut ed, but which seldom precedes t he reference for which
t he prefet ch is t arget ed.
See: Chapt er 9, Opt imizing Cache Usage.
Aut omat i c har dw ar e pr ef et ch is a feat ure in t he Pent ium 4 processor. I t brings
cache lines int o t he unified second- level cache based on prior reference pat t erns.
Soft ware prefet ching has t he following charact erist ics:
handles irregular access pat t erns, which do not t rigger t he hardware prefet cher
handles prefet ching of short arrays and avoids hardware prefet ching st art - up
delay before init iat ing t he fet ches
must be added t o new code; so it does not benefit exist ing applicat ions
Hardware prefet ching for Pent ium 4 processor has t he following charact erist ics:
works wit h exist ing applicat ions
does not require ext ensive st udy of prefet ch inst ruct ions
requires regular access pat t erns
avoids inst ruct ion and issue port bandwidt h overhead
has a st art - up penalt y before t he hardware prefet cher t riggers and begins
init iat ing fet ches
The hardware prefet cher can handle mult iple st reams in eit her t he forward or back-
ward direct ions. The st art - up delay and fet ch- ahead has a larger effect for short
arrays when hardware prefet ching generat es a request for dat a beyond t he end of an
array ( not act ually ut ilized) . The hardware penalt y diminishes if it is amort ized over
longer arrays.
Hardware prefet ching is t riggered aft er t wo successive cache misses in t he last level
cache and requires t hese cache misses t o sat isfy a condit ion t hat t he linear address
dist ance bet ween t hese cache misses is wit hin a t hreshold value. The t hreshold value
depends on t he processor implement at ion ( see Table 2- 6) . However, hardware
prefet ching will not cross 4- KByt e page boundaries. As a result , hardware
prefet ching can be very effect ive when dealing wit h cache miss pat t erns t hat have
small st rides and t hat are significant ly less t han half t he t hreshold dist ance t o t rigger
hardware prefet ching. On t he ot her hand, hardware prefet ching will not benefit
cache miss pat t erns t hat have frequent DTLB misses or have access st rides t hat
cause successive cache misses t hat are spat ially apart by more t han t he t rigger
t hreshold dist ance.
2-31
Soft ware can proact ively cont rol dat a access pat t ern t o favor smaller access st rides
( e. g., st ride t hat is less t han half of t he t rigger t hreshold dist ance) over larger access
st rides ( st ride t hat is great er t han t he t rigger t hreshold dist ance) , t his can achieve
addit ional benefit of improved t emporal localit y and reducing cache misses in t he last
level cache significant ly.
Soft ware opt imizat ion of a dat a access pat t ern should emphasize t uning for hard-
ware prefet ch first t o favor great er proport ions of smaller- st ride dat a accesses in t he
workload; before at t empt ing t o provide hint s t o t he processor by employing soft ware
prefet ch inst ruct ions.
2.2.4.5 Loads and Stores
The Pent ium 4 processor employs t he following t echniques t o speed up t he execut ion
of memory operat ions:
speculat ive execut ion of loads
reordering of loads wit h respect t o loads and st ores
mult iple out st anding misses
buffering of writ es
forwarding of dat a from st ores t o dependent loads
Performance may be enhanced by not exceeding t he memory issue bandwidt h and
buffer resources provided by t he processor. Up t o one load and one st ore may be
issued for each cycle from a memory port reservat ion st at ion. I n order t o be
dispat ched t o a reservat ion st at ion, t here must be a buffer ent ry available for each
memory operat ion. There are 48 load buffers and 24 st ore buffers
3
. These buffers
hold t he op and address informat ion unt il t he operat ion is complet ed, ret ired, and
deallocat ed.
The Pent ium 4 processor is designed t o enable t he execut ion of memory operat ions
out of order wit h respect t o ot her inst ruct ions and wit h respect t o each ot her. Loads
can be carried out speculat ively, t hat is, before all preceding branches are resolved.
However, speculat ive loads cannot cause page fault s.
Reordering loads wit h respect t o each ot her can prevent a load miss from st alling
lat er loads. Reordering loads wit h respect t o ot her loads and st ores t o different
addresses can enable more parallelism, allowing t he machine t o execut e operat ions
as soon as t heir input s are ready. Writ es t o memory are always carried out in
program order t o maint ain program correct ness.
A cache miss for a load does not prevent ot her loads from issuing and complet ing.
The Pent ium 4 processor support s up t o four ( or eight for Pent ium 4 processor wit h
CPUI D signat ure corresponding t o family 15, model 3) out st anding load misses t hat
can be serviced eit her by on- chip caches or by memory.
3. Pentium 4 processors with CPUID model encoding equal to 3 have more than 24 store buffers.
2-32
St ore buffers improve performance by allowing t he processor t o cont inue execut ing
inst ruct ions wit hout having t o wait unt il a writ e t o memory and/ or cache is complet e.
Writ es are generally not on t he crit ical pat h for dependence chains, so it is oft en
beneficial t o delay writ es for more efficient use of memory- access bus cycles.
2.2.4.6 Store Forwarding
Loads can be moved before st ores t hat occurred earlier in t he program if t hey are not
predict ed t o load from t he same linear address. I f t hey do read from t he same linear
address, t hey have t o wait for t he st ore dat a t o become available. However, wit h
st ore forwarding, t hey do not have t o wait for t he st ore t o writ e t o t he memory hier-
archy and ret ire. The dat a from t he st ore can be forwarded direct ly t o t he load, as
long as t he following condit ions are met :
Sequence Dat a t o be forwarded t o t he load has been generat ed by a program-
mat ically- earlier st ore which has already execut ed.
Si ze Byt es loaded must be a subset of ( including a proper subset , t hat is, t he
same) byt es st ored.
Al i gnment The st ore cannot wrap around a cache line boundary, and t he
linear address of t he load must be t he same as t hat of t he st ore.
2.3 INTEL

PENTIUM

M PROCESSOR
MICROARCHITECTURE
Like t he I nt el Net Burst microarchit ect ure, t he pipeline of t he I nt el Pent ium M
processor microarchit ect ure cont ains t hree sect ions:
in- order issue front end
out - of- order superscalar execut ion core
in- order ret irement unit
I nt el Pent ium M processor microarchit ect ure support s a high- speed syst em bus ( up
t o 533 MHz) wit h 64- byt e line size. Most coding recommendat ions t hat apply t o t he
I nt el Net Burst microarchit ect ure also apply t o t he I nt el Pent ium M processor.
The I nt el Pent ium M processor microarchit ect ure is designed for lower power
consumpt ion. There are ot her specific areas of t he Pent ium M processor microarchi-
t ect ure t hat differ from t he I nt el Net Burst microarchit ect ure. They are described
next . A block diagram of t he I nt el Pent ium M processor is shown in Figure 2- 6.
2-33
2.3.1 Front End
The I nt el Pent ium M processor uses a pipeline dept h t hat enables high performance
and low power consumpt ion. I t s short er t han t hat of t he I nt el Net Burst microarchi-
t ect ure.
The I nt el Pent ium M processor front end consist s of t wo part s:
fet ch/ decode unit
inst ruct ion cache
The fet ch and decode unit includes a hardware inst ruct ion prefet cher and t hree
decoders t hat enable parallelism. I t also provides a 32- KByt e inst ruct ion cache t hat
st ores un- decoded binary inst ruct ions.
The inst ruct ion prefet cher fet ches inst ruct ions in a linear fashion from memory if t he
t arget inst ruct ions are not already in t he inst ruct ion cache. The prefet cher is
designed t o fet ch efficient ly from an aligned 16- byt e block. I f t he modulo 16
remainder of a branch t arget address is 14, only t wo useful inst ruct ion byt es are
fet ched in t he first cycle. The rest of t he inst ruct ion byt es are fet ched in subsequent
cycles.
The t hree decoders decode inst ruct ions and break t hem down int o ops. I n each
clock cycle, t he first decoder is capable of decoding an inst ruct ion wit h four or fewer
Figure 2-6. The Intel Pentium M Processor Microarchitecture
Bus Unit
System Bus
Frequently used paths
Less frequently used
paths
1st Level
nstruction
Cache
Fetch/Decode
Execution
Out-Of-Order Core
Retirement
1st Level Data
Cache
2nd Level Cache
BTBs/Branch Prediction
Front End
Branch History Update
20
2-34
ops. The remaining t wo decoders each decode a one op inst ruct ion in each clock
cycle.
The front end can issue mult iple ops per cycle, in original program order, t o t he out -
of- order core.
The I nt el Pent ium M processor incorporat es sophist icat ed branch predict ion hard-
ware t o support t he out - of- order core. The branch predict ion hardware includes
dynamic predict ion, and branch t arget buffers.
The I nt el Pent ium M processor has enhanced dynamic branch predict ion hardware.
Branch t arget buffers ( BTB) predict t he direct ion and t arget of branches based on an
inst ruct ions address.
The Pent ium M Processor includes t wo t echniques t o reduce t he execut ion t ime of
cert ain operat ions:
ESP f ol di ng This eliminat es t he ESP manipulat ion ops in st ack- relat ed
inst ruct ions such as PUSH, POP, CALL and RET. I t increases decode rename and
ret irement t hroughput . ESP folding also increases execut ion bandwidt h by
eliminat ing ops which would have required execut ion resources.
Mi cr o- ops ( ops) f usi on Some of t he most frequent pairs of ops derived
from t he same inst ruct ion can be fused int o a single ops. The following
cat egories of fused ops have been implement ed in t he Pent ium M processor:
St ore address and st ore dat a ops are fused int o a single St ore op.
This holds for all t ypes of st ore operat ions, including int eger, float ing- point ,
MMX t echnology, and St reaming SI MD Ext ensions ( SSE and SSE2)
operat ions.
A load op in most cases can be fused wit h a successive execut ion op. This
holds for int eger, float ing- point and MMX t echnology loads and for most kinds
of successive execut ion operat ions. Not e t hat SSE Loads can not be fused.
2.3.2 Data Prefetching
The I nt el Pent ium M processor support s t hree prefet ching mechanisms:
The first mechanism is a hardware inst ruct ion fet cher and is described in t he
previous sect ion.
The second mechanism aut omat ically fet ches dat a int o t he second- level cache.
The implement at ion of aut omat ic hardware prefet ching in Pent ium M processor
family is basically similar t o t hose described for Net Burst microarchit ect ure. The
t rigger t hreshold dist ance for each relevant processor models is shown in
Table 2- 6. The t hird mechanism is a soft ware mechanism t hat fet ches dat a int o
t he caches using t he prefet ch inst ruct ions.
2-35
Dat a is fet ched 64 byt es at a t ime; t he inst ruct ion and dat a t ranslat ion lookaside
buffers support 128 ent ries. See Table 2- 7 for processor cache paramet ers.
2.3.3 Out-of-Order Core
The processor core dynamically execut es ops independent of program order. The
core is designed t o facilit at e parallel execut ion by employing many buffers, issue
port s, and parallel execut ion unit s.
The out - of- order core buffers ops in a Reservat ion St at ion ( RS) unt il t heir operands
are ready and resources are available. Each cycle, t he core may dispat ch up t o five
ops t hrough t he issue port s.
2.3.4 In-Order Retirement
The ret irement unit in t he Pent ium M processor buffers complet ed ops is t he reorder
buffer ( ROB) . The ROB updat es t he archit ect ural st at e in order. Up t o t hree ops may
be ret ired per cycle.
Table 2-6. Trigger Threshold and CPUID Signatures for Processor Families
Trigger Threshold Distance
(Bytes)
Extended
Model ID
Extended
Family ID Family ID Model ID
512 0 0 15 3, 4, 6
256 0 0 15 0, 1, 2
256 0 0 6 9, 13, 14
Table 2-7. Cache Parameters of Pentium M, Intel Core Solo,
and Intel Core Duo Processors
Level Capacity
Associativity
(ways)
Line Size
(bytes)
Access
Latency
(clocks)
Write Update
Policy
First 32 KByte 8 64 3 Writeback
Instruction 32 KByte 8 N/A N/A N/A
Second
(mode 9)
1 MByte 8 64 9 Writeback
Second
(model 13)
Second
(model 14)
2-36
2.4 MICROARCHITECTURE OF INTEL

CORE
SOLO AND
INTEL
CORE

DUO PROCESSORS
I nt el Core Solo and I nt el Core Duo processors incorporat e an microarchit ect ure t hat
is similar t o t he Pent ium M processor microarchit ect ure, but provides addit ional
enhancement s for performance and power efficiency. Enhancement s include:
I nt el Smar t Cache This second level cache is shared bet ween t wo cores in an
I nt el Core Duo processor t o minimize bus t raffic bet ween t wo cores accessing a
single- copy of cached dat a. I t allows an I nt el Core Solo processor ( or when one
of t he t wo cores in an I nt el Core Duo processor is idle) t o access it s full capacit y.
St r eam SI MD Ex t ensi ons 3 These ext ensions are support ed in I nt el Core
Solo and I nt el Core Duo processors.
Decoder i mpr ov ement I mprovement in decoder and op fusion allows t he
front end t o see most inst ruct ions as single op inst ruct ions. This increases t he
t hroughput of t he t hree decoders in t he front end.
I mpr oved ex ecut i on cor e Throughput of SI MD inst ruct ions is improved and
t he out - of- order engine is more robust in handling sequences of frequent ly- used
inst ruct ions. Enhanced int ernal buffering and prefet ch mechanisms also improve
dat a bandwidt h for execut ion.
Pow er - opt i mi zed bus The syst em bus is opt imized for power efficiency;
increased bus speed support s 667 MHz.
Dat a Pr ef et ch I nt el Core Solo and I nt el Core Duo processors implement
improved hardware prefet ch mechanisms: one mechanism can look ahead and
prefet ch dat a int o L1 from L2. These processors also provide enhanced hardware
prefet chers similar t o t hose of t he Pent ium M processor ( see Table 2- 6) .
2.4.1 Front End
Execut ion of SI MD inst ruct ions on I nt el Core Solo and I nt el Core Duo processors are
improved over Pent ium M processors by t he following enhancement s:
Mi cr o- op f usi on Scalar SI MD operat ions on regist er and memory have single
op flows comparable t o X87 flows. Many packed inst ruct ions are fused t o reduce
it s op flow from four t o t wo ops.
El i mi nat i ng decoder r est r i ct i ons I nt el Core Solo and I nt el Core Duo
processors improve decoder t hroughput wit h micro- fusion and macro- fusion, so
t hat many more SSE and SSE2 inst ruct ions can be decoded wit hout rest rict ion.
On Pent ium M processors, many single op SSE and SSE2 inst ruct ions must be
decoded by t he main decoder.
I mpr oved pack ed SI MD i nst r uct i on decodi ng On I nt el Core Solo and I nt el
Core Duo processors, decoding of most packed SSE inst ruct ions is done by all
t hree decoders. As a result t he front end can process up t o t hree packed SSE
inst ruct ions every cycle. There are some except ions t o t he above; some
shuffle/ unpack/ shift operat ions are not fused and require t he main decoder.
2-37
2.4.2 Data Prefetching
I nt el Core Solo and I nt el Core Duo processors provide hardware mechanisms t o
prefet ch dat a from memory t o t he second- level cache. There are t wo t echniques:
1. One mechanism act ivat es aft er t he dat a access pat t ern experiences t wo cache-
reference misses wit hin a t rigger- dist ance t hreshold ( see Table 2- 6) . This
mechanism is similar t o t hat of t he Pent ium M processor, but can t rack 16 forward
dat a st reams and 4 backward st reams.
2. The second mechanism fet ches an adj acent cache line of dat a aft er experiencing
a cache miss. This effect ively simulat es t he prefet ching capabilit ies of 128- byt e
sect ors ( similar t o t he sect oring of t wo adj acent 64- byt e cache lines available in
Pent ium 4 processors) .
Hardware prefet ch request s are queued up in t he bus syst em at lower priorit y t han
normal cache- miss request s. I f bus queue is in high demand, hardware prefet ch
request s may be ignored or cancelled t o service bus t raffic required by demand
cache- misses and ot her bus t ransact ions. Hardware prefet ch mechanisms are
enhanced over t hat of Pent ium M processor by:
Dat a st ores t hat are not in t he second- level cache generat e read for ownership
request s. These request s are t reat ed as loads and can t rigger a prefet ch st ream.
Soft ware prefet ch inst ruct ions are t reat ed as loads, t hey can also t rigger a
prefet ch st ream.
2.5 INTEL
HYPER-THREADING TECHNOLOGY
I nt el

Hyper-Threading Technology ( HT Technology) is support ed by specific
members of t he I nt el Pent ium 4 and Xeon processor families. The t echnology enables
soft ware t o t ake advant age of t ask- level, or t hread- level parallelism by providing
mult iple logical processors wit hin a physical processor package. I n it s first implemen-
t at ion in I nt el Xeon processor, Hyper-Threading Technology makes a single physical
processor appear as t wo logical processors.
The t wo logical processors each have a complet e set of archit ect ural regist ers while
sharing one single physical processor' s resources. By maint aining t he archit ect ure
st at e of t wo processors, an HT Technology capable processor looks like t wo proces-
sors t o soft ware, including operat ing syst em and applicat ion code.
By sharing resources needed for peak demands bet ween t wo logical processors, HT
Technology is well suit ed for mult iprocessor syst ems t o provide an addit ional perfor-
mance boost in t hroughput when compared t o t radit ional MP syst ems.
Figure 2- 7 shows a t ypical bus- based symmet ric mult iprocessor ( SMP) based on
processors support ing HT Technology. Each logical processor can execut e a soft ware
t hread, allowing a maximum of t wo soft ware t hreads t o execut e simult aneously on
one physical processor. The t wo soft ware t hreads execut e simult aneously, meaning
t hat in t he same clock cycle an add operat ion from logical processor 0 and anot her
2-38
add operat ion and load from logical processor 1 can be execut ed simult aneously by
t he execut ion engine.
I n t he first implement at ion of HT Technology, t he physical execut ion resources are
shared and t he archit ect ure st at e is duplicat ed for each logical processor. This mini-
mizes t he die area cost of implement ing HT Technology while st ill achieving perfor-
mance gains for mult it hreaded applicat ions or mult it asking workloads.
The performance pot ent ial due t o HT Technology is due t o:
The fact t hat operat ing syst ems and user programs can schedule processes or
t hreads t o execut e simult aneously on t he logical processors in each physical
processor
The abilit y t o use on- chip execut ion resources at a higher level t han when only a
single t hread is consuming t he execut ion resources; higher level of resource
ut ilizat ion can lead t o higher syst em t hroughput
2.5.1 Processor Resources and HT Technology
The maj orit y of microarchit ect ure resources in a physical processor are shared
bet ween t he logical processors. Only a few small dat a st ruct ures were replicat ed for
each logical processor. This sect ion describes how resources are shared, part it ioned
or replicat ed.
Figure 2-7. Hyper-Threading Technology on an SMP
OM15152
Bus Interface
Execution Engine
Architectural
State
Architectural
State
Local APIC Local APIC
System Bus
Execution Engine
Architectural
State
Architectural
State
Bus Interface
2-39
2.5.1.1 Replicated Resources
The archit ect ural st at e is replicat ed for each logical processor. The archit ect ure st at e
consist s of regist ers t hat are used by t he operat ing syst em and applicat ion code t o
cont rol program behavior and st ore dat a for comput at ions. This st at e includes t he
eight general- purpose regist ers, t he cont rol regist ers, machine st at e regist ers,
debug regist ers, and ot hers. There are a few except ions, most not ably t he memory
t ype range regist ers ( MTRRs) and t he performance monit oring resources. For a
complet e list of t he archit ect ure st at e and except ions, see t he I nt el 64 and I A- 32
Archit ect ures Soft ware Developers Manual, Volumes 3A & 3B.
Ot her resources such as inst ruct ion point ers and regist er renaming t ables were repli-
cat ed t o simult aneously t rack execut ion and st at e changes of t he t wo logical proces-
sors. The ret urn st ack predict or is replicat ed t o improve branch predict ion of ret urn
inst ruct ions.
I n addit ion, a few buffers ( for example, t he 2- ent ry inst ruct ion st reaming buffers)
were replicat ed t o reduce complexit y.
2.5.1.2 Partitioned Resources
Several buffers are shared by limit ing t he use of each logical processor t o half t he
ent ries. These are referred t o as part it ioned resources. Reasons for t his part it ioning
include:
Operat ional fairness
Permit t ing t he abilit y t o allow operat ions from one logical processor t o bypass
operat ions of t he ot her logical processor t hat may have st alled
For example: a cache miss, a branch mispredict ion, or inst ruct ion dependencies may
prevent a logical processor from making forward progress for some number of
cycles. The part it ioning prevent s t he st alled logical processor from blocking forward
progress.
I n general, t he buffers for st aging inst ruct ions bet ween maj or pipe st ages are part i-
t ioned. These buffers include op queues aft er t he execut ion t race cache, t he queues
aft er t he regist er rename st age, t he reorder buffer which st ages inst ruct ions for
ret irement , and t he load and st ore buffers.
I n t he case of load and st ore buffers, part it ioning also provided an easier implemen-
t at ion t o maint ain memory ordering for each logical processor and det ect memory
ordering violat ions.
2.5.1.3 Shared Resources
Most resources in a physical processor are fully shared t o improve t he dynamic ut ili-
zat ion of t he resource, including caches and all t he execut ion unit s. Some shared
resources which are linearly addressed, like t he DTLB, include a logical processor I D
bit t o dist inguish whet her t he ent ry belongs t o one logical processor or t he ot her.
2-40
The first level cache can operat e in t wo modes depending on a cont ext - I D bit :
Shared mode: The L1 dat a cache is fully shared by t wo logical processors.
Adapt ive mode: I n adapt ive mode, memory accesses using t he page direct ory is
mapped ident ically across logical processors sharing t he L1 dat a cache.
The ot her resources are fully shared.
2.5.2 Microarchitecture Pipeline and HT Technology
This sect ion describes t he HT Technology microarchit ect ure and how inst ruct ions
from t he t wo logical processors are handled bet ween t he front end and t he back end
of t he pipeline.
Alt hough inst ruct ions originat ing from t wo programs or t wo t hreads execut e simult a-
neously and not necessarily in program order in t he execut ion core and memory hier-
archy, t he front end and back end cont ain several select ion point s t o select bet ween
inst ruct ions from t he t wo logical processors. All select ion point s alt ernat e bet ween
t he t wo logical processors unless one logical processor cannot make use of a pipeline
st age. I n t his case, t he ot her logical processor has full use of every cycle of t he pipe-
line st age. Reasons why a logical processor may not use a pipeline st age include
cache misses, branch mispredict ions, and inst ruct ion dependencies.
2.5.3 Front End Pipeline
The execut ion t race cache is shared bet ween t wo logical processors. Execut ion t race
cache access is arbit rat ed by t he t wo logical processors every clock. I f a cache line is
fet ched for one logical processor in one clock cycle, t he next clock cycle a line would
be fet ched for t he ot her logical processor provided t hat bot h logical processors are
request ing access t o t he t race cache.
I f one logical processor is st alled or is unable t o use t he execut ion t race cache, t he
ot her logical processor can use t he full bandwidt h of t he t race cache unt il t he init ial
logical processor s inst ruct ion fet ches ret urn from t he L2 cache.
Aft er fet ching t he inst ruct ions and building t races of ops, t he ops are placed in a
queue. This queue decouples t he execut ion t race cache from t he regist er rename
pipeline st age. As described earlier, if bot h logical processors are act ive, t he queue is
part it ioned so t hat bot h logical processors can make independent forward progress.
2.5.4 Execution Core
The core can dispat ch up t o six ops per cycle, provided t he ops are ready t o
execut e. Once t he ops are placed in t he queues wait ing for execut ion, t here is no
dist inct ion bet ween inst ruct ions from t he t wo logical processors. The execut ion core
and memory hierarchy is also oblivious t o which inst ruct ions belong t o which logical
processor.
2-41
Aft er execut ion, inst ruct ions are placed in t he reorder buffer. The reorder buffer
decouples t he execut ion st age from t he ret irement st age. The reorder buffer is
part it ioned such t hat each uses half t he ent ries.
2.5.5 Retirement
The ret irement logic t racks when inst ruct ions from t he t wo logical processors are
ready t o be ret ired. I t ret ires t he inst ruct ion in program order for each logical
processor by alt ernat ing bet ween t he t wo logical processors. I f one logical processor
is not ready t o ret ire any inst ruct ions, t hen all ret irement bandwidt h is dedicat ed t o
t he ot her logical processor.
Once st ores have ret ired, t he processor needs t o writ e t he st ore dat a int o t he level-
one dat a cache. Select ion logic alt ernat es bet ween t he t wo logical processors t o
commit st ore dat a t o t he cache.
2.6 MULTICORE PROCESSORS
The I nt el Pent ium D processor and t he Pent ium Processor Ext reme Edit ion int roduce
mult icore feat ures. These processors enhance hardware support for mult it hreading
by providing t wo processor cores in each physical processor package. The Dual- Core
I nt el Xeon and I nt el Core Duo processors also provide t wo processor cores in a phys-
ical package. The mult icore t opology of I nt el Core 2 Duo processors are similar t o
t hose of I nt el Core Duo processor.
The I nt el Pent ium D processor provides t wo logical processors in a physical package,
each logical processor has a separat e execut ion core and a cache hierarchy. The
Dual- Core I nt el Xeon processor and t he I nt el Pent ium Processor Ext reme Edit ion
provide four logical processors in a physical package t hat has t wo execut ion cores.
Each core provides t wo logical processors sharing an execut ion core and a cache
hierarchy.
The I nt el Core Duo processor provides t wo logical processors in a physical package.
Each logical processor has a separat e execut ion core ( including first - level cache) and
a smart second- level cache. The second- level cache is shared bet ween t wo logical
processors and opt imized t o reduce bus t raffic when t he same copy of cached dat a is
used by t wo logical processors. The full capacit y of t he second- level cache can be
used by one logical processor if t he ot her logical processor is inact ive.
The funct ional blocks of t he dual- core processors are shown in Figure 2- 8. The Quad-
Core I nt el Xeon processors, I nt el Core 2 Quad processor and I nt el Core 2 Ext reme
quad- core processor consist of t wo replica of t he dual- core modules. The funct ional
blocks of t he quad- core processors are also shown in Figure 2- 8.
2-42
Figure 2-8. Pentium D Processor, Pentium Processor Extreme Edition,
Intel Core Duo Processor, Intel Core 2 Duo Processor, and Intel Core 2 Quad Processor
Architectual
State
System Bus
Execution Engine
Execution Engine
Architectual
State
Bus Interface Bus Interface
Architectual
State
Architectual
State
Pentium Processor Extreme Edition
System Bus
Architectual State
Execution Engine
Execution Engine
Architectual State
Bus Interface
Intel Core Duo Processor
Intel Core 2 Duo Processor
Second Level Cache
Architectual State
System Bus
Execution Engine
Execution Engine
Architectual State
Bus Interface Bus Interface
Pentium D Processor
System Bus
Intel Core 2 Quad Processor
Intel Xeon Processor 3200 Series
Intel Xeon Processor 5300 Series
Architectual State
Execution Engine
Execution Engine
Architectual State
Bus Interface
Second Level Cache
Architectual State
Execution Engine
Execution Engine
Architectual State
Bus Interface
Second Level Cache
OM19809
2-43
2.6.1 Microarchitecture Pipeline and MultiCore Processors
I n general, each core in a mult icore processor resembles a single- core processor
implement at ion of t he underlying microarchit ect ure. The implement at ion of t he
cache hierarchy in a dual- core or mult icore processor may be t he same or different
from t he cache hierarchy implement at ion in a single- core processor.
CPUI D should be used t o det ermine cache- sharing t opology informat ion in a
processor implement at ion and t he underlying microarchit ect ure. The former is
obt ained by querying t he det erminist ic cache paramet er leaf ( see Chapt er 9, Opt i-
mizing Cache Usage ) ; t he lat t er by using t he encoded values for ext ended family,
family, ext ended model, and model fields. See Table 2- 8.
2.6.2 Shared Cache in Intel
Core
Duo Processors
The I nt el Core Duo processor has t wo symmet ric cores t hat share t he second- level
cache and a single bus int erface ( see Figure 2- 8) . Two t hreads execut ing on t wo
cores in an I nt el Core Duo processor can t ake advant age of shared second- level
cache, accessing a single- copy of cached dat a wit hout generat ing bus t raffic.
2.6.2.1 Load and Store Operations
When an inst ruct ion needs t o read dat a from a memory address, t he processor looks
for it in caches and memory. When an inst ruct ion writ es dat a t o a memory locat ion
( writ e back) t he processor first makes sure t hat t he cache line t hat cont ains t he
memory locat ion is owned by t he first - level dat a cache of t he init iat ing core ( t hat is,
Table 2-8. Family And Model Designations of Microarchitectures
Dual-Core
Processor
Micro-
architecture
Extended
Family Family
Extended
Model Model
Pentium D
processor
NetBurst 0 15 0 3, 4, 6
Pentium
processor
Extreme
Edition
NetBurst 0 15 0 3, 4, 6
Intel Core Duo
processor
Improved
Pentium M
0 6 0 14
Intel Core 2
Duo
processor/
Intel Xeon
processor
5100
Intel Core
Microarchitec-
ture
0 6 0 15
2-44
t he line is in exclusive or modified st at e) . Then t he processor looks for t he cache line
in t he cache and memory subsyst ems. The lookups for t he localit y of load or st ore
operat ion are in t he following order:
1. DCU of t he init iat ing core
2. DCU of t he ot her core and second- level cache
3. Syst em memory
t he cache line availabilit y or st at e in t he L2 cache. Table 2- 9 list s t he performance
charact erist ics of generic load and st ore operat ions in an I nt el Core Duo processor.
Numeric values of Table 2- 9 are in t erms of processor core cycles.
Throughput is expressed as t he number of cycles t o wait before t he same operat ion
can st art again. The lat ency of a bus t ransact ion is exposed in some of t hese opera-
t ions, as indicat ed by ent ries cont aining + bus t ransact ion. On I nt el Core Duo
processors, a t ypical bus t ransact ion may t ake 5. 5 bus cycles. For a 667 MHz bus and
a core frequency of 2. 167GHz, t he t ot al of 14 + 5. 5 * 2167 / ( 667/ 4) ~ 86 core
cycles.
Somet imes a modified cache line has t o be evict ed t o make room for a new cache
line. The modified cache line is evict ed in parallel t o bringing in new dat a and does
not require addit ional lat ency. However, when dat a is writ t en back t o memory, t he
evict ion consumes cache bandwidt h and bus bandwidt h. For mult iple cache misses
t hat require t he evict ion of modified lines and are wit hin a short t ime, t here is an
overall degradat ion in response t ime of t hese cache misses.
For st ore operat ion, reading for ownership must be complet ed before t he dat a is
writ t en t o t he first - level dat a cache and t he line is marked as modified. Reading for
ownership and st oring t he dat a happens aft er inst ruct ion ret irement and follows t he
order of ret irement . The bus st ore lat ency does not affect t he st ore inst ruct ion it self.
However, several sequent ial st ores may have cumulat ive lat ency t hat can effect
performance.
Table 2-9. Characteristics of Load and Store Operations
in Intel Core Duo Processors
Load Store
Data Locality Latency Throughput Latency Throughput
DCU 3 1 2 1
DCU of the other core in
Modified state
14 + bus
transaction
14 + bus
transaction
14 + bus
transaction
~10
2nd-level cache 14 <6 14 <6
Memory 14 + bus
transaction
Bus read
protocol
14 + bus
transaction
Bus write
protocol
2-45
2.7 INTEL

64 ARCHITECTURE
I nt el 64 archit ect ure support s almost all feat ures in t he I A- 32 I nt el archit ect ure and
ext ends support t o run 64- bit OS and 64- bit applicat ions in 64- bit linear address
space. I nt el 64 archit ect ure provides a new operat ing mode, referred t o as I A- 32e
mode, and increases t he linear address space for soft ware t o 64 bit s and support s
physical address space up t o 40 bit s.
I A- 32e mode consist s of t wo sub- modes: ( 1) compat ibilit y mode enables a 64- bit
operat ing syst em t o run most legacy 32- bit soft ware unmodified, ( 2) 64- bit mode
enables a 64- bit operat ing syst em t o run applicat ions writ t en t o access 64- bit linear
address space.
I n t he 64- bit mode of I nt el 64 archit ect ure, soft ware may access:
64- bit flat linear addressing
8 addit ional general- purpose regist ers ( GPRs)
8 addit ional regist ers for st reaming SI MD ext ensions ( SSE, SSE2, SSE3 and
SSSE3)
64- bit - wide GPRs and inst ruct ion point ers
uniform byt e- regist er addressing
fast int errupt - priorit izat ion mechanism
a new inst ruct ion- point er relat ive- addressing mode
For opt imizing 64- bit applicat ions, t he feat ures t hat impact soft ware opt imizat ions
include:
using a set of prefixes t o access new regist ers or 64- bit regist er operand
point er size increases from 32 bit s t o 64 bit s
inst ruct ion- specific usages
2.8 SIMD TECHNOLOGY
SI MD comput at ions ( see Figure 2- 9) were int roduced t o t he archit ect ure wit h MMX
t echnology. MMX t echnology allows SI MD comput at ions t o be performed on packed
byt e, word, and doubleword int egers. The int egers are cont ained in a set of eight
64- bit regist ers called MMX regist ers ( see Figure 2- 10) .
The Pent ium III processor ext ended t he SI MD comput at ion model wit h t he int roduc-
t ion of t he St reaming SI MD Ext ensions ( SSE) . SSE allows SI MD comput at ions t o be
performed on operands t hat cont ain four packed single- precision float ing- point dat a
element s. The operands can be in memory or in a set of eight 128- bit XMM regist ers
( see Figure 2- 10) . SSE also ext ended SI MD comput at ional capabilit y by adding addi-
t ional 64- bit MMX inst ruct ions.
Figure 2- 9 shows a t ypical SI MD comput at ion. Two set s of four packed dat a element s
( X1, X2, X3, and X4, and Y1, Y2, Y3, and Y4) are operat ed on in parallel, wit h t he
2-46
same operat ion being performed on each corresponding pair of dat a element s ( X1
and Y1, X2 and Y2, X3 and Y3, and X4 and Y4) . The result s of t he four parallel compu-
t at ions are sort ed as a set of four packed dat a element s.
The Pent ium 4 processor furt her ext ended t he SI MD comput at ion model wit h t he
int roduct ion of St reaming SI MD Ext ensions 2 ( SSE2) , St reaming SI MD Ext ensions 3
( SSE3) , and I nt el Xeon processor 5100 series int roduced Supplement al St reaming
SI MD Ext ensions 3 ( SSSE3) .
SSE2 works wit h operands in eit her memory or in t he XMM regist ers. The t echnology
ext ends SI MD comput at ions t o process packed double- precision float ing- point dat a
element s and 128- bit packed int egers. There are 144 inst ruct ions in SSE2 t hat
operat e on t wo packed double- precision float ing- point dat a element s or on 16
packed byt e, 8 packed word, 4 doubleword, and 2 quadword int egers.
SSE3 enhances x87, SSE and SSE2 by providing 13 inst ruct ions t hat can accelerat e
applicat ion performance in specific areas. These include video processing, complex
arit hmet ics, and t hread synchronizat ion. SSE3 complement s SSE and SSE2 wit h
inst ruct ions t hat process SI MD dat a asymmet rically, facilit at e horizont al comput a-
t ion, and help avoid loading cache line split s. See Figure 2- 10.
SSSE3 provides addit ional enhancement for SI MD comput at ion wit h 32 inst ruct ions
on digit al video and signal processing.
The SI MD ext ensions operat es t he same way in I nt el 64 archit ect ure as in I A- 32
archit ect ure, wit h t he following enhancement s:
128- bit SI MD inst ruct ions referencing XMM regist er can access 16 XMM regist ers
in 64- bit mode.
Figure 2-9. Typical SIMD Operations
X4 X3 X2 X1
Y4 Y3 Y2 Y1
X4 op Y4 X3 op Y3 X2 op Y2 X1 op Y1
OP OP OP OP
OM15148
2-47
I nst ruct ions t hat reference 32- bit general purpose regist ers can access 16
general purpose regist ers in 64- bit mode.
SI MD improves t he performance of 3D graphics, speech recognit ion, image
processing, scient ific applicat ions and applicat ions t hat have t he following charact er-
ist ics:
inherent ly parallel
recurring memory access pat t erns
localized recurring operat ions performed on t he dat a
dat a- independent cont rol flow
SI MD float ing- point inst ruct ions fully support t he I EEE St andard 754 for Binary
Float ing- Point Arit hmet ic. They are accessible from all I A- 32 execut ion modes:
prot ect ed mode, real address mode, and Virt ual 8086 mode.
SSE, SSE2, and MMX t echnologies are archit ect ural ext ensions. Exist ing soft ware will
cont inue t o run correct ly, wit hout modificat ion on I nt el microprocessors t hat incorpo-
rat e t hese t echnologies. Exist ing soft ware will also run correct ly in t he presence of
applicat ions t hat incorporat e SI MD t echnologies.
SSE and SSE2 inst ruct ions also int roduced cacheabilit y and memory ordering
inst ruct ions t hat can improve cache usage and applicat ion performance.
Figure 2-10. SIMD Instruction Register Usage
MM7
MM6
MM7
MM3
MM2
MM1
MM0
MM5
MM4
MM7
XMM6
XMM7
XMM3
XMM2
XMM1
XMM0
XMM5
XMM4
64-bit MMX Registers 128-bit XMM Registers
OM15149
2-48
For more on SSE, SSE2, SSE3 and MMX t echnologies, see t he following chapt ers in
t he I nt el 64 and I A- 32 Archit ect ures Soft ware Developers Manual, Volume 1:
Chapt er 9, Programming wit h I nt el MMX Technology
Chapt er 10, Programming wit h St reaming SI MD Ext ensions ( SSE)
Chapt er 11, Programming wit h St reaming SI MD Ext ensions 2 ( SSE2)
Chapt er 12, Programming wit h SSE3 and Supplement al SSE3
2.8.1 Summary of SIMD Technologies
2.8.1.1 MMX Technology
MMX Technology int roduced:
64- bit MMX regist ers
Support for SI MD operat ions on packed byt e, word, and doubleword int egers
MMX inst ruct ions are useful for mult imedia and communicat ions soft ware.
2.8.1.2 Streaming SIMD Extensions
St reaming SI MD ext ensions int roduced:
128- bit XMM regist ers
128- bit dat a t ype wit h four packed single- precision float ing- point operands
dat a prefet ch inst ruct ions
non- t emporal st ore inst ruct ions and ot her cacheabilit y and memory ordering
inst ruct ions
ext ra 64- bit SI MD int eger support
SSE inst ruct ions are useful for 3D geomet ry, 3D rendering, speech recognit ion, and
video encoding and decoding.
2.8.1.3 Streaming SIMD Extensions 2
St reaming SI MD ext ensions 2 add t he following:
128- bit dat a t ype wit h t wo packed double- precision float ing- point operands
128- bit dat a t ypes for SI MD int eger operat ion on 16- byt e, 8- word,
4- doubleword, or 2- quadword int egers
support for SI MD arit hmet ic on 64- bit int eger operands
inst ruct ions for convert ing bet ween new and exist ing dat a t ypes
ext ended support for dat a shuffling
Ext ended support for cacheabilit y and memory ordering operat ions
2-49
SSE2 inst ruct ions are useful for 3D graphics, video decoding/ encoding, and encryp-
t ion.
2.8.1.4 Streaming SIMD Extensions 3
St reaming SI MD ext ensions 3 add t he following:
SI MD float ing- point inst ruct ions for asymmet ric and horizont al comput at ion
a special- purpose 128- bit load inst ruct ion t o avoid cache line split s
an x87 FPU inst ruct ion t o convert t o int eger independent of t he float ing- point
cont rol word ( FCW)
inst ruct ions t o support t hread synchronizat ion
SSE3 inst ruct ions are useful for scient ific, video and mult i- t hreaded applicat ions.
2.8.1.5 Supplemental Streaming SIMD Extensions 3
The Supplement al St reaming SI MD Ext ensions 3 int roduces 32 new inst ruct ions t o
accelerat e eight t ypes of comput at ions on packed int egers. These include:
12 inst ruct ions t hat perform horizont al addit ion or subt ract ion operat ions
6 inst ruct ions t hat evaluat e t he absolut e values
2 inst ruct ions t hat perform mult iply and add operat ions and speed up t he
evaluat ion of dot product s
2 inst ruct ions t hat accelerat e packed- int eger mult iply operat ions and produce
int eger values wit h scaling
2 inst ruct ions t hat perform a byt e- wise, in- place shuffle according t o t he second
shuffle cont rol operand
6 inst ruct ions t hat negat e packed int egers in t he dest inat ion operand if t he signs
of t he corresponding element in t he source operand is less t han zero
2 inst ruct ions t hat align dat a from t he composit e of t wo operands
2-50
3-1
CHAPTER 3
This chapt er discusses general opt imizat ion t echniques t hat can improve t he perfor-
mance of applicat ions running on processors based on I nt el Core microarchit ect ure,
I nt el Net Burst microarchit ect ure, I nt el Core Duo, I nt el Core Solo, and Pent ium M
processors. These t echniques t ake advant age of microarchit ect ural described in
Chapt er 2, I nt el 64 and I A- 32 Processor Archit ect ures. Opt imizat ion guidelines
focusing on I nt el dual- core processors, Hyper-Threading Technology and 64- bit mode
applicat ions are discussed in Chapt er 8, Mult i- Core and Hyper-Threading Tech-
nology, and Chapt er 9, 64- bit Mode Coding Guidelines.
Pract ices t hat opt imize performance focus on t hree areas:
t ools and t echniques for code generat ion
analysis of t he performance charact erist ics of t he workload and it s int eract ion
wit h microarchit ect ural subsyst ems
t uning code t o t he t arget microarchit ect ure ( or families of microarchit ect ure) t o
improve performance
Some hint s on using t ools are summarized first t o simplify t he first t wo t asks. t he rest
of t he chapt er will focus on recommendat ions of code generat ion or code t uning t o
t he t arget microarchit ect ures.
This chapt er explains opt imizat ion t echniques for t he I nt el C+ + Compiler, t he I nt el
Fort ran Compiler, and ot her compilers.
3.1 PERFORMANCE TOOLS
I nt el offers several t ools t o help opt imize applicat ion performance, including
compilers, performance analyzer and mult it hreading t ools.
3.1.1 Intel

C++ and Fortran Compilers
I nt el compilers support mult iple operat ing syst ems ( Windows* , Linux* , Mac OS* and
embedded) . The I nt el compilers opt imize performance and give applicat ion devel-
opers access t o advanced feat ures:
Flexibilit y t o t arget 32- bit or 64- bit I nt el processors for opt imizat ion
Compat ibilit y wit h many int egrat ed development environment s or t hird- part y
compilers.
Aut omat ic opt imizat ion feat ures t o t ake advant age of t he t arget processor s
archit ect ure.
3-2
Aut omat ic compiler opt imizat ion reduces t he need t o writ e different code for
different processors.
Common compiler feat ures t hat are support ed across Windows, Linux and Mac
OS include:
General opt imizat ion set t ings
Cache- management feat ures
I nt er procedural opt imizat ion ( I PO) met hods
Profile- guided opt imizat ion ( PGO) met hods
Mult it hreading support
Float ing- point arit hmet ic precision and consist ency support
Compiler opt imizat ion and vect orizat ion report s
3.1.2 General Compiler Recommendations
Generally speaking, a compiler t hat has been t uned for t he t arget microarchit ect ure
can be expect ed t o mat ch or out perform hand- coding. However, if performance prob-
lems are not ed wit h t he compiled code, some compilers ( like I nt el C+ + and Fort ran
Compilers) allow t he coder t o insert int rinsics or inline assembly in order t o exert
cont rol over what code is generat ed. I f inline assembly is used, t he user must verify
t hat t he code generat ed is of good qualit y and yields good performance.
Default compiler swit ches are t arget ed for common cases. An opt imizat ion may be
made t o t he compiler default if it is beneficial for most programs. I f t he root cause of
a performance problem is a poor choice on t he part of t he compiler, using different
swit ches or compiling t he t arget ed module wit h a different compiler may be t he
solut ion.
3.1.3 VTune Performance Analyzer
VTune uses performance monit oring hardware t o collect st at ist ics and coding infor-
mat ion of your applicat ion and it s int eract ion wit h t he microarchit ect ure. This allows
soft ware engineers t o measure performance charact erist ics of t he workload for a
given microarchit ect ure. VTune support s I nt el Core microarchit ect ure, I nt el Net Burst
microarchit ect ure, I nt el Core Duo, I nt el Core Solo, and Pent ium M processor families.
The VTune Performance Analyzer provides t wo kinds of feedback:
indicat ion of a performance improvement gained by using a specific coding
recommendat ion or microarchit ect ural feat ure
informat ion on whet her a change in t he program has improved or degraded
performance wit h respect t o a part icular met ric
3-3
The VTune Performance Analyzer also provides measures for a number of workload
charact erist ics, including:
ret irement t hroughput of inst ruct ion execut ion as an indicat ion of t he degree of
ext ract able inst ruct ion- level parallelism in t he workload
dat a t raffic localit y as an indicat ion of t he st ress point of t he cache and memory
hierarchy
dat a t raffic parallelism as an indicat ion of t he degree of effect iveness of amort i-
zat ion of dat a access lat ency
NOTE
I mproving performance in one part of t he machine does not
necessarily bring significant gains t o overall performance. I t is
possible t o degrade overall performance by improving performance
for some part icular met ric.
Where appropriat e, coding recommendat ions in t his chapt er include descript ions of
t he VTune Performance Analyzer event s t hat provide measurable dat a on t he perfor-
mance gain achieved by following t he recommendat ions. For more on using t he
VTune analyzer, refer t o t he applicat ions online help.
3.2 PROCESSOR PERSPECTIVES
Many coding recommendat ions for I nt el Core microarchit ect ure work well across
Pent ium M, I nt el Core Solo, I nt el Core Duo processors and processors based on I nt el
Net Burst microarchit ect ure. However, t here are sit uat ions where a recommendat ion
may benefit one microarchit ect ure more t han anot her. Some of t hese are:
I nst ruct ion decode t hroughput is import ant for processors based on I nt el Core
microarchit ect ure ( Pent ium M, I nt el Core Solo, and I nt el Core Duo processors)
but less import ant for processors based on I nt el Net Burst microarchit ect ure.
Generat ing code wit h a 4- 1- 1 t emplat e ( inst ruct ion wit h four ops followed by
t wo inst ruct ions wit h one op each) helps t he Pent ium M processor.
I nt el Core Solo and I nt el Core Duo processors have an enhanced front end t hat
is less sensit ive t o t he 4- 1- 1 t emplat e. Processors based on I nt el Core microar-
chit ect ure have 4 decoders and employ micro- fusion and macro- fusion so t hat
each of t hree simple decoders are not rest rict ed t o handling simple inst ruct ions
consist ing of one op.
Taking advant age of micro- fusion will increase decoder t hroughput across I nt el
Core Solo, I nt el Core Duo and I nt el Core2 Duo processors. Taking advant age of
macro- fusion can improve decoder t hroughput furt her on I nt el Core 2 Duo
processor family.
3-4
On processors based on I nt el Net Burst microarchit ect ure, t he code size limit of
int erest is imposed by t he t race cache. On Pent ium M processors, t he code size
limit is governed by t he inst ruct ion cache.
Dependencies for part ial regist er writ es incur large penalt ies when using t he
Pent ium M processor ( t his applies t o processors wit h CPUI D signat ure family 6,
model 9) . On Pent ium 4, I nt el Xeon processors, Pent ium M processor ( wit h
CPUI D signat ure family 6, model 13) , such penalt ies are relieved by art ificial
dependencies bet ween each part ial regist er writ e. I nt el Core Solo, I nt el Core Duo
processors and processors based on I nt el Core microarchit ect ure can experience
minor delays due t o part ial regist er st alls. To avoid false dependences from
part ial regist er updat es, use full regist er updat es and ext ended moves.
Use appropriat e inst ruct ions t hat support dependence- breaking ( PXOR, SUB,
XOR inst ruct ions) . Dependence- breaking support for XORPS is available in I nt el
Core Solo, I nt el Core Duo processors and processors based on I nt el Core
microarchit ect ure.
Float ing point regist er st ack exchange inst ruct ions are slight ly more expensive
due t o issue rest rict ions in processors based on I nt el Net Burst microarchit ect ure.
Hardware prefet ching can reduce t he effect ive memory lat ency for dat a and
inst ruct ion accesses in general. But different microarchit ect ures may require
some cust om modificat ions t o adapt t o t he specific hardware prefet ch implemen-
t at ion of each microarchit ect ure.
On processors based on I nt el Net Burst microarchit ect ure, lat encies of some
inst ruct ions are relat ively significant ( including shift s, rot at es, int eger mult iplies,
and moves from memory wit h sign ext ension) . Use care when using t he LEA
inst ruct ion. See Sect ion 3. 5. 1.3, Using LEA.
On processors based on I nt el Net Burst microarchit ect ure, t here may be a penalt y
when inst ruct ions wit h immediat es requiring more t han 16- bit signed represen-
t at ion are placed next t o ot her inst ruct ions t hat use immediat es.
3.2.1 CPUID Dispatch Strategy and Compatible Code Strategy
When opt imum performance on all processor generat ions is desired, applicat ions can
t ake advant age of t he CPUI D inst ruct ion t o ident ify t he processor generat ion and
int egrat e processor- specific inst ruct ions int o t he source code. The I nt el C+ +
Compiler support s t he int egrat ion of different versions of t he code for different t arget
processors. The select ion of which code t o execut e at runt ime is made based on t he
CPU ident ifiers. Binary code t arget ed for different processor generat ions can be
generat ed under t he cont rol of t he programmer or by t he compiler.
For applicat ions t hat t arget mult iple generat ions of microarchit ect ures, and where
minimum binary code size and single code pat h is import ant , a compat ible code
st rat egy is t he best . Opt imizing applicat ions using t echniques developed for t he I nt el
Core microarchit ect ure and combined wit h some for I nt el Net Burst microarchit ect ure
are likely t o improve code efficiency and scalabilit y when running on processors
based on current and fut ure generat ions of I nt el 64 and I A- 32 processors. This
3-5
compat ible approach t o opt imizat ion is also likely t o deliver high performance on
Pent ium M, I nt el Core Solo and I nt el Core Duo processors.
3.2.2 Transparent Cache-Parameter Strategy
I f t he CPUI D inst ruct ion support s funct ion leaf 4, also known as det erminist ic cache
paramet er leaf, t he leaf report s cache paramet ers for each level of t he cache hier-
archy in a det erminist ic and forward- compat ible manner across I nt el 64 and I A- 32
processor families.
For coding t echniques t hat rely on specific paramet ers of a cache level, using t he
det erminist ic cache paramet er allows soft ware t o implement t echniques in a way t hat
is forward- compat ible wit h fut ure generat ions of I nt el 64 and I A- 32 processors, and
cross- compat ible wit h processors equipped wit h different cache sizes.
3.2.3 Threading Strategy and Hardware Multithreading Support
I nt el 64 and I A- 32 processor families offer hardware mult it hreading support in t wo
forms: dual- core t echnology and HT Technology.
To fully harness t he performance pot ent ial of hardware mult it hreading in current and
fut ure generat ions of I nt el 64 and I A- 32 processors, soft ware must embrace a
t hreaded approach in applicat ion design. At t he same t ime, t o address t he widest
range of inst alled machines, mult i- t hreaded soft ware should be able t o run wit hout
failure on a single processor wit hout hardware mult it hreading support and should
achieve performance on a single logical processor t hat is comparable t o an
unt hreaded implement at ion ( if such comparison can be made) . This generally
requires archit ect ing a mult i- t hreaded applicat ion t o minimize t he overhead of t hread
synchronizat ion. Addit ional guidelines on mult it hreading are discussed in Chapt er 8,
Mult icore and Hyper-Threading Technology.
3.3 CODING RULES, SUGGESTIONS AND TUNING HINTS
This sect ion includes rules, suggest ions and hint s. They are t arget ed for engineers
who are:
modifying source code t o enhance performance ( user/ source rules)
writ ing assemblers or compilers ( assembly/ compiler rules)
doing det ailed performance t uning ( t uning suggest ions)
Coding recommendat ions are ranked in import ance using t wo measures:
Local impact ( high, medium, or low) refers t o a recommendat ions affect on t he
performance of a given inst ance of code.
Generalit y ( high, medium, or low) measures how oft en such inst ances occur
across all applicat ion domains. Generalit y may also be t hought of as frequency.
3-6
These recommendat ions are approximat e. They can vary depending on coding st yle,
applicat ion domain, and ot her fact ors.
The purpose of t he high, medium, and low ( H, M, and L) priorit ies is t o suggest t he
relat ive level of performance gain one can expect if a recommendat ion is imple-
ment ed.
Because it is not possible t o predict t he frequency of a part icular code inst ance in
applicat ions, priorit y hint s cannot be direct ly correlat ed t o applicat ion- level perfor-
mance gain. I n cases in which applicat ion- level performance gain has been observed,
we have provided a quant it at ive charact erizat ion of t he gain ( for informat ion only) .
I n cases in which t he impact has been deemed inapplicable, no priorit y is assigned.
3.4 OPTIMIZING THE FRONT END
Opt imizing t he front end covers t wo aspect s:
Maint aining st eady supply of ops t o t he execut ion engine Mispredict ed
branches can disrupt st reams of ops, or cause t he execut ion engine t o wast e
execut ion resources on execut ing st reams of ops in t he non- archit ect ed code
pat h. Much of t he t uning in t his respect focuses on working wit h t he Branch
Predict ion Unit . Common t echniques are covered in Sect ion 3.4. 1, Branch
Predict ion Opt imizat ion.
Supplying st reams of ops t o ut ilize t he execut ion bandwidt h and ret irement
bandwidt h as much as possible For I nt el Core microarchit ect ure and I nt el Core
Duo processor family, t his aspect focuses maint aining high decode t hroughput .
I n I nt el Net Burst microarchit ect ure, t his aspect focuses on keeping t he Trace
Cache operat ing in st ream mode. Techniques t o maximize decode t hroughput for
I nt el Core microarchit ect ure are covered in Sect ion 3. 4. 2, Fet ch and Decode
Opt imizat ion.
3.4.1 Branch Prediction Optimization
Branch opt imizat ions have a significant impact on performance. By underst anding
t he flow of branches and improving t heir predict abilit y, you can increase t he speed of
code significant ly.
Opt imizat ions t hat help branch predict ion are:
Keep code and dat a on separat e pages. This is very import ant ; see Sect ion 3. 6,
Opt imizing Memory Accesses, for more informat ion.
Eliminat e branches whenever possible.
Arrange code t o be consist ent wit h t he st at ic branch predict ion algorit hm.
Use t he PAUSE inst ruct ion in spin- wait loops.
I nline funct ions and pair up calls and ret urns.
3-7
Unroll as necessary so t hat repeat edly- execut ed loops have sixt een or fewer
it erat ions ( unless t his causes an excessive code size increase) .
Separat e branches so t hat t hey occur no more frequent ly t han every t hree ops
where possible.
3.4.1.1 Eliminating Branches
Eliminat ing branches improves performance because:
I t reduces t he possibilit y of mispredict ions.
I t reduces t he number of required branch t arget buffer ( BTB) ent ries. Condit ional
branches, which are never t aken, do not consume BTB resources.
There are four principal ways of eliminat ing branches:
Arrange code t o make basic blocks cont iguous.
Unroll loops, as discussed in Sect ion 3. 4.1. 7, Loop Unrolling.
Use t he CMOV inst ruct ion.
Use t he SETCC inst ruct ion.
The following rules apply t o branch eliminat ion:
Assembl y/ Compi l er Codi ng Rul e 1. ( MH i mpact , M gener al i t y) Arrange code
t o make basic blocks cont iguous and eliminat e unnecessary branches.
For t he Pent ium M processor, every branch count s. Even correct ly predict ed branches
have a negat ive effect on t he amount of useful code delivered t o t he processor. Also,
t aken branches consume space in t he branch predict ion st ruct ures and ext ra
branches creat e pressure on t he capacit y of t he st ruct ures.
Assembl y/ Compi l er Codi ng Rul e 2. ( M i mpact , ML gener al i t y ) Use t he SETCC
and CMOV inst ruct ions t o eliminat e unpredict able condit ional branches where
possible. Do not do t his for predict able branches. Do not use t hese inst ruct ions t o
eliminat e all unpredict able condit ional branches ( because using t hese inst ruct ions
will incur execut ion overhead due t o t he requirement for execut ing bot h pat hs of a
condit ional branch) . I n addit ion, convert ing a condit ional branch t o SETCC or CMOV
t rades off cont rol flow dependence for dat a dependence and rest rict s t he capabilit y
of t he out - of- order engine. When t uning, not e t hat all I nt el 64 and I A- 32 processors
usually have very high branch predict ion rat es. Consist ent ly mispredict ed branches
are generally rare. Use t hese inst ruct ions only if t he increase in comput at ion t ime is
less t han t he expect ed cost of a mispredict ed branch.
Consider a line of C code t hat has a condit ion dependent upon one of t he const ant s:
X = (A < B) ? CONST1 : CONST2;
This code condit ionally compares t wo values, A and B. I f t he condit ion is t rue, X is set
t o CONST1; ot herwise it is set t o CONST2. An assembly code sequence equivalent t o
t he above C code can cont ain branches t hat are not predict able if t here are no corre-
lat ion in t he t wo values.
3-8
Example 3- 1 shows t he assembly code wit h unpredict able branches. The unpredict -
able branches can be removed wit h t he use of t he SETCC inst ruct ion. Example 3- 2
shows opt imized code t hat has no branches.
The opt imized code in Example 3- 2 set s EBX t o zero, t hen compares A and B. I f A is
great er t han or equal t o B, EBX is set t o one. Then EBX is decreased and ANDd wit h
t he difference of t he const ant values. This set s EBX t o eit her zero or t he difference of
t he values. By adding CONST2 back t o EBX, t he correct value is writ t en t o EBX. When
CONST2 is equal t o zero, t he last inst ruct ion can be delet ed.
Anot her way t o remove branches on Pent ium I I and subsequent processors is t o use
t he CMOV and FCMOV inst ruct ions. Example 3- 3 shows how t o change a TEST and
branch inst ruct ion sequence using CMOV t o eliminat e a branch. I f t he TEST set s t he
equal flag, t he value in EBX will be moved t o EAX. This branch is dat a- dependent , and
is represent at ive of an unpredict able branch.
Example 3-1. Assembly Code with an Unpredictable Branch
cmp a, b ; Condition
jbe L30 ; Conditional branch
mov ebx const1 ; ebx holds X
jmp L31 ; Unconditional branch
L30:
mov ebx, const2
L31:
Example 3-2. Code Optimization to Eliminate Branches
xor ebx, ebx ; Clear ebx (X in the C code)
cmp A, B
setge bl ; When ebx = 0 or 1
; OR the complement condition
sub ebx, 1 ; ebx=11...11 or 00...00
and ebx, CONST3; CONST3 = CONST1-CONST2
add ebx, CONST2; ebx=CONST1 or CONST2
3-9
The CMOV and FCMOV inst ruct ions are available on t he Pent ium I I and subsequent
processors, but not on Pent ium processors and earlier I A- 32 processors. Be sure t o
check whet her a processor support s t hese inst ruct ions wit h t he CPUI D inst ruct ion.
3.4.1.2 Spin-Wait and Idle Loops
The Pent ium 4 processor int roduces a new PAUSE inst ruct ion; t he inst ruct ion is
archit ect urally a NOP on I nt el 64 and I A- 32 processor implement at ions.
To t he Pent ium 4 and lat er processors, t his inst ruct ion act s as a hint t hat t he code
sequence is a spin- wait loop. Wit hout a PAUSE inst ruct ion in such loops, t he Pent ium
4 processor may suffer a severe penalt y when exit ing t he loop because t he processor
may det ect a possible memory order violat ion. I nsert ing t he PAUSE inst ruct ion
significant ly reduces t he likelihood of a memory order violat ion and as a result
improves performance.
I n Example 3- 4, t he code spins unt il memory locat ion A mat ches t he value st ored in
t he regist er EAX. Such code sequences are common when prot ect ing a crit ical
sect ion, in producer- consumer sequences, for barriers, or ot her synchronizat ion.
3.4.1.3 Static Prediction
Branches t hat do not have a hist ory in t he BTB ( see Sect ion 3.4. 1, Branch Predict ion
Opt imizat ion ) are predict ed using a st at ic predict ion algorit hm. Pent ium 4,
Example 3-3. Eliminating Branch with CMOV Instruction
test ecx, ecx
jne 1H
mov eax, ebx
1H:
; To optimize code, combine jne and mov into one cmovcc instruction that checks the equal flag
test ecx, ecx ; Test the flags
cmoveq eax, ebx ; If the equal flag is set, move
; ebx to eax- the 1H: tag no longer needed
Example 3-4. Use of PAUSE Instruction
lock: cmp eax, a
jne loop
; Code in critical section:
loop: pause
cmp eax, a
jne loop
jmp lock
3-10
Pent ium M, I nt el Core Solo and I nt el Core Duo processors have similar st at ic predic-
t ion algorit hms t hat :
predict uncondit ional branches t o be t aken
predict indirect branches t o be NOT t aken
I n addit ion, condit ional branches in processors based on t he I nt el Net Burst microar-
chit ect ure are predict ed using t he following st at ic predict ion algorit hm:
predict backward condit ional branches t o be t aken; rule is suit able for loops
predict forward condit ional branches t o be NOT t aken
Pent ium M, I nt el Core Solo and I nt el Core Duo processors do not st at ically predict
condit ional branches according t o t he j ump direct ion. All condit ional branches are
dynamically predict ed, even at first appearance.
The following rule applies t o st at ic eliminat ion.
Assembl y/ Compi l er Codi ng Rul e 3. ( M i mpact , H gener al i t y) Arrange code t o
be consist ent wit h t he st at ic branch predict ion algorit hm: make t he fall- t hrough
code following a condit ional branch be t he likely t arget for a branch wit h a forward
t arget , and make t he fall- t hrough code following a condit ional branch be t he
unlikely t arget for a branch wit h a backward t arget .
Example 3- 5 illust rat es t he st at ic branch predict ion algorit hm. The body of an I F-
THEN condit ional is predict ed.
Examples 3- 6 and Example 3- 7 provide basic rules for a st at ic predict ion algorit hm.
I n Example 3- 6, t he backward branch (JC BEGI N) is not in t he BTB t he first t ime
Example 3-5. Pentium 4 Processor Static Branch Prediction Algorithm
//Forward condition branches not taken (fall through)
IF<condition> {....
}
IF<condition> {...
}
//Backward conditional branches are taken
LOOP {...
}<condition>
//Unconditional branches taken
JMP
------
3-11
t hrough; t herefore, t he BTB does not issue a predict ion. The st at ic predict or,
however, will predict t he branch t o be t aken, so a mispredict ion will not occur.
The first branch inst ruct ion ( JC BEGI N) in Example 3- 7 is a condit ional forward
branch. I t is not in t he BTB t he first t ime t hrough, but t he st at ic predict or will predict
t he branch t o fall t hrough. The st at ic predict ion algorit hm correct ly predict s t hat t he
CALL CONVERT inst ruct ion will be t aken, even before t he branch has any branch
hist ory in t he BTB.
The I nt el Core microarchit ect ure does not use t he st at ic predict ion heurist ic.
However, t o maint ain consist ency across I nt el 64 and I A- 32 processors, soft ware
should maint ain t he st at ic predict ion heurist ic as t he default .
3.4.1.4 Inlining, Calls and Returns
The ret urn address st ack mechanism augment s t he st at ic and dynamic predict ors t o
opt imize specifically for calls and ret urns. I t holds 16 ent ries, which is large enough
t o cover t he call dept h of most programs. I f t here is a chain of more t han 16 nest ed
calls and more t han 16 ret urns in rapid succession, performance may degrade.
The t race cache in I nt el Net Burst microarchit ect ure maint ains branch predict ion
informat ion for calls and ret urns. As long as t he t race wit h t he call or ret urn remains
in t he t race cache and t he call and ret urn t arget s remain unchanged, t he dept h limit
of t he ret urn address st ack described above will not impede performance.
To enable t he use of t he ret urn st ack mechanism, calls and ret urns must be mat ched
in pairs. I f t his is done, t he likelihood of exceeding t he st ack dept h in a manner t hat
will impact performance is very low.
Example 3-6. Static Taken Prediction
Begin: mov eax, mem32
and eax, ebx
imul eax, edx
shld eax, 7
jc Begin
Example 3-7. Static Not-Taken Prediction
mov eax, mem32
and eax, ebx
imul eax, edx
shld eax, 7
jc Begin
mov eax, 0
Begin: call Convert
3-12
The following rules apply t o inlining, calls, and ret urns.
Assembl y/ Compi l er Codi ng Rul e 4. ( MH i mpact , MH gener al i t y) Near calls
must be mat ched wit h near ret urns, and far calls must be mat ched wit h far ret urns.
Pushing t he ret urn address on t he st ack and j umping t o t he rout ine t o be called is
not recommended since it creat es a mismat ch in calls and ret urns.
Calls and ret urns are expensive; use inlining for t he following reasons:
Paramet er passing overhead can be eliminat ed.
I n a compiler, inlining a funct ion exposes more opport unit y for opt imizat ion.
I f t he inlined rout ine cont ains branches, t he addit ional cont ext of t he caller may
improve branch predict ion wit hin t he rout ine.
A mispredict ed branch can lead t o performance penalt ies inside a small funct ion
t hat are larger t han t hose t hat would occur if t hat funct ion is inlined.
Assembl y/ Compi l er Codi ng Rul e 5. ( MH i mpact , MH gener al i t y ) Select ively
inline a funct ion if doing so decreases code size or if t he funct ion is small and t he
call sit e is frequent ly execut ed.
Assembl y/ Compi l er Codi ng Rul e 6. ( H i mpact , H gener al i t y ) Do not inline a
funct ion if doing so increases t he working set size beyond what will fit in t he t race
cache.
Assembl y/ Compi l er Codi ng Rul e 7. ( ML i mpact , ML gener al i t y) I f t here are
more t han 16 nest ed calls and ret urns in rapid succession; consider t ransforming
t he program wit h inline t o reduce t he call dept h.
Assembl y / Compi l er Codi ng Rul e 8. ( ML i mpact , ML gener al i t y ) Favor inlining
small funct ions t hat cont ain branches wit h poor predict ion rat es. I f a branch
mispredict ion result s in a RETURN being premat urely predict ed as t aken, a
performance penalt y may be incurred.)
Assembl y/ Compi l er Codi ng Rul e 9. ( L i mpact , L gener al i t y ) I f t he last
st at ement in a funct ion is a call t o anot her funct ion, consider convert ing t he call t o
a j ump. This will save t he call/ ret urn overhead as well as an ent ry in t he ret urn
st ack buffer.
Assembl y/ Compi l er Codi ng Rul e 10. ( M i mpact , L gener al i t y) Do not put
more t han four branches in a 16- byt e chunk.
Assembl y/ Compi l er Codi ng Rul e 11. ( M i mpact , L gener al i t y ) Do not put
more t han t wo end loop branches in a 16- byt e chunk.
3.4.1.5 Code Alignment
Careful arrangement of code can enhance cache and memory localit y. Likely
sequences of basic blocks should be laid out cont iguously in memory. This may
involve removing unlikely code, such as code t o handle error condit ions, from t he
sequence. See Sect ion 3. 7, Prefet ching, on opt imizing t he inst ruct ion prefet cher.
3-13
Assembl y/ Compi l er Codi ng Rul e 12. ( M i mpact , H gener al i t y) All branch
t arget s should be 16- byt e aligned.
Assembl y/ Compi l er Codi ng Rul e 13. ( M i mpact , H gener al i t y ) I f t he body of a
condit ional is not likely t o be execut ed, it should be placed in anot her part of t he
program. I f it is highly unlikely t o be execut ed and code localit y is an issue, it
should be placed on a different code page.
3.4.1.6 Branch Type Selection
The default predict ed t arget for indirect branches and calls is t he fall- t hrough pat h.
Fall- t hrough predict ion is overridden if and when a hardware predict ion is available
for t hat branch. The predict ed branch t arget from branch predict ion hardware for an
indirect branch is t he previously execut ed branch t arget .
The default predict ion t o t he fall- t hrough pat h is only a significant issue if no branch
predict ion is available, due t o poor code localit y or pat hological branch conflict prob-
lems. For indirect calls, predict ing t he fall- t hrough pat h is usually not an issue, since
execut ion will likely ret urn t o t he inst ruct ion aft er t he associat ed ret urn.
Placing dat a immediat ely following an indirect branch can cause a performance
problem. I f t he dat a consist s of all zeros, it looks like a long st ream of ADDs t o
memory dest inat ions and t his can cause resource conflict s and slow down branch
recovery. Also, dat a immediat ely following indirect branches may appear as branches
t o t he branch predicat ion hardware, which can branch off t o execut e ot her dat a
pages. This can lead t o subsequent self- modifying code problems.
Assembl y/ Compi l er Codi ng Rul e 14. ( M i mpact , L gener al i t y) When indirect
branches are present , t ry t o put t he most likely t arget of an indirect branch
immediat ely following t he indirect branch. Alt ernat ively, if indirect branches are
common but t hey cannot be predict ed by branch predict ion hardware, t hen follow
t he indirect branch wit h a UD2 inst ruct ion, which will st op t he processor from
decoding down t he fall- t hrough pat h.
I ndirect branches result ing from code const ruct s ( such as swit ch st at ement s,
comput ed GOTOs or calls t hrough point ers) can j ump t o an arbit rary number of loca-
t ions. I f t he code sequence is such t hat t he t arget dest inat ion of a branch goes t o t he
same address most of t he t ime, t hen t he BTB will predict accurat ely most of t he t ime.
Since only one t aken ( non- fall- t hrough) t arget can be st ored in t he BTB, indirect
branches wit h mult iple t aken t arget s may have lower predict ion rat es.
The effect ive number of t arget s st ored may be increased by int roducing addit ional
condit ional branches. Adding a condit ional branch t o a t arget is fruit ful if:
The branch direct ion is correlat ed wit h t he branch hist ory leading up t o t hat
branch; t hat is, not j ust t he last t arget , but how it got t o t his branch.
The source/ t arget pair is common enough t o warrant using t he ext ra branch
predict ion capacit y. This may increase t he number of overall branch mispredic-
t ions, while improving t he mispredict ion of indirect branches. The profit abilit y is
lower if t he number of mispredict ing branches is very large.
3-14
User / Sour ce Codi ng Rul e 1. ( M i mpact , L gener al i t y) I f an indirect branch has
t wo or more common t aken t arget s and at least one of t hose t arget s is correlat ed
wit h branch hist ory leading up t o t he branch, t hen convert t he indirect branch t o a
t ree where one or more indirect branches are preceded by condit ional branches t o
t hose t arget s. Apply t his peeling procedure t o t he common t arget of an indirect
branch t hat correlat es t o branch hist ory.
The purpose of t his rule is t o reduce t he t ot al number of mispredict ions by enhancing
t he predict abilit y of branches ( even at t he expense of adding more branches) . The
added branches must be predict able for t his t o be wort hwhile. One reason for such
predict abilit y is a st rong correlat ion wit h preceding branch hist ory. That is, t he direc-
t ions t aken on preceding branches are a good indicat or of t he direct ion of t he branch
under considerat ion.
Example 3- 8 shows a simple example of t he correlat ion bet ween a t arget of a
preceding condit ional branch and a t arget of an indirect branch.
Correlat ion can be difficult t o det ermine analyt ically, for a compiler and for an
assembly language programmer. I t may be fruit ful t o evaluat e performance wit h and
wit hout peeling t o get t he best performance from a coding effort .
An example of peeling out t he most favored t arget of an indirect branch wit h corre-
lat ed branch hist ory is shown in Example 3- 9.
Example 3-8. Indirect Branch With Two Favored Targets
function ()
{
int n = rand(); // random integer 0 to RAND_MAX
if ( ! (n & 0x0) ) { // n will be 0 half the times
n = 0; // updates branch history to predict taken
}
// indirect branches with multiple taken targets
// may have lower prediction rates
switch (n) {
case 0: handle_0(); break; // common target, correlated with
// branch history that is forward taken
case 1: handle_(); break; // uncommon
case 3: handle_3(); break; // uncommon
default: handle_other(); // common target
}
}
3-15
3.4.1.7 Loop Unrolling
Benefit s of unrolling loops are:
Unrolling amort izes t he branch overhead, since it eliminat es branches and some
of t he code t o manage induct ion variables.
Unrolling allows one t o aggressively schedule ( or pipeline) t he loop t o hide
lat encies. This is useful if you have enough free regist ers t o keep variables live as
you st ret ch out t he dependence chain t o expose t he crit ical pat h.
Unrolling exposes t he code t o various ot her opt imizat ions, such as removal of
redundant loads, common subexpression eliminat ion, and so on.
The Pent ium 4 processor can correct ly predict t he exit branch for an inner loop
t hat has 16 or fewer it erat ions ( if t hat number of it erat ions is predict able and
t here are no condit ional branches in t he loop) . So, if t he loop body size is not
excessive and t he probable number of it erat ions is known, unroll inner loops unt il
t hey have a maximum of 16 it erat ions. Wit h t he Pent ium M processor, do not
unroll loops having more t han 64 it erat ions.
The pot ent ial cost s of unrolling loops are:
Excessive unrolling or unrolling of very large loops can lead t o increased code
size. This can be harmful if t he unrolled loop no longer fit s in t he t race cache ( TC) .
Unrolling loops whose bodies cont ain branches increases demand on BTB
capacit y. I f t he number of it erat ions of t he unrolled loop is 16 or fewer, t he branch
Example 3-9. A Peeling Technique to Reduce Indirect Branch Misprediction
function ()
{
int n = rand(); // Random integer 0 to RAND_MAX
if( ! (n & 0x01) ) THEN
n = 0; // n will be 0 half the times
if (!n) THEN
handle_0(); // Peel out the most common target
// with correlated branch history
{
switch (n) {
case 1: handle_1(); break; // Uncommon
case 3: handle_3(); break; // Uncommon
default: handle_other(); // Make the favored target in
// the fall-through path
}
}
}
3-16
predict or should be able t o correct ly predict branches in t he loop body t hat
alt ernat e direct ion.
Assembl y/ Compi l er Codi ng Rul e 15. ( H i mpact , M gener al i t y) Unroll small
loops unt il t he overhead of t he branch and induct ion variable account s ( generally)
for less t han 10% of t he execut ion t ime of t he loop.
Assembl y / Compi l er Codi ng Rul e 16. ( H i mpact , M gener al i t y) Avoid unrolling
loops excessively; t his may t hrash t he t race cache or inst ruct ion cache.
Assembl y/ Compi l er Codi ng Rul e 17. ( M i mpact , M gener al i t y ) Unroll loops
t hat are frequent ly execut ed and have a predict able number of it erat ions t o reduce
t he number of it erat ions t o 16 or fewer. Do t his unless it increases code size so t hat
t he working set no longer fit s in t he t race or inst ruct ion cache. I f t he loop body
cont ains more t han one condit ional branch, t hen unroll so t hat t he number of
it erat ions is 16/ ( # condit ional branches) .
Example 3- 10 shows how unrolling enables ot her opt imizat ions.
I n t his example, t he loop t hat execut es 100 t imes assigns X t o every even- numbered
element and Y t o every odd- numbered element . By unrolling t he loop you can make
assignment s more efficient ly, removing one branch in t he loop body.
3.4.1.8 Compiler Support for Branch Prediction
Compilers generat e code t hat improves t he efficiency of branch predict ion in t he
Pent ium 4, Pent ium M, I nt el Core Duo processors and processors based on I nt el Core
microarchit ect ure. The I nt el C+ + Compiler accomplishes t his by:
keeping code and dat a on separat e pages
using condit ional move inst ruct ions t o eliminat e branches
generat ing code consist ent wit h t he st at ic branch predict ion algorit hm
inlining where appropriat e
unrolling if t he number of it erat ions is predict able
Example 3-10. Loop Unrolling
Before unrolling:
do i = 1, 100
if ( i mod 2 == 0 ) then a( i ) = x
else a( i ) = y
enddo
After unrolling
do i = 1, 100, 2
a( i ) = y
a( i+1 ) = x
enddo
3-17
Wit h profile- guided opt imizat ion, t he compiler can lay out basic blocks t o eliminat e
branches for t he most frequent ly execut ed pat hs of a funct ion or at least improve
t heir predict abilit y. Branch predict ion need not be a concern at t he source level. For
more informat ion, see I nt el C+ + Compiler document at ion.
3.4.2 Fetch and Decode Optimization
I nt el Core microarchit ect ure provides several mechanisms t o increase front end
t hroughput . Techniques t o t ake advant age of some of t hese feat ures are discussed
below.
3.4.2.1 Optimizing for Micro-fusion
An I nst ruct ion t hat operat es on a regist er and a memory operand decodes int o more
ops t han it s corresponding regist er- regist er version. Replacing t he equivalent work
of t he former inst ruct ion using t he regist er- regist er version usually require a
sequence of t wo inst ruct ions. The lat t er sequence is likely t o result in reduced fet ch
bandwidt h.
Assembl y/ Compi l er Codi ng Rul e 18. ( ML i mpact , M gener al i t y) For improving
fet ch/ decode t hroughput , Give preference t o memory flavor of an inst ruct ion over
t he regist er- only flavor of t he same inst ruct ion, if such inst ruct ion can benefit from
micro- fusion.
The following examples are some of t he t ypes of micro- fusions t hat can be handled
by all decoders:
All st ores t o memory, including st ore immediat e. St ores execut e int ernally as t wo
separat e ops: st ore- address and st ore- dat a.
All read- modify ( load+ op) inst ruct ions bet ween regist er and memory, for
example:
ADDPS XMM9, OWORD PTR [RSP+40]
FADD DOUBLE PTR [RDI+RSI*8]
XOR RAX, QWORD PTR [RBP+32]
All inst ruct ions of t he form load and j ump, for example:
JMP [RDI+200]
RET
CMP and TEST wit h immediat e operand and memory
An I nt el 64 inst ruct ion wit h RI P relat ive addressing is not micro- fused in t he following
cases:
When an addit ional immediat e is needed, for example:
CMP [RIP+400], 27
MOV [RIP+3000], 142
When an RI P is needed for cont rol flow purposes, for example:
JMP [RIP+5000000]
3-18
I n t hese cases, I nt el Core Microarchit ect ure provides a 2 op flow from decoder 0,
result ing in a slight loss of decode bandwidt h since 2 op flow must be st eered t o
decoder 0 from t he decoder wit h which it was aligned.
RI P addressing may be common in accessing global dat a. Since it will not benefit
from micro- fusion, compiler may consider accessing global dat a wit h ot her means of
memory addressing.
3.4.2.2 Optimizing for Macro-fusion
Macro- fusion merges t wo inst ruct ions t o a single op. I nt el Core Microarchit ect ure
performs t his hardware opt imizat ion under limit ed circumst ances.
The first inst ruct ion of t he macro- fused pair must be a CMP or TEST inst ruct ion. This
inst ruct ion can be REG- REG, REG- I MM, or a micro- fused REG- MEM comparison. The
second inst ruct ion ( adj acent in t he inst ruct ion st ream) should be a condit ional
branch.
Since t hese pairs are common ingredient in basic it erat ive programming sequences,
macro- fusion improves performance even on un- recompiled binaries. All of t he
decoders can decode one macro- fused pair per cycle, wit h up t o t hree ot her inst ruc-
t ions, result ing in a peak decode bandwidt h of 5 inst ruct ions per cycle.
Each macro- fused inst ruct ion execut es wit h a single dispat ch. This process reduces
lat ency, which in t his case shows up as a cycle removed from branch mispredict ed
penalt y. Soft ware also gain all ot her fusion benefit s: increased rename and ret ire
bandwidt h, more st orage for inst ruct ions in- flight , and power savings from repre-
sent ing more work in fewer bit s.
The following list det ails when you can use macro- fusion:
CMP or TEST can be fused when comparing:
REG-REG. For example: CMP EAX, ECX; JZ label
REG-IMM. For example: CMP EAX,0x80; JZ label
REG-MEM. For example: CMP EAX,[ECX]; JZ label
MEM-REG. For example: CMP [ EAX] , ECX; JZ label
TEST can fused wit h all condit ional j umps.
CMP can be fused wit h only t he following condit ional j umps. These condit ional
j umps check carry flag ( CF) or zero flag ( ZF) . j ump. The list of macro- fusion-
capable condit ional j umps are:
JA or JNBE
JAE or JNB or JNC
JE or JZ
JNA or JBE
JNAE or JC or JB
JNE or JNZ
3-19
CMP and TEST can not be fused when comparing MEM- I MM ( e. g. CMP [ EAX] , 0x80; JZ
label) . Macro- fusion is not support ed in 64- bit mode.
Assembl y/ Compi l er Codi ng Rul e 19. ( M i mpact , ML gener al i t y) Employ
macro- fusion where possible using inst ruct ion pairs t hat support macro- fusion.
Prefer TEST over CMP if possible. Use unsigned variables and unsigned j umps when
possible. Try t o logically verify t hat a variable is non- negat ive at t he t ime of
comparison. Avoid CMP or TEST of MEM- I MM flavor when possible. However, do not
add ot her inst ruct ions t o avoid using t he MEM- I MM flavor.
Example 3-11. Macro-fusion, Unsigned Iteration Count
Without Macro-fusion With Macro-fusion
C code for (int
1
i = 0; i < 1000; i++)
a++;
NOTES:
1. Signed iteration count inhibits macro-fusion
for ( unsigned int
2
i = 0; i < 1000; i++)
a++;
2. Unsigned iteration count is compatible with macro-fusion
Disassembly for (int i = 0; i < 1000; i++)
mov dword ptr [ i ], 0
jmp First
Loop:
mov eax, dword ptr [ i ]
add eax, 1
mov dword ptr [ i ], eax
for ( unsigned int i = 0; i < 1000; i++)
mov dword ptr [ i ], 0
jmp First
Loop:
mov eax, dword ptr [ i ]
add eax, 1
mov dword ptr [ i ], eax
First:
cmp dword ptr [ i ], 3E8H
3
jge End
a++;
mov eax, dword ptr [ a ]
addqq eax,1
mov dword ptr [ a ], eax
jmp Loop
End:
3. CMP MEM-IMM, JGE inhibit macro-fusion
First:
cmp eax, 3E8H
4
jae End
a++;
add eax, 1
jmp Loop
End:
4. CMP REG-IMM, JAE permits macro-fusion
3-20
Example 3-12. Macro-fusion, If Statement
C code int
1
a = 7;
if ( a < 77 )
a++;
else
a--;
unsigned int
2
a = 7;
if ( a < 77 )
a++;
else
a--;
Disassembly int a = 7;
mov dword ptr [ a ], 7
if (a < 77)
cmp dword ptr [ a ], 4DH
3
jge Dec
a++;
add eax, 1
mov dword ptr [a], eax
else
jmp End
a--;
Dec:
sub eax, 1
End::
unsigned int a = 7;
mov dword ptr [ a ], 7
if ( a < 77 )
cmp eax, 4DH
jae Dec
a++;
add eax,1
else
jmp End
a--;
Dec:
sub eax, 1
End::
NOTES:
1. Signed iteration count inhibits macro-fusion
2. Unsigned iteration count is compatible with macro-fusion
3. CMP MEM-IMM, JGE inhibit macro-fusion
3-21
Assembl y / Compi l er Codi ng Rul e 20. ( M i mpact , ML gener al i t y) Soft ware can
enable macro fusion when it can be logically det ermined t hat a variable is non-
negat ive at t he t ime of comparison; use TEST appropriat ely t o enable macro- fusion
when comparing a variable wit h 0.
For eit her signed or unsigned variable a; CMP a, 0 and TEST a, a produce t he
same result as far as t he flags are concerned. Since TEST can be macro- fused more
oft en, soft ware can use TEST a, a t o replace CMP a, 0 for t he purpose of enabling
macro- fusion.
3.4.2.3 Length-Changing Prefixes (LCP)
The lengt h of an inst ruct ion can be up t o 15 byt es in lengt h. Some prefixes can
dynamically change t he lengt h of an inst ruct ion t hat t he decoder must recognize.
Typically, t he predecode unit will est imat e t he lengt h of an inst ruct ion in t he byt e
st ream assuming t he absence of LCP. When t he predecoder encount ers an LCP in t he
fet ch line, it must use a slower lengt h decoding algorit hm. Wit h t he slower lengt h
decoding algorit hm, t he predecoder decodes t he fet ch in 6 cycles, inst ead of t he
usual 1 cycle. Normal queuing t hroughout of t he machine pipeline generally cannot
hide LCP penalt ies.
The prefixes t hat can dynamically change t he lengt h of a inst ruct ion include:
operand size prefix ( 0x66)
address size prefix ( 0x67)
Example 3-13. Macro-fusion, Signed Variable
test ecx, ecx
jle OutSideTheIF
cmp ecx, 64H
jge OutSideTheIF
<IF BLOCK CODE>
OutSideTheIF:
test ecx, ecx
jle OutSideTheIF
cmp ecx, 64H
jae OutSideTheIF
<IF BLOCK CODE>
OutSideTheIF:
Example 3-14. Macro-fusion, Signed Comparison
C Code Without Macro-fusion With Macro-fusion
if (a == 0) cmp a, 0
jne lbl
...
lbl:
test a, a
jne lbl
...
lbl:
if ( a >= 0) cmp a, 0
jl lbl;
...
lbl:
test a, a
jl lbl
...
lbl:
3-22
The inst ruct ion MOV DX, 01234h is subj ect t o LCP st alls in processors based on I nt el
Core microarchit ect ure, and in I nt el Core Duo and I nt el Core Solo processors.
I nst ruct ions t hat cont ain imm16 as part of t heir fixed encoding but do not require LCP
t o change t he immediat e size are not subj ect t o LCP st alls. The REX prefix ( 4xh) in
64- bit mode can change t he size of t wo classes of inst ruct ion, but does not cause an
LCP penalt y.
I f t he LCP st all happens in a t ight loop, it can cause significant performance degrada-
t ion. When decoding is not a bot t leneck, as in float ing- point heavy code, isolat ed LCP
st alls usually do not cause performance degradat ion.
Assembl y/ Compi l er Codi ng Rul e 21. ( MH i mpact , MH gener al i t y) Favor
generat ing code using imm8 or imm32 values inst ead of imm16 values.
I f imm16 is needed, load equivalent imm32 int o a regist er and use t he word value in
t he regist er inst ead.
Double LCP Stalls
I nst ruct ions t hat are subj ect t o LCP st alls and cross a 16- byt e fet ch line boundary can
cause t he LCP st all t o t rigger t wice. The following alignment sit uat ions can cause LCP
st alls t o t rigger t wice:
An inst ruct ion is encoded wit h a MODR/ M and SI B byt e, and t he fet ch line
boundary crossing is bet ween t he MODR/ M and t he SI B byt es.
An inst ruct ion st art s at offset 13 of a fet ch line references a memory locat ion
using regist er and immediat e byt e offset addressing mode.
The first st all is for t he 1st fet ch line, and t he 2nd st all is for t he 2nd fet ch line. A
double LCP st all causes a decode penalt y of 11 cycles.
The following examples cause LCP st all once, regardless of t heir fet ch- line locat ion of
t he first byt e of t he inst ruct ion:
ADD DX, 01234H
ADD word ptr [EDX], 01234H
ADD word ptr 012345678H[EDX], 01234H
ADD word ptr [012345678H], 01234H
The following inst ruct ions cause a double LCP st all when st art ing at offset 13 of a
fet ch line:
ADD word ptr [ EDX+ ESI ], 01234H
ADD word ptr 012H[EDX], 01234H
ADD word ptr 012345678H[EDX+ESI ], 01234H
To avoid double LCP st alls, do not use inst ruct ions subj ect t o LCP st alls t hat use SI B
byt e encoding or addressing mode wit h byt e displacement .
False LCP Stalls
False LCP st alls have t he same charact erist ics as LCP st alls, but occur on inst ruct ions
t hat do not have any imm16 value.
3-23
False LCP st alls occur when ( a) inst ruct ions wit h LCP t hat are encoded using t he F7
opcodes, and ( b) are locat ed at offset 14 of a fet ch line. These inst ruct ions are not ,
neg, div, idiv, mul, and imul. False LCP experiences delay because t he inst ruct ion
lengt h decoder can not det ermine t he lengt h of t he inst ruct ion before t he next fet ch
line, which holds t he exact opcode of t he inst ruct ion in it s MODR/ M byt e.
The following t echniques can help avoid false LCP st alls:
Upcast all short operat ions from t he F7 group of inst ruct ions t o long, using t he
full 32 bit version.
Ensure t hat t he F7 opcode never st art s at offset 14 of a fet ch line.
Assembl y/ Compi l er Codi ng Rul e 22. ( M i mpact , ML gener al i t y) Ensure
inst ruct ions using 0xF7 opcode byt e does not st art at offset 14 of a fet ch line; and
avoid using t hese inst ruct ion t o operat e on 16- bit dat a, upcast short dat a t o 32 bit s.
3.4.2.4 Optimizing the Loop Stream Detector (LSD)
Loops t hat fit t he following crit eria are det ect ed by t he LSD and replayed from t he
inst ruct ion queue:
Must be less t han or equal t o four 16- byt e fet ches.
Must be less t han or equal t o 18 inst ruct ions.
Can cont ain no more t han four t aken branches and none of t hem can be a RET.
Should usually have more t han 64 it erat ions.
Many calculat ion- int ensive loops, searches and soft ware st ring moves mat ch t hese
charact erist ics. These loops exceed t he BPU predict ion capacit y and always t ermi-
nat e in a branch mispredict ion.
Assembl y / Compi l er Codi ng Rul e 23. ( MH i mpact , MH gener al i t y ) Break up a
loop long sequence of inst ruct ions int o loops of short er inst ruct ion blocks of no
more t han 18 inst ruct ions.
Assembl y/ Compi l er Codi ng Rul e 24. ( MH i mpact , M gener al i t y) Avoid
unrolling loops cont aining LCP st alls, if t he unrolled block exceeds 18 inst ruct ions.
3.4.2.5 Scheduling Rules for the Pentium 4 Processor Decoder
Pr ocessors based on I nt el Net Burst microarchit ect ure have a single decoder t hat can
decode inst ruct ions at t he maximum rat e of one inst ruct ion per clock. Complex
inst ruct ions must enlist t he help of t he microcode ROM.
Example 3-15. Avoiding False LCP Delays with 0xF7 Group Instructions
A Sequence Causing Delay in the Decoder Alternate Sequence to Avoid Delay
neg word ptr a movsx eax, word ptr a
neg eax
mov word ptr a, AX
3-24
Because ops are delivered from t he t race cache in t he common cases, decoding
rules and code alignment are not required.
3.4.2.6 Scheduling Rules for the Pentium M Processor Decoder
The Pent ium M processor has t hree decoders, but t he decoding rules t o supply ops
at high bandwidt h are less st ringent t han t hose of t he Pent ium III processor. This
provides an opport unit y t o build a front - end t racker in t he compiler and t ry t o
schedule inst ruct ions correct ly. The decoder limit at ions are:
The first decoder is capable of decoding one macroinst ruct ion made up of four or
fewer ops in each clock cycle. I t can handle any number of byt es up t o t he
maximum of 15. Mult iple prefix inst ruct ions require addit ional cycles.
The t wo addit ional decoders can each decode one macroinst ruct ion per clock
cycle ( assuming t he inst ruct ion is one op up t o seven byt es in lengt h) .
I nst ruct ions composed of more t han four ops t ake mult iple cycles t o decode.
Assembl y/ Compi l er Codi ng Rul e 25. ( M i mpact , M gener al i t y) Avoid put t ing
explicit references t o ESP in a sequence of st ack operat ions ( POP, PUSH, CALL,
RET) .
3.4.2.7 Other Decoding Guidelines
Assembl y/ Compi l er Codi ng Rul e 26. ( ML i mpact , L gener al i t y) Use simple
inst ruct ions t hat are less t han eight byt es in lengt h.
Assembl y/ Compi l er Codi ng Rul e 27. ( M i mpact , MH gener al i t y) Avoid using
prefixes t o change t he size of immediat e and displacement .
Long inst ruct ions ( more t han seven byt es) limit t he number of decoded inst ruct ions
per cycle on t he Pent ium M processor. Each prefix adds one byt e t o t he lengt h of
inst ruct ion, possibly limit ing t he decoder s t hroughput . I n addit ion, mult iple prefixes
can only be decoded by t he first decoder. These prefixes also incur a delay when
decoded. I f mult iple prefixes or a prefix t hat changes t he size of an immediat e or
displacement cannot be avoided, schedule t hem behind inst ruct ions t hat st all t he
pipe for some ot her reason.
3.5 OPTIMIZING THE EXECUTION CORE
The superscalar, out - of- order execut ion core( s) in recent generat ions of microarchi-
t ect ures cont ain mult iple execut ion hardware resources t hat can execut e mult iple
ops in parallel. These resources generally ensure t hat ops execut e efficient ly and
3-25
proceed wit h fixed lat encies. General guidelines t o make use of t he available paral-
lelism are:
Follow t he rules ( see Sect ion 3.4) t o maximize useful decode bandwidt h and front
end t hroughput . These rules include favouring single op inst ruct ions and t aking
advant age of micro- fusion, St ack point er t racker and macro- fusion.
Maximize rename bandwidt h. Guidelines are discussed in t his sect ion and include
properly dealing wit h part ial regist ers, ROB read port s and inst ruct ions which
causes side- effect s on flags.
Scheduling recommendat ions on sequences of inst ruct ions so t hat mult iple
dependency chains are alive in t he reservat ion st at ion ( RS) simult aneously, t hus
ensuring t hat your code ut ilizes maximum parallelism.
Avoid hazards, minimize delays t hat may occur in t he execut ion core, allowing
t he dispat ched ops t o make progress and be ready for ret irement quickly.
3.5.1 Instruction Selection
Some execut ion unit s are not pipelined, t his means t hat ops cannot be dispat ched
in consecut ive cycles and t he t hroughput is less t han one per cycle.
I t is generally a good st art ing point t o select inst ruct ions by considering t he number
of ops associat ed wit h each inst ruct ion, favoring in t he order of: single- op inst ruc-
t ions, simple inst ruct ion wit h less t hen 4 ops, and last inst ruct ion requiring microse-
quencer ROM ( ops which are execut ed out of t he microsequencer involve ext ra
overhead) .
Assembl y/ Compi l er Codi ng Rul e 28. ( M i mpact , H gener al i t y) Favor single-
micro- operat ion inst ruct ions. Also favor inst ruct ion wit h short er lat encies.
A compiler may be already doing a good j ob on inst ruct ion select ion. I f so, user int er-
vent ion usually is not necessary.
Assembl y/ Compi l er Codi ng Rul e 29. ( M i mpact , L gener al i t y) Avoid prefixes,
especially mult iple non- 0F- prefixed opcodes.
Assembl y/ Compi l er Codi ng Rul e 30. ( M i mpact , L gener al i t y) Do not use
many segment regist ers.
On t he Pent ium M processor, t here is only one level of renaming of segment regist ers.
Assembl y/ Compi l er Codi ng Rul e 31. ( ML i mpact , M gener al i t y ) Avoid using
complex inst ruct ions ( for example, ent er, leave, or loop) t hat have more t han four
ops and require mult iple cycles t o decode. Use sequences of simple inst ruct ions
inst ead.
Complex inst ruct ions may save archit ect ural regist ers, but incur a penalt y of 4 ops t o
set up paramet ers for t he microsequencer ROM in I nt el Net Burst microarchit ect ure.
Theoret ically, arranging inst ruct ions sequence t o mat ch t he 4- 1- 1- 1 t emplat e applies
t o processors based on I nt el Core microarchit ect ure. However, wit h macro- fusion
and micro- fusion capabilit ies in t he front end, at t empt s t o schedule inst ruct ion
sequences using t he 4- 1- 1- 1 t emplat e will likely provide diminishing ret urns.
3-26
I nst ead, soft ware should follow t hese addit ional decoder guidelines:
I f you need t o use mult iple op, non- microsequenced inst ruct ions, t ry t o
separat e by a few single op inst ruct ions. The following inst ruct ions are
examples of mult iple- op inst ruct ion not requiring microsequencer:
ADC/SBB
CMOVcc
Read-modify-write instructions
I f a series of mult iple- op inst ruct ions cannot be separat ed, t ry breaking t he
series int o a different equivalent inst ruct ion sequence. For example, a series of
read- modify- writ e inst ruct ions may go fast er if sequenced as a series of read-
modify + st ore inst ruct ions. This st rat egy could improve performance even if t he
new code sequence is larger t han t he original one.
3.5.1.1 Use of the INC and DEC Instructions
The I NC and DEC inst ruct ions modify only a subset of t he bit s in t he flag regist er. This
creat es a dependence on all previous writ es of t he flag regist er. This is especially
problemat ic when t hese inst ruct ions are on t he crit ical pat h because t hey are used t o
change an address for a load on which many ot her inst ruct ions depend.
Assembl y/ Compi l er Codi ng Rul e 32. ( M i mpact , H gener al i t y) I NC and DEC
inst ruct ions should be replaced wit h ADD or SUB inst ruct ions, because ADD and
SUB overwrit e all flags, whereas I NC and DEC do not , t herefore creat ing false
dependencies on earlier inst ruct ions t hat set t he flags.
3.5.1.2 Integer Divide
Typically, an int eger divide is preceded by a CWD or CDQ inst ruct ion. Depending on
t he operand size, divide inst ruct ions use DX: AX or EDX: EAX for t he dividend. The
CWD or CDQ inst ruct ions sign- ext end AX or EAX int o DX or EDX, respect ively. These
inst ruct ions have denser encoding t han a shift and move would be, but t hey generat e
t he same number of micro- ops. I f AX or EAX is known t o be posit ive, replace t hese
inst ruct ions wit h:
xor dx, dx
or
xor edx, edx
3.5.1.3 Using LEA
I n some cases wit h processor based on I nt el Net Burst microarchit ect ure, t he LEA
inst ruct ion or a sequence of LEA, ADD, SUB and SHI FT inst ruct ions can replace
const ant mult iply inst ruct ions. The LEA inst ruct ion can also be used as a mult iple
operand addit ion inst ruct ion, for example:
LEA ECX, [EAX + EBX + 4 + A]
3-27
Using LEA in t his way may avoid regist er usage by not t ying up regist ers for operands
of arit hmet ic inst ruct ions. This use may also save code space.
I f t he LEA inst ruct ion uses a shift by a const ant amount t hen t he lat ency of t he
sequence of ops is short er if adds are used inst ead of a shift , and t he LEA inst ruct ion
may be replaced wit h an appropriat e sequence of ops. This, however, increases t he
t ot al number of ops, leading t o a t radeoff.
Assembl y/ Compi l er Codi ng Rul e 33. ( ML i mpact , L gener al i t y ) I f an LEA
inst ruct ion using t he scaled index is on t he crit ical pat h, a sequence wit h ADDs may
be bet t er. I f code densit y and bandwidt h out of t he t race cache are t he crit ical
fact or, t hen use t he LEA inst ruct ion.
3.5.1.4 Using SHIFT and ROTATE
The SHI FT and ROTATE inst ruct ions have a longer lat ency on processor wit h a CPUI D
signat ure corresponding t o family 15 and model encoding of 0, 1, or 2. The lat ency of
a sequence of adds will be short er for left shift s of t hree or less. Fixed and variable
SHI FTs have t he same lat ency.
The rot at e by immediat e and rot at e by regist er inst ruct ions are more expensive t han
a shift . The rot at e by 1 inst ruct ion has t he same lat ency as a shift .
Assembl y / Compi l er Codi ng Rul e 34. ( ML i mpact , L gener al i t y ) Avoid ROTATE
by regist er or ROTATE by immediat e inst ruct ions. I f possible, replace wit h a
ROTATE by 1 inst ruct ion.
3.5.1.5 Address Calculations
For comput ing addresses, use t he addressing modes rat her t han general- purpose
comput at ions. I nt ernally, memory reference inst ruct ions can have four operands:
Relocat able load- t ime const ant
I mmediat e const ant
Base regist er
Scaled index regist er
I n t he segment ed model, a segment regist er may const it ut e an addit ional operand in
t he linear address calculat ion. I n many cases, several int eger inst ruct ions can be
eliminat ed by fully using t he operands of memory references.
3.5.1.6 Clearing Registers and Dependency Breaking Idioms
Code sequences t hat modifies part ial regist er can experience some delay in it s
dependency chain, but can be avoided by using dependency breaking idioms.
I n processors based on I nt el Core microarchit ect ure, a number of inst ruct ions can
help clear execut ion dependency when soft ware uses t hese inst ruct ion t o clear
regist er cont ent t o zero. The inst ruct ions include
3-28
XOR REG, REG
SUB REG, REG
XORPS/PD XMMREG, XMMREG
PXOR XMMREG, XMMREG
SUBPS/PD XMMREG, XMMREG
PSUBB/W/D/Q XMMREG, XMMREG
I n I nt el Core Solo and I nt el Core Duo processors, t he XOR, SUB, XORPS, or PXOR
inst ruct ions can be used t o clear execut ion dependencies on t he zero evaluat ion of
t he dest inat ion regist er.
The Pent ium 4 processor provides special support for XOR, SUB, and PXOR opera-
t ions when execut ed wit hin t he same regist er. This recognizes t hat clearing a regist er
does not depend on t he old value of t he regist er. The XORPS and XORPD inst ruct ions
do not have t his special support . They cannot be used t o break dependence chains.
Assembl y/ Compi l er Codi ng Rul e 35. ( M i mpact , ML gener al i t y) Use
dependency- breaking- idiom inst ruct ions t o set a regist er t o 0, or t o break a false
dependence chain result ing from reuse of regist ers. I n cont ext s where t he
condit ion codes must be preserved, move 0 int o t he regist er inst ead. This requires
more code space t han using XOR and SUB, but avoids set t ing t he condit ion codes.
Example 3- 16 of using pxor t o break dependency idiom on a XMM regist er when
performing negat ion on t he element s of an array.
int a[4096], b[4096], c[4096];
For ( int i = 0; i < 4096; i++ )
C[i] = - ( a[i] + b[i] );
Example 3-16. Clearing Register to Break Dependency While Negating Array Elements
Negation (-x = (x XOR (-1)) - (-1) without
breaking dependency
Negation (-x = 0 -x) using PXOR reg, reg breaks
dependency
Lea eax, a
lea ecx, b
lea edi, c
xor edx, edx
movdqa xmm7, allone
lp:
lea eax, a
lea ecx, b
lea edi, c
xor edx, edx
lp:
movdqa xmm0, [eax + edx]
paddd xmm0, [ecx + edx]
pxor xmm0, xmm7
psubd xmm0, xmm7
movdqa [edi + edx], xmm0
add edx, 16
cmp edx, 4096
jl lp
movdqa xmm0, [eax + edx]
paddd xmm0, [ecx + edx]
pxor xmm7, xmm7
psubd xmm7, xmm0
movdqa [edi + edx], xmm7
add edx,16
cmp edx, 4096
jl lp
3-29
Assembl y/ Compi l er Codi ng Rul e 36. ( M i mpact , MH gener al i t y) Break
dependences on port ions of regist ers bet ween inst ruct ions by operat ing on 32- bit
regist ers inst ead of part ial regist ers. For moves, t his can be accomplished wit h 32- bit
moves or by using MOVZX.
On Pent ium M processors, t he MOVSX and MOVZX inst ruct ions bot h t ake a single
op, whet her t hey move from a regist er or memory. On Pent ium 4 processors, t he
MOVSX t akes an addit ional op. This is likely t o cause less delay t han t he part ial
regist er updat e problem ment ioned above, but t he performance gain may vary. I f t he
addit ional op is a crit ical problem, MOVSX can somet imes be used as alt ernat ive.
Somet imes sign- ext ended semant ics can be maint ained by zero- ext ending oper-
ands. For example, t he C code in t he following st at ement s does not need sign ext en-
sion, nor does it need prefixes for operand size overrides:
static short INT a, b;
IF (a == b) {
. . .
}
Code for comparing t hese 16- bit operands might be:
MOVZW EAX, [a]
MOVZW EBX, [b]
CMP EAX, EBX
These circumst ances t end t o be common. However, t he t echnique will not work if t he
compare is for great er t han, less t han, great er t han or equal, and so on, or if t he
values in eax or ebx are t o be used in anot her operat ion where sign ext ension is
required.
Assembl y / Compi l er Codi ng Rul e 37. ( M i mpact , M gener al i t y ) Try t o use zero
ext ension or operat e on 32- bit operands inst ead of using moves wit h sign
ext ension.
The t race cache can be packed more t ight ly when inst ruct ions wit h operands t hat can
only be represent ed as 32 bit s are not adj acent .
Assembl y/ Compi l er Codi ng Rul e 38. ( ML i mpact , L gener al i t y ) Avoid placing
inst ruct ions t hat use 32- bit immediat es which cannot be encoded as sign- ext ended
16- bit immediat es near each ot her. Try t o schedule ops t hat have no immediat e
immediat ely before or aft er ops wit h 32- bit immediat es.
3.5.1.7 Compares
Use TEST when comparing a value in a regist er wit h zero. TEST essent ially ANDs
operands t oget her wit hout writ ing t o a dest inat ion regist er. TEST is preferred over
AND because AND produces an ext ra result regist er. TEST is bet t er t han CMP . .. , 0
because t he inst ruct ion size is smaller.
3-30
Use TEST when comparing t he result of a logical AND wit h an immediat e const ant for
equalit y or inequalit y if t he regist er is EAX for cases such as:
I F ( AVAR & 8) { }
The TEST inst ruct ion can also be used t o det ect rollover of modulo of a power of 2.
For example, t he C code:
IF ( (AVAR % 16) == 0 ) { }
can be implement ed using:
TEST EAX, 0x0F
JNZ AfterIf
Using t he TEST inst ruct ion bet ween t he inst ruct ion t hat may modify part of t he flag
regist er and t he inst ruct ion t hat uses t he flag regist er can also help prevent part ial
flag regist er st all.
Assembl y/ Compi l er Codi ng Rul e 39. ( ML i mpact , M gener al i t y ) Use t he TEST
inst ruct ion inst ead of AND when t he result of t he logical AND is not used. This saves
ops in execut ion. Use a TEST if a regist er wit h it self inst ead of a CMP of t he regist er
t o zero, t his saves t he need t o encode t he zero and saves encoding space. Avoid
comparing a const ant t o a memory operand. I t is preferable t o load t he memory
operand and compare t he const ant t o a regist er.
Oft en a produced value must be compared wit h zero, and t hen used in a branch.
Because most I nt el archit ect ure inst ruct ions set t he condit ion codes as part of t heir
execut ion, t he compare inst ruct ion may be eliminat ed. Thus t he operat ion can be
t est ed direct ly by a JCC inst ruct ion. The not able except ions are MOV and LEA. I n
t hese cases, use TEST.
Assembl y/ Compi l er Codi ng Rul e 40. ( ML i mpact , M gener al i t y) Eliminat e
unnecessary compare wit h zero inst ruct ions by using t he appropriat e condit ional
j ump inst ruct ion when t he flags are already set by a preceding arit hmet ic
inst ruct ion. I f necessary, use a TEST inst ruct ion inst ead of a compare. Be cert ain
t hat any code t ransformat ions made do not int roduce problems wit h overflow.
3.5.1.8 Using NOPs
Code generat ors generat e a no- operat ion ( NOP) t o align inst ruct ions. Examples of
NOPs of different lengt hs in 32- bit mode are shown below:
1-byte: XCHG EAX, EAX
2-byte: MOV REG, REG
3-byte: LEA REG, 0 (REG) (8-bit displacement)
4-byte: NOP DWORD PTR [ EAX + 0] (8-bit displacement)
5-byte: NOP DWORD PTR [ EAX + EAX* 1 + 0] (8-bit displacement)
6-byte: LEA REG, 0 (REG) (32-bit displacement)
7-byte: NOP DWORD PTR [ EAX + 0] (32-bit displacement)
8-byte: NOP DWORD PTR [ EAX + EAX* 1 + 0] (32-bit displacement)
9-byte: NOP WORD PTR [ EAX + EAX* 1 + 0] (32-bit displacement)
3-31
These are all t rue NOPs, having no effect on t he st at e of t he machine except t o
advance t he EI P. Because NOPs require hardware resources t o decode and execut e,
use t he fewest number t o achieve t he desired padding.
The one byt e NOP: [ XCHG EAX, EAX] has special hardware support . Alt hough it st ill
consumes a op and it s accompanying resources, t he dependence upon t he old value
of EAX is removed. This op can be execut ed at t he earliest possible opport unit y,
reducing t he number of out st anding inst ruct ions and is t he lowest cost NOP.
The ot her NOPs have no special hardware support . Their input and out put regist ers
are int erpret ed by t he hardware. Therefore, a code generat or should arrange t o use
t he regist er cont aining t he oldest value as input , so t hat t he NOP will dispat ch and
release RS resources at t he earliest possible opport unit y.
Try t o observe t he following NOP generat ion priorit y:
Select t he smallest number of NOPs and pseudo- NOPs t o provide t he desired
padding.
Select NOPs t hat are least likely t o execut e on slower execut ion unit clust ers.
Select t he regist er argument s of NOPs t o reduce dependencies.
3.5.1.9 Mixing SIMD Data Types
Previous microarchit ect ures ( before I nt el Core microarchit ect ure) do not have
explicit rest rict ions on mixing int eger and float ing- point ( FP) operat ions on XMM
regist ers. For I nt el Core microarchit ect ure, mixing int eger and float ing- point opera-
t ions on t he cont ent of an XMM regist er can degrade performance. Soft ware should
avoid mixed- use of int eger/ FP operat ion on XMM regist ers. Specifically,
Use SI MD int eger operat ions t o feed SI MD int eger operat ions. Use PXOR for
idiom.
Use SI MD float ing point operat ions t o feed SI MD float ing point operat ions. Use
XORPS for idiom.
When float ing point operat ions are bit wise equivalent , use PS dat a t ype inst ead
of PD dat a t ype. MOVAPS and MOVAPD do t he same t hing, but MOVAPS t akes one
less byt e t o encode t he inst ruct ion.
3.5.1.10 Spill Scheduling
The spill scheduling algorit hm used by a code generat or will be impact ed by t he
memory subsyst em. A spill scheduling algorit hm is an algorit hm t hat select s what
3-32
values t o spill t o memory when t here are t oo many live values t o fit in regist ers.
Consider t he code in Example 3- 17, where it is necessary t o spill eit her A, B, or C.
For modern microarchit ect ures, using dependence dept h informat ion in spill sched-
uling is even more import ant t han in previous processors. The loop- carried depen-
dence in A makes it especially import ant t hat A not be spilled. Not only would a
st ore/ load be placed in t he dependence chain, but t here would also be a dat a- not -
ready st all of t he load, cost ing furt her cycles.
Assembl y/ Compi l er Codi ng Rul e 41. ( H i mpact , MH gener al i t y) For small
loops, placing loop invariant s in memory is bet t er t han spilling loop- carried
dependencies.
A possibly count er- int uit ive result is t hat in such a sit uat ion it is bet t er t o put loop
invariant s in memory t han in regist ers, since loop invariant s never have a load
blocked by st ore dat a t hat is not ready.
3.5.2 Avoiding Stalls in Execution Core
Alt hough t he design of t he execut ion core is opt imized t o make common cases
execut es quickly. A op may encount er various hazards, delays, or st alls while
making forward progress from t he front end t o t he ROB and RS. The significant cases
are:
ROB Read Port St alls
Part ial Regist er Reference St alls
Part ial Updat es t o XMM Regist er St alls
Part ial Flag Regist er Reference St alls
3.5.2.1 ROB Read Port Stalls
As a op is renamed, it det ermines whet her it s source operands have execut ed and
been writ t en t o t he reorder buffer ( ROB) , or whet her t hey will be capt ured in flight
in t he RS or in t he bypass net work. Typically, t he great maj orit y of source operands
are found t o be in flight during renaming. Those t hat have been writ t en back t o t he
ROB are read t hrough a set of read port s.
Since t he I nt el Core Microarchit ect ure is opt imized for t he common case where t he
operands are in flight , it does not provide a full set of read port s t o enable all
renamed ops t o read all sources from t he ROB in t he same cycle.
Example 3-17. Spill Scheduling Code
LOOP
C := ...
B := ...
A := A + ...
3-33
When not all sources can be read, a op can st all in t he rename st age unt il it can get
access t o enough ROB read port s t o complet e renaming t he op. This st all is usually
short - lived. Typically, a op will complet e renaming in t he next cycle, but it appears
t o t he applicat ion as a loss of rename bandwidt h.
Some of t he soft ware- visible sit uat ions t hat can cause ROB read port st alls include:
Regist ers t hat have become cold and require a ROB read port because execut ion
unit s are doing ot her independent calculat ions.
Const ant s inside regist ers
Point er and index regist ers
I n rare cases, ROB read port st alls may lead t o more significant performance degra-
dat ions. There are a couple of heurist ics t hat can help prevent over- subscribing t he
ROB read port s:
Keep common regist er usage clust ered t oget her. Mult iple references t o t he same
writ t en- back regist er can be folded inside t he out of order execut ion core.
Keep dependency chains int act . This pract ice ensures t hat t he regist ers will not
have been writ t en back when t he new micro- ops are writ t en t o t he RS.
These t wo scheduling heurist ics may conflict wit h ot her more common scheduling
heurist ics. To reduce demand on t he ROB read port , use t hese t wo heurist ics only if
bot h t he following sit uat ions are met :
short lat ency operat ions
indicat ions of act ual ROB read port st alls can be confirmed by measurement s of
t he performance event ( t he relevant event is RAT_STALLS. ROB_READ_PORT, see
Appendix A of t he I nt el 64 and I A- 32 Archit ect ures Soft ware Developers
Manual, Volume 3B)
I f t he code has a long dependency chain, t hese t wo heurist ics should not be used
because t hey can cause t he RS t o fill, causing damage t hat out weighs t he posit ive
effect s of reducing demands on t he ROB read port .
3.5.2.2 Bypass between Execution Domains
Float ing point ( FP) loads have an ext ra cycle of lat ency. Moves bet ween FP and SI MD
st acks have anot her addit ional cycle of lat ency.
Example:
ADDPS XMM0, XMM1
PAND XMM0, XMM3
ADDPS XMM2, XMM0
The overall lat ency for t he above calculat ion is 9 cycles:
3 cycles for each ADDPS inst ruct ion
1 cycle for t he PAND inst ruct ion
3-34
1 cycle t o bypass bet ween t he ADDPS float ing point domain t o t he PAND int eger
domain
1 cycle t o move t he dat a from t he PAND int eger t o t he second float ing point
ADDPS domain
To avoid t his penalt y, you should organize code t o minimize domain changes. Some-
t imes you cannot avoid bypasses.
Account for bypass cycles when count ing t he overall lat ency of your code. I f your
calculat ion is lat ency- bound, you can execut e more inst ruct ions in parallel or break
dependency chains t o reduce t ot al lat ency.
Code t hat has many bypass domains and is complet ely lat ency- bound may run
slower on t he I nt el Core microarchit ect ure t han it did on previous microarchit ect ures.
3.5.2.3 Partial Register Stalls
General purpose regist ers can be accessed in granularit ies of byt es, words, double-
words; 64- bit mode also support s quadword granularit y. Referencing a port ion of a
regist er is referred t o as a part ial regist er reference.
A part ial regist er st all happens when an inst ruct ion refers t o a regist er, port ions of
which were previously modified by ot her inst ruct ions. For example, part ial regist er
st alls occurs wit h a read t o AX while previous inst ruct ions st ored AL and AH, or a read
t o EAX while previous inst ruct ion modified AX.
The delay of a part ial regist er st all is small in processors based on I nt el Core and
Net Burst microarchit ect ures, and in Pent ium M processor ( wit h CPUI D signat ure
family 6, model 13) , I nt el Core Solo, and I nt el Core Duo processors. Pent ium M
processors ( CPUI D signat ure wit h family 6, model 9) and t he P6 family incur a large
penalt y.
Not e t hat in I nt el 64 archit ect ure, an updat e t o t he lower 32 bit s of a 64 bit int eger
regist er is archit ect urally defined t o zero ext end t he upper 32 bit s. While t his act ion
may be logically viewed as a 32 bit updat e, it is really a 64 bit updat e ( and t herefore
does not cause a part ial st all) .
Referencing part ial regist ers frequent ly produces code sequences wit h eit her false or
real dependencies. Example 3- 18 demonst rat es a series of false and real dependen-
cies caused by referencing part ial regist ers.
I f inst ruct ions 4 and 6 ( in Example 3- 18) are changed t o use a movzx inst ruct ion
inst ead of a mov, t hen t he dependences of inst ruct ion 4 on 2 ( and t ransit ively 1
3-35
before it ) , and inst ruct ion 6 on 5 are broken. This creat es t wo independent chains of
comput at ion inst ead of one serial one.
Example 3- 19 illust rat es t he use of MOVZX t o avoid a part ial regist er st all when
packing t hree byt e values int o a regist er.
3.5.2.4 Partial XMM Register Stalls
Part ial regist er st alls can also apply t o XMM regist ers. The following SSE and SSE2
inst ruct ions updat e only part of t he dest inat ion regist er:
MOVL/HPD XMM, MEM64
MOVL/HPS XMM, MEM32
Example 3-18. Dependencies Caused by Referencing Partial Registers
1: add ah, bh
2: add al, 3 ; Instruction 2 has a false dependency on 1
3: mov bl, al ; depends on 2, but the dependence is real
4: mov ah, ch ; Instruction 4 has a false dependency on 2
5: sar eax, 16 ; this wipes out the al/ah/ax part, so the
; result really doesn't depend on them programatically,
; but the processor must deal with real dependency on
; al/ah/ax
6: mov al, bl ; instruction 6 has a real dependency on 5
7: add ah, 13 ; instruction 7 has a false dependency on 6
8: imul dl ; instruction 8 has a false dependency on 7
; because al is implicitly used
9: mov al, 17 ; instruction 9 has a false dependency on 7
; and a real dependency on 8
10: imul cx : implicitly uses ax and writes to dx, hence
; a real dependency
Example 3-19. Avoiding Partial Register Stalls in Integer Code
A Sequence Causing Partial
Register Stall
Alternate Sequence Using
MOVZX to Avoid Delay
mov al, byte ptr a[2]
shl eax,16
mov ax, word ptr a
movd mm0, eax
ret
movzx eax, byte ptr a[2]
shl eax, 16
movzx ecx, word ptr a
or eax,ecx
movd mm0, eax
ret
3-36
MOVSS/SD between registers
Using t hese inst ruct ions creat es a dependency chain bet ween t he unmodified part of
t he regist er and t he modified part of t he regist er. This dependency chain can cause
performance loss.
Example 3- 20 illust rat es t he use of MOVZX t o avoid a part ial regist er st all when
packing t hree byt e values int o a regist er.
Follow t hese recommendat ions t o avoid st alls from part ial updat es t o XMM regist ers:
Avoid using inst ruct ions which updat e only part of t he XMM regist er.
I f a 64- bit load is needed, use t he MOVSD or MOVQ inst ruct ion.
I f 2 64- bit loads are required t o t he same regist er from non cont inuous locat ions,
use MOVSD/ MOVHPD inst ead of MOVLPD/ MOVHPD.
When copying t he XMM regist er, use t he following inst ruct ions for full regist er
copy, even if you only want t o copy some of t he source regist er dat a:
MOVAPS
MOVAPD
MOVDQA
Example 3-20. Avoiding Partial Register Stalls in SIMD Code
Using movlpd for memory transactions
and movsd between register copies
Causing Partial Register Stall
Using movsd for memory and movapd
between register copies Avoid Delay
mov edx, x
mov ecx, count
movlpd xmm3,_1_
movlpd xmm2,_1pt5_
align 16
mov edx, x
mov ecx, count
movsd xmm3,_1_
movsd xmm2, _1pt5_
align 16
lp:
movlpd xmm0, [edx]
addsd xmm0, xmm3
movsd xmm1, xmm2
subsd xmm1, [edx]
mulsd xmm0, xmm1
movsd [edx], xmm0
add edx, 8
dec ecx
jnz lp
lp:
movsd xmm0, [edx]
addsd xmm0, xmm3
movapd xmm1, xmm2
subsd xmm1, [edx]
mulsd xmm0, xmm1
movsd [edx], xmm0
add edx, 8
dec ecx
jnz lp
3-37
3.5.2.5 Partial Flag Register Stalls
A part ial flag regist er st all occurs when an inst ruct ion modifies a part of t he flag
regist er and t he following inst ruct ion is dependent on t he out come of t he flags. This
happens most oft en wit h shift inst ruct ions ( SAR, SAL, SHR, SHL) . The flags are not
modified in t he case of a zero shift count , but t he shift count is usually known only at
execut ion t ime. The front end st alls unt il t he inst ruct ion is ret ired.
Ot her inst ruct ions t hat can modify some part of t he flag regist er include
CMPXCHG8B, various rot at e inst ruct ions, STC, and STD. An example of assembly
wit h a part ial flag regist er st all and alt ernat ive code wit hout t he st all is shown in
Example 3- 21.
I n processors based on I nt el Core microarchit ect ure, shift immediat e by 1 is handled
by special hardware such t hat it does not experience part ial flag st all.
3.5.2.6 Floating Point/SIMD Operands in Intel NetBurst microarchitecture
I n processors based on I nt el Net Burst microarchit ect ure, t he lat ency of MMX or SI MD
float ing point regist er- t o- regist er moves is significant . This can have implicat ions for
regist er allocat ion.
Moves t hat writ e a port ion of a regist er can int roduce unwant ed dependences. The
MOVSD REG, REG inst ruct ion writ es only t he bot t om 64 bit s of a regist er, not all
128 bit s. This int roduces a dependence on t he preceding inst ruct ion t hat produces
t he upper 64 bit s ( even if t hose bit s are not longer want ed) . The dependence inhibit s
regist er renaming, and t hereby reduces parallelism.
Use MOVAPD as an alt ernat ive; it writ es all 128 bit s. Even t hough t his inst ruct ion has
a longer lat ency, t he ops for MOVAPD use a different execut ion port and t his port is
more likely t o be free. The change can impact performance. There may be excep-
t ional cases where t he lat ency mat t ers more t han t he dependence or t he execut ion
port .
Example 3-21. Avoiding Partial Flag Register Stalls
A Sequence with Partial
Flag Register Stall
Alternate Sequence without
Partial Flag Register Stall
xor eax, eax
mov ecx, a
sar ecx, 2
setz al
;No partial register stall,
;but flag stall as sar may
;change the flags
or eax, eax
mov ecx, a
sar ecx, 2
test ecx, ecx
setz al
;No partial reg or flag stall,
; test always updates
; all the flags
3-38
Assembl y/ Compi l er Codi ng Rul e 42. ( M i mpact , ML gener al i t y) Avoid
int roducing dependences wit h part ial float ing point regist er writ es, e. g. from t he
MOVSD XMMREG1, XMMREG2 inst ruct ion. Use t he MOVAPD XMMREG1, XMMREG2
inst ruct ion inst ead.
The MOVSD XMMREG, MEM inst ruct ion writ es all 128 bit s and breaks a dependence.
The MOVUPD from memory inst ruct ion performs t wo 64- bit loads, but requires addi-
t ional ops t o adj ust t he address and combine t he loads int o a single regist er. This
same funct ionalit y can be obt ained using MOVSD XMMREG1, MEM; MOVSD
XMMREG2, MEM+ 8; UNPCKLPD XMMREG1, XMMREG2, which uses fewer ops and
can be packed int o t he t race cache more effect ively. The lat t er alt ernat ive has been
found t o provide a several percent performance improvement in some cases. I t s
encoding requires more inst ruct ion byt es, but t his is seldom an issue for t he Pent ium
4 processor. The st ore version of MOVUPD is complex and slow, so much so t hat t he
sequence wit h t wo MOVSD and a UNPCKHPD should always be used.
Assembl y/ Compi l er Codi ng Rul e 43. ( ML i mpact , L gener al i t y) I nst ead of
using MOVUPD XMMREG1, MEM for a unaligned 128- bit load, use MOVSD
XMMREG1, MEM; MOVSD XMMREG2, MEM+ 8; UNPCKLPD XMMREG1, XMMREG2. I f
t he addit ional regist er is not available, t hen use MOVSD XMMREG1, MEM; MOVHPD
XMMREG1, MEM+ 8.
Assembl y/ Compi l er Codi ng Rul e 44. ( M i mpact , ML gener al i t y ) I nst ead of
using MOVUPD MEM, XMMREG1 for a st ore, use MOVSD MEM, XMMREG1;
UNPCKHPD XMMREG1, XMMREG1; MOVSD MEM+ 8, XMMREG1 inst ead.
3.5.3 Vectorization
This sect ion provides a brief summary of opt imizat ion issues relat ed t o vect orizat ion.
There is more det ail in t he chapt ers t hat follow.
Vect orizat ion is a program t ransformat ion t hat allows special hardware t o perform
t he same operat ion on mult iple dat a element s at t he same t ime. Successive
processor generat ions have provided vect or support t hrough t he MMX t echnology,
St reaming SI MD Ext ensions ( SSE) , St reaming SI MD Ext ensions 2 ( SSE2) , St reaming
SI MD Ext ensions 3 ( SSE3) and Supplement al St reaming SI MD Ext ensions 3
( SSSE3) .
Vect orizat ion is a special case of SI MD, a t erm defined in Flynns archit ect ure
t axonomy t o denot e a single inst ruct ion st ream capable of operat ing on mult iple dat a
element s in parallel. The number of element s which can be operat ed on in parallel
range from four single- precision float ing point dat a element s in St reaming SI MD
Ext ensions and t wo double- precision float ing- point dat a element s in St reaming SI MD
Ext ensions 2 t o sixt een byt e operat ions in a 128- bit regist er in St reaming SI MD
Ext ensions 2. Thus, vect or lengt h ranges from 2 t o 16, depending on t he inst ruct ion
ext ensions used and on t he dat a t ype.
3-39
The I nt el C+ + Compiler support s vect orizat ion in t hree ways:
The compiler may be able t o generat e SI MD code wit hout int ervent ion from t he
user.
The can user insert pragmas t o help t he compiler realize t hat it can vect orize t he
code.
The user can writ e SI MD code explicit ly using int rinsics and C+ + classes.
To help enable t he compiler t o generat e SI MD code, avoid global point ers and global
variables. These issues may be less t roublesome if all modules are compiled simult a-
neously, and whole- program opt imizat ion is used.
User / Sour ce Codi ng Rul e 2. ( H i mpact , M gener al i t y) Use t he smallest
possible float ing- point or SI MD dat a t ype, t o enable more parallelism wit h t he use
of a ( longer) SI MD vect or. For example, use single precision inst ead of double
precision where possible. .
User / Sour ce Codi ng Rul e 3. ( M i mpact , ML gener al i t y ) Arrange t he nest ing of
loops so t hat t he innermost nest ing level is free of int er- it erat ion dependencies.
Especially avoid t he case where t he st ore of dat a in an earlier it erat ion happens
lexically aft er t he load of t hat dat a in a fut ure it erat ion, somet hing which is called a
lexically backward dependence. .
The int eger part of t he SI MD inst ruct ion set ext ensions cover 8- bit , 16- bit and 32- bit
operands. Not all SI MD operat ions are support ed for 32 bit s, meaning t hat some
source code will not be able t o be vect orized at all unless smaller operands are used.
User / Sour ce Codi ng Rul e 4. ( M i mpact , ML gener al i t y ) Avoid t he use of
condit ional branches inside loops and consider using SSE inst ruct ions t o eliminat e
branches.
User / Sour ce Codi ng Rul e 5. ( M i mpact , ML gener al i t y) Keep induct ion ( loop)
variable expressions simple.
3.5.4 Optimization of Partially Vectorizable Code
Frequent ly, a program cont ains a mixt ure of vect orizable code and some rout ines
t hat are non- vect orizable. A common sit uat ion of part ially vect orizable code involves
a loop st ruct ure which include mixt ures of vect orized code and unvect orizable code.
This sit uat ion is depict ed in Figure 3- 1.
3-40
I t generally consist s of five st ages wit hin t he loop:
Prolog
Unpacking vect orized dat a st ruct ure int o individual element s
Calling a non- vect orizable rout ine t o process each element serially
Packing individual result int o vect orized dat a st ruct ure
Epilog
This sect ion discusses t echniques t hat can reduce t he cost and bot t leneck associat ed
wit h t he packing/ unpacking st ages in t hese part ially vect orize code.
Example 3- 22 shows a reference code t emplat e t hat is represent at ive of part ially
vect orizable coding sit uat ions t hat also experience performance issues. The unvec-
t orizable port ion of code is represent ed generically by a sequence of calling a serial
funct ion named foo mult iple t imes. This generic example is referred t o as shuffle
wit h st ore forwarding, because t he problem generally involves an unpacking st age
t hat shuffles dat a element s bet ween regist er and memory, followed by a packing
st age t hat can experience st ore forwarding issue.
There are more t han one useful t echniques t hat can reduce t he st ore- forwarding
bot t leneck bet ween t he serialized port ion and t he packing st age. The following sub-
sect ions present s alt ernat e t echniques t o deal wit h t he packing, unpacking, and
paramet er passing t o serialized funct ion calls.

Figure 3-1. Generic Program Flow of Partially Vectorized Code
Serial Routine
Packed SIMD Instruction
Unpacking
Packing
Unvectorizable Code
Packed SIMD Instruction
3-41
Example 3-22. Reference Code Template for Partially Vectorizable Program
// Prolog ///////////////////////////////
push ebp
mov ebp, esp
// Unpacking ////////////////////////////
sub ebp, 32
and ebp, 0xfffffff0
movaps [ebp], xmm0
// Serial operations on components ///////
sub ebp, 4
mov eax, [ebp+4]
mov [ebp], eax
call foo
mov [ebp+16+4], eax
mov eax, [ebp+8]
mov [ebp], eax
call foo
mov [ebp+16+4+4], eax
mov eax, [ebp+12]
mov [ebp], eax
call foo
mov eax, [ebp+12+4]
mov [ebp], eax
call foo
// Packing ///////////////////////////////
movaps xmm0, [ebp+16+4]
// Epilog ////////////////////////////////
pop ebp
ret
3-42
3.5.4.1 Alternate Packing Techniques
The packing met hod implement ed in t he reference code of Example 3- 22 will experi-
ence delay as it assembles 4 doubleword result from memory int o an XMM regist er
due t o st ore- forwarding rest rict ions.
Three alt ernat e t echniques for packing, using different SI MD inst ruct ion t o assemble
cont ent s in XMM regist ers are shown in Example 3- 23. All t hree t echniques avoid
st ore- forwarding delay by sat isfying t he rest rict ions on dat a sizes bet ween a
preceding st ore and subsequent load operat ions.
3.5.4.2 Simplifying Result Passing
I n Example 3- 22, individual result s were passed t o t he packing st age by st oring t o
cont iguous memory locat ions. I nst ead of using memory spills t o pass four result s,
result passing may be accomplished by using eit her one or more regist ers. Using
regist ers t o simplify result passing and reduce memory spills can improve perfor-
mance by varying degrees depending on t he regist er pressure at runt ime.
Example 3- 24 shows t he coding sequence t hat uses four ext ra XMM regist ers t o
reduce all memory spills of passing result s back t o t he parent rout ine. However, soft -
ware must observe t he following condit ions when using t his t echnique:
There is no regist er short age.
I f t he loop does not have many st ores or loads but has many comput at ions, t his
t echnique does not help performance. This t echnique adds work t o t he comput a-
t ional unit s, while t he st ore and loads port s are idle.
Example 3-23. Three Alternate Packing Methods for Avoiding Store Forwarding Difficulty
Packing Method 1 Packing Method 2 Packing Method 3
movd xmm0, [ebp+16+4]
movd xmm3, [ebp+12+16+4]
punpckldq xmm0, xmm1
psllq xmm3, 32
orps xmm2, xmm3
psllq xmm1, 32
orps xmm0, xmm1movlhps
xmm0, xmm2
movlhps xmm1,xmm3
psllq xmm1, 32
movlhps xmm0, xmm2
orps xmm0, xmm1
3-43

Example 3-24. Using Four Registers to Reduce Memory Spills and Simplify Result Passing
mov eax, [ebp+4]
mov [ebp], eax
call foo
movd xmm0, eax
mov eax, [ebp+8]
mov [ebp], eax
call foo
movd xmm1, eax
mov eax, [ebp+12]
mov [ebp], eax
call foo
movd xmm2, eax
mov eax, [ebp+12+4]
mov [ebp], eax
call foo
movd xmm3, eax
3-44
3.5.4.3 Stack Optimization
I n Example 3- 22, an input paramet er was copied in t urn ont o t he st ack and passed
t o t he non- vect orizable rout ine for processing. The paramet er passing from consecu-
t ive memory locat ions can be simplified by a t echnique shown in Example 3- 25.
St ack Opt imizat ion can only be used when:
The serial operat ions are funct ion calls. The funct ion foo is declared as: INT
FOO(INT A). The paramet er is passed on t he st ack.
The order of operat ion on t he component s is from last t o first .
Not e t he call t o FOO and t he advance of EDP when passing t he vect or element s t o
FOO one by one from last t o first .
3.5.4.4 Tuning Considerations
Tuning considerat ions for sit uat ions represent ed by looping of Example 3- 22 include
Applying one of more of t he following combinat ions:
choose an alt ernat e packing t echnique
consider a t echnique t o simply result - passing
consider t he st ack opt imizat ion t echnique t o simplify paramet er passing
Minimizing t he average number of cycles t o execut e one it erat ion of t he loop
Minimizing t he per- it erat ion cost of t he unpacking and packing operat ions
The speed improvement by using t he t echniques discussed in t his sect ion will vary,
depending on t he choice of combinat ions implement ed and charact erist ics of t he
non- vect orizable rout ine. For example, if t he rout ine foo is short ( represent at ive of
Example 3-25. Stack Optimization Technique to Simplify Parameter Passing
call foo
mov [ebp+16], eax
add ebp, 4
call foo
mov [ebp+16], eax
add ebp, 4
call foo
mov [ebp+16], eax
add ebp, 4
call foo
3-45
t ight , short loops) , t he per- it erat ion cost of unpacking/ packing t end t o be smaller
t han sit uat ions where t he non- vect orizable code cont ain longer operat ion or many
dependencies. This is because many it erat ions of short , t ight loop can be in flight in
t he execut ion core, so t he per- it erat ion cost of packing and unpacking is only
part ially exposed and appear t o cause very lit t le performance degradat ion.
Evaluat ion of t he per- it erat ion cost of packing/ unpacking should be carried out in a
met hodical manner over a select ed number of t est cases, where each case may
implement some combinat ion of t he t echniques discussed in t his sect ion. The per-
it erat ion cost can be est imat ed by:
evaluat ing t he average cycles t o execut e one it erat ion of t he t est case
evaluat ing t he average cycles t o execut e one it erat ion of a base line loop
sequence of non- vect orizable code
Example 3- 26 shows t he base line code sequence t hat can be used t o est imat e t he
average cost of a loop t hat execut es non- vect orizable rout ines.
The average per- it erat ion cost of packing/ unpacking can be derived from measuring
t he execut ion t imes of a large number of it erat ions by:
((Cycles to run TestCase) - (Cycles to run equivalent baseline sequence) ) / (Iteration count).
Example 3-26. Base Line Code Sequence to Estimate Loop Overhead
push ebp
mov ebp, esp
sub ebp, 4
mov [ebp], edi
call foo
mov [ebp], edi
call foo
mov [ebp], edi
call foo
mov [ebp], edi
call foo
add ebp, 4
pop ebp
ret
3-46
For example, using a simple funct ion t hat ret urns an input paramet er ( represent at ive
of t ight , short loops) , t he per- it erat ion cost of packing/ unpacking may range from
slight ly more t han 7 cycles ( t he shuffle wit h st ore forwarding case, Example 3- 22) t o
~ 0. 9 cycles ( accomplished by several t est cases) . Across 27 t est cases ( consist ing of
one of t he alt ernat e packing met hods, no result - simplificat ion/ simplificat ion of eit her
1 or 4 result s, no st ack opt imizat ion or wit h st ack opt imizat ion) , t he average per- it er-
at ion cost of packing/ unpacking is about 1. 7 cycles.
Generally speaking, packing met hod 2 and 3 ( see Example 3- 23) t end t o be more
robust t han packing met hod 1; t he opt imal choice of simplifying 1 or 4 result s will be
affect ed by regist er pressure of t he runt ime and ot her relevant microarchit ect ural
condit ions.
Not e t hat t he numeric discussion of per- it erat ion cost of packing/ packing is illust ra-
t ive only. I t will vary wit h t est cases using a different base line code sequence and will
generally increase if t he non- vect orizable rout ine requires longer t ime t o execut e
because t he number of loop it erat ions t hat can reside in flight in t he execut ion core
decreases.
3.6 OPTIMIZING MEMORY ACCESSES
This sect ion discusses guidelines for opt imizing code and dat a memory accesses. The
most import ant recommendat ions are:
Execut e load and st ore operat ions wit hin available execut ion bandwidt h.
Enable forward progress of speculat ive execut ion.
Enable st ore forwarding t o proceed.
Align dat a, paying at t ent ion t o dat a layout and st ack alignment .
Place code and dat a on separat e pages.
Enhance dat a localit y.
Use prefet ching and cacheabilit y cont rol inst ruct ions.
Enhance code localit y and align branch t arget s.
Take advant age of writ e combining.
Alignment and forwarding problems are among t he most common sources of large
delays on processors based on I nt el Net Burst microarchit ect ure.
3.6.1 Load and Store Execution Bandwidth
Typically, loads and st ores are t he most frequent operat ions in a workload, up t o 40%
of t he inst ruct ions in a workload carrying load or st ore int ent are not uncommon.
Each generat ion of microarchit ect ure provides mult iple buffers t o support execut ing
load and st ore operat ions while t here are inst ruct ions in flight .
3-47
Soft ware can maximize memory performance by not exceeding t he issue or buffering
limit at ions of t he machine. I n t he I nt el Core microarchit ect ure, only 20 st ores and 32
loads may be in flight at once. Since only one load can issue per cycle, algorit hms
which operat e on t wo arrays are const rained t o one operat ion every ot her cycle
unless you use programming t ricks t o reduce t he amount of memory usage.
I nt el Net Burst microarchit ect ure has t he same number of st ore buffers, slight ly more
load buffers and similar t hroughput of issuing load operat ions. I nt el Core Duo and
I nt el Core Solo processors have less buffers. Nevert heless t he general heurist ic
applies t o all of t hem.
3.6.2 Enhance Speculative Execution and Memory Disambiguation
Prior t o I nt el Core microarchit ect ure, when code cont ains bot h st ores and loads, t he
loads cannot be issued before t he address of t he st ore is resolved. This rule ensures
correct handling of load dependencies on preceding st ores.
The I nt el Core microarchit ect ure cont ains a mechanism t hat allows some loads t o be
issued early speculat ively. The processor lat er checks if t he load address overlaps
wit h a st ore. I f t he addresses do overlap, t hen t he processor re- execut es t he inst ruc-
t ions.
Example 3- 27 illust rat es a sit uat ion t hat t he compiler cannot be sure t hat Pt r-
> Array does not change during t he loop. Therefore, t he compiler cannot keep Pt r-
> Array in a regist er as an invariant and must read it again in every it erat ion.
Alt hough t his sit uat ion can be fixed in soft ware by a rewrit ing t he code t o require t he
address of t he point er is invariant , memory disambiguat ion provides performance
gain wit hout rewrit ing t he code.
Example 3-27. Loads Blocked by Stores of Unknown Address
C code Assembly sequence
struct AA {
AA ** array;
};
void nullify_array ( AA *Ptr, DWORD Index,
AA *ThisPtr )
{
while ( Ptr->Array[--Index] != ThisPtr )
{
Ptr->Array[Index] = NULL ;
} ;
} ;
nullify_loop:
mov dword ptr [eax], 0
mov edx, dword ptr [edi]
sub ecx, 4
cmp dword ptr [ecx+edx], esi
lea eax, [ecx+edx]
jne nullify_loop
3-48
3.6.3 Alignment
Alignment of dat a concerns all kinds of variables:
Dynamically allocat ed variables
Members of a dat a st ruct ure
Global or local variables
Paramet ers passed on t he st ack
Misaligned dat a access can incur significant performance penalt ies. This is part icu-
larly t rue for cache line split s. The size of a cache line is 64 byt es in t he Pent ium 4 and
ot her recent I nt el processors, including processors based on I nt el Core microarchi-
t ect ure.
An access t o dat a unaligned on 64- byt e boundary leads t o t wo memory accesses and
requires several ops t o be execut ed ( inst ead of one) . Accesses t hat span 64- byt e
boundaries are likely t o incur a large performance penalt y, t he cost of each st all
generally are great er on machines wit h longer pipelines.
Double- precision float ing- point operands t hat are eight - byt e aligned have bet t er
performance t han operands t hat are not eight - byt e aligned, since t hey are less likely
t o incur penalt ies for cache and MOB split s. Float ing- point operat ion on a memory
operands require t hat t he operand be loaded from memory. This incurs an addit ional
op, which can have a minor negat ive impact on front end bandwidt h. Addit ionally,
memory operands may cause a dat a cache miss, causing a penalt y.
Assembl y/ Compi l er Codi ng Rul e 45. ( H i mpact , H gener al i t y) Align dat a on
nat ural operand size address boundaries. I f t he dat a will be accessed wit h vect or
inst ruct ion loads and st ores, align t he dat a on 16- byt e boundaries.
For best performance, align dat a as follows:
Align 8- bit dat a at any address.
Align 16- bit dat a t o be cont ained wit hin an aligned 4- byt e word.
Align 32- bit dat a so t hat it s base address is a mult iple of four.
Align 64- bit dat a so t hat it s base address is a mult iple of eight .
Align 80- bit dat a so t hat it s base address is a mult iple of sixt een.
Align 128- bit dat a so t hat it s base address is a mult iple of sixt een.
A 64- byt e or great er dat a st ruct ure or array should be aligned so t hat it s base
address is a mult iple of 64. Sort ing dat a in decreasing size order is one heurist ic for
assist ing wit h nat ural alignment . As long as 16- byt e boundaries ( and cache lines) are
never crossed, nat ural alignment is not st rict ly necessary ( t hough it is an easy way t o
enforce t his) .
Example 3- 28 shows t he t ype of code t hat can cause a cache line split . The code
loads t he addresses of t wo DWORD arrays. 029E70FEH is not a 4- byt e- aligned
address, so a 4- byt e access at t his address will get 2 byt es from t he cache line t his
address is cont ained in, and 2 byt es from t he cache line t hat st art s at 029E700H. On
processors wit h 64- byt e cache lines, a similar cache line split will occur every 8 it er-
at ions.
3-49
Figure 3- 2 illust rat es t he sit uat ion of accessing a dat a element t hat span across
cache line boundaries.
Alignment of code is less import ant for processors based on I nt el Net Burst microar-
chit ect ure. Alignment of branch t arget s t o maximize bandwidt h of fet ching cached
inst ruct ions is an issue only when not execut ing out of t he t race cache.
Alignment of code can be an issue for t he Pent ium M, I nt el Core Duo and I nt el Core 2
Duo processors. Alignment of branch t arget s will improve decoder t hroughput .
Example 3-28. Code That Causes Cache Line Split
mov esi, 029e70feh
mov edi, 05be5260h
Blockmove:
mov eax, DWORD PTR [esi]
mov ebx, DWORD PTR [esi+4]
mov DWORD PTR [edi], eax
mov DWORD PTR [edi+4], ebx
add esi, 8
add edi, 8
sub edx, 1
jnz Blockmove
Figure 3-2. Cache Line Split in Accessing Elements in a Array
Index 1 Index 0 cont'd
Index 0
Index 15 Index 16 Cache Line 029e7100h
Cache Line 029e70c0h
Index 17 Index 16 cont'd Index 31 Index 32 Cache Line 029e7140h
Address 029e70feh Address 029e70c1h
3-50
3.6.4 Store Forwarding
The processor s memory syst em only sends st ores t o memory ( including cache) aft er
st ore ret irement . However, st ore dat a can be forwarded from a st ore t o a subsequent
load from t he same address t o give a much short er st ore- load lat ency.
There are t wo kinds of requirement s for st ore forwarding. I f t hese requirement s are
violat ed, st ore forwarding cannot occur and t he load must get it s dat a from t he cache
( so t he st ore must writ e it s dat a back t o t he cache first ) . This incurs a penalt y t hat is
largely relat ed t o pipeline dept h of t he underlying microarchit ect ure.
The first requirement pert ains t o t he size and alignment of t he st ore- forwarding dat a.
This rest rict ion is likely t o have high impact on overall applicat ion performance. Typi-
cally, a performance penalt y due t o violat ing t his rest rict ion can be prevent ed. The
st ore- t o- load forwarding rest rict ions vary from one microarchit ect ure t o anot her.
Several examples of coding pit falls t hat cause st ore- forwarding st alls and solut ions t o
t hese pit falls are discussed in det ail in Sect ion 3. 6. 4. 1, St ore- t o- Load- Forwarding
Rest rict ion on Size and Alignment . The second requirement is t he availabilit y of
dat a, discussed in Sect ion 3. 6.4. 2, St ore- forwarding Rest rict ion on Dat a Avail-
abilit y. A good pract ice is t o eliminat e redundant load operat ions.
I t may be possible t o keep a t emporary scalar variable in a regist er and never writ e it
t o memory. Generally, such a variable must not be accessible using indirect point ers.
Moving a variable t o a regist er eliminat es all loads and st ores of t hat variable and
eliminat es pot ent ial problems associat ed wit h st ore forwarding. However, it also
increases regist er pressure.
Load inst ruct ions t end t o st art chains of comput at ion. Since t he out - of- order engine
is based on dat a dependence, load inst ruct ions play a significant role in t he engines
abilit y t o execut e at a high rat e. Eliminat ing loads should be given a high priorit y.
I f a variable does not change bet ween t he t ime when it is st ored and t he t ime when
it is used again, t he regist er t hat was st ored can be copied or used direct ly. I f regist er
pressure is t oo high, or an unseen funct ion is called before t he st ore and t he second
load, it may not be possible t o eliminat e t he second load.
Assembl y/ Compi l er Codi ng Rul e 46. ( H i mpact , M gener al i t y) Pass
paramet ers in regist ers inst ead of on t he st ack where possible. Passing argument s
on t he st ack requires a st ore followed by a reload. While t his sequence is opt imized
in hardware by providing t he value t o t he load direct ly from t he memory order
buffer wit hout t he need t o access t he dat a cache if permit t ed by st ore- forwarding
rest rict ions, float ing point values incur a significant lat ency in forwarding. Passing
float ing point argument s in ( preferably XMM) regist ers should save t his long lat ency
operat ion.
Paramet er passing convent ions may limit t he choice of which paramet ers are passed
in regist ers which are passed on t he st ack. However, t hese limit at ions may be over-
come if t he compiler has cont rol of t he compilat ion of t he whole binary ( using whole-
program opt imizat ion) .
3-51
3.6.4.1 Store-to-Load-Forwarding Restriction on Size and Alignment
Dat a size and alignment rest rict ions for st ore- forwarding apply t o processors based
on I nt el Net Burst microarchit ect ure, I nt el Core microarchit ect ure, I nt el Core 2 Duo,
I nt el Core Solo and Pent ium M processors. The performance penalt y for violat ing
st ore- forwarding rest rict ions is less for short er- pipelined machines t han for I nt el
St ore- forwarding rest rict ions vary wit h each microarchit ect ure. I nt el Net Burst
microarchit ect ure places more const raint s t han I nt el Core microarchit ect ure on code
generat ion t o enable st ore- forwarding t o make progress inst ead of experiencing
st alls. Fixing st ore- forwarding problems for I nt el Net Burst microarchit ect ure gener-
ally also avoids problems on Pent ium M, I nt el Core Duo and I nt el Core 2 Duo proces-
sors. The size and alignment rest rict ions for st ore- forwarding in processors based on
I nt el Net Burst microarchit ect ure are illust rat ed in Figure 3- 3.
Figure 3-3. Size and Alignment Restrictions in Store Forwarding
OM15155
(a) Small load after
Large Store
Store
Load
Load Aligned with
Store Will Forward
Non-Forwarding
Penalty
(b) Size of Load >=
Store
Store
Load
Penalty
(c) Size of Load >=
Store(s)
Store
Load
Penalty
(d) 128-bit Forward
Must Be 16-Byte
Aligned
Store
Load
Penalty
16-Byte
Boundary
3-52
The following rules help sat isfy size and alignment rest rict ions for st ore forwarding:
Assembl y/ Compi l er Codi ng Rul e 47. ( H i mpact , M gener al i t y) A load t hat
forwards from a st ore must have t he same address st art point and t herefore t he
same alignment as t he st ore dat a.
Assembl y/ Compi l er Codi ng Rul e 48. ( H i mpact , M gener al i t y) The dat a of a
load which is forwarded from a st ore must be complet ely cont ained wit hin t he st ore
dat a.
A load t hat forwards from a st ore must wait for t he st ores dat a t o be writ t en t o t he
st ore buffer before proceeding, but ot her, unrelat ed loads need not wait .
Assembl y/ Compi l er Codi ng Rul e 49. ( H i mpact , ML gener al i t y) I f it is
necessary t o ext ract a non- aligned port ion of st ored dat a, read out t he smallest
aligned port ion t hat complet ely cont ains t he dat a and shift / mask t he dat a as
necessary. This is bet t er t han incurring t he penalt ies of a failed st ore- forward.
Assembl y/ Compi l er Codi ng Rul e 50. ( MH i mpact , ML gener al i t y) Avoid
several small loads aft er large st ores t o t he same area of memory by using a single
large read and regist er copies as needed.
Example 3- 29 depict s several st ore- forwarding sit uat ions in which small loads follow
large st ores. The first t hree load operat ions illust rat e t he sit uat ions described in Rule
50. However, t he last load operat ion get s dat a from st ore- forwarding wit hout
problem.
Example 3- 30 illust rat es a st ore- forwarding sit uat ion in which a large load follows
several small st ores. The dat a needed by t he load operat ion cannot be forwarded
because all of t he dat a t hat needs t o be forwarded is not cont ained in t he st ore buffer.
Avoid large loads aft er small st ores t o t he same area of memory.
Example 3-29. Situations Showing Small Loads After Large Store
mov [EBP],abcd
mov AL, [EBP] ; Not blocked - same alignment
mov BL, [EBP + 1] ; Blocked
mov CL, [EBP + 2] ; Blocked
mov DL, [EBP + 3] ; Blocked
mov AL, [EBP] ; Not blocked - same alignment
; n.b. passes older blocked loads
3-53
Example 3- 31 illust rat es a st alled st ore- forwarding sit uat ion t hat may appear in
compiler generat ed code. Somet imes a compiler generat es code similar t o t hat
shown in Example 3- 31 t o handle a spilled byt e t o t he st ack and convert t he byt e t o
an int eger value.
Example 3- 32 offers t wo alt ernat ives t o avoid t he non- forwarding sit uat ion shown in
Example 3- 31.
When moving dat a t hat is smaller t han 64 bit s bet ween memory locat ions, 64- bit or
128- bit SI MD regist er moves are more efficient ( if aligned) and can be used t o avoid
unaligned loads. Alt hough float ing- point regist ers allow t he movement of 64 bit s at a
t ime, float ing point inst ruct ions should not be used for t his purpose, as dat a may be
inadvert ent ly modified.
Example 3-30. Non-forwarding Example of Large Load After Small Store
mov [EBP], a
mov [EBP + 1], b
mov [EBP + 2], c
mov [EBP + 3], d
mov EAX, [EBP] ; Blocked
; The first 4 small store can be consolidated into
; a single DWORD store to prevent this non-forwarding
; situation.
Example 3-31. A Non-forwarding Situation in Compiler Generated Code
mov DWORD PTR [esp+10h], 00000000h
mov BYTE PTR [esp+10h], bl
mov eax, DWORD PTR [esp+10h] ; Stall
and eax, 0xff ; Converting back to byte value
Example 3-32. Two Ways to Avoid Non-forwarding Situation in Example 3-31
; A. Use MOVZ instruction to avoid large load after small
; store, when spills are ignored.
movz eax, bl ; Replaces the last three instructions
; B. Use MOVZ instruction and handle spills to the stack
mov DWORD PTR [esp+10h], 00000000h
mov BYTE PTR [esp+10h], bl
movz eax, BYTE PTR [esp+10h] ; Not blocked
3-54
As an addit ional example, consider t he cases in Example 3- 33.
I n t he first case ( A) , t here is a large load aft er a series of small st ores t o t he same
area of memory ( beginning at memory address MEM) . The large load will st all.
The FLD must wait for t he st ores t o writ e t o memory before it can access all t he dat a
it requires. This st all can also occur wit h ot her dat a t ypes ( for example, when byt es
or words are st ored and t hen words or doublewords are read from t he same area of
memory) .
I n t he second case ( B) , t here is a series of small loads aft er a large st ore t o t he same
area of memory ( beginning at memory address MEM) . The small loads will st all.
The word loads must wait for t he quadword st ore t o writ e t o memory before t hey can
access t he dat a t hey require. This st all can also occur wit h ot her dat a t ypes ( for
example, when doublewords or words are st ored and t hen words or byt es are read
from t he same area of memory) . This can be avoided by moving t he st ore as far from
t he loads as possible.
St ore forwarding rest rict ions for processors based on I nt el Core microarchit ect ure is
list ed in Table 3- 1.
Example 3-33. Large and Small Load Stalls
; A. Large load stall
mov mem, eax ; Store dword to address MEM"
mov mem + 4, ebx ; Store dword to address MEM + 4"
fld mem ; Load qword at address MEM", stalls
; B. Small Load stall
fstp mem ; Store qword to address MEM"
mov bx, mem+2 ; Load word at address MEM + 2", stalls
mov cx, mem+4 ; Load word at address MEM + 4", stalls
Table 3-1. Store Forwarding Restrictions of Processors
Based on Intel Core Microarchitecture
Store
Alignment
Width of
Store
(bits)
Load Alignment
(byte)
Width of
Load (bits)
Store
Forwarding
Restriction
To Natural size 16 word aligned 8, 16 not stalled
To Natural size 16 not word aligned 8 stalled
To Natural size 32 dword aligned 8, 32 not stalled
To Natural size 32 not dword aligned 8 stalled
To Natural size 32 word aligned 16 not stalled
To Natural size 32 not word aligned 16 stalled
To Natural size 64 qword aligned 8, 16, 64 not stalled
3-55
3.6.4.2 Store-forwarding Restriction on Data Availability
The value t o be st ored must be available before t he load operat ion can be complet ed.
I f t his rest rict ion is violat ed, t he execut ion of t he load will be delayed unt il t he dat a is
available. This delay causes some execut ion resources t o be used unnecessarily, and
t hat can lead t o sizable but non- det erminist ic delays. However, t he overall impact of
t his problem is much smaller t han t hat from violat ing size and alignment require-
ment s.
I n processors based on I nt el Net Burst microarchit ect ure, hardware predict s when
loads are dependent on and get t heir dat a forwarded from preceding st ores. These
predict ions can significant ly improve performance. However, if a load is scheduled
t oo soon aft er t he st ore it depends on or if t he generat ion of t he dat a t o be st ored is
delayed, t here can be a significant penalt y.
To Natural size 64 not qword aligned 8, 16 stalled
To Natural size 64 dword aligned 32 not stalled
To Natural size 128 dqword aligned 8, 16, 128 not stalled
To Natural size 128 not dqword aligned 8, 16 stalled
To Natural size 128 dword aligned 32 not stalled
To Natural size 128 qword aligned 64 not stalled
To Natural size 128 not qword aligned 64 stalled
Unaligned, start byte 1 32 byte 0 of store 8, 16, 32 not stalled
Unaligned, start byte 1 32 not byte 0 of store 8, 16 stalled
Unaligned, start byte 1 64 byte 0 of store 8, 16, 32 not stalled
Unaligned, start byte 1 64 not byte 0 of store 8, 16, 32 stalled
Unaligned, start byte 1 64 byte 0 of store 64 stalled
Unaligned, start byte 7 32 byte 0 of store 8 not stalled
Unaligned, start byte 7 32 not byte 0 of store 8 not stalled
Unaligned, start byte 7 32 dont care 16, 32 stalled
Unaligned, start byte 7 64 dont care 16, 32, 64 stalled
Table 3-1. Store Forwarding Restrictions of Processors
Based on Intel Core Microarchitecture (Contd.)
Store
Alignment
Width of
Store
(bits)
Load Alignment
(byte)
Width of
Load (bits)
Store
Forwarding
Restriction
3-56
There are several cases in which dat a is passed t hrough memory, and t he st ore may
need t o be separat ed from t he load:
Spills, save and rest ore regist ers in a st ack frame
Paramet er passing
Global and volat ile variables
Type conversion bet ween int eger and float ing point
When compilers do not analyze code t hat is inlined, forcing variables t hat are
involved in t he int erface wit h inlined code t o be in memory, creat ing more
memory variables and prevent ing t he eliminat ion of redundant loads
Assembl y/ Compi l er Codi ng Rul e 51. ( H i mpact , MH gener al i t y) Where it is
possible t o do so wit hout incurring ot her penalt ies, priorit ize t he allocat ion of
variables t o regist ers, as in regist er allocat ion and for paramet er passing, t o
minimize t he likelihood and impact of st ore- forwarding problems. Try not t o st ore-
forward dat a generat ed from a long lat ency inst ruct ion - for example, MUL or DI V.
Avoid st ore- forwarding dat a for variables wit h t he short est st ore- load dist ance.
Avoid st ore- forwarding dat a for variables wit h many and/ or long dependence
chains, and especially avoid including a st ore forward on a loop- carried dependence
chain.
Example 3- 34 shows an example of a loop- carried dependence chain.
Assembl y/ Compi l er Codi ng Rul e 52. ( M i mpact , MH gener al i t y ) Calculat e
st ore addresses as early as possible t o avoid having st ores block loads.
3.6.5 Data Layout Optimizations
User / Sour ce Codi ng Rul e 6. ( H i mpact , M gener al i t y) Pad dat a st ruct ures
defined in t he source code so t hat every dat a element is aligned t o a nat ural
operand size address boundary.
I f t he operands are packed in a SI MD inst ruct ion, align t o t he packed element size
( 64- bit or 128- bit ) .
Align dat a by providing padding inside st ruct ures and arrays. Programmers can reor-
ganize st ruct ures and arrays t o minimize t he amount of memory wast ed by padding.
However, compilers might not have t his freedom. The C programming language, for
example, specifies t he order in which st ruct ure element s are allocat ed in memory. For
more informat ion, see Sect ion 4.4, St ack and Dat a Alignment , and Appendix D,
St ack Alignment .
Example 3-34. Loop-carried Dependence Chain
for ( i = 0; i < MAX; i++ ) {
a[i] = b[i] * foo;
foo = a[i] / 3;
} // foo is a loop-carried dependence.
3-57
Example 3- 35 shows how a dat a st ruct ure could be rearranged t o reduce it s size.
Cache line size of 64 byt es can impact st reaming applicat ions ( for example, mult i-
media) . These reference and use dat a only once before discarding it . Dat a accesses
which sparsely ut ilize t he dat a wit hin a cache line can result in less efficient ut ilizat ion
of syst em memory bandwidt h. For example, arrays of st ruct ures can be decomposed
int o several arrays t o achieve bet t er packing, as shown in Example 3- 36.
Example 3-35. Rearranging a Data Structure
struct unpacked { /* Fits in 20 bytes due to padding */
int a;
char b;
int c;
char d;
int e;
};
struct packed { /* Fits in 16 bytes */
int a;
int c;
int e;
char b;
char d;
}
Example 3-36. Decomposing an Array
struct { /* 1600 bytes */
int a, c, e;
char b, d;
} array_of_struct [100];
int a[100], c[100], e[100];
char b[100], d[100];
} struct_of_array;
int a, c, e;
} hybrid_struct_of_array_ace[100];
3-58
The efficiency of such opt imizat ions depends on usage pat t erns. I f t he element s of
t he st ruct ure are all accessed t oget her but t he access pat t ern of t he array is random,
t hen ARRAY_OF_STRUCT avoids unnecessary prefet ch even t hough it wast es
memory.
However, if t he access pat t ern of t he array exhibit s localit y ( for example, if t he array
index is being swept t hrough) t hen processors wit h hardware prefet chers will
prefet ch dat a from STRUCT_OF_ARRAY, even if t he element s of t he st ruct ure are
accessed t oget her.
When t he element s of t he st ruct ure are not accessed wit h equal frequency, such as
when element A is accessed t en t imes more oft en t han t he ot her ent ries, t hen
STRUCT_OF_ARRAY not only saves memory, but it also prevent s fet ching unneces-
sary dat a it ems B, C, D, and E.
Using STRUCT_OF_ARRAY also enables t he use of t he SI MD dat a t ypes by t he
programmer and t he compiler.
Not e t hat STRUCT_OF_ARRAY can have t he disadvant age of requiring more indepen-
dent memory st ream references. This can require t he use of more prefet ches and
addit ional address generat ion calculat ions. I t can also have an impact on DRAM page
access efficiency. An alt ernat ive, HYBRI D_STRUCT_OF_ARRAY blends t he t wo
approaches. I n t his case, only 2 separat e address st reams are generat ed and refer-
enced: 1 for HYBRI D_STRUCT_OF_ARRAY_ACE and 1 for
HYBRI D_STRUCT_OF_ARRAY_BD. The second alt erat ive also prevent s fet ching
unnecessary dat a assuming t hat ( 1) t he variables A, C and E are always used
t oget her, and ( 2) t he variables B and D are always used t oget her, but not at t he same
t ime as A, C and E.
The hybrid approach ensures:
Simpler/ fewer address generat ions t han STRUCT_OF_ARRAY
Fewer st reams, which reduces DRAM page misses
Fewer prefet ches due t o fewer st reams
Efficient cache line packing of dat a element s t hat are used concurrent ly
Assembl y/ Compi l er Codi ng Rul e 53. ( H i mpact , M gener al i t y) Try t o arrange
dat a st ruct ures such t hat t hey permit sequent ial access.
I f t he dat a is arranged int o a set of st reams, t he aut omat ic hardware prefet cher can
prefet ch dat a t hat will be needed by t he applicat ion, reducing t he effect ive memory
lat ency. I f t he dat a is accessed in a non- sequent ial manner, t he aut omat ic hardware
prefet cher cannot prefet ch t he dat a. The prefet cher can recognize up t o eight
char b, d;
} hybrid_struct_of_array_bd[100];
Example 3-36. Decomposing an Array (Contd.)
3-59
concurrent st reams. See Chapt er 9, Opt imizing Cache Usage, for more informat ion
on t he hardware prefet cher.
On I nt el Core 2 Duo, I nt el Core Duo, I nt el Core Solo, Pent ium 4, I nt el Xeon and
Pent ium M processors, memory coherence is maint ained on 64- byt e cache lines
( rat her t han 32- byt e cache lines. as in earlier processors) . This can increase t he
opport unit y for false sharing.
User / Sour ce Codi ng Rul e 7. ( M i mpact , L gener al i t y ) Beware of false sharing
wit hin a cache line ( 64 byt es) and wit hin a sect or of 128 byt es on processors based
on I nt el Net Burst microarchit ect ure.
3.6.6 Stack Alignment
The easiest way t o avoid st ack alignment problems is t o keep t he st ack aligned at all
t imes. For example, a language t hat support s 8- bit , 16- bit , 32- bit , and 64- bit dat a
quant it ies but never uses 80- bit dat a quant it ies can require t he st ack t o always be
aligned on a 64- bit boundary.
Assembl y/ Compi l er Codi ng Rul e 54. ( H i mpact , M gener al i t y) I f 64- bit dat a is
ever passed as a paramet er or allocat ed on t he st ack, make sure t hat t he st ack is
aligned t o an 8- byt e boundary.
Doing t his will require using a general purpose regist er ( such as EBP) as a frame
point er. The t radeoff is bet ween causing unaligned 64- bit references ( if t he st ack is
not aligned) and causing ext ra general purpose regist er spills ( if t he st ack is aligned) .
Not e t hat a performance penalt y is caused only when an unaligned access split s a
cache line. This means t hat one out of eight spat ially consecut ive unaligned accesses
is always penalized.
A rout ine t hat makes frequent use of 64- bit dat a can avoid st ack misalignment by
placing t he code described in Example 3- 37 in t he funct ion prologue and epilogue.
Example 3-37. Dynamic Stack Alignment
prologue:
subl esp, 4 ; Save frame ptr
movl [esp], ebp
movl ebp, esp ; New frame pointer
andl ebp, 0xFFFFFFFC ; Aligned to 64 bits
movl [ebp], esp ; Save old stack ptr
subl esp, FRAMESIZE ; Allocate space
; ... callee saves, etc.
3-60
I f for some reason it is not possible t o align t he st ack for 64- bit s, t he rout ine should
access t he paramet er and save it int o a regist er or known aligned st orage, t hus incur-
ring t he penalt y only once.
3.6.7 Capacity Limits and Aliasing in Caches
There are cases in which addresses wit h a given st ride will compet e for some
resource in t he memory hierarchy.
Typically, caches are implement ed t o have mult iple ways of set associat ivit y, wit h
each way consist ing of mult iple set s of cache lines ( or sect ors in some cases) .
Mult iple memory references t hat compet e for t he same set of each way in a cache
can cause a capacit y issue. There are aliasing condit ions t hat apply t o specific
microarchit ect ures. Not e t hat first - level cache lines are 64 byt es. Thus, t he least
significant 6 bit s are not considered in alias comparisons. For processors based on
I nt el Net Burst microarchit ect ure, dat a is loaded int o t he second level cache in a
sect or of 128 byt es, so t he least significant 7 bit s are not considered in alias compar-
isons.
3.6.7.1 Capacity Limits in Set-Associative Caches
Capacit y limit s may be reached if t he number of out st anding memory references t hat
are mapped t o t he same set in each way of a given cache exceeds t he number of
ways of t hat cache. The condit ions t hat apply t o t he first - level dat a cache and second
level cache are list ed below:
L1 Set Conf l i ct s Mult iple references map t o t he same first - level cache set .
The conflict ing condit ion is a st ride det ermined by t he size of t he cache in byt es,
divided by t he number of ways. These compet ing memory references can cause
excessive cache misses only if t he number of out st anding memory references
exceeds t he number of ways in t he working set :
On Pent ium 4 and I nt el Xeon processors wit h a CPUI D signat ure of family
encoding 15, model encoding of 0, 1, or 2; t here will be an excess of first -
level cache misses for more t han 4 simult aneous compet ing memory
references t o addresses wit h 2- KByt e modulus.
epilogue:
; ... callee restores, etc.
movl esp, [ebp] ; Restore stack ptr
movl ebp, [esp] ; Restore frame ptr
addl esp, 4
ret
Example 3-37. Dynamic Stack Alignment (Contd.)
3-61
encoding 15, model encoding 3; t here will be an excess of first - level cache
misses for more t han 8 simult aneous compet ing references t o addresses t hat
are apart by 2- KByt e modulus.
On I nt el Core 2 Duo, I nt el Core Duo, I nt el Core Solo, and Pent ium M
processors, t here will be an excess of first - level cache misses for more t han 8
simult aneous references t o addresses t hat are apart by 4- KByt e modulus.
L2 Set Conf l i ct s Mult iple references map t o t he same second- level cache set .
The conflict ing condit ion is also det ermined by t he size of t he cache or t he
number of ways:
On Pent ium 4 and I nt el Xeon processors, t here will be an excess of second-
level cache misses for more t han 8 simult aneous compet ing references. The
st ride sizes t hat can cause capacit y issues are 32 KByt es, 64 KByt es, or
128 KByt es, depending of t he size of t he second level cache.
On Pent ium M processors, t he st ride sizes t hat can cause capacit y issues are
128 KByt es or 256 KByt es, depending of t he size of t he second level cache.
On I nt el Core 2 Duo, I nt el Core Duo, I nt el Core Solo processors, st ride size of
256 KByt es can cause capacit y issue if t he number of simult aneous accesses
exceeded t he way associat ivit y of t he L2 cache.
3.6.7.2 Aliasing Cases in Processors Based on Intel NetBurst
Microarchitecture
Aliasing condit ions t hat are specific t o processors based on I nt el Net Burst microar-
chit ect ure are:
16 KByt es f or code There can only be one of t hese in t he t race cache at a
t ime. I f t wo t races whose st art ing addresses are 16 KByt es apart are in t he same
working set , t he sympt om will be a high t race cache miss rat e. Solve t his by
offset t ing one of t he addresses by one or more byt es.
Dat a conf l i ct There can only be one inst ance of t he dat a in t he first - level
cache at a t ime. I f a reference ( load or st ore) occurs and it s linear address
mat ches a dat a conflict condit ion wit h anot her reference ( load or st ore) t hat is
under way, t hen t he second reference cannot begin unt il t he first one is kicked
out of t he cache.
encoding 15, model encoding of 0, 1, or 2; t he dat a conflict condit ion applies
t o addresses having ident ical values in bit s 15: 6 ( t his is also referred t o as a
64- KByt e aliasing conflict ) . I f you avoid t his kind of aliasing, you can speed
up programs by a fact or of t hree if t hey load frequent ly from preceding st ores
wit h aliased addresses and lit t le ot her inst ruct ion- level parallelism is
available. The gain is smaller when loads alias wit h ot her loads, which causes
t hrashing in t he first - level cache.
3-62
encoding 15, model encoding 3; t he dat a conflict condit ion applies t o
addresses having ident ical values in bit s 21: 6.
3.6.7.3 Aliasing Cases in the Pentium M, Intel Core Solo, Intel Core Duo
and Intel Core 2 Duo Processors
Pent ium M, I nt el Core Solo, I nt el Core Duo and I nt el Core 2 Duo processors have t he
following aliasing case:
St or e f or w ar di ng I f a st ore t o an address is followed by a load from t he same
address, t he load will not proceed unt il t he st ore dat a is available. I f a st ore is
followed by a load and t heir addresses differ by a mult iple of 4 KByt es, t he load
st alls unt il t he st ore operat ion complet es.
Assembl y/ Compi l er Codi ng Rul e 55. ( H i mpact , M gener al i t y) Avoid having a
st ore followed by a non- dependent load wit h addresses t hat differ by a mult iple of
4 KByt es. Also, lay out dat a or order comput at ion t o avoid having cache lines t hat
have linear addresses t hat are a mult iple of 64 KByt es apart in t he same working
set . Avoid having more t han 4 cache lines t hat are some mult iple of 2 KByt es apart
in t he same first - level cache working set , and avoid having more t han 8 cache lines
t hat are some mult iple of 4 KByt es apart in t he same first - level cache working set .
When declaring mult iple arrays t hat are referenced wit h t he same index and are each
a mult iple of 64 KByt es ( as can happen wit h STRUCT_OF_ARRAY dat a layout s) , pad
t hem t o avoid declaring t hem cont iguously. Padding can be accomplished by eit her
int ervening declarat ions of ot her variables or by art ificially increasing t he dimension.
User / Sour ce Codi ng Rul e 8. ( H i mpact , ML gener al i t y) Consider using a
special memory allocat ion library wit h address offset capabilit y t o avoid aliasing.
One way t o implement a memory allocat or t o avoid aliasing is t o allocat e more t han
enough space and pad. For example, allocat e st ruct ures t hat are 68 KB inst ead of
64 KByt es t o avoid t he 64- KByt e aliasing, or have t he allocat or pad and ret urn
random offset s t hat are a mult iple of 128 Byt es ( t he size of a cache line) .
User / Sour ce Codi ng Rul e 9. ( M i mpact , M gener al i t y) When padding variable
declarat ions t o avoid aliasing, t he great est benefit comes from avoiding aliasing on
second- level cache lines, suggest ing an offset of 128 byt es or more.
4- KByt e memory aliasing occurs when t he code accesses t wo different memory loca-
t ions wit h a 4- KByt e offset bet ween t hem. The 4- KByt e aliasing sit uat ion can mani-
fest in a memory copy rout ine where t he addresses of t he source buffer and
dest inat ion buffer maint ain a const ant offset and t he const ant offset happens t o be a
mult iple of t he byt e increment from one it erat ion t o t he next .
Example 3- 38 shows a rout ine t hat copies 16 byt es of memory in each it erat ion of a
loop. I f t he offset s ( modular 4096) bet ween source buffer ( EAX) and dest inat ion
buffer ( EDX) differ by 16, 32, 48, 64, 80; loads have t o wait unt il st ores have been
ret ired before t hey can cont inue. For example at offset 16, t he load of t he next it era-
t ion is 4- KByt e aliased current it erat ion st ore, t herefore t he loop must wait unt il t he
st ore operat ion complet es, making t he ent ire loop serialized. The amount of t ime
3-63
needed t o wait decreases wit h larger offset unt il offset of 96 resolves t he issue ( as
t here is no pending st ores by t he t ime of t he load wit h same address) .
The I nt el Core microarchit ect ure provides a performance monit oring event ( see
LOAD_BLOCK.OVERLAP_STORE in I nt el 64 and I A- 32 Archit ect ures Soft ware
Developers Manual, Volume 3B) t hat allows soft ware t uning effort t o det ect t he
occurrence of aliasing condit ions.
3.6.8 Mixing Code and Data
The aggressive prefet ching and predecoding of inst ruct ions by I nt el processors have
t wo relat ed effect s:
Self- modifying code works correct ly, according t o t he I nt el archit ect ure processor
requirement s, but incurs a significant performance penalt y. Avoid self- modifying
code if possible.
Placing writ able dat a in t he code segment might be impossible t o dist inguish
from self- modifying code. Writ able dat a in t he code segment might suffer t he
same performance penalt y as self- modifying code.
Assembl y/ Compi l er Codi ng Rul e 56. ( M i mpact , L gener al i t y) I f ( hopefully
read- only) dat a must occur on t he same page as code, avoid placing it immediat ely
aft er an indirect j ump. For example, follow an indirect j ump wit h it s most ly likely
t arget , and place t he dat a aft er an uncondit ional branch.
Tuni ng Suggest i on 1. I n rare cases, a performance problem may be caused by
execut ing dat a on a code page as inst ruct ions. This is very likely t o happen when
execut ion is following an indirect branch t hat is not resident in t he t race cache. I f
t his is clearly causing a performance problem, t ry moving t he dat a elsewhere, or
insert ing an illegal opcode or a PAUSE inst ruct ion immediat ely aft er t he indirect
branch. Not e t hat t he lat t er t wo alt ernat ives may degrade performance in some
circumst ances.
Example 3-38. Aliasing Between Loads and Stores Across Loop Iterations
LP:
movaps xmm0, [eax+ecx]
movaps [edx+ecx], xmm0
add ecx, 16
jnz lp
3-64
Assembl y/ Compi l er Codi ng Rul e 57. ( H i mpact , L gener al i t y ) Always put
code and dat a on separat e pages. Avoid self- modifying code wherever possible. I f
code is t o be modified, t ry t o do it all at once and make sure t he code t hat performs
t he modificat ions and t he code being modified are on separat e 4- KByt e pages or on
separat e aligned 1- KByt e subpages.
3.6.8.1 Self-modifying Code
Self- modifying code ( SMC) t hat ran correct ly on Pent ium III processors and prior
implement at ions will run correct ly on subsequent implement at ions. SMC and cross-
modifying code ( when mult iple processors in a mult iprocessor syst em are writ ing t o
a code page) should be avoided when high performance is desired.
Soft ware should avoid writ ing t o a code page in t he same 1- KByt e subpage t hat is
being execut ed or fet ching code in t he same 2- KByt e subpage of t hat is being
writ t en. I n addit ion, sharing a page cont aining direct ly or speculat ively execut ed
code wit h anot her processor as a dat a page can t rigger an SMC condit ion t hat causes
t he ent ire pipeline of t he machine and t he t race cache t o be cleared. This is due t o t he
self- modifying code condit ion.
Dynamic code need not cause t he SMC condit ion if t he code writ t en fills up a dat a
page before t hat page is accessed as code. Dynamically- modified code ( for example,
from t arget fix- ups) is likely t o suffer from t he SMC condit ion and should be avoided
where possible. Avoid t he condit ion by int roducing indirect branches and using dat a
t ables on dat a pages ( not code pages) using r egist er- indirect calls.
3.6.9 Write Combining
Writ e combining ( WC) improves performance in t wo ways:
On a writ e miss t o t he first - level cache, it allows mult iple st ores t o t he same
cache line t o occur before t hat cache line is read for ownership ( RFO) from furt her
out in t he cache/ memory hierarchy. Then t he rest of line is read, and t he byt es
t hat have not been writ t en are combined wit h t he unmodified byt es in t he
ret urned line.
Writ e combining allows mult iple writ es t o be assembled and writ t en furt her out in
t he cache hierarchy as a unit . This saves port and bus t raffic. Saving t raffic is
part icularly import ant for avoiding part ial writ es t o uncached memory.
There are six writ e- combining buffers ( on Pent ium 4 and I nt el Xeon processors wit h
a CPUI D signat ure of family encoding 15, model encoding 3; t here are 8 writ e-
combining buffers) . Two of t hese buffers may be writ t en out t o higher cache levels
and freed up for use on ot her writ e misses. Only four writ e- combining buffers are
guarant eed t o be available for simult aneous use. Writ e combining applies t o memory
t ype WC; it does not apply t o memory t ype UC.
3-65
There are six writ e- combining buffers in each processor core in I nt el Core Duo and
I nt el Core Solo processors. Processors based on I nt el Core microarchit ect ure have
eight writ e- combining buffers in each core.
Assembl y/ Compi l er Codi ng Rul e 58. ( H i mpact , L gener al i t y ) I f an inner loop
writ es t o more t han four arrays ( four dist inct cache lines) , apply loop fission t o
break up t he body of t he loop such t hat only four arrays are being writ t en t o in each
it erat ion of each of t he result ing loops.
Writ e combining buffers are used for st ores of all memory t ypes. They are part icu-
larly import ant for writ es t o uncached memory: writ es t o different part s of t he same
cache line can be grouped int o a single, full- cache- line bus t ransact ion inst ead of
going across t he bus ( since t hey are not cached) as several part ial writ es. Avoiding
part ial writ es can have a significant impact on bus bandwidt h- bound graphics appli-
cat ions, where graphics buffers are in uncached memory. Separat ing writ es t o
uncached memory and writ es t o writ eback memory int o separat e phases can assure
t hat t he writ e combining buffers can fill before get t ing evict ed by ot her writ e t raffic.
Eliminat ing part ial writ e t ransact ions has been found t o have performance impact on
t he order of 20% for some applicat ions. Because t he cache lines are 64 byt es, a writ e
t o t he bus for 63 byt es will result in 8 part ial bus t ransact ions.
When coding funct ions t hat execut e simult aneously on t wo t hreads, reducing t he
number of writ es t hat ar e allowed in an inner loop will help t ake full advant age of
writ e- combining st or e buffers. For wr it e- combining buffer recommendat ions for
Hyper-Threading Technology, see Chapt er 8, Mult icore and Hyper-Threading Tech-
nology.
St ore ordering and visibilit y are also import ant issues for writ e combining. When a
writ e t o a writ e- combining buffer for a previously- unwrit t en cache line occurs, t here
will be a read- for- ownership ( RFO) . I f a subsequent writ e happens t o anot her writ e-
combining buffer, a separat e RFO may be caused for t hat cache line. Subsequent
writ es t o t he first cache line and writ e- combining buffer will be delayed unt il t he
second RFO has been serviced t o guarant ee properly ordered visibilit y of t he writ es.
I f t he memory t ype for t he writ es is writ e- combining, t here will be no RFO since t he
line is not cached, and t here is no such delay. For det ails on writ e- combining, see
Chapt er 10, Memory Cache Cont rol, of I nt el 64 and I A- 32 Archit ect ures Soft ware
Developers Manual, Volume 3A.
3.6.10 Locality Enhancement
Localit y enhancement can reduce dat a t raffic originat ing from an out er- level sub-
syst em in t he cache/ memory hierarchy. This is t o address t he fact t hat t he access-
cost in t erms of cycle- count from an out er level will be more expensive t han from an
inner level. Typically, t he cycle- cost of accessing a given cache level ( or memory
syst em) varies across different microarchit ect ures, processor implement at ions, and
plat form component s. I t may be sufficient t o recognize t he relat ive dat a access cost
t rend by localit y rat her t han t o follow a large t able of numeric values of cycle- cost s,
list ed per localit y, per processor/ plat form implement at ions, et c. The general t rend is
t ypically t hat access cost from an out er subsyst em may be approximat ely 3- 10X
3-66
more expensive t han accessing dat a from t he immediat e inner level in t he
cache/ memory hierarchy, assuming similar degrees of dat a access parallelism.
Thus localit y enhancement should st art wit h charact erizing t he dominant dat a t raffic
localit y. Sect ion A, Applicat ion Performance Tools, describes some t echniques t hat
can be used t o det ermine t he dominant dat a t raffic localit y for any workload.
Even if cache miss rat es of t he last level cache may be low relat ive t o t he number of
cache references, processors t ypically spend a sizable port ion of t heir execut ion t ime
wait ing for cache misses t o be serviced. Reducing cache misses by enhancing a
programs localit y is a key opt imizat ion. This can t ake several forms:
Blocking t o it erat e over a port ion of an array t hat will fit in t he cache ( wit h t he
purpose t hat subsequent references t o t he dat a- block [ or t ile] will be cache hit
references)
Loop int erchange t o avoid crossing cache lines or page boundaries
Loop skewing t o make accesses cont iguous
Localit y enhancement t o t he last level cache can be accomplished wit h sequencing
t he dat a access pat t ern t o t ake advant age of hardware prefet ching. This can also
t ake several forms:
Transformat ion of a sparsely populat ed mult i- dimensional array int o a one-
dimension array such t hat memory references occur in a sequent ial, small- st ride
pat t ern t hat is friendly t o t he hardware prefet ch ( see Sect ion 2. 2. 4.4, Dat a
Prefet ch )
Opt imal t ile size and shape select ion can furt her improve t emporal dat a localit y
by increasing hit rat es int o t he last level cache and reduce memory t raffic
result ing from t he act ions of hardware prefet ching ( see Sect ion 9. 6.11,
Hardware Prefet ching and Cache Blocking Techniques )
I t is import ant t o avoid operat ions t hat work against localit y- enhancing t echniques.
Using t he lock prefix heavily can incur large delays when accessing memory, regard-
less of whet her t he dat a is in t he cache or in syst em memory.
User / Sour ce Codi ng Rul e 10. ( H i mpact , H gener al i t y) Opt imizat ion
t echniques such as blocking, loop int erchange, loop skewing, and packing are best
done by t he compiler. Opt imize dat a st ruct ures eit her t o fit in one- half of t he first -
level cache or in t he second- level cache; t urn on loop opt imizat ions in t he compiler
t o enhance localit y for nest ed loops.
Opt imizing for one- half of t he first - level cache will bring t he great est performance
benefit in t erms of cycle- cost per dat a access. I f one- half of t he first - level cache is
t oo small t o be pract ical, opt imize for t he second- level cache. Opt imizing for a point
in bet ween ( for example, for t he ent ire first - level cache) will likely not bring a
subst ant ial improvement over opt imizing for t he second- level cache.
3-67
3.6.11 Minimizing Bus Latency
Each bus t ransact ion includes t he overhead of making request s and arbit rat ions. The
average lat ency of bus read and bus writ e t ransact ions will be longer if reads and
writ es alt ernat e. Segment ing reads and writ es int o phases can reduce t he average
lat ency of bus t ransact ions. This is because t he number of incidences of successive
t ransact ions involving a read following a writ e, or a writ e following a read, are
reduced.
User / Sour ce Codi ng Rul e 11. ( M i mpact , ML gener al i t y ) I f t here is a blend of
reads and writ es on t he bus, changing t he code t o separat e t hese bus t ransact ions
int o read phases and writ e phases can help performance.
Not e, however, t hat t he order of read and writ e operat ions on t he bus is not t he same
as it appears in t he program.
Bus lat ency for fet ching a cache line of dat a can vary as a funct ion of t he access
st ride of dat a references. I n general, bus lat ency will increase in response t o
increasing values of t he st ride of successive cache misses. I ndependent ly, bus
lat ency will also increase as a funct ion of increasing bus queue dept hs ( t he number
of out st anding bus request s of a given t ransact ion t ype) . The combinat ion of t hese
t wo t rends can be highly non- linear, in t hat bus lat ency of largest ride, bandwidt h-
sensit ive sit uat ions are such t hat effect ive t hroughput of t he bus syst em for dat a-
parallel accesses can be significant ly less t han t he effect ive t hroughput of small-
st ride, bandwidt h- sensit ive sit uat ions.
To minimize t he per- access cost of memory t raffic or amort ize raw memory lat ency
effect ively, soft ware should cont rol it s cache miss pat t ern t o favor higher concent ra-
t ion of smaller- st ride cache misses.
User / Sour ce Codi ng Rul e 12. ( H i mpact , H gener al i t y ) To achieve effect ive
amort izat ion of bus lat ency, soft ware should favor dat a access pat t erns t hat result
in higher concent rat ions of cache miss pat t erns, wit h cache miss st rides t hat are
significant ly smaller t han half t he hardware prefet ch t rigger t hreshold.
3.6.12 Non-Temporal Store Bus Traffic
Peak syst em bus bandwidt h is shared by several t ypes of bus act ivit ies, including
reads ( from memory) , reads for ownership ( of a cache line) , and writ es. The dat a
t ransfer rat e for bus writ e t ransact ions is higher if 64 byt es are writ t en out t o t he bus
at a t ime.
Typically, bus writ es t o Writ eback ( WB) memory must share t he syst em bus band-
widt h wit h read- for- ownership ( RFO) t raffic. Non- t emporal st ores do not require RFO
t raffic; t hey do require care in managing t he access pat t erns in order t o ensure 64
byt es are evict ed at once ( rat her t han evict ing several 8- byt e chunks) .
Alt hough t he dat a bandwidt h of full 64- byt e bus writ es due t o non- t emporal st ores is
t wice t hat of bus writ es t o WB memory, t ransferring 8- byt e chunks wast es bus
3-68
request bandwidt h and delivers significant ly lower dat a bandwidt h. This difference is
depict ed in Examples 3- 39 and 3- 40.
3.7 PREFETCHING
Recent I nt el processor families employ several prefet ching mechanisms t o accelerat e
t he movement of dat a or code and improve performance:
Hardware inst ruct ion prefet cher
Soft ware prefet ch for dat a
Hardware prefet ch for cache lines of dat a or inst ruct ions
Example 3-39. Using Non-temporal Stores and 64-byte Bus Write Transactions
#define STRIDESIZE 256
lea ecx, p64byte_Aligned
mov edx, ARRAY_LEN
xor eax, eax
slloop:
movntps XMMWORD ptr [ecx + eax], xmm0
movntps XMMWORD ptr [ecx + eax+16], xmm0
; 64 bytes is written in one bus transaction
add eax, STRIDESIZE
cmp eax, edx
jl slloop
Example 3-40. On-temporal Stores and Partial Bus Write Transactions
#define STRIDESIZE 256
Lea ecx, p64byte_Aligned
Mov edx, ARRAY_LEN
Xor eax, eax
slloop:
movntps XMMWORD ptr [ecx + eax], xmm0
; Storing 48 bytes results in 6 bus partial transactions
add eax, STRIDESIZE
cmp eax, edx
3-69
3.7.1 Hardware Instruction Fetching and Software Prefetching
I n processor based on I nt el Net Burst microarchit ect ure, t he hardware inst ruct ion
fet cher reads inst ruct ions, 32 byt es at a t ime, int o t he 64- byt e inst ruct ion st reaming
buffers. I nst ruct ion fet ching for I nt el Core microarchit ect ure is discussed in
Sect ion 2. 1.2.
Soft ware prefet ching requires a programmer t o use PREFETCH hint inst ruct ions and
ant icipat e some suit able t iming and locat ion of cache misses.
I n I nt el Core microarchit ect ure, soft ware PREFETCH inst ruct ions can prefet ch beyond
page boundaries and can perform one- t o- four page walks. Soft ware PREFETCH
inst ruct ions issued on fill buffer allocat ions ret ire aft er t he page walk complet es and
t he DCU miss is det ect ed. Soft ware PREFETCH inst ruct ions can t rigger all hardware
prefet chers in t he same manner as do regular loads.
Soft ware PREFETCH operat ions work t he same way as do load from memory opera-
t ions, wit h t he following except ions:
Soft ware PREFETCH inst ruct ions ret ire aft er virt ual t o physical address
t ranslat ion is complet ed.
I f an except ion, such as page fault , is required t o prefet ch t he dat a, t hen t he
soft ware prefet ch inst ruct ion ret ires wit hout prefet ching dat a.
3.7.2 Software and Hardware Prefetching in Prior
Microarchitectures
Pent ium 4 and I nt el Xeon processors based on I nt el Net Burst microarchit ect ure int ro-
duced hardware prefet ching in addit ion t o soft ware prefet ching. The hardware
prefet cher operat es t ransparent ly t o fet ch dat a and inst ruct ion st reams from
memory wit hout requiring programmer int ervent ion. Subsequent microarchit ect ures
cont inue t o improve and add feat ures t o t he hardware prefet ching mechanisms.
Earlier implement at ions of hardware prefet ching mechanisms focus on prefet ching
dat a and inst ruct ion from memory t o L2; more recent implement at ions provide addi-
t ional feat ures t o prefet ch dat a from L2 t o L1.
I n I nt el Net Burst microarchit ect ure, t he hardware prefet cher can t rack 8 indepen-
dent st reams.
The Pent ium M processor also provides a hardware prefet cher for dat a. I t can t rack
12 separat e st reams in t he forward direct ion and 4 st reams in t he backward direc-
t ion. The processor s PREFETCHNTA inst ruct ion also fet ches 64- byt es int o t he first -
level dat a cache wit hout pollut ing t he second- level cache.
I nt el Core Solo and I nt el Core Duo processors provide more advanced hardware
prefet chers for dat a t han Pent ium M processors. Key differences are summarized in
Table 2- 6.
Alt hough t he hardware prefet cher operat es t ransparent ly ( requiring no int ervent ion
by t he programmer ) , it operat es most efficient ly if t he programmer specifically
t ailors dat a access pat t erns t o suit it s charact erist ics ( it favors small- st ride cache
3-70
miss pat t erns) . Opt imizing dat a access pat t erns t o suit t he hardware prefet cher is
highly recommended, and should be a higher- priorit y considerat ion t han using soft -
ware prefet ch inst ruct ions.
The hardware prefet cher is best for small- st ride dat a access pat t erns in eit her direc-
t ion wit h a cache- miss st ride not far from 64 byt es. This is t rue for dat a accesses t o
addresses t hat are eit her known or unknown at t he t ime of issuing t he load opera-
t ions. Soft ware prefet ch can complement t he hardware prefet cher if used carefully.
There is a t radeoff t o make bet ween hardware and soft ware prefet ching. This
pert ains t o applicat ion charact erist ics such as regularit y and st ride of accesses. Bus
bandwidt h, issue bandwidt h ( t he lat ency of loads on t he crit ical pat h) and whet her
access pat t erns are suit able for non- t emporal prefet ch will also have an impact .
For a det ailed descript ion of how t o use prefet ching, see Chapt er 9, Opt imizing
Cache Usage.
Chapt er 5, Opt imizing for SI MD I nt eger Applicat ions, cont ains an example t hat
uses soft ware prefet ch t o implement a memory copy algorit hm.
Tuni ng Suggest i on 2. I f a load is found t o miss frequent ly, eit her insert a prefet ch
before it or ( if issue bandwidt h is a concern) move t he load up t o execut e earlier.
3.7.3 Hardware Prefetching for First-Level Data Cache
The hardware prefet ching mechanism for L1 in I nt el Core microarchit ect ure is
discussed in Sect ion 2. 1. 4. 2. A similar L1 prefet ch mechanism is also available t o
processors based on I nt el Net Burst microarchit ect ure wit h CPUI D signat ure of family
15 and model 6.
Example 3- 41 depict s a t echnique t o t rigger hardware prefet ch. The code demon-
st rat es t raversing a linked list and performing some comput at ional work on 2
members of each element t hat reside in 2 different cache lines. Each element is of
size 192 byt es. The t ot al size of all element s is larger t han can be fit t ed in t he L2
cache.
3-71
The addit ional inst ruct ions t o load dat a from one member in t he modified sequence
can t rigger t he DCU hardware prefet ch mechanisms t o prefet ch dat a in t he next
cache line, enabling t he work on t he second member t o complet e sooner.
Soft ware can gain from t he first - level dat a cache prefet chers in t wo cases:
I f dat a is not in t he second- level cache, t he first - level dat a cache prefet cher
enables early t rigger of t he second- level cache prefet cher.
I f dat a is in t he second- level cache and not in t he first - level dat a cache, t hen t he
first - level dat a cache prefet cher t riggers earlier dat a bring- up of sequent ial cache
line t o t he first - level dat a cache.
There are sit uat ions t hat soft ware should pay at t ent ion t o a pot ent ial side effect of
t riggering unnecessary DCU hardware prefet ches. I f a large dat a st ruct ure wit h many
members spanning many cache lines is accessed in ways t hat only a few of it s
members are act ually referenced, but t here are mult iple pair accesses t o t he same
cache line. The DCU hardware prefet cher can t rigger fet ching of cache lines t hat are
not needed. I n Example , references t o t he Pt s array and Alt Pt s will t rigger DCU
Example 3-41. Using DCU Hardware Prefetch
Original code Modified sequence benefit from prefetch
mov ebx, DWORD PTR [First]
xor eax, eax
scan_list:
mov eax, [ebx+4]
mov ecx, 60
do_some_work_1:
add eax, eax
and eax, 6
sub ecx, 1
jnz do_some_work_1
mov ebx, DWORD PTR [First]
xor eax, eax
scan_list:
mov eax, [ebx+4]
mov eax, [ebx+4]
mov eax, [ebx+4]
mov ecx, 60
do_some_work_1:
add eax, eax
and eax, 6
sub ecx, 1
jnz do_some_work_1
mov eax, [ebx+64]
mov ecx, 30
do_some_work_2:
add eax, eax
and eax, 6
sub ecx, 1
jnz do_some_work_2
mov eax, [ebx+64]
mov ecx, 30
do_some_work_2:
add eax, eax
and eax, 6
sub ecx, 1
jnz do_some_work_2
mov ebx, [ebx]
test ebx, ebx
jnz scan_list
mov ebx, [ebx]
test ebx, ebx
jnz scan_list
3-72
prefet ch t o fet ch addit ional cache lines t hat wont be needed. I f significant negat ive
performance impact is det ect ed due t o DCU hardware prefet ch on a port ion of t he
code, soft ware can t ry t o reduce t he size of t hat cont emporaneous working set t o be
less t han half of t he L2 cache.
To fully benefit from t hese prefet chers, organize and access t he dat a using one of t he
following met hods:
Met hod 1:
Organize t he dat a so consecut ive accesses can usually be found in t he same
4- KByt e page.
Access t he dat a in const ant st rides forward or backward I P Prefet cher.
Example 3-42. Avoid Causing DCU Hardware Prefetch to Fetch Un-needed Lines
while ( CurrBond != NULL )
{
MyATOM *a1 = CurrBond->At1 ;
MyATOM *a2 = CurrBond->At2 ;
if ( a1->CurrStep <= a1->LastStep &&
a2->CurrStep <= a2->LastStep
)
{
a1->CurrStep++ ;
a2->CurrStep++ ;
double ux = a1->Pts[0].x - a2->Pts[0].x ;
double uy = a1->Pts[0].y - a2->Pts[0].y ;
double uz = a1->Pts[0].z - a2->Pts[0].z ;
a1->AuxPts[0].x += ux ;
a1->AuxPts[0].y += uy ;
a1->AuxPts[0].z += uz ;
a2->AuxPts[0].x += ux ;
a2->AuxPts[0].y += uy ;
a2->AuxPts[0].z += uz ;
} ;
CurrBond = CurrBond->Next ;
} ;
3-73
Met hod 2:
Organize t he dat a in consecut ive lines.
Access t he dat a in increasing addresses, in sequent ial cache lines.
Example demonst rat es accesses t o sequent ial cache lines t hat can benefit from t he
first - level cache prefet cher.
By elevat ing t he load operat ions from memory t o t he beginning of each it erat ion, it is
likely t hat a significant part of t he lat ency of t he pair cache line t ransfer from memory
t o t he second- level cache will be in parallel wit h t he t ransfer of t he first cache line.
The I P prefet cher uses only t he lower 8 bit s of t he address t o dist inguish a specific
address. I f t he code size of a loop is bigger t han 256 byt es, t wo loads may appear
similar in t he lowest 8 bit s and t he I P prefet cher will be rest rict ed. Therefore, if you
have a loop bigger t han 256 byt es, make sure t hat no t wo loads have t he same
lowest 8 bit s in order t o use t he I P prefet cher.
3.7.4 Hardware Prefetching for Second-Level Cache
The I nt el Core microarchit ect ure cont ains t wo second- level cache prefet chers:
St r eamer Loads dat a or inst ruct ions from memory t o t he second- level cache.
To use t he st reamer, organize t he dat a or inst ruct ions in blocks of 128 byt es,
aligned on 128 byt es. The first access t o one of t he t wo cache lines in t his block
while it is in memory t riggers t he st reamer t o prefet ch t he pair line. To soft ware,
t he L2 st reamer s funct ionalit y is similar t o t he adj acent cache line prefet ch
mechanism found in processors based on I nt el Net Burst microarchit ect ure.
Dat a pr ef et ch l ogi c ( DPL) DPL and L2 St reamer are t riggered only by
writ eback memory t ype. They prefet ch only inside page boundary ( 4 KByt es) .
Bot h L2 prefet chers can be t riggered by soft ware prefet ch inst ruct ions and by
prefet ch request from DCU prefet chers. DPL can also be t riggered by read for
ownership ( RFO) operat ions. The L2 St reamer can also be t riggered by DPL
request s for L2 cache misses.
Soft ware can gain from organizing dat a bot h according t o t he inst ruct ion point er and
according t o line st rides. For example, for mat rix calculat ions, columns can be
Example 3-43. Technique For Using L1 Hardware Prefetch
unsigned int *p1, j, a, b;
for (j = 0; j < num; j += 16)
{
a = p1[j];
b = p1[j+1];
// Use these two values
}
3-74
prefet ched by I P- based prefet ches, and rows can be prefet ched by DPL and t he L2
st reamer.
3.7.5 Cacheability Instructions
SSE2 provides addit ional cacheabilit y inst ruct ions t hat ext end t hose provided in SSE.
The new cacheabilit y inst ruct ions include:
new st reaming st ore inst ruct ions
new cache line flush inst ruct ion
new memory fencing inst ruct ions
For more informat ion, see Chapt er 9, Opt imizing Cache Usage.
3.7.6 REP Prefix and Data Movement
The REP prefix is commonly used wit h st ring move inst ruct ions for memory relat ed
library funct ions such as MEMCPY ( using REP MOVSD) or MEMSET ( using REP STOS) .
These STRI NG/ MOV inst ruct ions wit h t he REP prefixes are implement ed in MS- ROM
and have several implement at ion variant s wit h different performance levels.
The specific variant of t he implement at ion is chosen at execut ion t ime based on dat a
layout , alignment and t he count er ( ECX) value. For example, MOVSB/ STOSB wit h t he
REP prefix should be used wit h count er value less t han or equal t o t hree for best
performance.
St ring MOVE/STORE inst ruct ions have mult iple dat a granularit ies. For efficient dat a
movement , larger dat a granularit ies are preferable. This means bet t er efficiency can
be achieved by decomposing an arbit rary count er value int o a number of double-
words plus single byt e moves wit h a count value less t han or equal t o 3.
Because soft ware can use SI MD dat a movement inst ruct ions t o move 16 byt es at a
t ime, t he following paragraphs discuss general guidelines for designing and imple-
ment ing high- performance library funct ions such as MEMCPY( ) , MEMSET( ) , and
MEMMOVE( ) . Four fact ors are t o be considered:
Thr oughput per i t er at i on I f t wo pieces of code have approximat ely ident ical
pat h lengt hs, efficiency favors choosing t he inst ruct ion t hat moves larger pieces
of dat a per it erat ion. Also, smaller code size per it erat ion will in general reduce
overhead and improve t hroughput . Somet imes, t his may involve a comparison of
t he relat ive overhead of an it erat ive loop st ruct ure versus using REP prefix for
it erat ion.
Addr ess al i gnment Dat a movement inst ruct ions wit h highest t hroughput
usually have alignment rest rict ions, or t hey operat e more efficient ly if t he
dest inat ion address is aligned t o it s nat ural dat a size. Specifically, 16- byt e moves
need t o ensure t he dest inat ion address is aligned t o 16- byt e boundaries, and
8- byt es moves perform bet t er if t he dest inat ion address is aligned t o 8- byt e
3-75
boundaries. Frequent ly, moving at doubleword granularit y performs bet t er wit h
addresses t hat are 8- byt e aligned.
REP st r i ng mov e vs. SI MD mov e I mplement ing general- purpose memory
funct ions using SI MD ext ensions usually requires adding some prolog code t o
ensure t he availabilit y of SI MD inst ruct ions, preamble code t o facilit at e aligned
dat a movement requirement s at runt ime. Throughput comparison must also t ake
int o considerat ion t he overhead of t he prolog when considering a REP st ring
implement at ion versus a SI MD approach.
Cache ev i ct i on I f t he amount of dat a t o be processed by a memory rout ine
approaches half t he size of t he last level on- die cache, t emporal localit y of t he
cache may suffer. Using st reaming st ore inst ruct ions ( for example: MOVNTQ,
MOVNTDQ) can minimize t he effect of flushing t he cache. The t hreshold t o st art
using a st reaming st ore depends on t he size of t he last level cache. Det ermine
t he size using t he det erminist ic cache paramet er leaf of CPUI D.
Techniques for using st reaming st ores for implement ing a MEMSET( ) - t ype
library must also consider t hat t he applicat ion can benefit from t his t echnique
only if it has no immediat e need t o reference t he t arget addresses. This
assumpt ion is easily upheld when t est ing a st reaming- st ore implement at ion on
a micro- benchmark configurat ion, but violat ed in a full- scale applicat ion
sit uat ion.
When applying general heurist ics t o t he design of general- purpose, high- perfor-
mance library rout ines, t he following guidelines can are useful when opt imizing an
arbit rary count er value N and address alignment . Different t echniques may be neces-
sary for opt imal performance, depending on t he magnit ude of N:
When N is less t han some small count ( where t he small count t hreshold will vary
bet ween microarchit ect ures - - empirically, 8 may be a good value when
opt imizing for I nt el Net Burst microarchit ect ure) , each case can be coded direct ly
wit hout t he overhead of a looping st ruct ure. For example, 11 byt es can be
processed using t wo MOVSD inst ruct ions explicit ly and a MOVSB wit h REP
count er equaling 3.
When N is not small but st ill less t han some t hreshold value ( which may vary for
different microarchit ect ures, but can be det ermined empirically) , an SI MD
implement at ion using run- t ime CPUI D and alignment prolog will likely deliver
less t hroughput due t o t he overhead of t he prolog. A REP st ring implement at ion
should favor using a REP st ring of doublewords. To improve address alignment , a
small piece of prolog code using MOVSB/ STOSB wit h a count less t han 4 can be
used t o peel off t he non- aligned dat a moves before st art ing t o use
MOVSD/ STOSD.
When N is less t han half t he size of last level cache, t hroughput considerat ion
may favor eit her:
An approach using a REP st ring wit h t he largest dat a granularit y because a
REP st ring has lit t le overhead for loop it erat ion, and t he branch mispredict ion
overhead in t he prolog/ epilogue code t o handle address alignment is
amort ized over many it erat ions.
3-76
An it erat ive approach using t he inst ruct ion wit h largest dat a granularit y,
where t he overhead for SI MD feat ure det ect ion, it erat ion overhead, and
prolog/ epilogue for alignment cont rol can be minimized. The t radeoff
bet ween t hese approaches may depend on t he microarchit ect ure.
An example of MEMSET( ) implement ed using st osd for arbit rary count er value
wit h t he dest inat ion address aligned t o doubleword boundary in 32- bit mode is
shown in Example 3- 44.
When N is larger t han half t he size of t he last level cache, using 16- byt e
granularit y st reaming st ores wit h prolog/ epilog for address alignment will likely
be more efficient , if t he dest inat ion addresses will not be referenced immediat ely
aft erwards.
Memory rout ines in t he runt ime library generat ed by I nt el compilers are opt imized
across a wide range of address alignment s, count er values, and microarchit ect ures.
I n most cases, applicat ions should t ake advant age of t he default memory rout ines
provided by I nt el compilers.
Example 3-44. REP STOSD with Arbitrary Count Size and 4-Byte-Aligned Destination
A C example of Memset() Equivalent Implementation Using REP STOSD
void memset(void *dst,int c,size_t size)
{
char *d = (char *)dst;
size_t i;
for (i=0;i<size;i++)
*d++ = (char)c;
}
push edi
movzx eax, byte ptr [esp+12]
mov ecx, eax
shl ecx, 8
or ecx, eax
mov ecx, eax
shl ecx, 16
or eax, ecx
mov edi, [esp+8] ; 4-byte aligned
mov ecx, [esp+16] ; byte count
shr ecx, 2 ; do dword
cmp ecx, 127
jle _main
test edi, 4
jz _main
stosd ;peel off one dword
dec ecx
_main: ; 8-byte aligned
rep stosd
mov ecx, [esp + 16]
and ecx, 3 ; do count <= 3
rep stosb ; optimal with <= 3
pop edi
ret
3-77
I n some sit uat ions, t he byt e count of t he dat a is known by t he cont ext ( as opposed
t o being known by a paramet er passed from a call) , and one can t ake a simpler
approach t han t hose required for a general- purpose library rout ine. For example, if
t he byt e count is also small, using REP MOVSB/ STOSB wit h a count less t han four can
ensure good address alignment and loop- unrolling t o finish t he remaining dat a; using
MOVSD/ STOSD can reduce t he overhead associat ed wit h it erat ion.
Using a REP prefix wit h st ring move inst ruct ions can provide high performance in t he
sit uat ions described above. However, using a REP prefix wit h st ring scan inst ruct ions
( SCASB, SCASW, SCASD, SCASQ) or compare inst ruct ions ( CMPSB, CMPSW,
SMPSD, SMPSQ) is not recommended for high performance. Consider using SI MD
inst ruct ions inst ead.
3.8 FLOATING-POINT CONSIDERATIONS
When programming float ing- point applicat ions, it is best t o st art wit h a high- level
programming language such as C, C+ + , or Fort ran. Many compilers perform float ing-
point scheduling and opt imizat ion when it is possible. However in order t o produce
opt imal code, t he compiler may need some assist ance.
3.8.1 Guidelines for Optimizing Floating-point Code
User / Sour ce Codi ng Rul e 13. ( M i mpact , M gener al i t y) Enable t he compilers
use of SSE, SSE2 or SSE3 inst ruct ions wit h appropriat e swit ches.
Follow t his procedure t o invest igat e t he performance of your float ing- point applica-
t ion:
Underst and how t he compiler handles float ing- point code.
Look at t he assembly dump and see what t ransforms are already performed on
t he program.
St udy t he loop nest s in t he applicat ion t hat dominat e t he execut ion t ime.
Det ermine why t he compiler is not creat ing t he fast est code.
See if t here is a dependence t hat can be resolved.
Det ermine t he problem area: bus bandwidt h, cache localit y, t race cache
bandwidt h, or inst ruct ion lat ency. Focus on opt imizing t he problem area. For
example, adding PREFETCH inst ruct ions will not help if t he bus is already
sat urat ed. I f t race cache bandwidt h is t he problem, added prefet ch ops may
degrade performance.
Also, in general, follow t he general coding recommendat ions discussed in t his
chapt er, including:
Blocking t he cache
Using prefet ch
3-78
Enabling vect orizat ion
Unrolling loops
User / Sour ce Codi ng Rul e 14. ( H i mpact , ML gener al i t y ) Make sure your
applicat ion st ays in range t o avoid denormal values, underflows. .
Out - of- range numbers cause very high overhead.
User / Sour ce Codi ng Rul e 15. ( M i mpact , ML gener al i t y) Do not use double
precision unless necessary. Set t he precision cont rol ( PC) field in t he x87 FPU
cont rol word t o Single Precision . This allows single precision ( 32- bit ) comput at ion
t o complet e fast er on some operat ions ( for example, divides due t o early out ) .
However, be careful of int roducing more t han a t ot al of t wo values for t he float ing
point cont rol word, or t here will be a large performance penalt y. See Sect ion 3. 8. 3.
User / Sour ce Codi ng Rul e 16. ( H i mpact , ML gener al i t y ) Use fast float - t o- int
rout ines, FI STTP, or SSE2 inst ruct ions. I f coding t hese rout ines, use t he FI STTP
inst ruct ion if SSE3 is available, or t he CVTTSS2SI and CVTTSD2SI inst ruct ions if
coding wit h St reaming SI MD Ext ensions 2.
Many libraries generat e X87 code t hat does more work t han is necessary. The FI STTP
inst ruct ion in SSE3 can convert float ing- point values t o 16- bit , 32- bit , or 64- bit int e-
gers using t runcat ion wit hout accessing t he float ing- point cont rol word ( FCW) . The
inst ruct ions CVTTSS2SI and CVTTSD2SI save many ops and some st ore- forwarding
delays over some compiler implement at ions. This avoids changing t he rounding
mode.
User / Sour ce Codi ng Rul e 17. ( M i mpact , ML gener al i t y) Removing dat a
dependence enables t he out - of- order engine t o ext ract more I LP from t he code.
When summing up t he element s of an array, use part ial sums inst ead of a single
accumulat or. .
For example, t o calculat e z = a + b + c + d, inst ead of:
X = A + B;
Y = X + C;
Z = Y + D;
use:
X = A + B;
Y = C + D;
Z = X + Y;
User / Sour ce Codi ng Rul e 18. ( M i mpact , ML gener al i t y ) Usually, mat h
libraries t ake advant age of t he t ranscendent al inst ruct ions ( for example, FSI N)
when evaluat ing element ary funct ions. I f t here is no crit ical need t o evaluat e t he
t ranscendent al funct ions using t he ext ended precision of 80 bit s, applicat ions
should consider an alt ernat e, soft ware- based approach, such as a lookup- t able-
based algorit hm using int erpolat ion t echniques. I t is possible t o improve
3-79
t ranscendent al performance wit h t hese t echniques by choosing t he desired numeric
precision and t he size of t he lookup t able, and by t aking advant age of t he
parallelism of t he SSE and t he SSE2 inst ruct ions.
3.8.2 Floating-point Modes and Exceptions
When working wit h float ing- point numbers, high- speed microprocessors frequent ly
must deal wit h sit uat ions t hat need special handling in hardware or code.
3.8.2.1 Floating-point Exceptions
The most frequent cause of performance degradat ion is t he use of masked float ing-
point except ion condit ions such as:
arit hmet ic overflow
arit hmet ic underflow
denormalized operand
Refer t o Chapt er 4 of I nt el 64 and I A- 32 Archit ect ures Soft ware Developers
Manual, Volume 1, for definit ions of overflow, underflow and denormal except ions.
Denormalized float ing- point numbers impact performance in t wo ways:
direct ly when are used as operands
indirect ly when are produced as a result of an underflow sit uat ion
I f a float ing- point applicat ion never underflows, t he denormals can only come from
float ing- point const ant s.
User / Sour ce Codi ng Rul e 19. ( H i mpact , ML gener al i t y ) Denormalized
float ing- point const ant s should be avoided as much as possible.
Denormal and arit hmet ic underflow except ions can occur during t he execut ion of x87
inst ruct ions or SSE/ SSE2/ SSE3 inst ruct ions. Processor s based on I nt el Net Burst
microarchit ect ure handle t hese except ions more efficient ly when execut ing
SSE/ SSE2/ SSE3 inst ruct ions and when speed is more import ant t han complying wit h
t he I EEE st andard. The following paragraphs give recommendat ions on how t o opt i-
mize your code t o reduce performance degradat ions relat ed t o float ing- point excep-
t ions.
3.8.2.2 Dealing with floating-point exceptions in x87 FPU code
Every special sit uat ion list ed in Sect ion 3. 8. 2.1, Float ing- point Except ions, is cost ly
in t erms of performance. For t hat reason, x87 FPU code should be writ t en t o avoid
t hese sit uat ions.
3-80
There are basically t hree ways t o reduce t he impact of overflow/ underflow sit uat ions
wit h x87 FPU code:
Choose float ing- point dat a t ypes t hat are large enough t o accommodat e result s
wit hout generat ing arit hmet ic overflow and underflow except ions.
Scale t he range of operands/ result s t o reduce as much as possible t he number of
arit hmet ic overflow/ underflow sit uat ions.
Keep int ermediat e result s on t he x87 FPU regist er st ack unt il t he final result s
have been comput ed and st ored in memory. Overflow or underflow is less likely
t o happen when int ermediat e result s are kept in t he x87 FPU st ack ( t his is
because dat a on t he st ack is st ored in double ext ended- precision format and
overflow/ underflow condit ions are det ect ed accordingly) .
Denormalized float ing- point const ant s ( which are read- only, and hence never
change) should be avoided and replaced, if possible, wit h zeros of t he same sign.
3.8.2.3 Floating-point Exceptions in SSE/SSE2/SSE3 Code
Most special sit uat ions t hat involve masked float ing- point except ions are handled
efficient ly in hardware. When a masked overflow except ion occurs while execut ing
SSE/ SSE2/ SSE3 code, processor hardware can handles it wit hout performance
penalt y.
Underflow except ions and denormalized source operands are usually t reat ed
according t o t he I EEE 754 specificat ion, but t his can incur significant performance
delay. I f a programmer is willing t o t rade pure I EEE 754 compliance for speed, t wo
non- I EEE 754 compliant modes are provided t o speed sit uat ions where underflows
and input are frequent : FTZ mode and DAZ mode.
When t he FTZ mode is enabled, an underflow result is aut omat ically convert ed t o a
zero wit h t he correct sign. Alt hough t his behavior is not compliant wit h I EEE 754, it is
provided for use in applicat ions where performance is more import ant t han I EEE 754
compliance. Since denormal result s are not produced when t he FTZ mode is enabled,
t he only denormal float ing- point numbers t hat can be encount ered in FTZ mode are
t he ones specified as const ant s ( read only) .
The DAZ mode is provided t o handle denormal source operands efficient ly when
running a SI MD float ing- point applicat ion. When t he DAZ mode is enabled, input
denormals are t reat ed as zeros wit h t he same sign. Enabling t he DAZ mode is t he
way t o deal wit h denormal float ing- point const ant s when performance is t he obj ec-
t ive.
I f depart ing from t he I EEE 754 specificat ion is accept able and performance is crit ical,
run SSE/ SSE2/ SSE3 applicat ions wit h FTZ and DAZ modes enabled.
NOTE
The DAZ mode is available wit h bot h t he SSE and SSE2 ext ensions,
alt hough t he speed improvement expect ed from t his mode is fully
realized only in SSE code.
3-81
3.8.3 Floating-point Modes
On t he Pent ium I I I processor, t he FLDCW inst ruct ion is an expensive operat ion. On
early generat ions of Pent ium 4 processors, FLDCW is improved only for sit uat ions
where an applicat ion alt ernat es bet ween t wo const ant values of t he x87 FPU cont rol
word ( FCW) , such as when performing conversions t o int egers. On Pent ium M, I nt el
Core Solo, I nt el Core Duo and I nt el Core 2 Duo processors, FLDCW is improved over
previous generat ions.
Specifically, t he opt imizat ion for FLDCW in t he first t wo generat ions of Pent ium 4
processors allow programmers t o alt ernat e bet ween t wo const ant values efficient ly.
For t he FLDCW opt imizat ion t o be effect ive, t he t wo const ant FCW values are only
allowed t o differ on t he following 5 bit s in t he FCW:
FCW[8-9] ; Precision control
FCW[10-11] ; Rounding control
FCW[12] ; Infinity control
I f programmers need t o modify ot her bit s ( for example: mask bit s) in t he FCW, t he
FLDCW inst ruct ion is st ill an expensive operat ion.
I n sit uat ions where an applicat ion cycles bet ween t hree ( or more) const ant values,
FLDCW opt imizat ion does not apply, and t he performance degradat ion occurs for
each FLDCW inst ruct ion.
One solut ion t o t his problem is t o choose t wo const ant FCW values, t ake advant age
of t he opt imizat ion of t he FLDCW inst ruct ion t o alt ernat e bet ween only t hese t wo
const ant FCW values, and devise some means t o accomplish t he t ask t hat requires
t he 3rd FCW value wit hout act ually changing t he FCW t o a t hird const ant value. An
alt ernat ive solut ion is t o st ruct ure t he code so t hat , for periods of t ime, t he applica-
t ion alt ernat es bet ween only t wo const ant FCW values. When t he applicat ion lat er
alt ernat es bet ween a pair of different FCW values, t he performance degradat ion
occurs only during t he t ransit ion.
I t is expect ed t hat SI MD applicat ions are unlikely t o alt ernat e bet ween FTZ and DAZ
mode values. Consequent ly, t he SI MD cont rol word does not have t he short lat encies
t hat t he float ing- point cont rol regist er does. A read of t he MXCSR regist er has a fairly
long lat ency, and a writ e t o t he regist er is a serializing inst ruct ion.
There is no separat e cont rol word for single and double precision; bot h use t he same
modes. Not ably, t his applies t o bot h FTZ and DAZ modes.
Assembl y/ Compi l er Codi ng Rul e 59. ( H i mpact , M gener al i t y ) Minimize
changes t o bit s 8- 12 of t he float ing point cont rol word. Changes for more t han t wo
values ( each value being a combinat ion of t he following bit s: precision, rounding
and infinit y cont rol, and t he rest of bit s in FCW) leads t o delays t hat are on t he
order of t he pipeline dept h.
3.8.3.1 Rounding Mode
Many libraries provide float - t o- int eger library rout ines t hat convert float ing- point
values t o int eger. Many of t hese libraries conform t o ANSI C coding st andards which
3-82
st at e t hat t he rounding mode should be t runcat ion. Wit h t he Pent ium 4 processor,
one can use t he CVTTSD2SI and CVTTSS2SI inst ruct ions t o convert operands wit h
t runcat ion wit hout ever needing t o change rounding modes. The cost savings of
using t hese inst ruct ions over t he met hods below is enough t o j ust ify using SSE and
SSE2 wherever possible when t runcat ion is involved.
For x87 float ing point , t he FI ST inst ruct ion uses t he rounding mode represent ed in
t he float ing- point cont rol word ( FCW) . The rounding mode is generally round t o
nearest , so many compiler writ ers implement a change in t he rounding mode in t he
processor in order t o conform t o t he C and FORTRAN st andards. This implement at ion
requires changing t he cont rol word on t he processor using t he FLDCW inst ruct ion.
For a change in t he rounding, precision, and infinit y bit s, use t he FSTCW inst ruct ion
t o st ore t he float ing- point cont rol word. Then use t he FLDCW inst ruct ion t o change
t he rounding mode t o t runcat ion.
I n a t ypical code sequence t hat changes t he rounding mode in t he FCW, a FSTCW
inst ruct ion is usually followed by a load operat ion. The load operat ion from memory
should be a 16- bit operand t o prevent st ore- forwarding problem. I f t he load opera-
t ion on t he previously- st ored FCW word involves eit her an 8- bit or a 32- bit operand,
t his will cause a st ore- forwarding problem due t o mismat ch of t he size of t he dat a
bet ween t he st ore operat ion and t he load operat ion.
To avoid st ore- forwarding problems, make sure t hat t he writ e and read t o t he FCW
are bot h 16- bit operat ions.
I f t here is more t han one change t o t he rounding, precision, and infinit y bit s, and t he
rounding mode is not import ant t o t he result , use t he algorit hm in Example 3- 45 t o
avoid synchronizat ion issues, t he overhead of t he FLDCW inst ruct ion, and having t o
change t he rounding mode. Not e t hat t he example suffers from a st ore- forwarding
problem which will lead t o a performance penalt y. However, it s performance is st ill
bet t er t han changing t he rounding, precision, and infinit y bit s among more t han t wo
values.
Example 3-45. Algorithm to Avoid Changing Rounding Mode
_fto132proc
lea ecx, [esp-8]
sub esp, 16 ; Allocate frame
and ecx, -8 ; Align pointer on boundary of 8
fld st(0) ; Duplicate FPU stack top
fistp qword ptr[ecx]
fild qword ptr[ecx]
mov edx, [ecx+4] ; High DWORD of integer
mov eax, [ecx] ; Low DWIRD of integer
test eax, eax
je integer_QnaN_or_zero
3-83
Assembl y/ Compi l er Codi ng Rul e 60. ( H i mpact , L gener al i t y ) Minimize t he
number of changes t o t he rounding mode. Do not use changes in t he rounding
mode t o implement t he floor and ceiling funct ions if t his involves a t ot al of more
t han t wo values of t he set of rounding, precision, and infinit y bit s.
3.8.3.2 Precision
I f single precision is adequat e, use it inst ead of double precision. This is t rue
because:
Single precision operat ions allow t he use of longer SI MD vect ors, since more
single precision dat a element s can fit in a regist er.
I f t he precision cont rol ( PC) field in t he x87 FPU cont rol word is set t o single
precision, t he float ing- point divider can complet e a single- precision comput at ion
much fast er t han eit her a double- precision comput at ion or an ext ended double-
precision comput at ion. I f t he PC field is set t o double pr ecision, t his will enable
t hose x87 FPU operat ions on double- precision dat a t o complet e fast er t han
arg_is_not_integer_QnaN:
fsubp st(1), st ; TOS=d-round(d), { st(1) = st(1)-st & pop ST}
test edx, edx ; Whats sign of integer
jns positive ; Number is negative
fstp dword ptr[ecx] ; Result of subtraction
mov ecx, [ecx] ; DWORD of diff(single-precision)
add esp, 16
xor ecx, 80000000h
add ecx,7fffffffh ; If diff<0 then decrement integer
adc eax,0 ; INC EAX (add CARRY flag)
ret
positive:
positive:
fstp dword ptr[ecx] ; 17-18 result of subtraction
mov ecx, [ecx] ; DWORD of diff(single precision)
add esp, 16
add ecx, 7fffffffh ; If diff<0 then decrement integer
sbb eax, 0 ; DEC EAX (subtract CARRY flag)
ret
integer_QnaN_or_zero:
test edx, 7fffffffh
jnz arg_is_not_integer_QnaN
add esp, 16
ret
Example 3-45. Algorithm to Avoid Changing Rounding Mode (Contd.)
3-84
ext ended double- precision comput at ion. These charact erist ics affect comput a-
t ions including float ing- point divide and square root .
Assembl y/ Compi l er Codi ng Rul e 61. ( H i mpact , L gener al i t y ) Minimize t he
number of changes t o t he precision mode.
3.8.3.3 Improving Parallelism and the Use of FXCH
The x87 inst ruct ion set relies on t he float ing point st ack for one of it s operands. I f t he
dependence graph is a t ree, which means each int ermediat e result is used only once
and code is scheduled carefully, it is oft en possible t o use only operands t hat are on
t he t op of t he st ack or in memory, and t o avoid using operands t hat are buried under
t he t op of t he st ack. When operands need t o be pulled from t he middle of t he st ack,
an FXCH inst ruct ion can be used t o swap t he operand on t he t op of t he st ack wit h
anot her ent ry in t he st ack.
The FXCH inst ruct ion can also be used t o enhance parallelism. Dependent chains can
be overlapped t o expose more independent inst ruct ions t o t he hardware scheduler.
An FXCH inst ruct ion may be required t o effect ively increase t he regist er name space
so t hat more operands can be simult aneously live.
I n processors based on I nt el Net Burst microarchit ect ure, however, t hat FXCH inhibit s
issue bandwidt h in t he t race cache. I t does t his not only because it consumes a slot ,
but also because of issue slot rest rict ions imposed on FXCH. I f t he applicat ion is not
bound by issue or ret irement bandwidt h, FXCH will have no impact .
The effect ive inst ruct ion window size in processors based on I nt el Net Burst microar-
chit ect ure is large enough t o permit inst ruct ions t hat are as far away as t he next it er-
at ion t o be overlapped. This oft en obviat es t he need t o use FXCH t o enhance
parallelism.
The FXCH inst ruct ion should be used only when it s needed t o express an algorit hm
or t o enhance parallelism. I f t he size of regist er name space is a problem, t he use of
XMM regist ers is recommended.
Assembl y/ Compi l er Codi ng Rul e 62. ( M i mpact , M gener al i t y) Use FXCH only
where necessary t o increase t he effect ive name space.
This in t urn allows inst ruct ions t o be reordered and made available for execut ion in
parallel. Out - of- order execut ion precludes t he need for using FXCH t o move inst ruc-
t ions for very short dist ances.
3.8.4 x87 vs. Scalar SIMD Floating-point Trade-offs
There are a number of differences bet ween x87 float ing- point code and scalar
float ing- point code ( using SSE and SSE2) . The following differences should drive
decisions about which regist ers and inst ruct ions t o use:
When an input operand for a SI MD float ing- point inst ruct ion cont ains values t hat
are less t han t he represent able range of t he dat a t ype, a denormal except ion
occurs. This causes a significant performance penalt y. An SI MD float ing- point
3-85
operat ion has a flush- t o- zero mode in which t he result s will not underflow.
Therefore subsequent comput at ion will not face t he performance penalt y of
handling denormal input operands. For example, in t he case of 3D applicat ions
wit h low light ing levels, using flush- t o- zero mode can improve performance by as
much as 50% for applicat ions wit h large numbers of underflows.
Scalar float ing- point SI MD inst ruct ions have lower lat encies t han equivalent x87
inst ruct ions. Scalar SI MD float ing- point mult iply inst ruct ion may be pipelined,
while x87 mult iply inst ruct ion is not .
Only x87 support s t ranscendent al inst ruct ions.
x87 support s 80- bit precision, double ext ended float ing point . SSE support a
maximum of 32- bit precision. SSE2 support s a maximum of 64- bit precision.
Scalar float ing- point regist ers may be accessed direct ly, avoiding FXCH and t op-
of- st ack rest rict ions.
The cost of convert ing from float ing point t o int eger wit h t runcat ion is signifi-
cant ly lower wit h St reaming SI MD Ext ensions 2 and St reaming SI MD Ext ensions
in t he processors based on I nt el Net Burst microarchit ect ure t han wit h eit her
changes t o t he rounding mode or t he sequence prescribed in t he Example 3- 45.
Assembl y/ Compi l er Codi ng Rul e 63. ( M i mpact , M gener al i t y) Use St reaming
SI MD Ext ensions 2 or St reaming SI MD Ext ensions unless you need an x87 feat ure.
Most SSE2 arit hmet ic operat ions have short er lat ency t hen t heir X87 count erpart
and t hey eliminat e t he overhead associat ed wit h t he management of t he X87
regist er st ack.
3.8.4.1 Scalar SSE/SSE2 Performance on Intel Core Solo and Intel Core
Duo Processors
On I nt el Core Solo and I nt el Core Duo processors, t he combinat ion of improved
decoding and op fusion allows inst ruct ions which were formerly t wo, t hree, and four
ops t o go t hrough all decoders. As a result , scalar SSE/ SSE2 code can mat ch t he
performance of x87 code execut ing t hrough t wo float ing- point unit s. On Pent ium M
processors, scalar SSE/ SSE2 code can experience approximat ely 30% performance
degradat ion relat ive t o x87 code execut ing t hrough t wo float ing- point unit s.
I n code sequences t hat have conversions from float ing- point t o int eger, divide single-
precision inst ruct ions, or any precision change, x87 code generat ion from a compiler
t ypically writ es dat a t o memory in single- precision and reads it again in order t o
reduce precision. Using SSE/ SSE2 scalar code inst ead of x87 code can generat e a
large performance benefit using I nt el Net Burst microarchit ect ure and a modest
benefit on I nt el Core Solo and I nt el Core Duo processors.
Recommendat i on: Use t he compiler swit ch t o generat e SSE2 scalar float ing- point
code rat her t han x87 code.
When working wit h scalar SSE/ SSE2 code, pay at t ent ion t o t he need for clearing t he
cont ent of unused slot s in an XMM regist er and t he associat ed performance impact .
3-86
For example, loading dat a from memory wit h MOVSS or MOVSD causes an ext ra
micro- op for zeroing t he upper part of t he XMM regist er.
On Pent ium M, I nt el Core Solo, and I nt el Core Duo processors, t his penalt y can be
avoided by using MOVLPD. However, using MOVLPD causes a performance penalt y on
Pent ium 4 processors.
Anot her sit uat ion occurs when mixing single- precision and double- precision code. On
processors based on I nt el Net Burst microarchit ect ure, using CVTSS2SD has perfor-
mance penalt y relat ive t o t he alt ernat ive sequence:
XORPS XMM1, XMM1
MOVSS XMM1, XMM2
CVTPS2PD XMM1, XMM1
On I nt el Core Solo and I nt el Core Duo processors, using CVTSS2SD is more desirable
t han t he alt ernat ive sequence.
3.8.4.2 x87 Floating-point Operations with Integer Operands
For processors based on I nt el Net Burst microarchit ect ure, split t ing float ing- point
operat ions ( FI ADD, FI SUB, FI MUL, and FI DI V) t hat t ake 16- bit int eger operands int o
t wo inst ruct ions ( FI LD and a float ing- point operat ion) is more efficient . However, for
float ing- point operat ions wit h 32- bit int eger operands, using FI ADD, FI SUB, FI MUL,
and FI DI V is equally efficient compared wit h using separat e inst ruct ions.
Assembl y/ Compi l er Codi ng Rul e 64. ( M i mpact , L gener al i t y) Try t o use
32- bit operands rat her t han 16- bit operands for FI LD. However, do not do so at t he
expense of int roducing a st ore- forwarding problem by writ ing t he t wo halves of t he
32- bit memory operand separat ely.
3.8.4.3 x87 Floating-point Comparison Instructions
The FCOMI and FCMOV inst ruct ions should be used when performing x87 float ing-
point comparisons. Using t he FCOM, FCOMP, and FCOMPP inst ruct ions t ypically
requires addit ional inst ruct ion like FSTSW. The lat t er alt ernat ive causes more ops t o
be decoded, and should be avoided.
3.8.4.4 Transcendental Functions
I f an applicat ion needs t o emulat e mat h funct ions in soft ware for performance or
ot her reasons ( see Sect ion 3. 8. 1, Guidelines for Opt imizing Float ing- point Code ) , it
may be wort hwhile t o inline mat h library calls because t he CALL and t he
prologue/ epilogue involved wit h such calls can significant ly affect t he lat ency of
operat ions.
Not e t hat t ranscendent al funct ions are support ed only in x87 float ing point , not in
St reaming SI MD Ext ensions or St reaming SI MD Ext ensions 2.
4-1
CHAPTER 4
I nt el Pent ium 4, I nt el Xeon and Pent ium M processors include support for St reaming
SI MD Ext ensions 2 ( SSE2) , St reaming SI MD Ext ensions t echnology ( SSE) , and MMX
t echnology. I n addit ion, St reaming SI MD Ext ensions 3 ( SSE3) were int roduced wit h
t he Pent ium 4 processor support ing Hyper-Threading Technology at 90 nm t echnology.
I nt el Core Solo and I nt el Core Duo processors support SSE3/ SSE2/ SSE, and MMX.
Processors based on I nt el Core microarchit ect ure support s MMX, SSE, SSE2, SSE3,
and SSSE3. Single- inst ruct ion, mult iple- dat a ( SI MD) t echnologies enable t he devel-
opment of advanced mult imedia, signal processing, and modeling applicat ions.
To t ake advant age of t he performance opport unit ies present ed by t hese capabilit ies,
do t he following:
Ensure t hat t he processor support s MMX t echnology, St reaming SI MD
Ext ensions, St reaming SI MD Ext ensions 2, St reaming SI MD Ext ensions 3, and
Supplement al St reaming SI MD Ext ensions 3.
Ensure t hat t he operat ing syst em support s MMX t echnology and SSE ( OS support
for SSE2, SSE3 and SSSE3 is t he same as OS support for SSE) .
Employ t he opt imizat ion and scheduling st rat egies described in t his book.
Use st ack and dat a alignment t echniques t o keep dat a properly aligned for
efficient memory use.
Ut ilize t he cacheabilit y inst ruct ions offered by SSE and SSE2, where appropriat e.
4.1 CHECKING FOR PROCESSOR SUPPORT OF SIMD
TECHNOLOGIES
This sect ion shows how t o check whet her a processor support s MMX t echnology, SSE,
SSE2, or SSE3.
SI MD t echnology can be included in your applicat ion in t hree ways:
1. Check for t he SI MD t echnology during inst allat ion. I f t he desired SI MD
t echnology is available, t he appropriat e DLLs can be inst alled.
2. Check for t he SI MD t echnology during program execut ion and inst all t he proper
DLLs at runt ime. This is effect ive for programs t hat may be execut ed on different
machines.
3. Creat e a fat binary t hat includes mult iple versions of rout ines; versions t hat use
SI MD t echnology and versions t hat do not . Check for SI MD t echnology during
program execut ion and run t he appropriat e versions of t he rout ines. This is
especially effect ive for programs t hat may be execut ed on different machines.
4-2
4.1.1 Checking for MMX Technology Support
I f MMX t echnology is available, t hen CPUI D.01H: EDX[ BI T 23] = 1. Use t he code
segment in Example 4- 1 t o t est for MMX t echnology.
For more informat ion on CPUI D see, I nt el
Processor I dent ificat ion wit h CPUI D

I nst ruct ion, order number 241618.
4.1.2 Checking for Streaming SIMD Extensions Support
Checking for processor support of St reaming SI MD Ext ensions ( SSE) on your
processor is similar t o checking for MMX t echnology. However, operat ing syst em ( OS)
must provide support for SSE st at es save and rest ore on cont ext swit ches t o ensure
consist ent applicat ion behavior when using SSE inst ruct ions.
To check whet her your syst em support s SSE, follow t hese st eps:
1. Check t hat your processor support s t he CPUI D inst ruct ion.
2. Check t he feat ure bit s of CPUI D for SSE exist ence.
Example 4- 2 shows how t o find t he SSE feat ure bit ( bit 25) in CPUI D feat ure flags.
Example 4-1. Identification of MMX Technology with CPUID
identify existence of cpuid instruction
;
; Identify signature is genuine Intel
;
mov eax, 1 ; Request for feature flags
cpuid ; 0FH, 0A2H CPUID instruction
test edx, 00800000h ; Is MMX technology bit (bit 23) in feature flags equal to 1
jnz Found
Example 4-2. Identification of SSE with CPUID
Identify existence of cpuid instruction
; Identify signature is genuine intel
cpuid ; 0FH, 0A2H cpuid instruction
test EDX, 002000000h ; Bit 25 in feature flags equal to 1
jnz Found
4-3
4.1.3 Checking for Streaming SIMD Extensions 2 Support
Checking for support of SSE2 is like checking for SSE support . The OS requirement s
for SSE2 Support are t he same as t he OS requirement s for SSE.
To check whet her your syst em support s SSE2, follow t hese st eps:
1. Check t hat your processor has t he CPUI D inst ruct ion.
2. Check t he feat ure bit s of CPUI D for SSE2 t echnology exist ence.
Example 4- 3 shows how t o find t he SSE2 feat ure bit ( bit 26) in t he CPUI D feat ure
flags.
4.1.4 Checking for Streaming SIMD Extensions 3 Support
SSE3 includes 13 inst ruct ions, 11 of t hose are suit ed for SI MD or x87 st yle program-
ming. Checking for support of SSE3 inst ruct ions is similar t o checking for SSE
support . The OS requirement s for SSE3 Support are t he same as t he requirement s
for SSE.
To check whet her your syst em support s t he x87 and SI MD inst ruct ions of SSE3,
follow t hese st eps:
2. Check t he ECX feat ure bit 0 of CPUI D for SSE3 t echnology exist ence.
Example 4- 4 shows how t o find t he SSE3 feat ure bit ( bit 0 of ECX) in t he CPUI D
feat ure flags.
Example 4-3. Identification of SSE2 with cpuid
test EDX, 004000000h ; Bit 26 in feature flags equal to 1
jnz Found
Example 4-4. Identification of SSE3 with CPUID
test ECX, 000000001h ; Bit 0 in feature flags equal to 1
jnz Found
4-4
Soft ware must check for support of MONI TOR and MWAI T before at t empt ing t o use
MONI TOR and MWAI T. Det ect ing t he availabilit y of MONI TOR and MWAI T can be done
using a code sequence similar t o Example 4- 4. The availabilit y of MONI TOR and
MWAI T is indicat ed by bit 3 of t he ret urned value in ECX.
4.1.5 Checking for Supplemental Streaming SIMD Extensions 3
Support
Checking for support of SSSE3 is similar t o checking for SSE support . The OS require-
ment s for SSSE3 Support are t he same as t he requirement s for SSE.
To check whet her your syst em support s SSSE3, follow t hese st eps:
2. Check t he feat ure bit s of CPUI D for SSSE3 t echnology exist ence.
Example 4- 5 shows how t o find t he SSSE3 feat ure bit in t he CPUI D feat ure flags.
4.2 CONSIDERATIONS FOR CODE CONVERSION TO SIMD
PROGRAMMING
The VTune Performance Enhancement Environment CD provides t ools t o aid in t he
evaluat ion and t uning. Before implement ing t hem, you need answers t o t he following
quest ions:
1. Will t he current code benefit by using MMX t echnology, St reaming SI MD
Ext ensions, St reaming SI MD Ext ensions 2, St reaming SI MD Ext ensions 3, or
Supplement al St reaming SI MD Ext ensions 3?
2. I s t his code int eger or float ing- point ?
3. What int eger word size or float ing- point precision is needed?
4. What coding t echniques should I use?
5. What guidelines do I need t o follow?
6. How should I arrange and align t he dat at ypes?
Figure 4- 1 provides a flowchart for t he process of convert ing code t o MMX t ech-
nology, SSE, SSE2, SSE3, or SSSE3.
Example 4-5. Identification of SSSE3 with cpuid
Identify existence of CPUID instruction
test ECX, 000000200h ; ECX bit 9
jnz Found
4-5
Figure 4-1. Converting to Streaming SIMD Extensions Chart
OM15156
Code benefits
from SIMD
STOP
Identify Hot Spots in Code
Integer or
floating-point?
Yes
Floating Point
Why FP?
Can convert
to Integer?
Range or
Precision
If possible, re-arrange data
for SIMD efficiency
Integer
Change to use
SIMD Integer
Yes
Change to use
Single Precision
Can convert to
Single-precision?
Yes
No
No
Align data structures
Convert to code to use
SIMD Technologies
Follow general coding
guidelines and SIMD
coding guidelines
Use memory optimizations
and prefetch if appropriate
Schedule instructions to
optimize performance
No
Performance
4-6
To use any of t he SI MD t echnologies opt imally, you must evaluat e t he following sit u-
at ions in your code:
Fragment s t hat are comput at ionally int ensive
Fragment s t hat are execut ed oft en enough t o have an impact on performance
Fragment s t hat wit h lit t le dat a- dependent cont rol flow
Fragment s t hat require float ing- point comput at ions
Fragment s t hat can benefit from moving dat a 16 byt es at a t ime
Fragment s of comput at ion t hat can coded using fewer inst ruct ions
Fragment s t hat require help in using t he cache hierarchy efficient ly
4.2.1 Identifying Hot Spots
To opt imize performance, use t he VTune Performance Analyzer t o find sect ions of
code t hat occupy most of t he comput at ion t ime. Such sect ions are called t he
hot spot s. See Appendix A, Applicat ion Performance Tools.
The VTune analyzer provides a hot spot s view of a specific module t o help you ident ify
sect ions in your code t hat t ake t he most CPU t ime and t hat have pot ent ial perfor-
mance problems. The hot spot s view helps you ident ify sect ions in your code t hat t ake
t he most CPU t ime and t hat have pot ent ial performance problems.
The VTune analyzer enables you t o change t he view t o show hot spot s by memory
locat ion, funct ions, classes, or source files. You can double- click on a hot spot and
open t he source or assembly view for t he hot spot and see more det ailed informat ion
about t he performance of each inst ruct ion in t he hot spot .
The VTune analyzer offers focused analysis and performance dat a at all levels of your
source code and can also provide advice at t he assembly language level. The code
coach analyzes and ident ifies opport unit ies for bet t er performance of C/ C+ + , Fort ran
and Java* programs, and suggest s specific opt imizat ions. Where appropriat e, t he
coach displays pseudocode t o suggest t he use of highly opt imized int rinsics and
funct ions in t he I nt el
Performance Library Suit e. Because VTune analyzer is

designed specifically for I nt el archit ect ure ( I A) - based processors, including t he
Pent ium 4 processor, it can offer det ailed approaches t o working wit h I A. See
Appendix A.1. 1, Recommended Opt imizat ion Set t ings for I nt el 64 and I A- 32 Proces-
sors, for det ails.
4.2.2 Determine If Code Benefits by Conversion to SIMD Execution
I dent ifying code t hat benefit s by using SI MD t echnologies can be t ime- consuming
and difficult . Likely candidat es for conversion are applicat ions t hat are highly compu-
t at ion int ensive, such as t he following:
Speech compression algorit hms and filt ers
Speech recognit ion algorit hms
4-7
Video display and capt ure rout ines
Rendering rout ines
3D graphics ( geomet ry)
I mage and video processing algorit hms
Spat ial ( 3D) audio
Physical modeling ( graphics, CAD)
Workst at ion applicat ions
Encrypt ion algorit hms
Complex arit hmet ics
Generally, good candidat e code is code t hat cont ains small- sized repet it ive loops t hat
operat e on sequent ial arrays of int egers of 8, 16 or 32 bit s, single- precision 32- bit
float ing- point dat a, double precision 64- bit float ing- point dat a ( int eger and float ing-
point dat a it ems should be sequent ial in memory) . The repet it iveness of t hese loops
incurs cost ly applicat ion processing t ime. However, t hese rout ines have pot ent ial for
increased performance when you convert t hem t o use one of t he SI MD t echnologies.
Once you ident ify your opport unit ies for using a SI MD t echnology, you must evaluat e
what should be done t o det ermine whet her t he current algorit hm or a modified one
will ensure t he best performance.
4.3 CODING TECHNIQUES
The SI MD feat ures of SSE3, SSE2, SSE, and MMX t echnology require new met hods of
coding algorit hms. One of t hem is vect orizat ion. Vect orizat ion is t he process of t rans-
forming sequent ially- execut ing, or scalar, code int o code t hat can execut e in parallel,
t aking advant age of t he SI MD archit ect ure parallelism. This sect ion discusses t he
coding t echniques available for an applicat ion t o make use of t he SI MD archit ect ure.
To vect orize your code and t hus t ake advant age of t he SI MD archit ect ure, do t he
following:
Det ermine if t he memory accesses have dependencies t hat would prevent
parallel execut ion.
St rip- mine t he inner loop t o reduce t he it erat ion count by t he lengt h of t he
SI MD operat ions ( for example, four for single- precision float ing- point SI MD,
eight for 16- bit int eger SI MD on t he XMM regist ers) .
Re- code t he loop wit h t he SI MD inst ruct ions.
Each of t hese act ions is discussed in det ail in t he subsequent sect ions of t his chapt er.
These sect ions also discuss enabling aut omat ic vect orizat ion using t he I nt el C+ +
Compiler.
4-8
4.3.1 Coding Methodologies
Soft ware developers need t o compare t he performance improvement t hat can be
obt ained from assembly code versus t he cost of t hose improvement s. Programming
direct ly in assembly language for a t arget plat form may produce t he required perfor-
mance gain, however, assembly code is not port able bet ween processor archit ec-
t ures and is expensive t o writ e and maint ain.
Performance obj ect ives can be met by t aking advant age of t he different SI MD t ech-
nologies using high- level languages as well as assembly. The new C/ C+ + language
ext ensions designed specifically for SSSE3, SSE3, SSE2, SSE, and MMX t echnology
help make t his possible.
Figure 4- 2 illust rat es t he t rade- offs involved in t he performance of hand- coded
assembly versus t he ease of programming and port abilit y.
The examples t hat follow illust rat e t he use of coding adj ust ment s t o enable t he algo-
rit hm t o benefit from t he SSE. The same t echniques may be used for single- precision
float ing- point , double- precision float ing- point , and int eger dat a under SSSE3, SSE3,
SSE2, SSE, and MMX t echnology.
Figure 4-2. Hand-Coded Assembly and High-Level Compiler Performance Trade-offs
P
e
r
f
o
r
m
a
n
c
e
Ease of Programming/Portability
Instrinsics Assembly
C/C++/Fortran
Automatic
Vectorization
4-9
As a basis for t he usage model discussed in t his sect ion, consider a simple loop
Not e t hat t he loop runs for only four it erat ions. This allows a simple replacement of
t he code wit h St reaming SI MD Ext ensions.
For t he opt imal use of t he St reaming SI MD Ext ensions t hat need dat a alignment on
t he 16- byt e boundary, all examples in t his chapt er assume t hat t he arrays passed t o
t he rout ine, A, B, C, are aligned t o 16- byt e boundaries by a calling rout ine. For t he
met hods t o ensure t his alignment , please refer t o t he applicat ion not es for t he
Pent ium 4 processor.
The sect ions t hat follow provide det ails on t he coding met hodologies: inlined
assembly, int rinsics, C+ + vect or classes, and aut omat ic vect orizat ion.
4.3.1.1 Assembly
Key loops can be coded direct ly in assembly language using an assembler or by using
inlined assembly ( C- asm) in C/ C+ + code. The I nt el compiler or assembler recognize
t he new inst ruct ions and regist ers, t hen direct ly generat e t he corresponding code.
This model offers t he opport unit y for at t aining great est performance, but t his perfor-
mance is not port able across t he different processor archit ect ures.
Example 4-6. Simple Four-Iteration Loop
void add(float *a, float *b, float *c)
{
int i;
for (i = 0; i < 4; i++) {
c[i] = a[i] + b[i];
}
}
4-10
Example 4- 7 shows t he St reaming SI MD Ext ensions inlined assembly encoding.
4.3.1.2 Intrinsics
I nt rinsics provide t he access t o t he I SA funct ionalit y using C/ C+ + st yle coding
inst ead of assembly language. I nt el has defined t hree set s of int rinsic funct ions t hat
are implement ed in t he I nt el C+ + Compiler t o support t he MMX t echnology,
St reaming SI MD Ext ensions and St reaming SI MD Ext ensions 2. Four new C dat a
t ypes, represent ing 64- bit and 128- bit obj ect s are used as t he operands of t hese
int rinsic funct ions. __M64 is used for MMX int eger SI MD, __M128 is used for single-
precision float ing- point SI MD, __M128I is used for St reaming SI MD Ext ensions 2
int eger SI MD, and __M128D is used for double precision float ing- point SI MD. These
t ypes enable t he programmer t o choose t he implement at ion of an algorit hm direct ly,
while allowing t he compiler t o perform regist er allocat ion and inst ruct ion scheduling
where possible. The int rinsics are port able among all I nt el archit ect ure- based
processors support ed by a compiler.
The use of int rinsics allows you t o obt ain performance close t o t he levels achievable
wit h assembly. The cost of writ ing and maint aining programs wit h int rinsics is consid-
erably less. For a det ailed descript ion of t he int rinsics and t heir use, refer t o t he
I nt el C+ + Compiler document at ion.
Example 4-7. Streaming SIMD Extensions Using Inlined Assembly Encoding
{
__asm {
mov eax, a
mov edx, b
mov ecx, c
movaps xmm0, XMMWORD PTR [eax]
addps xmm0, XMMWORD PTR [edx]
movaps XMMWORD PTR [ecx], xmm0
}
}
4-11
Example 4- 8 shows t he loop from Example 4- 6 using int rinsics.
The int rinsics map one- t o- one wit h act ual St reaming SI MD Ext ensions assembly
code. The XMMI NTRI N. H header file in which t he prot ot ypes for t he int rinsics are
defined is part of t he I nt el C+ + Compiler included wit h t he VTune Performance
Enhancement Environment CD.
I nt rinsics are also defined for t he MMX t echnology I SA. These are based on t he
__m64 dat a t ype t o represent t he cont ent s of an mm regist er. You can specify values
in byt es, short int egers, 32- bit values, or as a 64- bit obj ect .
The int rinsic dat a t ypes, however, are not a basic ANSI C dat a t ype, and t herefore
you must observe t he following usage rest rict ions:
Use int rinsic dat a t ypes only on t he left - hand side of an assignment as a ret urn
value or as a paramet er. You cannot use it wit h ot her arit hmet ic expressions ( for
example, + , > > ) .
Use int rinsic dat a t ype obj ect s in aggregat es, such as unions t o access t he byt e
element s and st ruct ures; t he address of an __M64 obj ect may be also used.
Use int rinsic dat a t ype dat a only wit h t he MMX t echnology int rinsics described in
t his guide.
For complet e det ails of t he hardware inst ruct ions, see t he I nt el Archit ect ure MMX
Technology Programmers Reference Manual. For a descript ion of dat a t ypes, see t he
I nt el 64 and I A- 32 Archit ect ures Soft ware Developers Manual.
4.3.1.3 Classes
A set of C+ + classes has been defined and available in I nt el C+ + Compiler t o provide
bot h a higher- level abst ract ion and more flexibilit y for programming wit h MMX t ech-
nology, St reaming SI MD Ext ensions and St reaming SI MD Ext ensions 2. These
classes provide an easy- t o- use and flexible int erface t o t he int rinsic funct ions,
allowing developers t o writ e more nat ural C+ + code wit hout worrying about which
int rinsic or assembly language inst ruct ion t o use for a given operat ion. Since t he
int rinsic funct ions underlie t he implement at ion of t hese C+ + classes, t he perfor-
Example 4-8. Simple Four-Iteration Loop Coded with Intrinsics
#include <xmmintrin.h>
{
__m128 t0, t1;
t0 = _mm_load_ps(a);
t1 = _mm_load_ps(b);
t0 = _mm_add_ps(t0, t1);
_mm_store_ps(c, t0);
}
4-12
mance of applicat ions using t his met hodology can approach t hat of one using t he
int rinsics. Furt her det ails on t he use of t hese classes can be found in t he I nt el C+ +
Class Libraries for SI MD Operat ions Users Guide, order number 693500.
Example 4- 9 shows t he C+ + code using a vect or class library. The example assumes
t he arrays passed t o t he rout ine are already aligned t o 16- byt e boundaries.
Here, fvec. h is t he class definit ion file and F32vec4 is t he class represent ing an array
of four float s. The + and = operat ors are overloaded so t hat t he act ual St reaming
SI MD Ext ensions implement at ion in t he previous example is abst ract ed out , or
hidden, from t he developer. Not e how much more t his resembles t he original code,
allowing for simpler and fast er programming.
Again, t he example is assuming t he arrays, passed t o t he rout ine, are already
aligned t o 16- byt e boundary.
4.3.1.4 Automatic Vectorization
The I nt el C+ + Compiler provides an opt imizat ion mechanism by which loops, such as
in Example 4- 6 can be aut omat ically vect orized, or convert ed int o St reaming SI MD
Ext ensions code. The compiler uses similar t echniques t o t hose used by a
programmer t o ident ify whet her a loop is suit able for conversion t o SI MD. This
involves det ermining whet her t he following might prevent vect orizat ion:
The layout of t he loop and t he dat a st ruct ures used
Dependencies amongst t he dat a accesses in each it erat ion and across it erat ions
Once t he compiler has made such a det erminat ion, it can generat e vect orized code
for t he loop, allowing t he applicat ion t o use t he SI MD inst ruct ions.
The caveat t o t his is t hat only cert ain t ypes of loops can be aut omat ically vect orized,
and in most cases user int eract ion wit h t he compiler is needed t o fully enable t his.
Example 4-9. C++ Code Using the Vector Classes
#include <fvec.h>
{
F32vec4 *av=(F32vec4 *) a;
F32vec4 *bv=(F32vec4 *) b;
F32vec4 *cv=(F32vec4 *) c;
*cv=*av + *bv;
}
4-13
Example 4- 10 shows t he code for aut omat ic vect orizat ion for t he simple four- it era-
t ion loop ( from Example 4- 6) .
Compile t his code using t he - QAX and - QRESTRI CT swit ches of t he I nt el C+ +
Compiler, version 4.0 or lat er.
The RESTRI CT qualifier in t he argument list is necessary t o let t he compiler know t hat
t here are no ot her aliases t o t he memory t o which t he point ers point . I n ot her words,
t he point er for which it is used, provides t he only means of accessing t he memory in
quest ion in t he scope in which t he point ers live. Wit hout t he rest rict qualifier, t he
compiler will st ill vect orize t his loop using runt ime dat a dependence t est ing, where
t he generat ed code dynamically select s bet ween sequent ial or vect or execut ion of
t he loop, based on overlap of t he paramet ers ( See document at ion for t he I nt el C+ +
Compiler) . The rest rict keyword avoids t he associat ed overhead alt oget her.
See I nt el C+ + Compiler document at ion for det ails.
4.4 STACK AND DATA ALIGNMENT
To get t he most performance out of code writ t en for SI MD t echnologies dat a should
be format t ed in memory according t o t he guidelines described in t his sect ion.
Assembly code wit h an unaligned accesses is a lot slower t han an aligned access.
4.4.1 Alignment and Contiguity of Data Access Patterns
The 64- bit packed dat a t ypes defined by MMX t echnology, and t he 128- bit packed
dat a t ypes for St reaming SI MD Ext ensions and St reaming SI MD Ext ensions 2 creat e
more pot ent ial for misaligned dat a accesses. The dat a access pat t erns of many algo-
rit hms are inherent ly misaligned when using MMX t echnology and St reaming SI MD
Ext ensions. Several t echniques for improving dat a access, such as padding, orga-
nizing dat a element s int o arrays, et c. are described below. SSE3 provides a special-
Example 4-10. Automatic Vectorization for a Simple Loop
void add (float *restrict a,
float *restrict b,
float *restrict c)
{
int i;
for (i = 0; i < 4; i++) {
c[i] = a[i] + b[i];
}
}
4-14
purpose inst ruct ion LDDQU t hat can avoid cache line split s is discussed in
Sect ion 5.7. 1. 1, Supplement al Techniques for Avoiding Cache Line Split s.
4.4.1.1 Using Padding to Align Data
However, when accessing SI MD dat a using SI MD operat ions, access t o dat a can be
improved simply by a change in t he declarat ion. For example, consider a declarat ion
of a st ruct ure, which represent s a point in space plus an at t ribut e.
typedef struct {short x,y,z; char a} Point;
Point pt[N];
Assume we will be performing a number of comput at ions on X, Y, Z in t hree of t he
four element s of a SI MD word; see Sect ion 4. 5. 1, Dat a St ruct ure Layout , for an
example. Even if t he first element in array PT is aligned, t he second element will st art
7 byt es lat er and not be aligned ( 3 short s at t wo byt es each plus a single byt e = 7
byt es) .
By adding t he padding variable PAD, t he st ruct ure is now 8 byt es, and if t he first
element is aligned t o 8 byt es ( 64 bit s) , all following element s will also be aligned. The
sample declarat ion follows:
typedef struct {short x,y,z; char a; char pad;} Point;
Point pt[N];
4.4.1.2 Using Arrays to Make Data Contiguous
I n t he following code,
for (i=0; i<N; i++) pt[i].y *= scale;
t he second dimension Y needs t o be mult iplied by a scaling value. Here, t he FOR loop
accesses each Y dimension in t he array PT t hus disallowing t he access t o cont iguous
dat a. This can degrade t he performance of t he applicat ion by increasing cache
misses, by poor ut ilizat ion of each cache line t hat is fet ched, and by increasing t he
chance for accesses which span mult iple cache lines.
The following declarat ion allows you t o vect orize t he scaling operat ion and furt her
improve t he alignment of t he dat a access pat t erns:
short ptx[N], pty[N], ptz[N];
for (i=0; i<N; i++) pty[i] *= scale;
Wit h t he SI MD t echnology, choice of dat a organizat ion becomes more import ant and
should be made carefully based on t he operat ions t hat will be performed on t he dat a.
I n some applicat ions, t radit ional dat a arrangement s may not lead t o t he maximum
performance.
A simple example of t his is an FI R filt er. An FI R filt er is effect ively a vect or dot product
in t he lengt h of t he number of coefficient t aps.
Consider t he following code:
(data [ j ] *coeff [0] + data [j+1]*coeff [1]+...+data [j+num of taps-1]*coeff [num of taps-1]),
4-15
I f in t he code above t he filt er operat ion of dat a element I is t he vect or dot product
t hat begins at dat a element J, t hen t he filt er operat ion of dat a element I + 1 begins at
dat a element J+ 1.
Assuming you have a 64- bit aligned dat a vect or and a 64- bit aligned coefficient s
vect or, t he filt er operat ion on t he first dat a element will be fully aligned. For t he
second dat a element , however, access t o t he dat a vect or will be misaligned. For an
example of how t o avoid t he misalignment problem in t he FI R filt er, refer t o I nt el
applicat ion not es on St reaming SI MD Ext ensions and filt ers.
Duplicat ion and padding of dat a st ruct ures can be used t o avoid t he problem of dat a
accesses in algorit hms which are inherent ly misaligned. Sect ion 4. 5. 1, Dat a St ruc-
t ure Layout , discusses t rade- offs for organizing dat a st ruct ures.
NOTE
The duplicat ion and padding t echnique overcomes t he misalignment
problem, t hus avoiding t he expensive penalt y for misaligned dat a
access, at t he cost of increasing t he dat a size. When developing your
code, you should consider t his t radeoff and use t he opt ion which
gives t he best performance.
4.4.2 Stack Alignment For 128-bit SIMD Technologies
For best performance, t he St reaming SI MD Ext ensions and St reaming SI MD Ext en-
sions 2 require t heir memory operands t o be aligned t o 16- byt e boundaries.
Unaligned dat a can cause significant performance penalt ies compared t o aligned
dat a. However, t he exist ing soft ware convent ions for I A- 32 ( STDCALL, CDECL, FAST-
CALL) as implement ed in most compilers, do not provide any mechanism for
ensuring t hat cert ain local dat a and cert ain paramet ers are 16- byt e aligned. There-
fore, I nt el has defined a new set of I A- 32 soft ware convent ions for alignment t o
support t he new __M128* dat at ypes ( __M128, __M128D, and __M218I ) . These
meet t he following condit ions:
Funct ions t hat use St reaming SI MD Ext ensions or St reaming SI MD Ext ensions 2
dat a need t o provide a 16- byt e aligned st ack frame.
__M128* paramet ers need t o be aligned t o 16- byt e boundaries, possibly creat ing
holes ( due t o padding) in t he argument block.
The new convent ions present ed in t his sect ion as implement ed by t he I nt el C+ +
Compiler can be used as a guideline for an assembly language code as well. I n many
cases, t his sect ion assumes t he use of t he __M128* dat a t ypes, as defined by t he I nt el
C+ + Compiler, which represent s an array of four 32- bit float s.
For more det ails on t he st ack alignment for St reaming SI MD Ext ensions and SSE2,
see Appendix D, St ack Alignment .
4-16
4.4.3 Data Alignment for MMX Technology
Many compilers enable alignment of variables using cont rols. This aligns variable bit
lengt hs t o t he appropriat e boundaries. I f some of t he variables are not appropriat ely
aligned as specified, you can align t hem using t he C algorit hm in Example 4- 11.
The algorit hm in Example 4- 11 aligns an array of 64- bit element s on a 64- bit
boundary. The const ant of 7 is derived from one less t han t he number of byt es in a
64- bit element , or 8- 1. Aligning dat a in t his manner avoids t he significant perfor-
mance penalt ies t hat can occur when an access crosses a cache line boundary.
Anot her way t o improve dat a alignment is t o copy t he dat a int o locat ions t hat are
aligned on 64- bit boundaries. When t he dat a is accessed frequent ly, t his can provide
a significant performance improvement .
4.4.4 Data Alignment for 128-bit data
Dat a must be 16- byt e aligned when loading t o and st oring from t he 128- bit XMM
regist ers used by SSE/ SSE2/ SSE3/ SSSE3. This must be done t o avoid severe perfor-
mance penalt ies and, at worst , execut ion fault s.
There are MOVE inst ruct ions ( and int rinsics) t hat allow unaligned dat a t o be copied t o
and out of XMM regist ers when not using aligned dat a, but such operat ions are much
slower t han aligned accesses. I f dat a is not 16- byt e- aligned and t he programmer or
t he compiler does not det ect t his and uses t he aligned inst ruct ions, a fault occurs. So
keep dat a 16- byt e- aligned. Such alignment also works for MMX t echnology code,
even t hough MMX t echnology only requires 8- byt e alignment .
The following describes alignment t echniques for Pent ium 4 processor as imple-
ment ed wit h t he I nt el C+ + Compiler.
4.4.4.1 Compiler-Supported Alignment
The I nt el C+ + Compiler provides t he following met hods t o ensure t hat t he dat a is
aligned.
Al i gnment by F32v ec4 or __m128 Dat a Ty pes
When t he compiler det ect s F32VEC4 or __M128 dat a declarat ions or paramet ers, it
forces alignment of t he obj ect t o a 16- byt e boundary for bot h global and local dat a,
as well as paramet ers. I f t he declarat ion is wit hin a funct ion, t he compiler also aligns
Example 4-11. C Algorithm for 64-bit Data Alignment
/* Make newp a pointer to a 64-bit aligned array of NUM_ELEMENTS 64-bit elements. */
double *p, *newp;
p = (double*)malloc (sizeof(double)*(NUM_ELEMENTS+1));
newp = (p+7) & (~0x7);
4-17
t he funct ion' s st ack frame t o ensure t hat local dat a and paramet ers are 16- byt e-
aligned. For det ails on t he st ack frame layout t hat t he compiler generat es for bot h
debug and opt imized ( release - mode) compilat ions, refer t o I nt els compiler docu-
ment at ion.
__decl spec( al i gn( 16) ) speci f i cat i ons
These can be placed before dat a declarat ions t o force 16- byt e alignment . This is
useful for local or global dat a declarat ions t hat are assigned t o 128- bit dat a t ypes.
The synt ax for it is
__declspec(align(integer-constant))
where t he I NTEGER- CONSTANT is an int egral power of t wo but no great er t han 32.
For example, t he following increases t he alignment t o 16- byt es:
__declspec(align(16)) float buffer[400];
The variable BUFFER could t hen be used as if it cont ained 100 obj ect s of t ype __M128
or F32VEC4. I n t he code below, t he const ruct ion of t he F32VEC4 obj ect , X, will occur
wit h aligned dat a.
void foo() {
F32vec4 x = *(__m128 *) buffer;
...
}
Wit hout t he declarat ion of __DECLSPEC( ALI GN( 16) ) , a fault may occur.
Alignment by Using a UNION Structure
When feasible, a UNI ON can be used wit h 128- bit dat a t ypes t o allow t he compiler t o
align t he dat a st ruct ure by default . This is preferred t o forcing alignment wit h
__DECLSPEC( ALI GN( 16) ) because it exposes t he t rue program int ent t o t he compiler
in t hat __M128 dat a is being used. For example:
union {
float f[400];
__m128 m[100];
} buffer;
Now, 16- byt e alignment is used by default due t o t he __M128 t ype in t he UNI ON; it
is not necessary t o use __DECLSPEC( ALI GN( 16) ) t o force t he result .
I n C+ + ( but not in C) it is also possible t o force t he alignment of a
CLASS/ STRUCT/ UNI ON t ype, as in t he code t hat follows:
struct __declspec(align(16)) my_m128
{
float f[4];
};
4-18
I f t he dat a in such a CLASS is going t o be used wit h t he St reaming SI MD Ext ensions
or St reaming SI MD Ext ensions 2, it is preferable t o use a UNI ON t o make t his explicit .
I n C+ + , an anonymous UNI ON can be used t o make t his more convenient :
class my_m128 {
union {
__m128 m;
float f[4];
};
};
Because t he UNI ON is anonymous, t he names, M and F, can be used as immediat e
member names of MY__M128. Not e t hat __DECLSPEC( ALI GN) has no effect when
applied t o a CLASS, STRUCT, or UNI ON member in eit her C or C+ + .
Alignment by Using __m64 or DOUBLE Data
I n some cases, t he compiler aligns rout ines wit h __M64 or DOUBLE dat a t o 16- byt es
by default . The command- line swit ch, - QSFALI GN16, limit s t he compiler so t hat it
only performs t his alignment on rout ines t hat cont ain 128- bit dat a. The default
behavior is t o use - QSFALI GN8. This swit ch inst ruct s t he complier t o align rout ines
wit h 8- or 16- byt e dat a t ypes t o 16 byt es.
For more, see t he I nt el C+ + Compiler document at ion.
4.5 IMPROVING MEMORY UTILIZATION
Memory performance can be improved by rearranging dat a and algorit hms for SSE,
SSE2, and MMX t echnology int rinsics. Met hods for improving memory performance
involve working wit h t he following:
Dat a st ruct ure layout
St rip- mining for vect orizat ion and memory ut ilizat ion
Loop- blocking
Using t he cacheabilit y inst ruct ions, prefet ch and st reaming st ore, also great ly
enhance memory ut ilizat ion. See also: Chapt er 9, Opt imizing Cache Usage.
4.5.1 Data Structure Layout
For cert ain algorit hms, like 3D t ransformat ions and light ing, t here are t wo basic ways
t o arrange vert ex dat a. The t radit ional met hod is t he array of st ruct ures ( AoS)
arrangement , wit h a st ruct ure for each vert ex ( Example 4- 12) . However t his met hod
does not t ake full advant age of SI MD t echnology capabilit ies.
4-19
The best processing met hod for code using SI MD t echnology is t o arrange t he dat a in
an array for each coordinat e ( Example 4- 13) . This dat a arrangement is called st ruc-
t ure of arrays ( SoA) .
There are t wo opt ions for comput ing dat a in AoS format : perform operat ion on t he
dat a as it st ands in AoS format , or re- arrange it ( swizzle it ) int o SoA format dynami-
cally. See Example 4- 14 for code samples of each opt ion based on a dot - product
comput at ion.
Example 4-12. AoS Data Structure
typedef struct{
float x,y,z;
int a,b,c;
. . .
} Vertex;
Vertex Vertices[NumOfVertices];
Example 4-13. SoA Data Structure
typedef struct{
float x[NumOfVertices];
float y[NumOfVertices];
float z[NumOfVertices];
int a[NumOfVertices];
int b[NumOfVertices];
int c[NumOfVertices];
. . .
} VerticesList;
VerticesList Vertices;
Example 4-14. AoS and SoA Code Samples
; The dot product of an array of vectors (Array) and a fixed vector (Fixed) is a
; common operation in 3D lighting operations, where Array = (x0,y0,z0),(x1,y1,z1),...
; and Fixed = (xF,yF,zF)
; A dot product is defined as the scalar quantity d0 = x0*xF + y0*yF + z0*zF.
;
; AoS code
; All values marked DC are dont-care.
4-20
Performing SI MD operat ions on t he original AoS format can require more calculat ions
and some operat ions do not t ake advant age of all SI MD element s available. There-
fore, t his opt ion is generally less efficient .
The recommended way for comput ing dat a in AoS format is t o swizzle each set of
element s t o SoA format before processing it using SI MD t echnologies. Swizzling can
eit her be done dynamically during program execut ion or st at ically when t he dat a
st ruct ures are generat ed. See Chapt er 5 and Chapt er 6 for examples. Performing t he
swizzle dynamically is usually bet t er t han using AoS, but can be somewhat inefficient
because t here are ext ra inst ruct ions during comput at ion. Performing t he swizzle
st at ically, when dat a st ruct ures are being laid out , is best as t here is no runt ime over-
head.
As ment ioned earlier, t he SoA arrangement allows more efficient use of t he paral-
lelism of SI MD t echnologies because t he dat a is ready for comput at ion in a more
opt imal vert ical manner: mult iplying component s X0, X1,X2, X3 by XF, XF,XF,XF using
; In the AOS model, the vertices are stored in the xyz format
movaps xmm0, Array ; xmm0 = DC, x0, y0, z0
movaps xmm1, Fixed ; xmm1 = DC, xF, yF, zF
mulps xmm0, xmm1 ; xmm0 = DC, x0*xF, y0*yF, z0*zF
movhlps xmm, xmm0 ; xmm = DC, DC, DC, x0*xF
addps xmm1, xmm0 ; xmm0 = DC, DC, DC,
; x0*xF+z0*zFmovaps xmm2, xmm1
shufps xmm2, xmm2,55h ; xmm2 = DC, DC, DC, y0*yF
addps xmm2, xmm1 ; xmm1 = DC, DC, DC,
; x0*xF+y0*yF+z0*zF
; SoA code
; X = x0,x1,x2,x3
; Y = y0,y1,y2,y3
; Z = z0,z1,z2,z3
; A = xF,xF,xF,xF
; B = yF,yF,yF,yF
; C = zF,zF,zF,zF
movaps xmm0, X ; xmm0 = x0,x1,x2,x3
movaps xmm1, Y ; xmm0 = y0,y1,y2,y3
movaps xmm2, Z ; xmm0 = z0,z1,z2,z3
mulps xmm0, A ; xmm0 = x0*xF, x1*xF, x2*xF, x3*xF
mulps xmm1, B ; xmm1 = y0*yF, y1*yF, y2*yF, y3*xF
mulps xmm2, C ; xmm2 = z0*zF, z1*zF, z2*zF, z3*zF
addps xmm0, xmm1
addps xmm0, xmm2 ; xmm0 = (x0*xF+y0*yF+z0*zF), ...
Example 4-14. AoS and SoA Code Samples (Contd.)
4-21
4 SI MD execut ion slot s t o produce 4 unique result s. I n cont rast , comput ing direct ly
on AoS dat a can lead t o horizont al operat ions t hat consume SI MD execut ion slot s but
produce only a single scalar result ( as shown by t he many dont - care ( DC) slot s in
Example 4- 14) .
Use of t he SoA format for dat a st ruct ures can lead t o more efficient use of caches and
bandwidt h. When t he element s of t he st ruct ure are not accessed wit h equal
frequency, such as when element x, y, z are accessed t en t imes more oft en t han t he
ot her ent ries, t hen SoA saves memory and prevent s fet ching unnecessary dat a it ems
a, b, and c.
Not e t hat SoA can have t he disadvant age of requiring more independent memory
st ream references. A comput at ion t hat uses arrays X, Y, and Z ( see Example 4- 13)
would require t hree separat e dat a st reams. This can require t he use of more
prefet ches, addit ional address generat ion calculat ions, as well as having a great er
impact on DRAM page access efficiency.
There is an alt ernat ive: a hybrid SoA approach blends t he t wo alt ernat ives ( see
Example 4- 15) . I n t his case, only 2 separat e address st reams are generat ed and
referenced: one cont ains XXXX, YYYY, ZZZZ, ZZZZ, . . . and t he ot her AAAA, BBBB,
CCCC, AAAA, DDDD, . . . . The approach prevent s fet ching unnecessary dat a,
assuming t he variables X, Y, Z are always used t oget her; whereas t he variables A, B,
C would also be used t oget her, but not at t he same t ime as X, Y, Z.
The hybrid SoA approach ensures:
Dat a is organized t o enable more efficient vert ical SI MD comput at ion
Simpler/ less address generat ion t han AoS
Fewer st reams, which reduces DRAM page misses
Example 4-15. Hybrid SoA Data Structure
NumOfGroups = NumOfVertices/SIMDwidth
typedef struct{
float x[SIMDwidth];
float y[SIMDwidth];
float z[SIMDwidth];
} VerticesCoordList;
typedef struct{
int a[SIMDwidth];
int b[SIMDwidth];
int c[SIMDwidth];
. . .
} VerticesColorList;
VerticesCoordList VerticesCoord[NumOfGroups];
VerticesColorList VerticesColor[NumOfGroups];
4-22
Use of fewer prefet ches, due t o fewer st reams
Efficient cache line packing of dat a element s t hat are used concurrent ly.
Wit h t he advent of t he SI MD t echnologies, t he choice of dat a organizat ion becomes
more import ant and should be carefully based on t he operat ions t o be performed on
t he dat a. This will become increasingly import ant in t he Pent ium 4 processor and
fut ure processors. I n some applicat ions, t radit ional dat a arrangement s may not lead
t o t he maximum performance. Applicat ion developers are encouraged t o explore
different dat a arrangement s and dat a segment at ion policies for efficient comput a-
t ion. This may mean using a combinat ion of AoS, SoA, and Hybrid SoA in a given
applicat ion.
4.5.2 Strip-Mining
St rip- mining, also known as loop sect ioning, is a loop t ransformat ion t echnique for
enabling SI MD- encodings of loops, as well as providing a means of improving
memory performance. First int roduced for vect orizers, t his t echnique consist s of t he
generat ion of code when each vect or operat ion is done for a size less t han or equal t o
t he maximum vect or lengt h on a given vect or machine. By fragment ing a large loop
int o smaller segment s or st rips, t his t echnique t ransforms t he loop st ruct ure by:
I ncreasing t he t emporal and spat ial localit y in t he dat a cache if t he dat a are
reusable in different passes of an algorit hm.
Reducing t he number of it erat ions of t he loop by a fact or of t he lengt h of each
vect or, or number of operat ions being performed per SI MD operat ion. I n t he
case of St reaming SI MD Ext ensions, t his vect or or st rip- lengt h is reduced by 4
t imes: four float ing- point dat a it ems per single St reaming SI MD Ext ensions
single- precision float ing- point SI MD operat ion are processed. Consider
Example 4- 16.
Example 4-16. Pseudo-code Before Strip Mining
typedef struct _VERTEX {
float x, y, z, nx, ny, nz, u, v;
} Vertex_rec;

main()
{
Vertex_rec v[Num];
....
for (i=0; i<Num; i++) {
Transform(v[i]);
}
4-23
The main loop consist s of t wo funct ions: t ransformat ion and light ing. For each obj ect ,
t he main loop calls a t ransformat ion rout ine t o updat e some dat a, t hen calls t he
light ing rout ine t o furt her work on t he dat a. I f t he size of array V[ NUM] is larger t han
t he cache, t hen t he coordinat es for V[ I ] t hat were cached during TRANSFORM( V[ I ] )
will be evict ed from t he cache by t he t ime we do LI GHTI NG( V[ I ] ) . This means t hat
V[ I ] will have t o be fet ched from main memory a second t ime, reducing performance.
I n Example 4- 17, t he comput at ion has been st rip- mined t o a size STRI P_SI ZE. The
value STRI P_SI ZE is chosen such t hat STRI P_SI ZE element s of array V[ NUM] fit int o
t he cache hierarchy. By doing t his, a given element V[ I ] brought int o t he cache by
TRANSFORM( V[ I ] ) will st ill be in t he cache when we perform LI GHTI NG( V[ I ] ) , and
t hus improve performance over t he non- st rip- mined code.
4.5.3 Loop Blocking
Loop blocking is anot her useful t echnique for memory performance opt imizat ion. The
main purpose of loop blocking is also t o eliminat e as many cache misses as possible.
This t echnique t ransforms t he memory domain of a given problem int o smaller
chunks rat her t han sequent ially t raversing t hrough t he ent ire memory domain. Each
chunk should be small enough t o fit all t he dat a for a given comput at ion int o t he
cache, t hereby maximizing dat a reuse. I n fact , one can t reat loop blocking as st rip
mining in t wo or more dimensions. Consider t he code in Example 4- 16 and access
for (i=0; i<Num; i++) {
Lighting(v[i]);
}
....
}
Example 4-17. Strip Mined Code
MAI N( )
{
Vertex_rec v[Num];
....
for (i=0; i < Num; i+=strip_size) {
FOR ( J= I ; J < MI N( NUM, I + STRI P_SI ZE) ; J+ + ) {
TRANSFORM( V[ J] ) ;
}
FOR ( J= I ; J < MI N( NUM, I + STRI P_SI ZE) ; J+ + ) {
LI GHTI NG( V[ J] ) ;
}
}
}
Example 4-16. Pseudo-code Before Strip Mining (Contd.)
4-24
pat t ern in Figure 4- 3. The t wo- dimensional array A is referenced in t he J ( column)
direct ion and t hen referenced in t he I ( row) direct ion ( column- maj or order) ; whereas
array B is referenced in t he opposit e manner ( row- maj or order) . Assume t he
memory layout is in column- maj or order; t herefore, t he access st rides of array A and
B for t he code in Example 4- 18 would be 1 and MAX, respect ively.
For t he first it erat ion of t he inner loop, each access t o array B will generat e a cache
miss. I f t he size of one row of array A, t hat is, A[ 2, 0: MAX- 1] , is large enough, by t he
t ime t he second it erat ion st art s, each access t o array B will always generat e a cache
miss. For inst ance, on t he first it erat ion, t he cache line cont aining B[ 0, 0: 7] will be
brought in when B[ 0,0] is referenced because t he float t ype variable is four byt es and
each cache line is 32 byt es. Due t o t he limit at ion of cache capacit y, t his line will be
evict ed due t o conflict misses before t he inner loop reaches t he end. For t he next
it erat ion of t he out er loop, anot her cache miss will be generat ed while referencing
B[ 0, 1] . I n t his manner, a cache miss occurs when each element of array B is refer-
enced, t hat is, t here is no dat a reuse in t he cache at all for array B.
This sit uat ion can be avoided if t he loop is blocked wit h respect t o t he cache size. I n
Figure 4- 3, a BLOCK_SI ZE is select ed as t he loop blocking fact or. Suppose t hat
BLOCK_SI ZE is 8, t hen t he blocked chunk of each array will be eight cache lines
( 32 byt es each) . I n t he first it erat ion of t he inner loop, A[ 0, 0: 7] and B[ 0, 0: 7] will be
brought int o t he cache. B[ 0, 0: 7] will be complet ely consumed by t he first it erat ion of
t he out er loop. Consequent ly, B[ 0, 0: 7] will only experience one cache miss aft er
applying loop blocking opt imizat ion in lieu of eight misses for t he original algorit hm.
As illust rat ed in Figure 4- 3, arrays A and B are blocked int o smaller rect angular
Example 4-18. Loop Blocking
A. Original Loop
float A[MAX, MAX], B[MAX, MAX]
for (i=0; i< MAX; i++) {
for (j=0; j< MAX; j++) {
A[i,j] = A[i,j] + B[j, i];
}
}
B. Transformed Loop after Blocking
float A[MAX, MAX], B[MAX, MAX];
for (i=0; i< MAX; i+=block_size) {
for (j=0; j< MAX; j+=block_size) {
for (ii=i; ii<i+block_size; ii++) {
for (jj=j; jj<j+block_size; jj++) {
A[ii,jj] = A[ii,jj] + B[jj, ii];
}
}
}
}
4-25
chunks so t hat t he t ot al size of t wo blocked A and B chunks is smaller t han t he cache
size. This allows maximum dat a reuse.
As one can see, all t he redundant cache misses can be eliminat ed by applying t his
loop blocking t echnique. I f MAX is huge, loop blocking can also help reduce t he
penalt y from DTLB ( dat a t ranslat ion lookaside buffer) misses. I n addit ion t o
improving t he cache/ memory performance, t his opt imizat ion t echnique also saves
ext ernal bus bandwidt h.
4.6 INSTRUCTION SELECTION
The following sect ion gives some guidelines for choosing inst ruct ions t o complet e a
t ask.
One barrier t o SI MD comput at ion can be t he exist ence of dat a- dependent branches.
Condit ional moves can be used t o eliminat e dat a- dependent branches. Condit ional
Figure 4-3. Loop Blocking Access Pattern
OM15158
A (i, j) access pattern
j
i
A(i, j) access pattern
after blocking
B(i, j) access pattern
after blocking
+
< cache size
Blocking
4-26
moves can be emulat ed in SI MD comput at ion by using masked compares and logi-
cals, as shown in Example 4- 19.
Not e t hat t his can be applied t o bot h SI MD int eger and SI MD float ing- point code.
I f t here are mult iple consumers of an inst ance of a regist er, group t he consumers
t oget her as closely as possible. However, t he consumers should not be scheduled
near t he producer.
4.6.1 SIMD Optimizations and Microarchitectures
Pent ium M, I nt el Core Solo and I nt el Core Duo processors have a different microar-
chit ect ure t han I nt el Net Burst microarchit ect ure. The following subsect ion discusses
opt imizing SI MD code t arget ing I nt el Core Solo and I nt el Core Duo processors.
The regist er- regist er variant of t he following inst ruct ions has improved performance
on I nt el Core Solo and I nt el Core Duo processor relat ive t o Pent ium M processors.
Example 4-19. Emulation of Conditional Moves
High-level code:
short A[MAX_ELEMENT], B[MAX_ELEMENT], C[MAX_ELEMENT], D[MAX_ELEMENT],
E[MAX_ELEMENT];
for (i=0; i<MAX_ELEMENT; i++) {
if (A[i] > B[i]) {
C[i] = D[i];
} else {
C[i] = E[i];
}
}
Assembly code:
xor eax, eax
top_of_loop:
movq mm0, [A + eax]
pcmpgtwmm0, [B + eax]; Create compare mask
movq mm1, [D + eax]
pand mm1, mm0; Drop elements where A<B
pandn mm0, [E + eax] ; Drop elements where A>B
por mm0, mm1; Crete single word
movq [C + eax], mm0
add eax, 8
cmp eax, MAX_ELEMENT*2
jle top_of_loop
4-27
This is because t he inst ruct ions consist of t wo micro- ops inst ead of t hree. Relevant
inst ruct ions are: unpcklps, unpckhps, packsswb, packuswb, packssdw, pshufd,
shuffps and shuffpd.
Recommendat i on: When t arget ing code generat ion for I nt el Core Solo and I nt el
Core Duo processors, favor inst ruct ions consist ing of t wo ops over t hose wit h more
t han t wo ops.
I nt el Core microarchit ect ure generally execut es SI MD inst ruct ions more efficient ly
t han previous microarchit ect ures in t erms of lat ency and t hroughput , many of t he
rest rict ions specific t o I nt el Core Duo, I nt el Core Solo processors do not apply. The
same is t rue of I nt el Core microarchit ect ure relat ive t o I nt el Net Burst microarchit ec-
t ures.
4.7 TUNING THE FINAL APPLICATION
The best way t o t une your applicat ion once it is funct ioning correct ly is t o use a
profiler t hat measures t he applicat ion while it is running on a syst em. VTune analyzer
can help you det ermine where t o make changes in your applicat ion t o improve
performance. Using t he VTune analyzer can help you wit h various phases required for
opt imized performance. See Appendix A. 2, I nt el VTune Performance Analyzer,
for det ails. Aft er every effort t o opt imize, you should check t he performance gains t o
see where you are making your maj or opt imizat ion gains.
4-28
5-1
CHAPTER 5
SI MD int eger inst ruct ions provide performance improvement s in applicat ions t hat
are int eger- int ensive and can t ake advant age of SI MD archit ect ure.
Guidelines in t his chapt er for using SI MD int eger inst ruct ions ( in addit ion t o t hose
described in Chapt er 3) may be used t o develop fast and efficient code t hat scales
across processors wit h MMX t echnology, processors t hat use St reaming SI MD Ext en-
sions ( SSE) SI MD int eger inst ruct ions, as well as processor wit h t he SI MD int eger
inst ruct ions in SSE2, SSE3 and SSSE3.
The collect ion of 64- bit and 128- bit SI MD int eger inst ruct ions support ed by MMX
t echnology, SSE, SSE2, SSE3 and SSSE3 are referred t o as SI MD int eger inst ruct ions.
Code sequences in t his chapt er demonst rat es t he use of 64- bit SI MD int eger inst ruc-
t ions and 128- bit SI MD int eger inst ruct ions.
Processors based on I nt el Core microarchit ect ure support MMX, SSE, SSE2, SSE3,
and SSSE3. Execut ion of 128- bit SI MD int eger inst ruct ions in I nt el Core microarchi-
t ect ure are subst ant ially more efficient t han equivalent implement at ions on previous
microarchit ect ures. Conversion from 64- bit SI MD int eger code t o 128- bit SI MD
int eger code is highly recommended.
This chapt er cont ains examples t hat will help you t o get st art ed wit h coding your
applicat ion. The goal is t o provide simple, low- level operat ions t hat are frequent ly
used. The examples use a minimum number of inst ruct ions necessary t o achieve
best performance on t he current generat ion of I nt el 64 and I A- 32 processors.
Each example includes a short descript ion, sample code, and not es if necessary.
These examples do not address scheduling as it is assumed t he examples will be
incorporat ed in longer code sequences.
For planning considerat ions of using t he new SI MD int eger inst ruct ions, refer t o
Sect ion 4. 1.3.
5.1 GENERAL RULES ON SIMD INTEGER CODE
General r ules and suggest ions are:
Do not int ermix 64- bit SI MD int eger inst ruct ions wit h x87 float ing- point inst ruc-
t ions. See Sect ion 5.2, Using SI MD I nt eger wit h x87 Float ing- point . Not e t hat
all SI MD int eger inst ruct ions can be int ermixed wit hout penalt y.
Favor 128- bit SI MD int eger code over 64- bit SI MD int eger code. On previous
microarchit ect ures, most 128- bit SI MD inst ruct ions have t wo- cycle t hroughput
rest rict ions due t o t he underlying 64- bit dat a pat h in t he execut ion engine. I nt el
Core microarchit ect ure execut es almost all SI MD inst ruct ions wit h one- cycle
5-2
t hroughput and provides t hree port s t o execut e mult iple SI MD inst ruct ions in
parallel.
When writ ing SI MD code t hat works for bot h int eger and float ing- point dat a, use
t he subset of SI MD convert inst ruct ions or load/ st ore inst ruct ions t o ensure t hat
t he input operands in XMM regist ers cont ain dat a t ypes t hat are properly defined
t o mat ch t he inst ruct ion.
Code sequences cont aining cross- t yped usage produce t he same result across
different implement at ions but incur a significant performance penalt y. Using
SSE/ SSE2/ SSE3/ SSSE3 inst ruct ions t o operat e on t ype- mismat ched SI MD dat a
in t he XMM regist er is st rongly discouraged.
Use t he opt imizat ion rules and guidelines described in Chapt er 3 and Chapt er 4.
Take advant age of hardware prefet cher where possible. Use t he PREFETCH
inst ruct ion only when dat a access pat t erns are irregular and prefet ch dist ance
can be predet ermined. See Chapt er 7, Opt imizing Cache Usage.
Emulat e condit ional moves by using masked compares and logicals inst ead of
using condit ional branches.
5.2 USING SIMD INTEGER WITH X87 FLOATING-POINT
All 64- bit SI MD int eger inst ruct ions use MMX regist ers, which share regist er st at e
wit h t he x87 float ing- point st ack. Because of t his sharing, cert ain rules and consider-
at ions apply. I nst ruct ions using MMX regist ers cannot be freely int ermixed wit h x87
float ing- point regist ers. Take care when swit ching bet ween 64- bit SI MD int eger
inst ruct ions and x87 float ing- point inst ruct ions. See Sect ion 5. 2.1, Using t he EMMS
I nst ruct ion.
SI MD float ing- point operat ions and 128- bit SI MD int eger operat ions can be freely
int ermixed wit h eit her x87 float ing- point operat ions or 64- bit SI MD int eger opera-
t ions. SI MD float ing- point operat ions and 128- bit SI MD int eger operat ions use regis-
t ers t hat are unrelat ed t o t he x87 FP / MMX regist ers. The EMMS inst ruct ion is not
needed t o t ransit ion t o or from SI MD float ing- point operat ions or 128- bit SI MD oper-
at ions.
5.2.1 Using the EMMS Instruction
When generat ing 64- bit SI MD int eger code, keep in mind t hat t he eight MMX regis-
t ers are aliased t o x87 float ing- point regist ers. Swit ching from MMX inst ruct ions t o
x87 float ing- point inst ruct ions incurs a finit e delay, so it is t he best t o minimize
swit ching bet ween t hese inst ruct ion t ypes. But when swit ching, t he EMMS inst ruct ion
provides an efficient means t o clear t he x87 st ack so t hat subsequent x87 code can
operat e properly.
As soon as an inst ruct ion makes reference t o an MMX regist er, all valid bit s in t he x87
float ing- point t ag word are set , which implies t hat all x87 regist ers cont ain valid
5-3
values. I n order for soft ware t o operat e correct ly, t he x87 float ing- point st ack should
be empt ied when st art ing a series of x87 float ing- point calculat ions aft er operat ing
on t he MMX regist ers.
Using EMMS clears all valid bit s, effect ively empt ying t he x87 float ing- point st ack and
making it ready for new x87 float ing- point operat ions. The EMMS inst ruct ion ensures
a clean t ransit ion bet ween using operat ions on t he MMX regist ers and using opera-
t ions on t he x87 float ing- point st ack. On t he Pent ium 4 processor, t here is a finit e
overhead for using t he EMMS inst ruct ion.
Failure t o use t he EMMS inst ruct ion ( or t he _MM_EMPTY( ) int rinsic) bet ween opera-
t ions on t he MMX regist ers and x87 float ing- point regist ers may lead t o unexpect ed
result s.
NOTE
Failure t o reset t he t ag word for FP inst ruct ions aft er using an MMX
inst ruct ion can result in fault y execut ion or poor performance.
5.2.2 Guidelines for Using EMMS Instruction
When developing code wit h bot h x87 float ing- point and 64- bit SI MD int eger inst ruc-
t ions, follow t hese st eps:
1. Always call t he EMMS inst ruct ion at t he end of 64- bit SI MD int eger code when t he
code t ransit ions t o x87 float ing- point code.
2. I nsert t he EMMS inst ruct ion at t he end of all 64- bit SI MD int eger code segment s
t o avoid an x87 float ing- point st ack overflow except ion when an x87 float ing-
point inst ruct ion is execut ed.
When writ ing an applicat ion t hat uses bot h float ing- point and 64- bit SI MD int eger
inst ruct ions, use t he following guidelines t o help you det ermine when t o use EMMS:
I f nex t i nst r uct i on i s x 87 FP Use _MM_EMPTY( ) aft er a 64- bit SI MD int eger
inst ruct ion if t he next inst ruct ion is an X87 FP inst ruct ion; for example, before
doing calculat ions on float s, doubles or long doubles.
Don t empt y w hen al r eady empt y I f t he next inst ruct ion uses an MMX
regist er, _MM_EMPTY( ) incurs a cost wit h no benefit .
Gr oup I nst r uct i ons Try t o part it ion regions t hat use X87 FP inst ruct ions from
t hose t hat use 64- bit SI MD int eger inst ruct ions. This eliminat es t he need for an
EMMS inst ruct ion wit hin t he body of a crit ical loop.
5-4
Runt i me i ni t i al i zat i on Use _MM_EMPTY( ) during runt ime init ializat ion of
__M64 and X87 FP dat a t ypes. This ensures reset t ing t he regist er bet ween dat a
t ype t ransit ions. See Example 5- 1 for coding usage.
You must be aware t hat your code generat es an MMX inst ruct ion, which uses MMX
regist ers wit h t he I nt el C+ + Compiler, in t he following sit uat ions:
when using a 64- bit SI MD int eger int rinsic from MMX t echnology,
SSE/ SSE2/ SSSE3
when using a 64- bit SI MD int eger inst ruct ion from MMX t echnology,
SSE/ SSE2/ SSSE3 t hrough inline assembly
when referencing t he __M64 dat a t ype variable
Addit ional informat ion on t he x87 float ing- point programming model can be found in
t he I nt el 64 and I A- 32 Archit ect ures Soft ware Developers Manual, Volume 1. For
more on EMMS, visit ht t p: / / developer. int el. com.
5.3 DATA ALIGNMENT
Make sure t hat 64- bit SI MD int eger dat a is 8- byt e aligned and t hat 128- bit SI MD
int eger dat a is 16- byt e aligned. Referencing unaligned 64- bit SI MD int eger dat a can
incur a performance penalt y due t o accesses t hat span 2 cache lines. Referencing
unaligned 128- bit SI MD int eger dat a result s in an except ion unless t he MOVDQU
( move double- quadword unaligned) inst ruct ion is used. Using t he MOVDQU inst ruc-
t ion on unaligned dat a can result in lower performance t han using 16- byt e aligned
references. Refer t o Sect ion 4. 4, St ack and Dat a Alignment , for more informat ion.
Loading 16 byt es of SI MD dat a efficient ly requires dat a alignment on 16- byt e bound-
aries. SSSE3 provides t he PALI GNR inst ruct ion. I t reduces overhead in sit uat ions t hat
requires soft ware t o processing dat a element s from non- aligned address. The
PALI GNR inst ruct ion is most valuable when loading or st oring unaligned dat a wit h t he
address shift s by a few byt es. You can replace a set of unaligned loads wit h aligned
loads followed by using PALI GNR inst ruct ions and simple regist er t o regist er copies.
Using PALI GNRs t o replace unaligned loads improves performance by eliminat ing
cache line split s and ot her penalt ies. I n rout ines like MEMCPY( ) , PALI GNR can boost
Example 5-1. Resetting Register Between __m64 and FP Data Types Code
Incorrect Usage Correct Usage
__m64 x = _m_paddd(y, z); __m64 x = _m_paddd(y, z);
float f = init(); float f = (_mm_empty(), init());
5-5
t he performance of misaligned cases. Example 5- 2 shows a sit uat ion t hat benefit s by
using PALI GNR.
Example 5- 3 compares an opt imal SSE2 sequence of t he FI R loop and an equivalent
SSSE3 implement at ion. Bot h implement at ions unroll 4 it erat ion of t he FI R inner loop
t o enable SI MD coding t echniques. The SSE2 code can not avoid experiencing cache
line split once every four it erat ions. PALGNR allows t he SSSE3 code t o avoid t he
delays associat ed wit h cache line split s.
Example 5-2. FIR Processing Example in C language Code
void FIR(float *in, float *out, float *coeff, int count)
{int i,j;
for ( i=0; i<count - TAP; i++ )
{ float sum = 0;
for ( j=0; j<TAP; j++ )
{ sum += in[j]*coeff[j]; }
*out++ = sum;
in++;
}
}
Example 5-3. SSE2 and SSSE3 Implementation of FIR Processing Code
Optimized for SSE2 Optimized for SSSE3
pxor xmm0, xmm0
xor ecx, ecx
mov eax, dword ptr[input]
mov ebx, dword ptr[coeff4]
inner_loop:
movaps xmm1, xmmword ptr[eax+ecx]
mulps xmm1, xmmword ptr[ebx+4*ecx]
addps xmm0, xmm1
pxor xmm0, xmm0
xor ecx, ecx
mov eax, dword ptr[input]
mov ebx, dword ptr[coeff4]
inner_loop:
movaps xmm1, xmmword ptr[eax+ecx]
movaps xmm3, xmm1
mulps xmm1, xmmword ptr[ebx+4*ecx]
addps xmm0, xmm1
movups xmm1, xmmword ptr[eax+ecx+4]
mulps xmm1, xmmword
ptr[ebx+4*ecx+16]
addps xmm0, xmm1
movaps xmm2, xmmword ptr[eax+ecx+16]
movaps xmm1, xmm2
palignr xmm2, xmm3, 4
mulps xmm2, xmmword ptr[ebx+4*ecx+16]
addps xmm0, xmm2
mulps xmm1, xmmword
ptr[ebx+4*ecx+32]
addps xmm0, xmm1
movaps xmm2, xmm1
addps xmm0, xmm2
5-6
5.4 DATA MOVEMENT CODING TECHNIQUES
I n general, bet t er performance can be achieved if dat a is pre- arranged for SI MD
comput at ion ( see Sect ion 4. 5, I mproving Memory Ut ilizat ion ) . This may not always
be possible.
This sect ion covers t echniques for gat hering and arranging dat a for more efficient
SI MD comput at ion.
5.4.1 Unsigned Unpack
MMX t echnology provides several inst ruct ions t hat are used t o pack and unpack dat a
in t he MMX regist ers. SSE2 ext ends t hese inst ruct ions so t hat t hey operat e on
128- bit source and dest inat ions.
The unpack inst ruct ions can be used t o zero- ext end an unsigned number.
Example 5- 4 assumes t he source is a packed- word ( 16- bit ) dat a t ype.
mulps xmm1, xmmword
ptr[ebx+4*ecx+48]
addps xmm0, xmm1
add ecx, 16
cmp ecx, 4*TAP
jl inner_loop
mov eax, dword ptr[output]
movaps xmmword ptr[eax], xmm0
movaps xmm2, xmm1
addps xmm0, xmm2
add ecx, 16
cmp ecx, 4*TAP
jl inner_loop
mov eax, dword ptr[output]
movaps xmmword ptr[eax], xmm0
Example 5-3. SSE2 and SSSE3 Implementation of FIR Processing Code (Contd.)
Optimized for SSE2 Optimized for SSSE3
5-7
5.4.2 Signed Unpack
Signed numbers should be sign- ext ended when unpacking values. This is similar t o
t he zero- ext end shown above, except t hat t he PSRAD inst ruct ion ( packed shift right
arit hmet ic) is used t o sign ext end t he values.
Example 5- 5 assumes t he source is a packed- word ( 16- bit ) dat a t ype.
Example 5-4. Zero Extend 16-bit Values into 32 Bits Using Unsigned Unpack Instruc-
tions Code
; Input:
; XMM0 8 16-bit values in source
; XMM7 0 a local variable can be used
; instead of the register XMM7 if
; desired.
; Output:
; XMM0 four zero-extended 32-bit
; doublewords from four low-end
; words
; XMM1 four zero-extended 32-bit
; doublewords from four high-end
; words
movdqa xmm1, xmm0 ; copy source
punpcklwd xmm0, xmm7 ; unpack the 4 low-end words
; into 4 32-bit doubleword
punpckhwd xmm1, xmm7 ; unpack the 4 high-end words
; into 4 32-bit doublewords
Example 5-5. Signed Unpack Code
Input:
; XMM0 source value
; Output:
; XMM0 four sign-extended 32-bit doublewords
; from four low-end words
; XMM1 four sign-extended 32-bit doublewords
; from four high-end words
;
5-8
5.4.3 Interleaved Pack with Saturation
Pack inst ruct ions pack t wo values int o a dest inat ion regist er in a predet ermined
order. PACKSSDW sat urat es t wo signed doublewords from a source operand and t wo
signed doublewords from a dest inat ion operand int o four signed words; and it packs
t he four signed words int o a dest inat ion regist er. See Figure 5- 1.
SSE2 ext ends PACKSSDW so t hat it sat urat es four signed doublewords from a source
operand and four signed doublewords from a dest inat ion operand int o eight signed
words; t he eight signed words are packed int o t he dest inat ion.
Figure 5- 2 illust rat es where t wo pairs of values are int erleaved in a dest inat ion
regist er; Example 5- 6 shows MMX code t hat accomplishes t he operat ion.
movdqa xmm1, xmm0 ; copy source
punpcklwd xmm0, xmm0 ; unpack four low end words of the source
; into the upper 16 bits of each doubleword
; in the destination
punpckhwd xmm1, xmm1 ; unpack 4 high-end words of the source
; into the upper 16 bits of each doubleword
; in the destination
psrad xmm0, 16 ; sign-extend the 4 low-end words of the source
; into four 32-bit signed doublewords
psrad xmm1, 16 ; sign-extend the 4 high-end words of the
; source into four 32-bit signed doublewords
Figure 5-1. PACKSSDW mm, mm/mm64 Instruction
Example 5-5. Signed Unpack Code (Contd.)
OM15159
D C B A
D
1
C
1
B
1
A
1
mm/m64 mm
mm
5-9
Two signed doublewords are used as source operands and t he result is int erleaved
signed words. The sequence in Example 5- 6 can be ext ended in SSE2 t o int erleave
eight signed words using XMM regist ers.
Pack inst ruct ions always assume t hat source operands are signed numbers. The
result in t he dest inat ion regist er is always defined by t he pack inst ruct ion t hat
performs t he operat ion. For example, PACKSSDW packs each of t wo signed 32- bit
values of t wo sources int o four sat urat ed 16- bit signed values in a dest inat ion
regist er. PACKUSWB, on t he ot her hand, packs t he four signed 16- bit values of t wo
sources int o eight sat urat ed eight - bit unsigned values in t he dest inat ion.
Figure 5-2. Interleaved Pack with Saturation
Example 5-6. Interleaved Pack with Saturation Code
; Input:
MM0 signed source1 value
; MM1 signed source2 value
; Output:
MM0 the first and third words contain the
; signed-saturated doublewords from MM0,
; the second and fourth words contain
; signed-saturated doublewords from MM1
;
packssdw mm0, mm0 ; pack and sign saturate
packssdw mm1, mm1 ; pack and sign saturate
punpcklwd mm0, mm1 ; interleave the low-end 16-bit
; values of the operands
OM15160
D C B A
D
1
B
1
C
1
A
1
MM/M64 mm
mm
5-10
5.4.4 Interleaved Pack without Saturation
Example 5- 7 is similar t o Example 5- 6 except t hat t he result ing words are not sat u-
rat ed. I n addit ion, in order t o prot ect against overflow, only t he low order 16 bit s of
each doubleword are used. Again, Example 5- 7 can be ext ended in SSE2 t o accom-
plish int erleaving eight words wit hout sat urat ion.
5.4.5 Non-Interleaved Unpack
Unpack inst ruct ions perform an int erleave merge of t he dat a element s of t he dest i-
nat ion and source operands int o t he dest inat ion regist er.
The following example merges t he t wo operands int o dest inat ion regist ers wit hout
int erleaving. For example, t ake t wo adj acent element s of a packed- word dat a t ype in
SOURCE1 and place t his value in t he low 32 bit s of t he result s. Then t ake t wo adj a-
cent element s of a packed- word dat a t ype in SOURCE2 and place t his value in t he
high 32 bit s of t he result s. One of t he dest inat ion regist ers will have t he combinat ion
illust rat ed in Figure 5- 3.
Example 5-7. Interleaved Pack without Saturation Code
; Input:
; MM0 signed source value
; MM1 signed source value
; Output:
; MM0 the first and third words contain the
; low 16-bits of the doublewords in MM0,
; the second and fourth words contain the
; low 16-bits of the doublewords in MM1
pslld mm1, 16 ; shift the 16 LSB from each of the
; doubleword values to the 16 MSB
; position
pand mm0, {0,ffff,0,ffff}
; mask to zero the 16 MSB
; of each doubleword value
por mm0, mm1 ; merge the two operands
5-11
The ot her dest inat ion regist er will cont ain t he opposit e combinat ion illust rat ed in
Figure 5- 4.
Figure 5-3. Result of Non-Interleaved Unpack Low in MM0
Figure 5-4. Result of Non-Interleaved Unpack High in MM1
OM15161
2
1
2
0
1
1
1
0
mm/m64 mm
mm
2
3
2
2
2
1
2
0
1
3
1
2
1
1
1
0
OM15162
2
3
2
2
1
3
1
2
mm/m64 mm
mm
2
3
2
2
2
1
2
0
1
3
1
2
1
1
1
0
5-12
Code in t he Example 5- 8 unpacks t wo packed- word sources in a non- int erleaved
way. The goal is t o use t he inst ruct ion which unpacks doublewords t o a quadword,
inst ead of using t he inst ruct ion which unpacks words t o doublewords.
5.4.6 Extract Word
The PEXTRW inst ruct ion t akes t he word in t he designat ed MMX regist er select ed by
t he t wo least significant bit s of t he immediat e value and moves it t o t he lower half of
a 32- bit int eger regist er. See Figure 5- 5 and Example 5- 9.
Example 5-8. Unpacking Two Packed-word Sources in Non-interleaved Way Code
; Input:
; MM0 packed-word source value
; MM1 packed-word source value
; Output:
; MM0 contains the two low-end words of the
; original sources, non-interleaved
; MM2 contains the two high end words of the
; original sources, non-interleaved.
movq mm2, mm0 ; copy source1
punpckldq mm0, mm1 ; replace the two high-end words
; of MMO with two low-end words of
; MM1; leave the two low-end words
; of MM0 in place
punpckhdq mm2, mm1 ; move two high-end words of MM2
; to the two low-end words of MM2;
; place the two high-end words of
; MM1 in two high-end words of MM2
Figure 5-5. PEXTRW Instruction
OM15163
0 ..0 X1
MM
R32
31 0
31 0 63
X4 X3 X2 X1
5-13
5.4.7 Insert Word
The PI NSRW inst ruct ion loads a word from t he lower half of a 32- bit int eger regist er
or from memory and insert s it in an MMX t echnology dest inat ion regist er at a posit ion
defined by t he t wo least significant bit s of t he immediat e const ant . I nsert ion is done
in such a way t hat t hree ot her words from t he dest inat ion regist er are left unt ouched.
See Figure 5- 6 and Example 5- 10.

Example 5-9. PEXTRW Instruction Code
; Input:
; eax source value
; immediate value: 0
; Output:
; edx 32-bit integer register containing the
; extracted word in the low-order bits &
; the high-order bits zero-extended
movq mm0, [eax]
pextrw edx, mm0, 0
Figure 5-6. PINSRW Instruction
OM15164
Y2
MM
R32
31 0
31 0 63
X4 X3 Y1 X1
Y1
5-14
I f all of t he operands in a regist er are being replaced by a series of PI NSRW inst ruc-
t ions, it can be useful t o clear t he cont ent and break t he dependence chain by eit her
using t he PXOR inst ruct ion or loading t he regist er. See Example 5- 11 and Sect ion
3.5. 1. 6, Clearing Regist ers and Dependency Breaking I dioms.
5.4.8 Move Byte Mask to Integer
The PMOVMSKB inst ruct ion ret urns a bit mask formed from t he most significant bit s
of each byt e of it s source operand. When used wit h 64- bit MMX regist ers, t his
produces an 8- bit mask, zeroing out t he upper 24 bit s in t he dest inat ion regist er.
When used wit h 128- bit XMM regist ers, it produces a 16- bit mask, zeroing out t he
upper 16 bit s in t he dest inat ion regist er.
The 64- bit version of t his inst ruct ion is shown in Figure 5- 7 and Example 5- 12.
Example 5-10. PINSRW Instruction Code
; Input:
; edx pointer to source value
; Output:
; mm0 register with new 16-bit value inserted
;
mov eax, [edx]
pinsrw mm0, eax, 1
Example 5-11. Repeated PINSRW Instruction Code
; Input:
; edx pointer to structure containing source
; values at offsets: of +0, +10, +13, and +24
; immediate value: 1
; Output:
; MMX register with new 16-bit value inserted
;
pxor mm0, mm0 ; Breaks dependency on previous value of mm0
mov eax, [edx]
pinsrw mm0, eax, 0
mov eax, [edx+10]
pinsrw mm0, eax, 1
mov eax, [edx+13]
pinsrw mm0, eax, 2
mov eax, [edx+24]
pinsrw mm0, eax, 3
5-15
5.4.9 Packed Shuffle Word for 64-bit Registers
The PSHUF inst ruct ion uses t he immediat e ( I MM8) operand t o select bet ween t he
four words in eit her t wo MMX regist ers or one MMX regist er and a 64- bit memory
locat ion.
Bit s 1 and 0 of t he immediat e value encode t he source for dest inat ion word 0 in MMX
regist er ( [ 15- 0] ) , and so on as shown in Table 5- 1:
Figure 5-7. PMOVSMKB Instruction
Example 5-12. PMOVMSKB Instruction Code
; Input:
; source value
; Output:
; 32-bit register containing the byte mask in the lower
; eight bits
;
movq mm0, [edi]
pmovmskb eax, mm0
OM15165
MM
R32
31 0 63
0..0
31
0..0
7 0
55 47 39 23 15 7
5-16
Bit s 7 and 6 encode for word 3 in MMX regist er ( [ 63- 48] ) . Similarly, t he 2- bit
encoding represent s which source word is used, for example, binary encoding of 10
indicat es t hat source word 2 in MMX regist er/ memory ( MM/ MEM[ 47- 32] ) is used.
See Figure 5- 8 and Example 5- 13.

Table 5-1. PAHUF Encoding
Bits Words
1 - 0 0
3 - 2 1
5 - 4 2
7 - 6 3
Figure 5-8. pshuf PSHUF Instruction
Example 5-13. PSHUF Instruction Code
; Input:
; edi source value
; Output:
; MM1 MM register containing re-arranged words
movq mm0, [edi]
pshufw mm1, mm0, 0x1b
OM15166
MM/m64
0 63
X4 X3 X2 X1
MM
0 63
X1 X2 X3 X4
5-17
5.4.10 Packed Shuffle Word for 128-bit Registers
The PSHUFLW/ PSHUFHW inst ruct ion performs a full shuffle of any source word field
wit hin t he low/ high 64 bit s t o any result word field in t he low/ high 64 bit s, using an
8- bit immediat e operand; ot her high/ low 64 bit s are passed t hrough from t he source
operand.
PSHUFD performs a full shuffle of any doubleword field wit hin t he 128- bit source t o
any doubleword field in t he 128- bit result , using an 8- bit immediat e operand.
No more t han 3 inst ruct ions, using PSHUFLW/ PSHUFHW/ PSHUFD, are required t o
implement many common dat a shuffling operat ions. Broadcast , Swap, and Reverse
are illust rat ed in Example 5- 14, Example 5- 15, and Example 5- 16.
Example 5-14. Broadcast Code, Using 2 Instructions
/* Goal: Broadcast the value from word 5 to all words */
/* Instruction Result */
| 7| 6| 5| 4| 3| 2| 1| 0|
PSHUFHW (3,2,1,1)| 7| 6| 5| 5| 3| 2| 1| 0|
PSHUFD (2,2,2,2) | 5| 5| 5| 5| 5| 5| 5| 5|
Example 5-15. Swap Code, Using 3 Instructions
/* Goal: Swap the values in word 6 and word 1 */
| 7| 6| 5| 4| 3| 2| 1| 0|
PSHUFD (3,0,1,2) | 7| 6| 1| 0| 3| 2| 5| 4|
PSHUFHW (3,1,2,0)| 7| 1| 6| 0| 3| 2| 5| 4|
PSHUFD (3,0,1,2) | 7| 1| 5| 4| 3| 2| 6| 0|
5-18
5.4.11 Shuffle Bytes
SSSE3 provides PSHUFB; t his inst ruct ion carries out byt e manipulat ion wit hin a 16
byt e range. PSHUFB can replace up t o 12 ot her inst ruct ions: including SHI FT, OR,
AND and MOV.
Use PSHUFB if t he alt ernat ive uses 5 or more inst ruct ions.
5.4.12 Unpacking/interleaving 64-bit Data in 128-bit Registers
The PUNPCKLQDQ/ PUNPCHQDQ inst ruct ions int erleave t he low/ high- order 64- bit s of
t he source operand and t he low/ high- order 64- bit s of t he dest inat ion operand. I t
t hen writ es t he result s t o t he dest inat ion regist er.
The high/ low- order 64- bit s of t he source operands are ignored.
5.4.13 Data Movement
There are t wo addit ional inst ruct ions t o enable dat a movement from 64- bit SI MD
int eger regist ers t o 128- bit SI MD regist ers.
The MOVQ2DQ inst ruct ion moves t he 64- bit int eger dat a from an MMX regist er
( source) t o a 128- bit dest inat ion regist er. The high- order 64 bit s of t he dest inat ion
regist er are zeroed- out .
The MOVDQ2Q inst ruct ion moves t he low- order 64- bit s of int eger dat a from a
128- bit source regist er t o an MMX regist er ( dest inat ion) .
5.4.14 Conversion Instructions
SSE provides I nst ruct ions t o support 4- wide conversion of single- precision dat a
t o/ from doubleword int eger dat a. Conversions bet ween double- precision dat a and
doubleword int eger dat a have been added in SSE2.
Example 5-16. Reverse Code, Using 3 Instructions
/* Goal: Reverse the order of the words */
| 7| 6| 5| 4| 3| 2| 1| 0|
PSHUFLW (0,1,2,3)| 7| 6| 5| 4| 0| 1| 2| 3|
PSHUFHW (0,1,2,3)| 4| 5| 6| 7| 0| 1| 2| 3|
PSHUFD (1,0,3,2) | 0| 1| 2| 3| 4| 5| 6| 7|
5-19
5.5 GENERATING CONSTANTS
SI MD int eger inst ruct ion set s do not have inst ruct ions t hat will load immediat e
const ant s t o t he SI MD regist ers.
The following code segment s generat e frequent ly used const ant s in t he SI MD
regist er. These examples can also be ext ended in SSE2 by subst it ut ing MMX wit h
XMM regist ers. See Example 5- 17.
NOTE
Because SI MD int eger inst ruct ion set s do not support shift inst ruc-
t ions for byt es, 2n1 and - 2n are relevant only for packed words and
packed doublewords.
5.6 BUILDING BLOCKS
This sect ion describes inst ruct ions and algorit hms which implement common code
building blocks.
Example 5-17. Generating Constants
pxor mm0, mm0 ; generate a zero register in MM0
pcmpeq mm1, mm1 ; Generate all 1's in register MM1,
; which is -1 in each of the packed
; data type fields
pxor mm0, mm0
pcmpeq mm1, mm1
psubb mm0, mm1 [psubw mm0, mm1] (psubd mm0, mm1)
; three instructions above generate
; the constant 1 in every
; packed-byte [or packed-word]
; (or packed-dword) field
pcmpeq mm1, mm1
psrlw mm1, 16-n(psrld mm1, 32-n)
; two instructions above generate
; the signed constant 2
n
1 in every
; packed-word (or packed-dword) field
pcmpeq mm1, mm1
psllw mm1, n (pslld mm1, n)
; two instructions above generate
; the signed constant -2n in every
; packed-word (or packed-dword) field
5-20
5.6.1 Absolute Difference of Unsigned Numbers
Example 5- 18 comput es t he absolut e difference of t wo unsigned numbers. I t
assumes an unsigned packed- byt e dat a t ype.
Here, we make use of t he subt ract inst ruct ion wit h unsigned sat urat ion. This inst ruc-
t ion receives UNSI GNED operands and subt ract s t hem wit h UNSI GNED sat urat ion.
This support exist s only for packed byt es and packed words, not for packed double-
words.
This example will not work if t he operands are signed. Not e t hat PSADBW may also
be used in some sit uat ions. See Sect ion 5.6. 9 for det ails.
5.6.2 Absolute Difference of Signed Numbers
Example 5- 19 comput es t he absolut e difference of t wo signed numbers using SSSE3
inst ruct ion PABSW. This sequence is more efficient t han using previous generat ion of
SI MD inst ruct ion ext ensions.
Example 5-18. Absolute Difference of Two Unsigned Numbers
; Input:
; MM0 source operand
; MM1 source operand
; Output:
; MM0absolute difference of the unsigned
; operands
movq mm2, mm0 ; make a copy of mm0
psubusbmm0, mm1 ; compute difference one way
psubusbmm1, mm2 ; compute difference the other way
por mm0, mm1 ; OR them together
Example 5-19. Absolute Difference of Signed Numbers
;Input:
; XMM0 signed source operand
; XMM1 signed source operand
;Output:
; XMM1absolute difference of the unsigned
; operands
psubw xmm0, xmm1 ; subtract words
pabsw xmm1, xmm0 ; results in XMM1
5-21
5.6.3 Absolute Value
Example 5- 20 show an MMX code sequence t o comput e |X|, where X is signed. This
example assumes signed words t o be t he operands.
Wit h SSSE3, t his sequence of t hree inst ruct ions can be replaced by t he PABSW
inst ruct ion. Addit ionally, SSSE3 provides a 128- bit version using XMM regist ers and
support s byt e, word and doubleword granularit y.
NOTE
The absolut e value of t he most negat ive number ( t hat is, 8000H for
16- bit ) cannot be represent ed using posit ive numbers. This algorit hm
will ret urn t he original value for t he absolut e value ( 8000H) .
5.6.4 Pixel Format Conversion
SSSE3 provides t he PSHUFB inst ruct ion t o carry out byt e manipulat ion wit hin a
16- byt e range. PSHUFB can replace a set of up t o 12 ot her inst ruct ion, including
SHI FT, OR, AND and MOV.
Use PSHUFB if t he alt ernat ive code uses 5 or more inst ruct ions. Example 5- 21 shows
t he basic form of conversion of color pixel format s.
Example 5-20. Computing Absolute Value
; Input:
; MM0 signed source operand
; Output:
; MM1 ABS(MMO)
pxor mm1, mm1 ; set mm1 to all zeros
psubw mm1, mm0 ; make each mm1 word contain the
; negative of each mm0 word
pmaxswmm1, mm0 ; mm1 will contain only the positive
; (larger) values - the absolute value
Example 5-21. Basic C Implementation of RGBA to BGRA Conversion
Standard C Code:
struct RGBA{BYTE r,g,b,a;};
struct BGRA{BYTE b,g,r,a;};
5-22
Example 5- 22 and Example 5- 23 show SSE2 code and SSSE3 code for pixel format
conversion. I n t he SSSE3 example, PSHUFB replaces six SSE2 inst ruct ions.
void BGRA_RGBA_Convert(BGRA *source, RGBA *dest, int num_pixels)
{
for(int i = 0; i < num_pixels; i++){
dest[i].r = source[i].r;
dest[i].g = source[i].g;
dest[i].b = source[i].b;
dest[i].a = source[i].a;
}
}
Example 5-22. Color Pixel Format Conversion Using SSE2
; Optimized for SSE2
mov esi, src
mov edi, dest
mov ecx, iterations
movdqa xmm0, ag_mask //{0,ff,0,ff,0,ff,0,ff,0,ff,0,ff,0,ff,0,ff}
movdqa xmm5, rb_mask //{ff,0,ff,0,ff,0,ff,0,ff,0,ff,0,ff,0,ff,0}
mov eax, remainder
convert16Pixs: // 16 pixels, 64 byte per iteration
movdqa xmm1, [esi] // xmm1 = [r3g3b3a3,r2g2b2a2,r1g1b1a1,r0g0b0a0]
movdqa xmm2, xmm1
movdqa xmm7, xmm1 //xmm7 abgr
psrld xmm2, 16 //xmm2 00ab
pslld xmm1, 16 //xmm1 gr00
por xmm1, xmm2 //xmm1 grab
pand xmm7, xmm0 //xmm7 a0g0
pand xmm1, xmm5 //xmm1 0r0b
por xmm1, xmm7 //xmm1 argb
movdqa [edi], xmm1
Example 5-21. Basic C Implementation of RGBA to BGRA Conversion
5-23
5.6.5 Endian Conversion
The PSHUFB inst ruct ion can also be used t o reverse byt e ordering wit hin a double-
word. I t is more efficient t han t radit ional t echniques, such as BSWAP.
//repeats for another 3*16 bytes
add esi, 64
add edi, 64
sub ecx, 1
jnz convert16Pixs
Example 5-23. Color Pixel Format Conversion Using SSSE3
; Optimized for SSSE3
mov esi, src
mov edi, dest
mov ecx, iterations
movdqa xmm0, _shufb
// xmm0 = [15,12,13,14,11,8,9,10,7,4,5,6,3,0,1,2]
mov eax, remainder
convert16Pixs: // 16 pixels, 64 byte per iteration
movdqa xmm1, [esi]
// xmm1 = [r3g3b3a3,r2g2b2a2,r1g1b1a1,r0g0b0a0]
movdqa xmm2, [esi+16]
pshufb xmm1, xmm0
// xmm1 = [b3g3r3a3,b2g2r2a2,b1g1r1a1,b0g0r0a0]
movdqa [edi], xmm1
//repeats for another 3*16 bytes
add esi, 64
add edi, 64
sub ecx, 1
jnz convert16Pixs
Example 5-22. Color Pixel Format Conversion Using SSE2 (Contd.)
5-24
Example 5- 24 shows t he t radit ional t echnique using four BSWAP inst ruct ions t o
reverse t he byt es wit hin a DWORD. Each BSWAP requires execut ing t wo ops. I n
addit ion, t he code requires 4 loads and 4 st ores for processing 4 DWORDs of dat a.
Example 5- 25 shows an SSSE3 implement at ion of endian conversion using PSHUFB.
The reversing of four DWORDs requires one load, one st ore, and PSHUFB.
On I nt el Core microarchit ect ure, reversing 4 DWORDs using PSHUFB can be approx-
imat ely t wice as fast as using BSWAP.
Example 5-24. Big-Endian to Little-Endian Conversion Using BSWAP
lea eax, src
lea ecx, dst
mov edx, elCount
start:
mov edi, [eax]
mov esi, [eax+4]
bswap edi
mov ebx, [eax+8]
bswap esi
mov ebp, [eax+12]
mov [ecx], edi
mov [ecx+4], esi
bswap ebx
mov [ecx+8], ebx
bswap ebp
mov [ecx+12], ebp
add eax, 16
add ecx, 16
sub edx, 4
jnz start
5-25
5.6.6 Clipping to an Arbitrary Range [High, Low]
This sect ion explains how t o clip a values t o a range [ HI GH, LOW] . Specifically, if t he
value is less t han LOW or great er t han HI GH, t hen clip t o LOW or HI GH, respect ively.
This t echnique uses t he packed- add and packed- subt ract inst ruct ions wit h sat urat ion
( signed or unsigned) , which means t hat t his t echnique can only be used on packed-
byt e and packed- word dat a t ypes.
The examples in t his sect ion use t he const ant s PACKED_MAX and PACKED_MI N and
show operat ions on word values. For simplicit y, we use t he following const ant s
( corresponding const ant s are used in case t he operat ion is done on byt e values) :
PACKED_MAX equals 0X7FFF7FFF7FFF7FFF
PACKED_MI N equals 0X8000800080008000
PACKED_LOW contains the value LOW in all four words of the packed-words data type
PACKED_HI GH contains the value HI GH in all four words of the packed-words data type
PACKED_USMAX all values equal 1
HI GH_US adds the HI GH value to all data elements (4 words) of PACKED_MI N
LOW_US adds the LOW value to all data elements (4 words) of PACKED_MI N
5.6.6.1 Highly Efficient Clipping
For clipping signed words t o an arbit rary range, t he PMAXSW and PMI NSW inst ruc-
t ions may be used. For clipping unsigned byt es t o an arbit rary range, t he PMAXUB
and PMI NUB inst ruct ions may be used.
Example 5-25. Big-Endian to Little-Endian Conversion Using PSHUFB
__declspec(align(16)) BYTE bswapMASK[16] =
{3,2,1,0, 7,6,5,4, 11,10,9,8, 15,14,13,12};
lea eax, src
lea ecx, dst
mov edx, elCount
movaps xmm7, bswapMASK
start:
movdqa xmm0, [eax]
pshufb xmm0, xmm7
movdqa [ecx], xmm0
add eax, 16
add ecx, 16
sub edx, 4
jnz start
5-26
Example 5- 26 shows how t o clip signed words t o an arbit rary range; t he code for
clipping unsigned byt es is similar.
The code above convert s values t o unsigned numbers first and t hen clips t hem t o an
unsigned range. The last inst ruct ion convert s t he dat a back t o signed dat a and places
t he dat a wit hin t he signed range.
Conversion t o unsigned dat a is required for correct result s when ( High - Low) <
0X8000. I f ( High - Low) > = 0X8000, simplify t he algorit hm as in Example 5- 28.
Example 5-26. Clipping to a Signed Range of Words [High, Low]
; Input:
; MM0 signed source operands
; Output:
; MM0 signed words clipped to the signed
; range [high, low]
pminsw mm0, packed_high
pmaxswmm0, packed_low
Example 5-27. Clipping to an Arbitrary Signed Range [High, Low]
; Input:
; MM0 signed source operands
; Output:
; MM1 signed operands clipped to the unsigned
; range [high, low]
paddw mm0, packed_min ; add with no saturation
; 0x8000 to convert to unsigned
padduswmm0, (packed_usmax - high_us)
; in effect this clips to high
psubuswmm0, (packed_usmax - high_us + low_us)
; in effect this clips to low
paddw mm0, packed_low ; undo the previous two offsets
5-27
This algorit hm saves a cycle when it is known t hat ( High - Low) > = 0x8000. The
t hree- inst ruct ion algorit hm does not work when ( High - Low) < 0x8000 because
0xffff minus any number < 0x8000 will yield a number great er in magnit ude t han
0x8000 ( which is a negat ive number) .
When t he second inst ruct ion, psubssw MM0, ( 0xffff - High + Low) in t he t hree- st ep
algorit hm ( Example 5- 28) is execut ed, a negat ive number is subt ract ed. The result
of t his subt ract ion causes t he values in MM0 t o be increased inst ead of decreased, as
should be t he case, and an incorrect answer is generat ed.
5.6.6.2 Clipping to an Arbitrary Unsigned Range [High, Low]
Example 5- 29 clips an unsigned value t o t he unsigned range [ High, Low] . I f t he value
is less t han low or great er t han high, t hen clip t o low or high, respect ively. This t ech-
nique uses t he packed- add and packed- subt ract inst ruct ions wit h unsigned sat ura-
t ion, t hus t he t echnique can only be used on packed- byt es and packed- words dat a
t ypes.
Figure 5- 29 illust rat es operat ion on word values.
Example 5-28. Simplified Clipping to an Arbitrary Signed Range
; Input: MM0 signed source operands
; Output: MM1 signed operands clipped to the unsigned
; range [high, low]
paddssw mm0, (packed_max - packed_high)
psubssw mm0, (packed_usmax - packed_high + packed_ow)
; clips to low
paddw mm0, low ; undo the previous two offsets
Example 5-29. Clipping to an Arbitrary Unsigned Range [High, Low]
; Input:
; MM0 unsigned source operands
; Output:
; MM1 unsigned operands clipped to the unsigned
; range [HIGH, LOW]
paddusw mm0, 0xffff - high
psubusw mm0, (0xffff - high + low)
; in effect this clips to low
paddw mm0, low
; undo the previous two offsets
5-28
5.6.7 Packed Max/Min of Signed Word and Unsigned Byte
5.6.7.1 Signed Word
The PMAXSW inst ruct ion ret urns t he maximum bet ween four signed words in eit her
of t wo SI MD regist ers, or one SI MD regist er and a memory locat ion.
The PMI NSW inst ruct ion ret urns t he minimum bet ween t he four signed words in
eit her of t wo SI MD regist ers, or one SI MD regist er and a memory locat ion.
5.6.7.2 Unsigned Byte
The PMAXUB inst ruct ion ret urns t he maximum bet ween t he eight unsigned byt es in
The PMI NUB inst ruct ion ret urns t he minimum bet ween t he eight unsigned byt es in
5.6.8 Packed Multiply High Unsigned
The PMULHUW/ PMULHW inst ruct ion mult iplies t he unsigned/ signed words in t he
dest inat ion operand wit h t he unsigned/ signed words in t he source operand. The
high- order 16 bit s of t he 32- bit int ermediat e result s are writ t en t o t he dest inat ion
operand.
5.6.9 Packed Sum of Absolute Differences
The PSADBW inst ruct ion comput es t he absolut e value of t he difference of unsigned
byt es for eit her t wo SI MD regist ers, or one SI MD regist er and a memory locat ion.
The differences are t hen summed t o produce a word result in t he lower 16- bit field,
and t he upper t hree words are set t o zero. See Figure 5- 9.
5-29
The subt ract ion operat ion present ed above is an absolut e difference. That is,
T = ABS( X- Y). Byt e values are st ored in t emporary space, all values are summed
t oget her, and t he result is writ t en t o t he lower word of t he dest inat ion regist er.
5.6.10 Packed Average (Byte/Word)
The PAVGB and PAVGW inst ruct ions add t he unsigned dat a element s of t he source
operand t o t he unsigned dat a element s of t he dest inat ion regist er, along wit h a carry-
in. The result s of t he addit ion are t hen independent ly shift ed t o t he right by one bit
posit ion. The high order bit s of each element are filled wit h t he carry bit s of t he corre-
sponding sum.
The dest inat ion operand is an SI MD regist er. The source operand can eit her be an
SI MD regist er or a memory operand.
The PAVGB inst ruct ion operat es on packed unsigned byt es and t he PAVGW inst ruc-
t ion operat es on packed unsigned words.
Figure 5-9. PSADBW Instruction Example
OM15167
MM/m64
X8 X7 X6 X5 X4 X3 X2 X1
0 63
MM
Y8 Y7 Y6 Y5 Y4 Y3 Y2 Y1
0 63
Temp
T8 T7 T6 T5 T4 T3 T2 T1
0 63
= = = = = = = =
- - - - - - - -
MM
0..0 0..0 0..0 T1+T2+T3+T4+T5+T6+T7+T8
0 63
5-30
5.6.11 Complex Multiply by a Constant
Complex mult iplicat ion is an operat ion which requires four mult iplicat ions and t wo
addit ions. This is exact ly how t he PMADDWD inst ruct ion operat es. I n order t o use
t his inst ruct ion, you need t o format t he dat a int o mult iple 16- bit values. The real and
imaginary component s should be 16- bit s each. Consider Example 5- 30, which
assumes t hat t he 64- bit MMX regist ers are being used:
Let t he input dat a be DR and DI , where DR is real component of t he dat a and DI
is imaginary component of t he dat a.
Format t he const ant complex coefficient s in memory as four 16- bit values [ CR -
CI CI CR] . Remember t o load t he values int o t he MMX regist er using MOVQ.
The real component of t he complex product is PR = DR* CR - DI * CI and t he
imaginary component of t he complex product is PI = DR* CI + DI * CR.
The out put is a packed doubleword. I f needed, a pack inst ruct ion can be used t o
convert t he result t o 16- bit ( t hereby mat ching t he format of t he input ) .
5.6.12 Packed 32*32 Multiply
The PMULUDQ inst ruct ion performs an unsigned mult iply on t he lower pair of double-
word operands wit hin 64- bit chunks from t he t wo sources; t he full 64- bit result from
each mult iplicat ion is ret urned t o t he dest inat ion regist er.
This inst ruct ion is added in bot h a 64- bit and 128- bit version; t he lat t er performs 2
independent operat ions, on t he low and high halves of a 128- bit regist er.
5.6.13 Packed 64-bit Add/Subtract
The PADDQ/ PSUBQ inst ruct ions add/ subt ract quadword operands wit hin each 64- bit
chunk from t he t wo sources; t he 64- bit result from each comput at ion is writ t en t o
t he dest inat ion regist er. Like t he int eger ADD/ SUB inst ruct ion, PADDQ/ PSUBQ can
operat e on eit her unsigned or signed ( t wos complement not at ion) int eger operands.
Example 5-30. Complex Multiply by a Constant
; Input:
; MM0 complex value, Dr, Di
; MM1 constant complex coefficient in the form
; [Cr -Ci Ci Cr]
; Output:
; MM0 two 32-bit dwords containing [Pr Pi]
;
punpckldq mm0, mm0 ; makes [dr di dr di]
pmaddwd mm0, mm1 ; done, the result is
; [(Dr*Cr-Di*Ci)(Dr*Ci+Di*Cr)]
5-31
When an individual result is t oo large t o be represent ed in 64- bit s, t he lower 64- bit s
of t he result are writ t en t o t he dest inat ion operand and t herefore t he result wraps
around. These inst ruct ions are added in bot h a 64- bit and 128- bit version; t he lat t er
performs 2 independent operat ions, on t he low and high halves of a 128- bit regist er.
5.6.14 128-bit Shifts
The PSLLDQ/ PSRLDQ inst ruct ions shift t he first operand t o t he left / right by t he
number of byt es specified by t he immediat e operand. The empt y low/ high- order
byt es are cleared ( set t o zero) .
I f t he value specified by t he immediat e operand is great er t han 15, t hen t he dest ina-
t ion is set t o all zeros.
5.7 MEMORY OPTIMIZATIONS
You can improve memory access using t he following t echniques:
Avoiding part ial memory accesses
I ncreasing t he bandwidt h of memory fills and video fills
Prefet ching dat a wit h St reaming SI MD Ext ensions. See Chapt er 9, Opt imizing
Cache Usage.
MMX regist ers and XMM regist ers allow you t o move large quant it ies of dat a wit hout
st alling t he processor. I nst ead of loading single array values t hat are 8, 16, or 32 bit s
long, consider loading t he values in a single quadword or double quadword and t hen
increment ing t he st ruct ure or array point er accordingly.
Any dat a t hat will be manipulat ed by SI MD int eger inst ruct ions should be loaded
using eit her:
An SI MD int eger inst ruct ion t hat loads a 64- bit or 128- bit operand ( for example:
MOVQ MM0, M64)
The regist er- memory form of any SI MD int eger inst ruct ion t hat operat es on a
quadword or double quadword memory operand ( for example, PMADDW MM0,
M64) .
All SI MD dat a should be st ored using an SI MD int eger inst ruct ion t hat st ores a 64- bit
or 128- bit operand ( for example: MOVQ M64, MM0)
The goal of t he above r ecommendat ions is t wofold. First , t he loading and st oring of
SI MD dat a is more efficient using t he larger block sizes. Second, following t he above
recommendat ions helps t o avoid mixing of 8- , 16- , or 32- bit load and st ore opera-
t ions wit h SI MD int eger t echnology load and st ore operat ions t o t he same SI MD dat a.
This prevent s sit uat ions in which small loads follow large st ores t o t he same area of
memory, or large loads follow small st ores t o t he same area of memory. The
5-32
Pent ium I I , Pent ium III, and Pent ium 4 processors may st all in such sit uat ions. See
Chapt er 3 for det ails.
5.7.1 Partial Memory Accesses
Consider a case wit h a large load aft er a series of small st ores t o t he same area of
memory ( beginning at memory address MEM) . The large load st alls in t he case
MOVQ must wait for t he st ores t o writ e memory before it can access all dat a it
requires. This st all can also occur wit h ot her dat a t ypes ( for example, when byt es or
words are st ored and t hen words or doublewords are read from t he same area of
memory) . When you change t he code sequence as shown in Example 5- 32, t he
processor can access t he dat a wit hout delay.
Example 5-31. A Large Load after a Series of Small Stores (Penalty)
mov mem, eax ; store dword to address mem"
mov mem + 4, ebx ; store dword to address mem + 4"
:
:
movq mm0, mem ; load qword at address mem", stalls
Example 5-32. Accessing Data Without Delay
movd mm1, ebx ; build data into a qword first
; before storing it to memory
movd mm2, eax
psllq mm1, 32
por mm1, mm2
movq mem, mm1 ; store SIMD variable to mem" as
; a qword
:
:
movq mm0, mem ; load qword SIMD mem", no stall
5-33
Consider a case wit h a series of small loads aft er a large st ore t o t he same area of
memory ( beginning at memory address MEM) , as shown in Example 5- 33. Most of
t he small loads st all because t hey are not aligned wit h t he st ore. See Sect ion 3. 6. 4,
St ore Forwarding, for det ails.
The word loads must wait for t he quadword st ore t o writ e t o memory before t hey can
access t he dat a t hey require. This st all can also occur wit h ot her dat a t ypes ( for
example: when doublewords or words are st ored and t hen words or byt es are read
from t he same area of memory) .
When you change t he code sequence as shown in Example 5- 34, t he processor can
access t he dat a wit hout delay.
These t ransformat ions, in general, increase t he number of inst ruct ions required t o
perform t he desired operat ion. For Pent ium I I , Pent ium III, and Pent ium 4 processors,
t he benefit of avoiding forwarding problems out weighs t he performance penalt y due
t o t he increased number of inst ruct ions.
Example 5-33. A Series of Small Loads After a Large Store
movq mem, mm0 ; store qword to address mem"
:
:
mov bx, mem + 2 ; load word at mem + 2" stalls
mov cx, mem + 4 ; load word at mem + 4" stalls
Example 5-34. Eliminating Delay for a Series of Small Loads after a Large Store
movq mem, mm0 ; store qword to address mem"
:
:
movq mm1, mem ; load qword at address mem"
movd eax, mm1 ; transfer mem + 2" to eax from
; MMX register, not memory
psrlq mm1, 32
shr eax, 16
movd ebx, mm1 ; transfer mem + 4" to bx from
; MMX register, not memory
and ebx, 0ffffh
5-34
5.7.1.1 Supplemental Techniques for Avoiding Cache Line Splits
Video processing applicat ions somet imes cannot avoid loading dat a from memory
addresses t hat are not aligned t o 16- byt e boundaries. An example of t his sit uat ion is
when each line in a video frame is averaged by shift ing horizont ally half a pixel.
Example shows a common operat ion in video processing t hat loads dat a from
memory address not aligned t o a 16- byt e boundar y. As video processing t raverses
each line in t he video frame, it experiences a cache line split for each 64 byt e chunk
loaded from memory.
SSE3 provides an inst ruct ion LDDQU for loading from memory address t hat are not
16- byt e aligned. LDDQU is a special 128- bit unaligned load designed t o avoid cache
line split s. I f t he address of t he load is aligned on a 16- byt e boundary, LDQQU loads
t he 16 byt es request ed. I f t he address of t he load is not aligned on a 16- byt e
boundary, LDDQU loads a 32- byt e block st art ing at t he 16- byt e aligned address
immediat ely below t he address of t he load request . I t t hen provides t he request ed 16
byt es. I f t he address is aligned on a 16- byt e boundary, t he effect ive number of
memory request s is implement at ion dependent ( one, or more) .
LDDQU is designed for programming usage of loading dat a from memory wit hout
st oring modified dat a back t o t he same address. Thus, t he usage of LDDQU should be
rest rict ed t o sit uat ions where no st ore- t o- load forwarding is expect ed. For sit uat ions
where st ore- t o- load forwarding is expect ed, use regular st ore/ load pairs ( eit her
aligned or unaligned based on t he alignment of t he dat a accessed) .
Example 5-35. An Example of Video Processing with Cache Line Splits
// Average half-pels horizonally (on // the x axis),
// from one reference frame only.
nextLinesLoop:
movdqu xmm0, XMMWORD PTR [edx] // may not be 16B aligned
movdqu xmm0, XMMWORD PTR [edx+1]
movdqu xmm1, XMMWORD PTR [edx+eax]
movdqu xmm1, XMMWORD PTR [edx+eax+1]
pavgbxmm0, xmm1
pavgbxmm2, xmm3
movdqaXMMWORD PTR [ecx], xmm0
movdqaXMMWORD PTR [ecx+eax], xmm2
// (repeat ...)
5-35
5.7.2 Increasing Bandwidth of Memory Fills and Video Fills
I t is beneficial t o underst and how memory is accessed and filled. A memory- t o-
memory fill ( for example a memory- t o- video fill) is defined as a 64- byt e ( cache line)
load from memory which is immediat ely st ored back t o memory ( such as a video
frame buffer) .
The following are guidelines for obt aining higher bandwidt h and short er lat encies for
sequent ial memory fills ( video fills) . These recommendat ions are relevant for all I nt el
archit ect ure processors wit h MMX t echnology and refer t o cases in which t he loads
and st ores do not hit in t he first - or second- level cache.
5.7.2.1 Increasing Memory Bandwidth Using the MOVDQ Instruction
Loading any size dat a operand will cause an ent ire cache line t o be loaded int o t he
cache hierarchy. Thus, any size load looks more or less t he same from a memory
bandwidt h perspect ive. However, using many smaller loads consumes more microar-
chit ect ural resources t han fewer larger st ores. Consuming t oo many resources can
cause t he processor t o st all and reduce t he bandwidt h t hat t he processor can request
of t he memory subsyst em.
Using MOVDQ t o st ore t he dat a back t o UC memory ( or WC memory in some cases)
inst ead of using 32- bit st ores ( for example, MOVD) will reduce by t hree- quart ers t he
number of st ores per memory fill cycle. As a result , using t he MOVDQ in memory fill
cycles can achieve significant ly higher effect ive bandwidt h t han using MOVD.
Example 5-36. Video Processing Using LDDQU to Avoid Cache Line Splits
// Average half-pels horizontally (on // the x axis),
// from one reference frame only.
nextLinesLoop:
lddqu xmm0, XMMWORD PTR [edx] // may not be 16B aligned
lddqu xmm0, XMMWORD PTR [edx+1]
lddqu xmm1, XMMWORD PTR [edx+eax]
lddqu xmm1, XMMWORD PTR [edx+eax+1]
pavgbxmm0, xmm1
pavgbxmm2, xmm3
movdqaXMMWORD PTR [ecx], xmm0 //results stored elsewhere
movdqaXMMWORD PTR [ecx+eax], xmm2
// (repeat ...)
5-36
5.7.2.2 Increasing Memory Bandwidth by Loading and Storing to and
from the Same DRAM Page
DRAM is divided int o pages, which are not t he same as operat ing syst em ( OS) pages.
The size of a DRAM page is a funct ion of t he t ot al size of t he DRAM and t he organiza-
t ion of t he DRAM. Page sizes of several Kilobyt es are common. Like OS pages, DRAM
pages are const ruct ed of sequent ial addresses. Sequent ial memory accesses t o t he
same DRAM page have short er lat encies t han sequent ial accesses t o different DRAM
pages.
I n many syst ems t he lat ency for a page miss ( t hat is, an access t o a different page
inst ead of t he page previously accessed) can be t wice as large as t he lat ency of a
memory page hit ( access t o t he same page as t he previous access) . Therefore, if t he
loads and st ores of t he memory fill cycle are t o t he same DRAM page, a significant
increase in t he bandwidt h of t he memory fill cycles can be achieved.
5.7.2.3 Increasing UC and WC Store Bandwidth by Using Aligned Stores
Using aligned st ores t o fill UC or WC memory will yield higher bandwidt h t han using
unaligned st ores. I f a UC st ore or some WC st ores cross a cache line boundary, a
single st ore will result in t wo t ransact ion on t he bus, reducing t he efficiency of t he
bus t ransact ions. By aligning t he st ores t o t he size of t he st ores, you eliminat e t he
possibilit y of crossing a cache line boundary, and t he st ores will not be split int o sepa-
rat e t ransact ions.
5.8 CONVERTING FROM 64-BIT TO 128-BIT SIMD
INTEGERS
SSE2 defines a superset of 128- bit int eger inst ruct ions current ly available in MMX
t echnology; t he operat ion of t he ext ended inst ruct ions remains. The superset simply
operat es on dat a t hat is t wice as wide. This simplifies port ing of 64- bit int eger appli-
cat ions. However, t here are few considerat ions:
Comput at ion inst ruct ions which use a memory operand t hat may not be aligned
t o a 16- byt e boundary must be replaced wit h an unaligned 128- bit load
( MOVDQU) followed by t he same comput at ion operat ion t hat uses inst ead
regist er operands.
Use of 128- bit int eger comput at ion inst ruct ions wit h memory operands t hat are
not 16- byt e aligned will result in a # GP. Unaligned 128- bit loads and st ores are
not as efficient as corresponding aligned versions; t his fact can reduce t he
performance gains when using t he 128- bit SI MD int eger ext ensions.
General guidelines on t he alignment of memory operands are:
The great est performance gains can be achieved when all memory st reams
are 16- byt e aligned.
5-37
Reasonable performance gains are possible if roughly half of all memory
st reams are 16- byt e aligned and t he ot her half are not .
Lit t le or no performance gain may result if all memory st reams are not
aligned t o 16- byt es. I n t his case, use of t he 64- bit SI MD int eger inst ruct ions
may be preferable.
Loop count ers need t o be updat ed because each 128- bit int eger inst ruct ion
operat es on t wice t he amount of dat a as it s 64- bit int eger count erpart .
Ext ension of t he PSHUFW inst ruct ion ( shuffle word across 64- bit int eger
operand) across a full 128- bit operand is emulat ed by a combinat ion of t he
following inst ruct ions: PSHUFHW, PSHUFLW, and PSHUFD.
Use of t he 64- bit shift by bit inst ruct ions ( PSRLQ, PSLLQ) are ext ended t o 128
bit s by:
Use of PSRLQ and PSLLQ, along wit h masking logic operat ions
A Code sequence rewrit t en t o use t he PSRLDQ and PSLLDQ inst ruct ions ( shift
double quadword operand by byt es)
5.8.1 SIMD Optimizations and Microarchitectures
Pent ium M, I nt el Core Solo and I nt el Core Duo processors have a different microar-
chit ect ure t han I nt el Net Burst microarchit ect ure. The following sect ions discuss opt i-
mizing SI MD code t hat t arget s I nt el Core Solo and I nt el Core Duo processors.
On I nt el Core Solo and I nt el Core Duo processors, lddqu behaves ident ically t o
movdqu by loading 16 byt es of dat a irrespect ive of address alignment .
5.8.1.1 Packed SSE2 Integer versus MMX Instructions
I n general, 128- bit SI MD int eger inst ruct ions should be favored over 64- bit MMX
inst ruct ions on I nt el Core Solo and I nt el Core Duo processors. This is because:
I mproved decoder bandwidt h and more efficient op flows relat ive t o t he
Pent ium M processor
Wider widt h of t he XMM regist ers can benefit code t hat is limit ed by eit her
decoder bandwidt h or execut ion lat ency. XMM regist ers can provide t wice t he
space t o st ore dat a for in- flight execut ion. Wider XMM regist ers can facilit at e
loop- unrolling or in reducing loop overhead by halving t he number of loop
it erat ions.
I n microarchit ect ures prior t o I nt el Core microarchit ect ure, execut ion t hroughput of
128- bit SI MD int egrat ion operat ions is basically t he same as 64- bit MMX operat ions.
Some shuffle/ unpack/ shift operat ions do not benefit from t he front end improve-
ment s. The net impact of using 128- bit SI MD int eger inst ruct ion on I nt el Core Solo
and I nt el Core Duo processors is likely t o be slight ly posit ive overall, but t here may
be a few sit uat ions where t heir use will generat e an unfavorable performance impact .
5-38
I nt el Core microarchit ect ure generally execut es SI MD inst ruct ions more efficient ly
t han previous microarchit ect ures in t erms of lat ency and t hroughput , many of t he
limit at ions specific t o I nt el Core Duo, I nt el Core Solo processors do not apply. The
same is t rue of I nt el Core microarchit ect ure relat ive t o I nt el Net Burst microarchit ec-
t ures.
6-1
CHAPTER 6
OPTIMIZING FOR SIMD FLOATING-POINT
APPLICATIONS
This chapt er discusses rules for opt imizing for t he single- inst ruct ion, mult iple- dat a
( SI MD) float ing- point inst ruct ions available in St reaming SI MD Ext ensions ( SSE) ,
St reaming SI MD Ext ensions 2 ( SSE2) and St reaming SI MD Ext ensions 3 ( SSE3) . The
chapt er also provides examples t hat illust rat e t he opt imizat ion t echniques for single-
precision and double- precision SI MD float ing- point applicat ions.
6.1 GENERAL RULES FOR SIMD FLOATING-POINT CODE
The rules and suggest ions in t his sect ion help opt imize float ing- point code cont aining
SI MD float ing- point inst ruct ions. Generally, it is import ant t o underst and and balance
port ut ilizat ion t o creat e efficient SI MD float ing- point code. Basic rules and sugges-
t ions include t he following:
Follow all guidelines in Chapt er 3 and Chapt er 4.
Mask except ions t o achieve higher performance. When except ions are
unmasked, soft ware performance is slower.
Ut ilize t he flush- t o- zero and denormals- are- zero modes for higher performance
t o avoid t he penalt y of dealing wit h denormals and underflows.
Use t he reciprocal inst ruct ions followed by it erat ion for increased accuracy. These
inst ruct ions yield reduced accuracy but execut e much fast er. Not e t he following:
I f reduced accuracy is accept able, use t hem wit h no it erat ion.
I f near full accuracy is needed, use a Newt on- Raphson it erat ion.
I f full accuracy is needed, t hen use divide and square root which provide
more accuracy, but slow down performance.
6.2 PLANNING CONSIDERATIONS
Whet her adapt ing an exist ing applicat ion or creat ing a new one, using SI MD float ing-
point inst ruct ions t o achieve opt imum performance gain requires programmers t o
consider several issues. I n general, when choosing candidat es for opt imizat ion, look
for code segment s t hat are comput at ionally int ensive and float ing- point int ensive.
Also consider efficient use of t he cache archit ect ure.
The sect ions t hat follow answer t he quest ions t hat should be raised before imple-
ment at ion:
Can dat a layout be arranged t o increase parallelism or cache ut ilizat ion?
6-2
Which part of t he code benefit s from SI MD float ing- point inst ruct ions?
I s t he current algorit hm t he most appropriat e for SI MD float ing- point inst ruc-
t ions?
I s t he code float ing- point int ensive?
Do eit her single- precision float ing- point or double- precision float ing- point
comput at ions provide enough range and precision?
Does t he result of comput at ion affect ed by enabling flush- t o- zero or denormals-
t o- zero modes?
I s t he dat a arranged for efficient ut ilizat ion of t he SI MD float ing- point regist ers?
I s t his applicat ion t arget ed for processors wit hout SI MD float ing- point inst ruc-
t ions?
See also: Sect ion 4.2, Considerat ions for Code Conversion t o SI MD Programming.
6.3 USING SIMD FLOATING-POINT WITH X87 FLOATING-
POINT
Because t he XMM regist ers used for SI MD float ing- point comput at ions are separat e
regist ers and are not mapped t o t he exist ing x87 float ing- point st ack, SI MD float ing-
point code can be mixed wit h x87 float ing- point or 64- bit SI MD int eger code.
Wit h I nt el Core microarchit ect ure, 128- bit SI MD int eger inst ruct ions provides
subst ant ially higher efficiency t han 64- bit SI MD int eger inst ruct ions. Soft ware should
favor using SI MD float ing- point and int eger inst ruct ions wit h XMM regist ers where
possible.
6.4 SCALAR FLOATING-POINT CODE
There are SI MD float ing- point inst ruct ions t hat operat e only on t he least - significant
operand in t he SI MD regist er. These inst ruct ions are known as scalar inst ruct ions.
They allow t he XMM regist ers t o be used for general- purpose float ing- point comput a-
t ions.
I n t erms of performance, scalar float ing- point code can be equivalent t o or exceed
x87 float ing- point code and has t he following advant ages:
SI MD float ing- point code uses a flat regist er model, whereas x87 float ing- point
code uses a st ack model. Using scalar float ing- point code eliminat es t he need t o
use FXCH inst ruct ions. These have performance limit s on t he I nt el Pent ium 4
processor.
Mixing wit h MMX t echnology code wit hout penalt y.
Flush- t o- zero mode.
Short er lat encies t han x87 float ing- point .
6-3
When using scalar float ing- point inst ruct ions, it is not necessary t o ensure t hat t he
dat a appears in vect or form. However, t he opt imizat ions regarding alignment , sched-
uling, inst ruct ion select ion, and ot her opt imizat ions covered in Chapt er 3 and
Chapt er 4 should be observed.
6.5 DATA ALIGNMENT
SI MD float ing- point dat a is 16- byt e aligned. Referencing unaligned 128- bit SI MD
float ing- point dat a will result in an except ion unless MOVUPS or MOVUPD ( move
unaligned packed single or unaligned packed double) is used. The unaligned inst ruc-
t ions used on aligned or unaligned dat a will also suffer a performance penalt y relat ive
t o aligned accesses.
See also: Sect ion 4. 4, St ack and Dat a Alignment .
6.5.1 Data Arrangement
Because SSE and SSE2 incorporat e SI MD archit ect ure, arranging dat a t o fully use t he
SI MD regist ers produces opt imum performance. This implies cont iguous dat a for
processing, which leads t o fewer cache misses. Correct dat a arrangement can pot en-
t ially quadruple dat a t hroughput when using SSE or double t hroughput when using
SSE2. Performance gains can occur because four dat a element s can be loaded wit h
128- bit load inst ruct ions int o XMM regist ers using SSE ( MOVAPS) . Similarly, t wo dat a
element s can loaded wit h 128- bit load inst ruct ions int o XMM regist ers using SSE2
( MOVAPD) .
Refer t o t he Sect ion 4. 4, St ack and Dat a Alignment , for dat a arrangement recom-
mendat ions. Duplicat ing and padding t echniques overcome misalignment problems
t hat occur in some dat a st ruct ures and arrangement s. This increases t he dat a space
but avoids penalt ies for misaligned dat a access.
For some applicat ions ( for example: 3D geomet ry) , t radit ional dat a arrangement
requires some changes t o fully ut ilize t he SI MD regist ers and parallel t echniques.
Tradit ionally, t he dat a layout has been an array of st ruct ures ( AoS) . To fully ut ilize t he
SI MD regist ers in such applicat ions, a new dat a layout has been proposed a st ruc-
t ure of arrays ( SoA) result ing in more opt imized performance.
6.5.1.1 Vertical versus Horizontal Computation
The maj orit y of t he float ing- point arit hmet ic inst ruct ions in SSE/ SSE2 are focused on
vert ical dat a processing for parallel dat a element s. This means t he dest inat ion of
each element is t he result of an arit hmet ic operat ion performed on input operands in
t he same vert ical posit ion ( Figure 6- 1) .
To supplement t hese homogeneous arit hmet ic operat ions on parallel dat a element s,
SSE and SSE2 provides dat a movement inst ruct ions ( e. g., SHUFPS) t hat facilit at e
moving dat a element s horizont ally.
6-4
AoS dat a st ruct ures are oft en used in 3D geomet ry comput at ions. SI MD t echnology
can be applied t o AoS dat a st ruct ure using a horizont al comput at ion model. This
means t hat X, Y, Z, and W component s of a single vert ex st ruct ure ( t hat is, of a single
vect or simult aneously referred t o as an XYZ dat a represent at ion, see Figure 6- 2) are
comput ed in parallel, and t he array is updat ed one vert ex at a t ime.
When dat a st ruct ures are organized for t he horizont al comput at ion model, some-
t imes t he availabilit y of homogeneous arit hmet ic operat ions in SSE/ SSE2 may cause
inefficiency or require addit ional int ermediat e movement bet ween dat a element s.
Alt ernat ively, t he dat a st ruct ure can be organized in t he SoA format . The SoA dat a
st ruct ure enables a vert ical comput at ion t echnique, and is recommended over hori-
zont al comput at ion for many applicat ions, for t he following reasons:
When comput ing on a single vect or ( XYZ) , it is common t o use only a subset of
t he vect or component s; for example, in 3D graphics t he W component is
somet imes ignored. This means t hat for single- vect or operat ions, 1 of 4
comput at ion slot s is not being ut ilized. This t ypically result s in a 25% reduct ion of
peak efficiency.
Figure 6-1. Homogeneous Operation on Parallel Data Elements
Figure 6-2. Horizontal Computation Model
X3 X2 X1 X0
Y3 Y2 Y1 Y0
X3 OP Y3 X2 OP Y2 X 1OP Y1 X0 OP Y0
OP OP OP OP
X Y Z W
6-5
I t may become difficult t o hide long lat ency operat ions. For inst ance, anot her
common funct ion in 3D graphics is normalizat ion, which requires t he
comput at ion of a reciprocal square root ( t hat is, 1/ sqrt ) . Bot h t he division and
square root are long lat ency operat ions. Wit h vert ical comput at ion ( SoA) , each of
t he 4 comput at ion slot s in a SI MD operat ion is producing a unique result , so t he
net lat ency per slot is L/ 4 where L is t he overall lat ency of t he operat ion.
However, for horizont al comput at ion, t he four comput at ion slot s each produce
t he same result , hence t o produce four separat e result s requires a net lat ency per
slot of L.
To ut ilize all four comput at ion slot s, t he vert ex dat a can be reorganized t o allow
comput at ion on each component of four separat e vert ices, t hat is, processing
mult iple vect ors simult aneously. This can also be referred t o as an SoA form of repre-
sent ing vert ices dat a shown in Table 6- 1.
Organizing dat a in t his manner yields a unique result for each comput at ional slot for
each arit hmet ic operat ion.
Vert ical comput at ion t akes advant age of t he inherent parallelism in 3D geomet ry
processing of vert ices. I t assigns t he comput at ion of four vert ices t o t he four
comput e slot s of t he Pent ium III processor, t hereby eliminat ing t he disadvant ages of
t he horizont al approach described earlier ( using SSE alone) . The dot product opera-
t ion implement s t he SoA represent at ion of vert ices dat a. A schemat ic represent at ion
of dot product operat ion is shown in Figure 6- 3.
Table 6-1. SoA Form of Representing Vertices Data
Vx array X1 X2 X3 X4 ..... Xn
Vy array Y1 Y2 Y3 Y4 ..... Yn
Vz array Z1 Z2 Z3 Y4 ..... Zn
Vw array W1 W2 W3 W4 ..... Wn
6-6
Figure 6- 3 shows how one result would be comput ed for seven inst ruct ions if t he
dat a were organized as AoS and using SSE alone: four result s would require 28
inst ruct ions.
Figure 6-3. Dot Product Operation
Example 6-1. Pseudocode for Horizontal (xyz, AoS) Computation
mulps ; x*x', y*y', z*z'
movaps ; reg->reg move, since next steps overwrite
shufps ; get b,a,d,c from a,b,c,d
addps ; get a+b,a+b,c+d,c+d
movaps ; reg->reg move
shufps ; get c+d,c+d,a+b,a+b from prior addps
addps ; get a+b+c+d,a+b+c+d,a+b+c+d,a+b+c+d
OM15168
X
+
X
+
X
+
X
=
X1 X2 X3 X4
Fx Fx Fx Fx
Y1 Y2 Y3 Y4
Fy Fy Fy Fy
Z1 Z2 Z3 Z4
Fz Fz Fz Fz
W1 W2 W3 W4
Fw Fw Fw Fw
R1 R2 R3 R4
6-7
Now consider t he case when t he dat a is organized as SoA. Example 6- 2 demon-
st rat es how four result s are comput ed for five inst ruct ions.
For t he most efficient use of t he four component - wide regist ers, reorganizing t he
dat a int o t he SoA format yields increased t hroughput and hence much bet t er perfor-
mance for t he inst ruct ions used.
As seen from t his simple example, vert ical comput at ion yielded 100% use of t he
available SI MD regist ers and produced four result s. ( The result s may vary based on
t he applicat ion. ) I f t he dat a st ruct ures must be in a format t hat is not friendly t o
vert ical comput at ion, it can be rearranged on t he fly t o achieve full ut ilizat ion of t he
SI MD regist ers. This operat ion is referred t o as swizzling operat ion and t he reverse
operat ion is referred t o as deswizzling.
6.5.1.2 Data Swizzling
Swizzling dat a from one format t o anot her may be required in many algorit hms when
t he available inst ruct ion set ext ension is limit ed ( for example: only SSE is available) .
An example of t his is AoS format , where t he vert ices come as XYZ adj acent coordi-
nat es. Rearranging t hem int o SoA format ( XXXX, YYYY, ZZZZ) allows more efficient
SI MD comput at ions.
For efficient dat a shuffling and swizzling use t he following inst ruct ions:
MOVLPS, MOVHPS load/ st ore and move dat a on half sect ions of t he regist ers.
SHUFPS, UNPACKHPS, and UNPACKLPS unpack dat a.
To gat her dat a from four different memory locat ions on t he fly, follow t hese st eps:
1. I dent ify t he first half of t he 128- bit memory locat ion.
2. Group t he different halves t oget her using MOVLPS and MOVHPS t o form an XYXY
layout in t wo regist ers.
3. From t he 4 at t ached halves, get XXXX by using one shuffle, YYYY by using
anot her shuffle.
ZZZZ is derived t he same way but only requires one shuffle. Example 6- 3 illust rat es
t he swizzle funct ion.
Example 6-2. Pseudocode for Vertical (xxxx, yyyy, zzzz, SoA) Computation
mulps ; x*x' for all 4 x-components of 4 vertices
mulps ; y*y' for all 4 y-components of 4 vertices
mulps ; z*z' for all 4 z-components of 4 vertices
addps ; x*x' + y*y'
addps ; x*x'+y*y'+z*z'
6-8
Example 6- 4 shows t he same dat a - swizzling algorit hm encoded using t he I nt el C+ +
Compiler s int rinsics for SSE.
Example 6-3. Swizzling Data
typedef struct _VERTEX_AOS {
float x, y, z, color;
} Vertex_aos; // AoS structure declaration
typedef struct _VERTEX_SOA {
float x[4], float y[4], float z[4];
float color[4];
} Vertex_soa; // SoA structure declaration
void swizzle_asm (Vertex_aos *in, Vertex_soa *out)
{
// in mem: x1y1z1w1-x2y2z2w2-x3y3z3w3-x4y4z4w4-
// SWIZZLE XYZW --> XXXX
asm {
mov ecx, in // get structure addresses
mov edx, out
movlps xmm7, [ecx] // xmm7 = -- -- y1 x1
movhps xmm7, [ecx+16] // xmm7 = y2 x2 y1 x1
movlps xmm0, [ecx+32] // xmm0 = -- -- y3 x3
movhps xmm0, [ecx+48] // xmm0 = y4 x4 y3 x3
movaps xmm6, xmm7 // xmm6 = y1 x1 y1 x1
shufps xmm7, xmm0, 0x88 // xmm7 = x1 x2 x3 x4 => X
shufps xmm6, xmm0, 0xDD // xmm6 = y1 y2 y3 y4 => Y
movlps xmm2, [ecx+8] // xmm2 = -- -- w1 z1
movhps xmm2, [ecx+24] // xmm2 = w2 z2 u1 z1
movlps xmm1, [ecx+40] // xmm1 = -- -- s3 z3
movhps xmm1, [ecx+56] // xmm1 = w4 z4 w3 z3
movaps xmm0, xmm2 // xmm0 = w1 z1 w1 z1
shufps xmm2, xmm1, 0x88 // xmm2 = z1 z2 z3 z4 => Z
shufps xmm0, xmm1, 0xDD // xmm0 = w1 w2 w3 w4 => W
movaps [edx], xmm7 // store X
movaps [edx+16], xmm6 // store Y
movaps [edx+32], xmm2 // store Z
movaps [edx+48], xmm0 // store W
// SWIZZLE XYZ -> XXX
}
}
6-9
NOTE
Avoid creat ing a dependence chain from previous comput at ions
because t he MOVHPS/ MOVLPS inst ruct ions bypass one part of t he
regist er. The same issue can occur wit h t he use of an exclusive- OR
funct ion wit hin an inner loop in order t o clear a regist er: XORPS
XMM0, XMM0; All 0s writ t en t o XMM0.
Alt hough t he generat ed result of all zeros does not depend on t he specific dat a
cont ained in t he source operand ( t hat is, XOR of a regist er wit h it self always
produces all zeros) , t he inst ruct ion cannot execut e unt il t he inst ruct ion t hat gener-
at es XMM0 has complet ed. I n t he worst case, t his creat es a dependence chain t hat
links successive it erat ions of t he loop, even if t hose it erat ions are ot herwise indepen-
dent . The performance impact can be significant depending on how many ot her inde-
pendent int ra- loop comput at ions are performed. Not e t hat on t he Pent ium 4
processor, t he SI MD int eger PXOR inst ruct ions, if used wit h t he same regist er, do
break t he dependence chain, eliminat ing false dependencies when clearing regist ers.
Example 6-4. Swizzling Data Using Intrinsics
//Intrinsics version of data swizzle
void swizzle_intrin (Vertex_aos *in, Vertex_soa *out, int stride)
{
__m128 x, y, z, w;
__m128 tmp;
x = _mm_loadl_pi(x,(__m64 *)(in));
x = _mm_loadh_pi(x,(__m64 *)(stride + (char *)(in)));
y = _mm_loadl_pi(y,(__m64 *)(2*stride+(char *)(in)));
y = _mm_loadh_pi(y,(__m64 *)(3*stride+(char *)(in)));
tmp = _mm_shuffle_ps( x, y, _MM_SHUFFLE( 2, 0, 2, 0));
y = _mm_shuffle_ps( x, y, _MM_SHUFFLE( 3, 1, 3, 1));
x = tmp;
z = _mm_loadl_pi(z,(__m64 *)(8 + (char *)(in)));
z = _mm_loadh_pi(z,(__m64 *)(stride+8+(char *)(in)));
w = _mm_loadl_pi(w,(__m64 *)(2*stride+8+(char*)(in)));
w = _mm_loadh_pi(w,(__m64 *)(3*stride+8+(char*)(in)));
tmp = _mm_shuffle_ps( z, w, _MM_SHUFFLE( 2, 0, 2, 0));
w = _mm_shuffle_ps( z, w, _MM_SHUFFLE( 3, 1, 3, 1));
z = tmp;
_mm_store_ps(&out->x[0], x);
_mm_store_ps(&out->y[0], y);
_mm_store_ps(&out->z[0], z);
_mm_store_ps(&out->w[0], w);
}
6-10
The same sit uat ion can occur for t he above MOVHPS/ MOVLPS/ SHUFPS sequence.
Since each MOVHPS/ MOVLPS inst ruct ion bypasses part of t he dest inat ion regist er,
t he inst ruct ion cannot execut e unt il t he prior inst ruct ion t hat generat es t his regist er
has complet ed. As wit h t he XORPS example, in t he worst case t his dependence can
prevent successive loop it erat ions from execut ing in parallel.
A solut ion is t o include a 128- bit load ( t hat is, from a dummy local variable, such as
TMP in Example 6- 4) t o each regist er t o be used wit h a MOVHPS/ MOVLPS inst ruct ion.
This act ion effect ively breaks t he dependence by performing an independent load
from a memory or cached locat ion.
6.5.1.3 Data Deswizzling
I n t he deswizzle operat ion, we want t o arrange t he SoA format back int o AoS format
so t he XXXX, YYYY, ZZZZ are rearranged and st ored in memory as XYZ. To do t his we
can use t he UNPCKLPS/ UNPCKHPS inst ruct ions t o regenerat e t he XYXY layout and
t hen st ore each half ( XY) int o it s corresponding memory locat ion using
MOVLPS/ MOVHPS. This is followed by anot her MOVLPS/ MOVHPS t o st ore t he Z
component . Example 6- 5 illust rat es t he deswizzle funct ion:
Example 6-5. Deswizzling Single-Precision SIMD Data
void deswizzle_asm(Vertex_soa *in, Vertex_aos *out)
{
__asm {
mov ecx, in // load structure addresses
mov edx, out
movaps xmm7, [ecx] // load x1 x2 x3 x4 => xmm7
movaps xmm6, [ecx+16] // load y1 y2 y3 y4 => xmm6
movaps xmm5, [ecx+32] // load z1 z2 z3 z4 => xmm5
movaps xmm4, [ecx+48] // load w1 w2 w3 w4 => xmm4
// START THE DESWIZZLING HERE
movaps xmm0, xmm7 // xmm0= x1 x2 x3 x4
unpcklps xmm7, xmm6 // xmm7= x1 y1 x2 y2
movlps [edx], xmm7 // v1 = x1 y1 -- --
movhps [edx+16], xmm7 // v2 = x2 y2 -- --
unpckhps xmm0, xmm6 // xmm0= x3 y3 x4 y4
movlps [edx+32], xmm0 // v3 = x3 y3 -- --
movhps [edx+48], xmm0 // v4 = x4 y4 -- --
movaps xmm0, xmm5 // xmm0= z1 z2 z3 z4
6-11
You may have t o swizzle dat a in t he regist ers, but not in memory. This occurs when
t wo different funct ions need t o process t he dat a in different layout . I n light ing, for
example, dat a comes as RRRR GGGG BBBB AAAA, and you must deswizzle t hem int o
RGBA before convert ing int o int egers. I n t his case, use t he MOVLHPS/MOVHLPS
inst ruct ions t o do t he first part of t he deswizzle followed by SHUFFLE inst ruct ions,
see Example 6- 6 and Example 6- 7.
unpcklps xmm5, xmm4 // xmm5= z1 w1 z2 w2
unpckhps xmm0, xmm4 // xmm0= z3 w3 z4 w4
movlps [edx+8], xmm5 // v1 = x1 y1 z1 w1
movhps [edx+24], xmm5 // v2 = x2 y2 z2 w2
movlps [edx+40], xmm0 // v3 = x3 y3 z3 w3
movhps [edx+56], xmm0 // v4 = x4 y4 z4 w4
// DESWIZZLING ENDS HERE
}
}
Example 6-6. Deswizzling Data Using the movlhps and shuffle Instructions
void deswizzle_rgb(Vertex_soa *in, Vertex_aos *out)
{
//---deswizzle rgb---
// assume: xmm1=rrrr, xmm2=gggg, xmm3=bbbb, xmm4=aaaa
__asm {
mov edx, out
movaps xmm1, [ecx] // load r4 r3 r2 r1 => xmm1
movaps xmm2, [ecx+16] // load g4 g3 g2 g1 => xmm2
movaps xmm3, [ecx+32] // load b4 b3 b2 b1 => xmm3
movaps xmm4, [ecx+48] // load a4 a3 a2 a1 => xmm4
// Start deswizzling here
movaps xmm7, xmm4 // xmm7= a4 a3 a2 a1
movhlps xmm7, xmm3 // xmm7= a4 a3 b4 b3
movaps xmm6, xmm2 // xmm6= g4 g3 g2 g1
movlhps xmm3, xmm4 // xmm3= a2 a1 b2 b1
movhlps xmm2, xmm1 // xmm2= g4 g3 r4 r3
movlhps xmm1, xmm6 // xmm1= g2 g1 r2 r1
movaps xmm6, xmm2 // xmm6= g4 g3 r4 r3
movaps xmm5, xmm1 // xmm5= g2 g1 r2 r1
shufps xmm2, xmm7, 0xDD // xmm2= a4 b4 g4 r4 =>v4
shufps xmm1, xmm3, 0x88 // xmm4= a1 b1 g1 r1 =>v1
Example 6-5. Deswizzling Single-Precision SIMD Data (Contd.)
6-12
6.5.1.4 Using MMX Technology Code for Copy or Shuffling Functions
I f t here are some part s in t he code t hat are mainly copying, shuffling, or doing logical
manipulat ions t hat do not require use of SSE code; consider performing t hese
act ions wit h MMX t echnology code. For example, if t ext ure dat a is st ored in memory
as SoA ( UUUU, VVVV) and t he dat a needs only t o be deswizzled int o AoS layout ( UV)
shufps xmm5, xmm3, 0xDD // xmm5= a2 b2 g2 r2 =>v2
shufps xmm6, xmm7, 0x88 // xmm6= a3 b3 g3 r3 =>v3
movaps [edx], xmm1 // v1
movaps [edx+16], xmm5 // v2
// DESWIZZLING ENDS HERE
}
}
Example 6-7. Deswizzling Data 128-bit Integer SIMD Data
void mmx_deswizzle(IVertex_soa *in, IVertex_aos *out, int cnt)
{
__asm {
mov ebx, in // assume 16 byte aligned
mov edx, out // assume 16 byte aligned
mov edi, cnt //
xor ecx, ecx // assume 16 byte aligned
nextdq:
movdqa xmm0, [ebx] // xmm0= u4 u3 u2 u1
movdqa xmm1, [ebx+16] // xmm1= v4 v3 v2 v1
movdqa xmm2, xmm0 // xmm2= u4 u3 u2 u1
punpckhdq xmm0, xmm1 // xmm0= v4 u4 v3 u3
punpckldq xmm2, xmm1 // xmm2= v2 u2 v1 u1
movdqa [edx], xmm2 // store v2 u2 v1 u1
movdqa [edx+16], mm0 // store v4 u4 v3 u3
add ecx, 16
cmp ecx, edi
jl nextdq
}
}
Example 6-6. Deswizzling Data Using the movlhps and shuffle Instructions (Contd.)
6-13
for t he graphic cards t o process, use eit her SSE or MMX t echnology code. Using MMX
inst ruct ions allow you t o conserve XMM regist ers for ot her comput at ional t asks.
Example 6- 8 illust rat es how t o use MMX t echnology code for copying or shuffling.
6.5.1.5 Horizontal ADD Using SSE
Alt hough vert ical comput at ions generally make use of SI MD performance bet t er t han
horizont al comput at ions, in some cases, code must use a horizont al operat ion.
MOVLHPS/MOVHLPS and shuffle can be used t o sum dat a horizont ally. For example,
st art ing wit h four 128- bit regist ers, t o sum up each regist er horizont ally while having
t he final result s in one regist er, use t he MOVLHPS/MOVHLPS t o align t he upper and
lower part s of each regist er. This allows you t o use a vert ical add. Wit h t he result ing
part ial horizont al summat ion, full summat ion follows easily.
Figure 6- 4 present s a horizont al add using MOVHLPS/ MOVLHPS. Example 6- 9 and
Example 6- 10 provide t he code for t his operat ion.
Example 6-8. Using MMX Technology Code for Copying or Shuffling
movq mm0, [Uarray+ebx] ; mm0= u1 u2
movq mm1, [Varray+ebx] ; mm1= v1 v2
movq mm2, mm0 ; mm2= u1 u2
punpckhdq mm0, mm1 ; mm0= u1 v1
punpckldq mm2, mm1 ; mm2= u2 v2
movq [Coords+edx], mm0 ; store u1 v1
movq [Coords+8+edx], mm2 ; store u2 v2
movq mm4, [Uarray+8+ebx] ; mm4= u3 u4
movq mm5, [Varray+8+ebx] ; mm5= v3 v4
movq mm6, mm4 ; mm6= u3 u4
punpckhdq mm4, mm5 ; mm4= u3 v3
punpckldq mm6, mm5 ; mm6= u4 v4
6-14
Figure 6-4. Horizontal Add Using MOVHLPS/MOVLHPS
Example 6-9. Horizontal Add Using MOVHLPS/MOVLHPS
void horiz_add(Vertex_soa *in, float *out) {
__asm {
mov edx, out
movaps xmm0, [ecx] // load A1 A2 A3 A4 => xmm0
movaps xmm1, [ecx+16] // load B1 B2 B3 B4 => xmm1
movaps xmm2, [ecx+32] // load C1 C2 C3 C4 => xmm2
movaps xmm3, [ecx+48] // load D1 D2 D3 D4 => xmm3
OM15169
A1+A2+A3+A4 B1+B2+B3+B4 C1+C2+C3+C4 D1+D2+D3+D4
A1+A3 B1+B3 C1+C3 D1+D3 A2+A4 B2+B4 C2+C4 D2+D4
A1+A3 A2+A4 B1+B3 B2+B4 C1+C3 C2+C4 D1+D3 D2+D4
A1 A2 A3 A4 B1 B2 B3 B4 C1 C2 C3 C4 D1 D2 D3 D4
A1 A2 B1 B2 A3 A4 B3 B4 C1 C2 D1 D2 C3 C4 D3 D4
ADDPS
SHUFPS SHUFPS
ADDPS ADDPS
MOVLHPS MOVLHPS
xmm0 xmm2
MOVHLPS MOVHLPS
xmm1 xmm3
6-15
// START HORIZONTAL ADD
movaps xmm5, xmm0 // xmm5= A1,A2,A3,A4
movlhps xmm5, xmm1 // xmm5= A1,A2,B1,B2
movhlps xmm1, xmm0 // xmm1= A3,A4,B3,B4
addps xmm5, xmm1 // xmm5= A1+A3,A2+A4,B1+B3,B2+B4
movaps xmm4, xmm2
movlhps xmm2, xmm3 // xmm2= C1,C2,D1,D2
movhlps xmm3, xmm4 // xmm3= C3,C4,D3,D4
addps xmm3, xmm2 // xmm3= C1+C3,C2+C4,D1+D3,D2+D4
movaps xmm6, xmm3 // xmm6= C1+C3,C2+C4,D1+D3,D2+D4
shufps xmm3, xmm5, 0xDD
//xmm6=A1+A3,B1+B3,C1+C3,D1+D3
shufps xmm5, xmm6, 0x88
// xmm5= A2+A4,B2+B4,C2+C4,D2+D4
addps xmm6, xmm5 // xmm6= D,C,B,A
// END HORIZONTAL ADD
movaps [edx], xmm6
}
}
Example 6-10. Horizontal Add Using Intrinsics with MOVHLPS/MOVLHPS
void horiz_add_intrin(Vertex_soa *in, float *out)
{
__m128 v, v2, v3, v4;
__m128 tmm0,tmm1,tmm2,tmm3,tmm4,tmm5,tmm6;
// Temporary variables
tmm0 = _mm_load_ps(in->x); // tmm0 = A1 A2 A3 A4
tmm1 = _mm_load_ps(in->y); // tmm1 = B1 B2 B3 B4
tmm2 = _mm_load_ps(in->z); // tmm2 = C1 C2 C3 C4
tmm3 = _mm_load_ps(in->w); // tmm3 = D1 D2 D3 D4
tmm5 = tmm0; // tmm0 = A1 A2 A3 A4
tmm5 = _mm_movelh_ps(tmm5, tmm1); // tmm5 = A1 A2 B1 B2
tmm1 = _mm_movehl_ps(tmm1, tmm0); // tmm1 = A3 A4 B3 B4
tmm5 = _mm_add_ps(tmm5, tmm1); // tmm5 = A1+A3 A2+A4 B1+B3 B2+B4
tmm4 = tmm2;
Example 6-9. Horizontal Add Using MOVHLPS/MOVLHPS (Contd.)
6-16
6.5.2 Use of CVTTPS2PI/CVTTSS2SI Instructions
The CVTTPS2PI and CVTTSS2SI inst ruct ions encode t he t runcat e/ chop rounding
mode implicit ly in t he inst ruct ion. They t ake precedence over t he rounding mode
specified in t he MXCSR regist er. This behavior can eliminat e t he need t o change t he
rounding mode from round- nearest , t o t runcat e/ chop, and t hen back t o round-
nearest t o resume comput at ion.
Avoid frequent changes t o t he MXCSR regist er since t here is a penalt y associat ed
wit h writ ing t his regist er. Typically, when using CVTTPS2P/ CVTTSS2SI , rounding
cont rol in MXCSR can always be set t o round- nearest .
6.5.3 Flush-to-Zero and Denormals-are-Zero Modes
The flush- t o- zero ( FTZ) and denormals- are- zero ( DAZ) modes are not compat ible
wit h t he I EEE St andard 754. They are provided t o improve performance for applica-
t ions where underflow is common and where t he generat ion of a denormalized result
is not necessary.
See also: Sect ion 3.8. 2, Float ing- point Modes and Except ions.
6.6 SIMD OPTIMIZATIONS AND MICROARCHITECTURES
Pent ium M, I nt el Core Solo and I nt el Core Duo processors have a different microarchi-
t ect ure t han I nt el Net Burst microarchit ect ure. I nt el Core microarchit ect ure offers
significant ly more efficient SI MD float ing- point capabilit y t han previous microarchit ec-
t ures. I n addit ion, inst ruct ion lat ency and t hroughput of SSE3 inst ruct ions are signifi-
cant ly improved in I nt el Core microarchit ect ure over previous microarchit ect ures.
tmm2 = _mm_movelh_ps(tmm2, tmm3); // tmm2 = C1 C2 D1 D2
tmm3 = _mm_movehl_ps(tmm3, tmm4); // tmm3 = C3 C4 D3 D4
tmm3 = _mm_add_ps(tmm3, tmm2); // tmm3 = C1+C3 C2+C4 D1+D3 D2+D4
tmm6 = tmm3; // tmm6 = C1+C3 C2+C4 D1+D3 D2+D4
tmm6 = _mm_shuffle_ps(tmm3, tmm5, 0xDD);
// tmm6 = A1+A3 B1+B3 C1+C3 D1+D3
tmm5 = _mm_shuffle_ps(tmm5, tmm6, 0x88);
// tmm5 = A2+A4 B2+B4 C2+C4 D2+D4
tmm6 = _mm_add_ps(tmm6, tmm5);
// tmm6 = A1+A2+A3+A4 B1+B2+B3+B4
// C1+C2+C3+C4 D1+D2+D3+D4
_mm_store_ps(out, tmm6);
}
Example 6-10. Horizontal Add Using Intrinsics with MOVHLPS/MOVLHPS (Contd.)
6-17
6.6.1 SIMD Floating-point Programming Using SSE3
SSE3 enhances SSE and SSE2 wit h nine inst ruct ions t arget ed for SI MD float ing- point
programming. I n cont rast t o many SSE/ SSE2 inst ruct ions offering homogeneous
arit hmet ic operat ions on parallel dat a element s and favoring t he vert ical comput at ion
model, SSE3 offers inst ruct ions t hat performs asymmet ric arit hmet ic operat ion and
arit hmet ic operat ion on horizont al dat a element s.
ADDSUBPS and ADDSUBPD are t wo inst ruct ions wit h asymmet ric arit hmet ic
processing capabilit y ( see Figure 6- 5) . HADDPS, HADDPD, HSUBPS and HSUBPD
offers horizont al arit hmet ic processing capabilit y ( see Figure 6- 6) . I n addit ion:
MOVSLDUP, MOVSHDUP and MOVDDUP load dat a from memory ( or XMM regist er)
and replicat e dat a element s at once.
Figure 6-5. Asymmetric Arithmetic Operation of the SSE3 Instruction
X1 X0
X1 + Y1 X0 -Y0
SUB
Y1 Y0
ADD
6-18
6.6.1.1 SSE3 and Complex Arithmetics
The flexibilit y of SSE3 in dealing wit h AOS- t ype of dat a st ruct ure can be demon-
st rat ed by t he example of mult iplicat ion and division of complex numbers. For
example, a complex number can be st ored in a st ruct ure consist ing of it s real and
imaginary part . This nat urally leads t o t he use of an array of st ruct ure. Example 6- 11
demonst rat es using SSE3 inst ruct ions t o perform mult iplicat ions of single- precision
complex numbers. Example 6- 12 demonst rat es using SSE3 inst ruct ions t o perform
division of complex numbers.
Figure 6-6. Horizontal Arithmetic Operation of the SSE3 Instruction HADDPD
Example 6-11. Multiplication of Two Pair of Single-precision Complex Number
// Multiplication of (ak + i bk ) * (ck + i dk )
// a + i b can be stored as a data structure
movsldup xmm0, Src1; load real parts into the destination,
; a1, a1, a0, a0
movaps xmm1, src2; load the 2nd pair of complex values,
; i.e. d1, c1, d0, c0
mulps xmm0, xmm1; temporary results, a1d1, a1c1, a0d0,
; a0c0
shufps xmm1, xmm1, b1; reorder the real and imaginary
; parts, c1, d1, c0, d0
movshdup xmm2, Src1; load the imaginary parts into the
; destination, b1, b1, b0, b0
X1 X0
Y0 + Y1 X0 + X1
ADD
Y1 Y0
ADD
6-19
I n bot h examples, t he complex numbers are st ore in arrays of st ruct ures.
MOVSLDUP, MOVSHDUP and t he asymmet ric ADDSUBPS allow performing complex
arit hmet ics on t wo pair of single- precision complex number simult aneously and
wit hout any unnecessary swizzling bet ween dat a element s.
Due t o microarchit ect ural differences, soft ware should implement mult iplicat ion of
complex double- precision numbers using SSE3 inst ruct ions on processors based on
mulps xmm2, xmm1; temporary results, b1c1, b1d1, b0c0,
; b0d0
addsubps xmm0, xmm2; b1c1+a1d1, a1c1 -b1d1, b0c0+a0d0,
; a0c0-b0d0
Example 6-12. Division of Two Pair of Single-precision Complex Numbers
// Division of (ak + i bk ) / (ck + i dk )
movshdup xmm0, Src1; load imaginary parts into the
; destination, b1, b1, b0, b0
movaps xmm1, src2; load the 2nd pair of complex values,
; i.e. d1, c1, d0, c0
mulps xmm0, xmm1; temporary results, b1d1, b1c1, b0d0,
; b0c0
shufps xmm1, xmm1, b1; reorder the real and imaginary
; parts, c1, d1, c0, d0
movsldup xmm2, Src1; load the real parts into the
; destination, a1, a1, a0, a0
mulps xmm2, xmm1; temp results, a1c1, a1d1, a0c0, a0d0
addsubps xmm0, xmm2; a1c1+b1d1, b1c1-a1d1, a0c0+b0d0,
; b0c0-a0d0
mulps xmm1, xmm1 ; c1c1, d1d1, c0c0, d0d0
movps xmm2, xmm1; c1c1, d1d1, c0c0, d0d0
shufps xmm2, xmm2, b1; d1d1, c1c1, d0d0, c0c0
addps xmm2, xmm1; c1c1+d1d1, c1c1+d1d1, c0c0+d0d0,
; c0c0+d0d0
divps xmm0, xmm2
shufps xmm0, xmm0, b1 ; (b1c1-a1d1)/(c1c1+d1d1),
; (a1c1+b1d1)/(c1c1+d1d1),
; (b0c0-a0d0)/( c0c0+d0d0),
; (a0c0+b0d0)/( c0c0+d0d0)
Example 6-11. Multiplication of Two Pair of Single-precision Complex Number (Contd.)
6-20
I nt el Core microarchit ect ure. I n I nt el Core Duo and I nt el Core Solo processors, soft -
ware should use scalar SSE2 inst ruct ions t o implement double- precision complex
mult iplicat ion. This is because t he dat a pat h bet ween SI MD execut ion unit s is 128
bit s in I nt el Core microarchit ect ure, and only 64 bit s in previous microarchit ect ures.
Example 6- 13 shows t wo equivalent implement at ions of double- precision complex
mult iply of t wo pair of complex numbers using vect or SSE2 versus SSE3 inst ruct ions.
Example 6- 14 shows t he equivalent scalar SSE2 implement at ion.
Example 6-13. Double-Precision Complex Multiplication of Two Pairs
SSE2 Vector Implementation SSE3 Vector Implementation
movapd xmm0, [eax] ;y x
movapd xmm1, [eax+16] ;w z
unpcklpd xmm1, xmm1 ;z z
movapd xmm2, [eax+16] ;w z
unpckhpd xmm2, xmm2 ;w w
mulpd xmm1, xmm0 ;z*y z*x
mulpd xmm2, xmm0 ;w*y w*x
xorpd xmm2, xmm7 ;-w*y +w*x
shufpd xmm2, xmm2,1 ;w*x -w*y
addpd xmm2, xmm1 ;z*y+w*x z*x-w*y
movapd [ecx], xmm2
movapd xmm0, [eax] ;y x
movapd xmm1, [eax+16] ;z z
movapd xmm2, xmm1
unpcklpd xmm1, xmm1
unpckhpd xmm2, xmm2
mulpd xmm1, xmm0 ;z*y z*x
mulpd xmm2, xmm0 ;w*y w*x
shufpd xmm2, xmm2, 1 ;w*x w*y
addsubpd xmm1, xmm2 ;w*x+z*y z*x-w*y
movapd [ecx], xmm1
Example 6-14. Double-Precision Complex Multiplication Using Scalar SSE2
movsd xmm0, [eax] ;x
movsd xmm5, [eax+8] ;y
movsd xmm1, [eax+16] ;z
movsd xmm2, [eax+24] ;w
movsd xmm3, xmm1 ;z
movsd xmm4, xmm2 ;w
mulsd xmm1, xmm0 ;z*x
mulsd xmm2, xmm0 ;w*x
mulsd xmm3, xmm5 ;z*y
mulsd xmm4, xmm5 ;w*y
subsd xmm1, xmm4 ;z*x - w*y
addsd xmm3, xmm2 ;z*y + w*x
movsd [ecx], xmm1
movsd [ecx+8], xmm3
6-21
6.6.1.2 SSE3 and Horizontal Computation
SI MD float ing- point operat ions: Somet imes t he AOS t ype of dat a organizat ion are
more nat ural in many algebraic formula. SSE3 enhances t he flexibilit y of SI MD
programming for applicat ions t hat rely on t he horizont al comput at ion model. SSE3
offers several inst ruct ions t hat are capable of horizont al arit hmet ic operat ions.
Wit h I nt el Core microarchit ect ure, t he lat ency and t hroughput of SSE3 inst ruct ions
for horizont al comput at ion have been significant ly improved over previous microar-
chit ect ures.
Example 6- 15 compares using SSE2 and SSE3 t o implement t he dot product of a pair
of vect ors consist ing of four element each. The performance of calculat ing dot prod-
uct s can be furt her improved by unrolling t o calculat e four pairs of vect ors per it era-
t ion. See Example 6- 16.
I n bot h cases, t he SSE3 versions are fast er t han t he SSE2 implement at ions.
Example 6-15. Dot Product of Vector Length 4
Optimized for Intel Core Duo Processor Optimized for Intel Core Microarchitecture
movaps xmm0, [eax]
mulps xmm0, [eax+16]
movhlps xmm1, xmm0
addps xmm0, xmm1
pshufd xmm1, xmm0, 1
addss xmm0, xmm1
movss [ecx], xmm0
movaps xmm0, [eax]
haddps xmm0, xmm0
movaps xmm1, xmm0
psrlq xmm0, xmm1
addss xmm0, xmm1
movss [eax], xmm0
Example 6-16. Unrolled Implementation of Four Dot Products
SSE2 Implementation SSE3 Implementation
movaps xmm0, [eax]
;w0*w1 z0*z1 y0*y1 x0*x1
movaps xmm2, [eax+32]
mulps xmm2, [eax+16+32]
;w2*w3 z2*z3 y2*y3 x2*x3
;w4*w5 z4*z5 y4*y5 x4*x5
;w6*w7 z6*z7 y6*y7 x6*x7
movaps xmm0, [eax]
haddps xmm0, xmm1
haddps xmm2, xmm3
haddps xmm0, xmm2
movaps [ecx], xmm0
6-22
6.6.1.3 Packed Floating-Point Performance in Intel Core Duo Processor
Most packed SI MD float ing- point code will speed up on I nt el Core Solo processors
relat ive t o Pent ium M processors. This is due t o improvement in decoding packed
SI MD inst ruct ions.
The improvement of packed float ing- point performance on t he I nt el Core Solo
processor over Pent ium M processor depends on several fact ors. Generally, code t hat
is decoder- bound and/ or has a mixt ure of int eger and packed float ing- point inst ruc-
t ions can expect significant gain. Code t hat is limit ed by execut ion lat ency and has a
cycles per inst ruct ions rat io great er t han one will not benefit from decoder
improvement .
When t arget ing complex arit hmet ics on I nt el Core Solo and I nt el Core Duo proces-
sors, using single- precision SSE3 inst ruct ions can deliver higher performance t han
alt ernat ives. On t he ot her hand, t asks requiring double- precision complex arit h-
met ics may perform bet t er using scalar SSE2 inst ruct ions on I nt el Core Solo and
I nt el Core Duo processors. This is because scalar SSE2 inst ruct ions can be
dispat ched t hrough t wo port s and execut ed using t wo separat e float ing- point unit s.
Packed horizont al SSE3 inst ruct ions ( HADDPS and HSUBPS) can simplify t he code
sequence for some t asks. However, t hese inst ruct ion consist of more t han five micro-
ops on I nt el Core Solo and I nt el Core Duo processors. Care must be t aken t o ensure
t he lat ency and decoding penalt y of t he horizont al inst ruct ion does not offset any
algorit hmic benefit s.
movaps xmm1, xmm0
unpcklps xmm0, xmm2
; y2*y3 y0*y1 x2*x3 x0*x1
unpckhps xmm1, xmm2
; w2*w3 w0*w1 z2*z3 z0*z1
movaps xmm5, xmm3
unpcklps xmm3, xmm4
; y6*y7 y4*y5 x6*x7 x4*x5
unpckhps xmm5, xmm4
; w6*w7 w4*w5 z6*z7 z4*z5
addps xmm0, xmm1
addps xmm5, xmm3
movaps xmm1, xmm5
movhlps xmm1, xmm0
movlhps xmm0, xmm5
addps xmm0, xmm1
movaps [ecx], xmm0
Example 6-16. Unrolled Implementation of Four Dot Products (Contd.)
SSE2 Implementation SSE3 Implementation
8-1
CHAPTER 8
This chapt er describes soft ware opt imizat ion t echniques for mult it hreaded applica-
t ions running in an environment using eit her mult iprocessor ( MP) syst ems or proces-
sors wit h hardware- based mult it hreading support . Mult iprocessor syst ems are
syst ems wit h t wo or more socket s, each mat ed wit h a physical processor package.
I nt el 64 and I A- 32 processors t hat provide hardware mult it hreading support include
dual- core processors, quad- core processors and processors support ing HT Tech-
nology
1
.
Comput at ional t hroughput in a mult it hreading environment can increase as more
hardware resources are added t o t ake advant age of t hread- level or t ask- level paral-
lelism. Hardware resources can be added in t he form of more t han one physical-
processor, processor- core- per- package, and/ or logical- processor- per- core. There-
fore, t here are some aspect s of mult it hreading opt imizat ion t hat apply across MP,
mult icore, and HT Technology. There are also some specific microarchit ect ural
resources t hat may be implement ed different ly in different hardware mult it hreading
configurat ions ( for example: execut ion resources are not shared across different
cores but shared by t wo logical processors in t he same core if HT Technology is
enabled) . This chapt er covers guidelines t hat apply t o t hese sit uat ions.
This chapt er covers
Performance charact erist ics and usage models
Programming models for mult it hreaded applicat ions
Soft ware opt imizat ion t echniques in five specific areas
8.1 PERFORMANCE AND USAGE MODELS
The performance gains of using mult iple processors, mult icore processors or HT
Technology are great ly affect ed by t he usage model and t he amount of parallelism in
t he cont rol flow of t he workload. Two common usage models are:
Mult it hreaded applicat ions
Mult it asking using single- t hreaded applicat ions
1. The presence of hardware multithreading support in Intel 64 and IA-32 processors can be
detected by checking the feature flag CPUID .01H:EDX[28]. A return value of in bit 28 indicates
that at least one form of hardware multithreading is present in the physical processor package.
The number of logical processors present in each package can also be obtained from CPUID. The
application must check how many logical processors are enabled and made available to applica-
tion at runtime by making the appropriate operating system calls. See the Intel 64 and IA-32
Architectures Software Developers Manual, Volume 2A for information.
8-2
8.1.1 Multithreading
When an applicat ion employs mult it hreading t o exploit t ask- level parallelism in a
workload, t he cont rol flow of t he mult i- t hreaded soft ware can be divided int o t wo
part s: parallel t asks and sequent ial t asks.
Amdahls law describes an applicat ions performance gain as it relat es t o t he degree
of parallelism in t he cont rol flow. I t is a useful guide for select ing t he code modules,
funct ions, or inst ruct ion sequences t hat are most likely t o realize t he most gains from
t ransforming sequent ial t asks and cont rol flows int o parallel code t o t ake advant age
mult it hreading hardware support .
Figure 8- 1 illust rat es how performance gains can be realized for any workload
according t o Amdahls law. The bar in Figure 8- 1 represent s an individual t ask unit or
t he collect ive workload of an ent ire applicat ion.
I n general, t he speedup of running mult iple t hreads on an MP syst ems wit h N phys-
ical processors, over single- t hreaded execut ion, can be expressed as:
where P is t he fract ion of workload t hat can be parallelized, and O represent s t he
overhead of mult it hreading and may vary bet ween different operat ing syst ems. I n
t his case, performance gain is t he inverse of t he relat ive response.
When opt imizing applicat ion performance in a mult it hreaded environment , cont rol
flow parallelism is likely t o have t he largest impact on performance scaling wit h
respect t o t he number of physical processors and t o t he number of logical processors
per physical processor.
Figure 8-1. Amdahls Law and MP Speed-up
RelativeResponse
Tsequential
Tparallel
-------------------------------- = 1 P
P
N
---- O + +

=
1-P P
Tsequential
1-P
P/2
Tparallel
P/2
Single Thread
Multi-Thread on MP
O
v
e
r
h
e
a
d
8-3
I f t he cont rol flow of a mult i- t hreaded applicat ion cont ains a workload in which only
50% can be execut ed in parallel, t he maximum performance gain using t wo physical
processors is only 33%, compared t o using a single processor. Using four processors
can deliver no more t han a 60% speedup over a single processor. Thus, it is crit ical t o
maximize t he port ion of cont rol flow t hat can t ake advant age of parallelism. I mproper
implement at ion of t hread synchronizat ion can significant ly increase t he proport ion of
serial cont rol flow and furt her reduce t he applicat ions performance scaling.
I n addit ion t o maximizing t he parallelism of cont rol flows, int eract ion bet ween
t hreads in t he form of t hread synchronizat ion and imbalance of t ask scheduling can
also impact overall processor scaling significant ly.
Excessive cache misses are one cause of poor performance scaling. I n a mult i-
t hreaded execut ion environment , t hey can occur from:
Aliased st ack accesses by different t hreads in t he same process
Thread cont ent ions result ing in cache line evict ions
False- sharing of cache lines bet ween different processors
Techniques t hat address each of t hese sit uat ions ( and many ot her areas) are
described in sect ions in t his chapt er.
8.1.2 Multitasking Environment
Hardware mult it hreading capabilit ies in I nt el 64 and I A- 32 processors can exploit
t ask- level parallelism when a workload consist s of several single- t hreaded applica-
t ions and t hese applicat ions are scheduled t o run concurrent ly under an MP- aware
operat ing syst em. I n t his environment , hardware mult it hreading capabilit ies can
deliver higher t hroughput for t he workload, alt hough t he relat ive performance of a
single t ask ( in t erms of t ime of complet ion relat ive t o t he same t ask when in a single-
t hreaded environment ) will vary, depending on how much shared execut ion
resources and memory are ut ilized.
For development purposes, several popular operat ing syst ems ( for example
Microsoft Windows* XP Professional and Home, Linux* dist ribut ions using kernel
2. 4. 19 or lat er
2
) include OS kernel code t hat can manage t he t ask scheduling and t he
balancing of shared execut ion resources wit hin each physical processor t o maximize
t he t hroughput .
Because applicat ions run independent ly under a mult it asking environment , t hread
synchronizat ion issues are less likely t o limit t he scaling of t hroughput . This is
because t he cont rol flow of t he workload is likely t o be 100% parallel
3
( if no int er-
processor communicat ion is t aking place and if t here are no syst em bus const raint s) .
2. This code is included in Red Hat* Linux Enterprise AS 2.1.
3. A software tool that attempts to measure the throughput of a multitasking workload is likely to
introduce control flows that are not parallel. Thread synchronization issues must be considered
as an integral part of its performance measuring methodology.
8-4
Wit h a mult it asking workload, however, bus act ivit ies and cache access pat t erns are
likely t o affect t he scaling of t he t hroughput . Running t wo copies of t he same appli-
cat ion or same suit e of applicat ions in a lock- st ep can expose an art ifact in perfor-
mance measuring met hodology. This is because an access pat t ern t o t he first level
dat a cache can lead t o excessive cache misses and produce skewed performance
result s. Fix t his problem by:
I ncluding a per- inst ance offset at t he st art - up of an applicat ion
I nt roducing het erogeneit y in t he workload by using different dat aset s wit h each
inst ance of t he applicat ion
Randomizing t he sequence of st art - up of applicat ions when running mult iple
copies of t he same suit e
When t wo applicat ions are employed as part of a mult it asking workload, t here is lit t le
synchronizat ion overhead bet ween t hese t wo processes. I t is also import ant t o
ensure each applicat ion has minimal synchronizat ion overhead wit hin it self.
An applicat ion t hat uses lengt hy spin loops for int ra- process synchronizat ion is less
likely t o benefit from HT Technology in a mult it asking workload. This is because crit -
ical resources will be consumed by t he long spin loops.
8.2 PROGRAMMING MODELS AND MULTITHREADING
Parallelism is t he most import ant concept in designing a mult it hreaded applicat ion
and realizing opt imal performance scaling wit h mult iple processors. An opt imized
mult it hreaded applicat ion is charact erized by large degrees of parallelism or minimal
dependencies in t he following areas:
Workload
Thread int eract ion
Hardware ut ilizat ion
The key t o maximizing workload parallelism is t o ident ify mult iple t asks t hat have
minimal int er- dependencies wit hin an applicat ion and t o creat e separat e t hreads for
parallel execut ion of t hose t asks.
Concurrent execut ion of independent t hreads is t he essence of deploying a mult i-
t hreaded applicat ion on a mult iprocessing syst em. Managing t he int eract ion bet ween
t hreads t o minimize t he cost of t hread synchronizat ion is also crit ical t o achieving
opt imal performance scaling wit h mult iple processors.
Efficient use of hardware resources bet ween concurrent t hreads requires opt imiza-
t ion t echniques in specific areas t o prevent cont ent ions of hardware resources.
Coding t echniques for opt imizing t hread synchronizat ion and managing ot her hard-
ware resources are discussed in subsequent sect ions.
Parallel programming models are discussed next .
8-5
8.2.1 Parallel Programming Models
Two common programming models for t ransforming independent t ask requirement s
int o applicat ion t hreads are:
Domain decomposit ion
Funct ional decomposit ion
8.2.1.1 Domain Decomposition
Usually large comput e- int ensive t asks use dat a set s t hat can be divided int o a
number of small subset s, each having a large degree of comput at ional indepen-
dence. Examples include:
Comput at ion of a discret e cosine t ransformat ion ( DCT) on t wo- dimensional dat a
by dividing t he t wo- dimensional dat a int o several subset s and creat ing t hreads t o
comput e t he t ransform on each subset
Mat rix mult iplicat ion; here, t hreads can be creat ed t o handle t he mult iplicat ion of
half of mat rix wit h t he mult iplier mat rix
Domain Decomposit ion is a programming model based on creat ing ident ical or
similar t hreads t o process smaller pieces of dat a independent ly. This model can t ake
advant age of duplicat ed execut ion resources present in a t radit ional mult iprocessor
syst em. I t can also t ake advant age of shared execut ion resources bet ween t wo
logical processors in HT Technology. This is because a dat a domain t hread t ypically
consumes only a fract ion of t he available on- chip execut ion resources.
Sect ion 8. 3. 5, Key Pract ices of Execut ion Resource Opt imizat ion, discusses addi-
t ional guidelines t hat can help dat a domain t hreads use shared execut ion resources
cooperat ively and avoid t he pit falls creat ing cont ent ions of hardware resources
bet ween t wo t hreads.
8.2.2 Functional Decomposition
Applicat ions usually process a wide variet y of t asks wit h diverse funct ions and many
unrelat ed dat a set s. For example, a video codec needs several different processing
funct ions. These include DCT, mot ion est imat ion and color conversion. Using a func-
t ional t hreading model, applicat ions can program separat e t hreads t o do mot ion est i-
mat ion, color conversion, and ot her funct ional t asks.
Funct ional decomposit ion will achieve more flexible t hread- level parallelism if it is
less dependent on t he duplicat ion of hardware resources. For example, a t hread
execut ing a sort ing algorit hm and a t hread execut ing a mat rix mult iplicat ion rout ine
are not likely t o require t he same execut ion unit at t he same t ime. A design recog-
nizing t his could advant age of t radit ional mult iprocessor syst ems as well as mult ipro-
cessor syst ems using processors support ing HT Technology.
8-6
8.2.3 Specialized Programming Models
I nt el Core Duo processor and processors based on I nt el Core microarchit ect ure offer
a second- level cache shared by t wo processor cores in t he same physical package.
This provides opport unit ies for t wo applicat ion t hreads t o access some applicat ion
dat a while minimizing t he overhead of bus t raffic.
Mult i- t hreaded applicat ions may need t o employ specialized programming models t o
t ake advant age of t his t ype of hardware feat ure. One such scenario is referred t o as
producer- consumer. I n t his scenario, one t hread writ es dat a int o some dest inat ion
( hopefully in t he second- level cache) and anot her t hread execut ing on t he ot her core
in t he same physical package subsequent ly reads dat a produced by t he first t hread.
The basic approach for implement ing a producer- consumer model is t o creat e t wo
t hreads; one t hread is t he producer and t he ot her is t he consumer. Typically, t he
producer and consumer t ake t urns t o work on a buffer and inform each ot her when
t hey are ready t o exchange buffers. I n a producer- consumer model, t here is some
t hread synchronizat ion overhead when buffers are exchanged bet ween t he producer
and consumer. To achieve opt imal scaling wit h t he number of cores, t he synchroniza-
t ion overhead must be kept low. This can be done by ensuring t he producer and
consumer t hreads have comparable t ime const ant s for complet ing each increment al
t ask prior t o exchanging buffers.
Example 8- 1 illust rat es t he coding st ruct ure of single- t hreaded execut ion of a
sequence of t ask unit s, where each t ask unit ( eit her t he producer or consumer)
execut es serially ( shown in Figure 8- 2) . I n t he equivalent scenario under mult i-
t hreaded execut ion, each producer- consumer pair is wrapped as a t hread funct ion
and t wo t hreads can be scheduled on available processor resources simult aneously.
Example 8-1. Serial Execution of Producer and Consumer Work Items
for (i = 0; i < number_of_iterations; i++) {
producer (i, buff); // pass buffer index and buffer address
consumer (i, buff);
}(
Figure 8-2. Single-threaded Execution of Producer-consumer Threading Model
P(1) P(1) C(1) C(1) P(1)
Main
Thread
8-7
8.2.3.1 Producer-Consumer Threading Models
Figure 8- 3 illust rat es t he basic scheme of int eract ion bet ween a pair of producer and
consumer t hreads. The horizont al direct ion represent s t ime. Each block represent s a
t ask unit , processing t he buffer assigned t o a t hread.
The gap bet ween each t ask represent s synchronizat ion overhead. The decimal
number in t he parent hesis represent s a buffer index. On an I nt el Core Duo processor,
t he producer t hread can st ore dat a in t he second- level cache t o allow t he consumer
t hread t o cont inue work requiring minimal bus t raffic.
The basic st ruct ure t o implement t he producer and consumer t hread funct ions wit h
synchronizat ion t o communicat e buffer index is shown in Example 8- 2.
Figure 8-3. Execution of Producer-consumer Threading Model
on a Multicore Processor
Example 8-2. Basic Structure of Implementing Producer Consumer Threads
(a) Basic structure of a producer thread function
void producer_thread()
{ int iter_num = workamount - 1; // make local copy
int mode1 = 1; // track usage of two buffers via 0 and 1
produce(buffs[0],count); // placeholder function
while (iter_num--) {
Signal(&signal1,1); // tell the other thread to commence
produce(buffs[mode1],count); // placeholder function
WaitForSignal(&end1);
mode1 = 1 - mode1; // switch to the other buffer
}
Main
Thread
P(2) P(1) P(2) P(1)
C(1) C(2) C(1) C(2)
P(1)
P: producer
C: consumer
8-8
I t is possible t o st ruct ure t he producer- consumer model in an int erlaced manner such
t hat it can minimize bus t raffic and be effect ive on mult icore processors wit hout
shared second- level cache.
I n t his int erlaced variat ion of t he producer- consumer model, each scheduling quant a
of an applicat ion t hread comprises of a producer t ask and a consumer t ask. Two iden-
t ical t hreads are creat ed t o execut e in parallel. During each scheduling quant a of a
t hread, t he producer t ask st art s first and t he consumer t ask follows aft er t he comple-
t ion of t he producer t ask; bot h t asks work on t he same buffer. As each t ask
complet es, one t hread signals t o t he ot her t hread not ifying it s corresponding t ask t o
use it s designat ed buffer. Thus, t he producer and consumer t asks execut e in parallel
in t wo t hreads. As long as t he dat a generat ed by t he producer reside in eit her t he
first or second level cache of t he same core, t he consumer can access t hem wit hout
incurring bus t raffic. The scheduling of t he int erlaced producer- consumer model is
shown in Figure 8- 4.
}
b) Basic structure of a consumer thread
void consumer_thread()
{ int mode2 = 0; // first iteration start with buffer 0, than alternate
int iter_num = workamount - 1;
while (iter_num--) {
WaitForSignal(&signal1);
consume(buffs[mode2],count); // placeholder function
Signal(&end1,1);
mode2 = 1 - mode2;
}
consume(buffs[mode2],count);
}
Figure 8-4. Interlaced Variation of the Producer Consumer Model
Example 8-2. Basic Structure of Implementing Producer Consumer Threads (Contd.)
P(2)
P(1)
P(2)
P(1) C(1)
C(2)
C(1)
C(2)
P(1)
Thread 0
Thread 1
8-9
Example 8- 3 shows t he basic st ruct ure of a t hread funct ion t hat can be used in t his
int erlaced producer- consumer model.
Example 8-3. Thread Function for an Interlaced Producer Consumer Model
// master thread starts first iteration, other thread must wait
// one iteration
void producer_consumer_thread(int master)
{
int mode = 1 - master; // track which thread and its designated
// buffer index
unsigned int iter_num = workamount >> 1;
unsigned int i=0;
iter_num += master & workamount & 1;
if (master) // master thread starts the first iteration
{
produce(buffs[mode],count);
Signal(sigp[1-mode1],1); // notify producer task in follower
// thread that it can proceed
consume(buffs[mode],count);
Signal(sigc[1-mode],1);
i = 1;
}
for (; i < iter_num; i++)
{
WaitForSignal(sigp[mode]);
produce(buffs[mode],count); // notify the producer task in
// other thread
Signal(sigp[1-mode],1);
WaitForSignal(sigc[mode]);
consume(buffs[mode],count);
Signal(sigc[1-mode],1);
}
}
8-10
8.2.4 Tools for Creating Multithreaded Applications
Programming direct ly t o a mult it hreading applicat ion programming int erface ( API ) is
not t he only met hod for creat ing mult it hreaded applicat ions. New t ools ( such as t he
I nt el compiler) have become available wit h capabilit ies t hat make t he challenge of
creat ing mult it hreaded applicat ion easier.
Feat ures available in t he lat est I nt el compilers are:
Generat ing mult it hreaded code using OpenMP* direct ives
4
Generat ing mult it hreaded code aut omat ically from unmodified high- level code
5
8.2.4.1 Programming with OpenMP Directives
OpenMP provides a st andardized, non- propriet ary, port able set of Fort ran and C+ +
compiler direct ives support ing shared memory parallelism in applicat ions. OpenMP
support s direct ive- based processing. This uses special preprocessors or modified
compilers t o int erpret parallelism expressed in Fort ran comment s or C/ C+ +
pragmas. Benefit s of direct ive- based processing include:
The original source can be compiled unmodified.
I t is possible t o make increment al code changes. This preserves algorit hms in t he
original code and enables rapid debugging.
I ncrement al code changes help programmers maint ain serial consist ency. When
t he code is run on one processor, it gives t he same result as t he unmodified
source code.
Offering direct ives t o fine t une t hread scheduling imbalance.
I nt els implement at ion of OpenMP runt ime can add minimal t hreading overhead
relat ive t o hand- coded mult it hreading.
8.2.4.2 Automatic Parallelization of Code
While OpenMP direct ives allow programmers t o quickly t ransform serial applicat ions
int o parallel applicat ions, programmers must ident ify specific port ions of t he applica-
t ion code t hat cont ain parallelism and add compiler direct ives. I nt el Compiler 6.0
support s a new ( - QPARALLEL) opt ion, which can ident ify loop st ruct ures t hat cont ain
parallelism. During program compilat ion, t he compiler aut omat ically at t empt s t o
decompose t he parallelism int o t hreads for parallel processing. No ot her int ervent ion
or programmer is needed.
4. Intel Compiler 5.0 and later supports OpenMP directives. Visit ht t p: / / devel-
oper. int el. com/ soft ware/ product s for details.
5. Intel Compiler 6.0 supports auto-parallelization.
8-11
8.2.4.3 Supporting Development Tools
I nt el
Threading Analysis Tools include I nt el
Thread Checker and I nt el
Thread
Profiler.
8.2.4.4 Intel

Thread Checker
Use I nt el Thread Checker t o find t hreading errors ( which include dat a races, st alls
and deadlocks) and reduce t he amount of t ime spent debugging t hreaded applica-
t ions.
I nt el Thread Checker product is an I nt el VTune Performance Analyzer plug- in dat a
collect or t hat execut es a program and aut omat ically locat es t hreading errors. As t he
program runs, I nt el Thread Checker monit ors memory accesses and ot her event s
and aut omat ically det ect s sit uat ions which could cause unpredict able t hreading-
relat ed result s.
8.2.4.5 Thread Profiler
Thread Profiler is a plug- in dat a collect or for t he I nt el VTune Performance Analyzer.
Use it t o analyze t hreading performance and ident ify parallel performance bot t le-
necks. I t graphically illust rat es what each t hread is doing at various levels of det ail
using a hierarchical summary. I t can ident ify inact ive t hreads, crit ical pat hs and
imbalances in t hread execut ion. Dat a is collapsed int o relevant summaries, sort ed t o
ident ify parallel regions or loops t hat require at t ent ion.
8.3 OPTIMIZATION GUIDELINES
This sect ion summarizes opt imizat ion guidelines for t uning mult it hreaded applica-
t ions. Five areas are list ed ( in order of import ance) :
Thread synchronizat ion
Bus ut ilizat ion
Memory opt imizat ion
Front end opt imizat ion
Execut ion resource opt imizat ion
Pract ices associat ed wit h each area are list ed in t his sect ion. Guidelines for each area
are discussed in great er dept h in sect ions t hat follow.
Most of t he coding recommendat ions improve performance scaling wit h processor
cores; and scaling due t o HT Technology. Techniques t hat apply t o only one environ-
ment are not ed.
8-12
8.3.1 Key Practices of Thread Synchronization
Key pract ices for minimizing t he cost of t hread synchronizat ion are summarized
below:
I nsert t he PAUSE inst ruct ion in fast spin loops and keep t he number of loop
repet it ions t o a minimum t o improve overall syst em performance.
Replace a spin- lock t hat may be acquired by mult iple t hreads wit h pipelined locks
such t hat no more t han t wo t hreads have writ e accesses t o one lock. I f only one
t hread needs t o writ e t o a variable shared by t wo t hreads, t here is no need t o
acquire a lock.
Use a t hread- blocking API in a long idle loop t o free up t he processor.
Prevent false- sharing of per- t hread- dat a bet ween t wo t hreads.
Place each synchronizat ion variable alone, separat ed by 128 byt es or in a
separat e cache line.
See Sect ion 8. 4, Thread Synchronizat ion, for det ails.
8.3.2 Key Practices of System Bus Optimization
Managing bus t raffic can significant ly impact t he overall performance of mult i-
t hreaded soft ware and MP syst ems. Key pract ices of syst em bus opt imizat ion for
achieving high dat a t hroughput and quick response are:
I mprove dat a and code localit y t o conserve bus command bandwidt h.
Avoid excessive use of soft ware prefet ch inst ruct ions and allow t he aut omat ic
hardware prefet cher t o work. Excessive use of soft ware prefet ches can signifi-
cant ly and unnecessarily increase bus ut ilizat ion if used inappropriat ely.
Consider using overlapping mult iple back- t o- back memory reads t o improve
effect ive cache miss lat encies.
Use full writ e t ransact ions t o achieve higher dat a t hroughput .
See Sect ion 8. 5, Syst em Bus Opt imizat ion, for det ails.
8.3.3 Key Practices of Memory Optimization
Key pract ices for opt imizing memory operat ions are summarized below:
Use cache blocking t o improve localit y of dat a access. Target one quart er t o one
half of cache size when t arget ing processors support ing HT Technology.
Minimize t he sharing of dat a bet ween t hreads t hat execut e on different physical
processors sharing a common bus.
Minimize dat a access pat t erns t hat are offset by mult iples of 64- KByt es in each
t hread.
8-13
Adj ust t he privat e st ack of each t hread in an applicat ion so t he spacing bet ween
t hese st acks is not offset by mult iples of 64 KByt es or 1 MByt e ( prevent s
unnecessary cache line evict ions) when t arget ing processors support ing HT
Technology.
Add a per- inst ance st ack offset when t wo inst ances of t he same applicat ion are
execut ing in lock st eps t o avoid memory accesses t hat are offset by mult iples of
64 KByt e or 1 MByt e when t arget ing processors support ing HT Technology.
See Sect ion 8. 6, Memory Opt imizat ion, for det ails.
8.3.4 Key Practices of Front-end Optimization
Key pract ices for front - end opt imizat ion on processors t hat support HT Technology
are:
Avoid Excessive Loop Unrolling t o ensure t he Trace Cache is operat ing efficient ly.
Opt imize code size t o improve localit y of Trace Cache and increase delivered t race
lengt h.
See Sect ion 8. 7, Front - end Opt imizat ion, for det ails.
8.3.5 Key Practices of Execution Resource Optimization
Each physical processor has dedicat ed execut ion resources. Logical processors in
physical processors support ing HT Technology share specific on- chip execut ion
resources. Key pract ices for execut ion resource opt imizat ion include:
Opt imize each t hread t o achieve opt imal frequency scaling first .
Opt imize mult it hreaded applicat ions t o achieve opt imal scaling wit h respect t o
t he number of physical processors.
Use on- chip execut ion resources cooperat ively if t wo t hreads are sharing t he
execut ion resources in t he same physical processor package.
For each processor support ing HT Technology, consider adding funct ionally
uncorrelat ed t hreads t o increase t he hardware resource ut ilizat ion of each
physical processor package.
See Sect ion 8. 8, Using Thread Affinit ies t o Manage Shared Plat form Resources, for
det ails.
8.3.6 Generality and Performance Impact
The next five sect ions cover t he opt imizat ion t echniques in det ail. Recommendat ions
discussed in each sect ion are ranked by import ance in t erms of est imat ed local
impact and generalit y.
8-14
Rankings are subj ect ive and approximat e. They can vary depending on coding st yle,
applicat ion and t hreading domain. The purpose of including high, medium and low
impact ranking wit h each recommendat ion is t o provide a relat ive indicat or as t o t he
degree of performance gain t hat can be expect ed when a recommendat ion is imple-
ment ed.
I t is not possible t o predict t he likelihood of a code inst ance across many applicat ions,
so an impact ranking cannot be direct ly correlat ed t o applicat ion- level performance
gain. The ranking on generalit y is also subj ect ive and approximat e.
Coding recommendat ions t hat do not impact all t hree scaling fact ors are t ypically
cat egorized as medium or lower.
8.4 THREAD SYNCHRONIZATION
Applicat ions wit h mult iple t hreads use synchronizat ion t echniques in order t o ensure
correct operat ion. However, t hread synchronizat ion t hat are improperly implement ed
can significant ly reduce performance.
The best pract ice t o reduce t he overhead of t hread synchronizat ion is t o st art by
reducing t he applicat ions requirement s for synchronizat ion. I nt el Thread Profiler can
be used t o profile t he execut ion t imeline of each t hread and det ect sit uat ions where
performance is impact ed by frequent occurrences of synchronizat ion overhead.
Several coding t echniques and operat ing syst em ( OS) calls are frequent ly used for
t hread synchronizat ion. These include spin- wait loops, spin- locks, crit ical sect ions, t o
name a few. Choosing t he opt imal OS call for t he circumst ance and implement ing
synchronizat ion code wit h parallelism in mind are crit ical in minimizing t he cost of
handling t hread synchronizat ion.
SSE3 provides t wo inst ruct ions ( MONI TOR/ MWAI T) t o help mult it hreaded soft ware
improve synchronizat ion bet ween mult iple agent s. I n t he first implement at ion of
MONI TOR and MWAI T, t hese inst ruct ions are available t o operat ing syst em so t hat
operat ing syst em can opt imize t hread synchronizat ion in different areas. For
example, an operat ing syst em can use MONI TOR and MWAI T in it s syst em idle loop
( known as C0 loop) t o reduce power consumpt ion. An operat ing syst em can also use
MONI TOR and MWAI T t o implement it s C1 loop t o improve t he responsiveness of t he
C1 loop. See Chapt er 7 in t he I nt el 64 and I A- 32 Archit ect ures Soft ware Devel-
opers Manual, Volume 3A.
8.4.1 Choice of Synchronization Primitives
Thread synchronizat ion oft en involves modifying some shared dat a while prot ect ing
t he operat ion using synchronizat ion primit ives. There are many primit ives t o choose
from. Guidelines t hat are useful when select ing synchronizat ion primit ives are:
Favor compiler int rinsics or an OS provided int erlocked API for at omic updat es of
simple dat a operat ion, such as increment and compare/ exchange. This will be
8-15
more efficient t han ot her more complicat ed synchronizat ion primit ives wit h
higher overhead.
For more informat ion on using different synchronizat ion primit ives, see t he
whit e paper Developing Mult i- t hreaded Applicat ions: A Plat form Consist ent
Approach. See ht t p: / / www3. int el. com/ cd/ ids/ developer/ asmo-
na/ eng/ 53797. ht m.
When choosing bet ween different primit ives t o implement a synchronizat ion
const ruct , using I nt el Thread Checker and Thread Profiler can be very useful in
dealing wit h mult it hreading funct ional correct ness issue and performance impact
under mult i- t hreaded execut ion. Addit ional informat ion on t he capabilit ies of
I nt el Thread Checker and Thread Profiler are described in Appendix A.
Table 8- 1 is useful for comparing t he propert ies of t hree cat egories of synchroniza-
t ion obj ect s available t o mult i- t hreaded applicat ions.
Table 8-1. Properties of Synchronization Objects
Characteristics
Operating System
Synchronization Objects
Light Weight User
Synchronization
Synchronization
Object based on
MONITOR/MWAIT
Cycles to acquire
and release (if
there is a
contention)
Thousands or Tens of
thousands cycles
Hundreds of cycles Hundreds of cycles
Power
consumption
Saves power by halting the
core or logical processor if
idle
Some power saving if
using PAUSE
Saves more power
than PAUSE
Scheduling and
context
switching
Returns to the OS scheduler
if contention exists (can be
tuned with earlier spin loop
count)
Does not return to OS
scheduler voluntarily
Does not return to OS
scheduler voluntarily
Ring level Ring 0 Ring 3 Ring 0
Miscellaneous Some objects provide intra-
process synchronization and
some are for inter-process
communication
Must lock accesses to
synchronization
variable if several
threads may write to it
simultaneously.
Otherwise can write
without locks.
Same as light weight.
Can be used only on
systems supporting
MONITOR/MWAIT
8-16
8.4.2 Synchronization for Short Periods
The frequency and durat ion t hat a t hread needs t o synchronize wit h ot her t hreads
depends applicat ion charact erist ics. When a synchronizat ion loop needs very fast
response, applicat ions may use a spin- wait loop.
A spin- wait loop is t ypically used when one t hread needs t o wait a short amount of
t ime for anot her t hread t o reach a point of synchronizat ion. A spin- wait loop consist s
of a loop t hat compares a synchronizat ion variable wit h some predefined value. See
Example 8- 4( a) .
On a modern microprocessor wit h a superscalar speculat ive execut ion engine, a loop
like t his result s in t he issue of mult iple simult aneous read request s from t he spinning
t hread. These request s usually execut e out - of- order wit h each read request being
allocat ed a buffer resource. On det ect ion of a writ e by a worker t hread t o a load t hat
is in progress, t he processor must guarant ee no violat ions of memory order occur.
The necessit y of maint aining t he order of out st anding memory operat ions inevit ably
cost s t he processor a severe penalt y t hat impact s all t hreads.
This penalt y occurs on t he Pent ium M processor, t he I nt el Core Solo and I nt el Core
Duo processors. However, t he penalt y on t hese processors is small compared wit h
penalt ies suffered on t he Pent ium 4 and I nt el Xeon processors. There t he perfor-
mance penalt y for exit ing t he loop is about 25 t imes more severe.
On a processor support ing HT Technology, spin- wait loops can consume a significant
port ion of t he execut ion bandwidt h of t he processor. One logical processor execut ing
a spin- wait loop can severely impact t he performance of t he ot her logical processor.
Recommended
use conditions
Number of active threads is
greater than number of
cores
Waiting thousands of cycles
for a signal
Synchronization among
processes
Number of active
threads is less than
or equal to number
of cores
Infrequent
contention
Need inter process
synchronization
Same as light weight
objects
MONITOR/MWAIT
available
Table 8-1. Properties of Synchronization Objects (Contd.)
Characteristics
Operating System
Synchronization Objects
Light Weight User
Synchronization
Synchronization
Object based on
MONITOR/MWAIT
8-17

User / Sour ce Codi ng Rul e 20. ( M i mpact , H gener al i t y) I nsert t he PAUSE
inst ruct ion in fast spin loops and keep t he number of loop repet it ions t o a minimum
t o improve overall syst em performance.
On processors t hat use t he I nt el Net Burst microarchit ect ure core, t he penalt y of
exit ing from a spin- wait loop can be avoided by insert ing a PAUSE inst ruct ion in t he
loop. I n spit e of t he name, t he PAUSE inst ruct ion improves performance by int ro-
ducing a slight delay in t he loop and effect ively causing t he memory read request s t o
Example 8-4. Spin-wait Loop and PAUSE Instructions
(a) An un-optimized spin-wait loop experiences performance penalty when exiting the loop. It
consumes execution resources without contributing computational work.
do {
// This loop can run faster than the speed of memory access,
// other worker threads cannot finish modifying sync_var until
// outstanding loads from the spinning loops are resolved.
} while( sync_var != constant_value);
(b) Inserting the PAUSE instruction in a fast spin-wait loop prevents performance-penalty to the
spinning thread and the worker thread
do {
_asm pause
// Ensure this loop is de-pipelined, i.e. preventing more than one
// load request to sync_var to be outstanding,
// avoiding performance penalty when the worker thread updates
// sync_var and the spinning thread exiting the loop.
}
while( sync_var != constant_value);
(c) A spin-wait loop using a test, test-and-set technique to determine the availability of the
synchronization variable. This technique is recommended when writing spin-wait loops to run on
Intel 64 and IA-32 architecture processors.
Spin_Lock:
CMP lockvar, 0 ; // Check if lock is free.
JE Get_lock
PAUSE; // Short delay.
JMP Spin_Lock;
Get_Lock:
MOV EAX, 1;
XCHG EAX, lockvar; // Try to get lock.
CMP EAX, 0; // Test if successful.
JNE Spin_Lock;
Critical_Section:
<critical section code>
MOV lockvar, 0; // Release lock.
8-18
be issued at a rat e t hat allows immediat e det ect ion of any st ore t o t he synchroniza-
t ion variable. This prevent s t he occurrence of a long delay due t o memory order
violat ion.
One example of insert ing t he PAUSE inst ruct ion in a simplified spin- wait loop is
shown in Example 8- 4( b) . The PAUSE inst ruct ion is compat ible wit h all I nt el 64 and
I A- 32 processors. On I A- 32 processors prior t o I nt el Net Burst microarchit ect ure, t he
PAUSE inst ruct ion is essent ially a NOP inst ruct ion. Addit ional examples of opt imizing
spin- wait loops using t he PAUSE inst ruct ion are available in Applicat ion not e AP- 949,
Using Spin- Loops on I nt el Pent ium 4 Processor and I nt el Xeon Processor. See
ht t p: / / www3. int el. com/ cd/ ids/ developer/ asmo- na/ eng/ dc/ t hreading/ knowledge-
base/ 19083. ht m.
I nsert ing t he PAUSE inst ruct ion has t he added benefit of significant ly reducing t he
power consumed during t he spin- wait because fewer syst em resources are used.
8.4.3 Optimization with Spin-Locks
Spin- locks are t ypically used when several t hreads needs t o modify a synchronizat ion
variable and t he synchronizat ion variable must be prot ect ed by a lock t o prevent un-
int ent ional overwrit es. When t he lock is released, however, several t hreads may
compet e t o acquire it at once. Such t hread cont ent ion significant ly reduces perfor-
mance scaling wit h respect t o frequency, number of discret e processors, and HT
Technology.
To reduce t he performance penalt y, one approach is t o reduce t he likelihood of many
t hreads compet ing t o acquire t he same lock. Apply a soft ware pipelining t echnique t o
handle dat a t hat must be shared bet ween mult iple t hreads.
I nst ead of allowing mult iple t hreads t o compet e for a given lock, no more t han t wo
t hreads should have writ e access t o a given lock. I f an applicat ion must use spin-
locks, include t he PAUSE inst ruct ion in t he wait loop. Example 8- 4( c) shows an
example of t he t est , t est - and- set t echnique for det ermining t he availabilit y of t he
lock in a spin- wait loop.
User / Sour ce Codi ng Rul e 21. ( M i mpact , L gener al i t y ) Replace a spin lock t hat
may be acquired by mult iple t hreads wit h pipelined locks such t hat no more t han
t wo t hreads have writ e accesses t o one lock. I f only one t hread needs t o writ e t o a
variable shared by t wo t hreads, t here is no need t o use a lock.
8.4.4 Synchronization for Longer Periods
When using a spin- wait loop not expect ed t o be released quickly, an applicat ion
should follow t hese guidelines:
Keep t he durat ion of t he spin- wait loop t o a minimum number of repet it ions.
Applicat ions should use an OS service t o block t he wait ing t hread; t his can
release t he processor so t hat ot her runnable t hreads can make use of t he
processor or available execut ion resources.
8-19
On processors support ing HT Technology, operat ing syst ems should use t he HLT
inst ruct ion if one logical processor is act ive and t he ot her is not . HLT will allow an idle
logical processor t o t ransit ion t o a halt ed st at e; t his allows t he act ive logical
processor t o use all t he hardware resources in t he physical package. An operat ing
syst em t hat does not use t his t echnique must st ill execut e inst ruct ions on t he idle
logical processor t hat repeat edly check for work. This idle loop consumes execut ion
resources t hat could ot herwise be used t o make progress on t he ot her act ive logical
processor.
I f an applicat ion t hread must remain idle for a long t ime, t he applicat ion should use
a t hread blocking API or ot her met hod t o release t he idle processor. The t echniques
discussed here apply t o t radit ional MP syst em, but t hey have an even higher impact
on processors t hat support HT Technology.
Typically, an operat ing syst em provides t iming services, for example Sleep ( dwMilli-
seconds)
6
; such variables can be used t o prevent frequent checking of a synchroni-
zat ion variable.
Anot her t echnique t o synchronize bet ween worker t hreads and a cont rol loop is t o
use a t hread- blocking API provided by t he OS. Using a t hread- blocking API allows t he
cont rol t hread t o use less processor cycles for spinning and wait ing. This gives t he OS
more t ime quant a t o schedule t he worker t hreads on available processors. Furt her-
more, using a t hread- blocking API also benefit s from t he syst em idle loop opt imiza-
t ion t hat OS implement s using t he HLT inst ruct ion.
User / Sour ce Codi ng Rul e 22. ( H i mpact , M gener al i t y) Use a t hread- blocking
API in a long idle loop t o free up t he processor.
Using a spin- wait loop in a t radit ional MP syst em may be less of an issue when t he
number of runnable t hreads is less t han t he number of processors in t he syst em. I f
t he number of t hreads in an applicat ion is expect ed t o be great er t han t he number of
processors ( eit her one processor or mult iple processors) , use a t hread- blocking API
t o free up processor resources. A mult it hreaded applicat ion adopt ing one cont rol
t hread t o synchronize mult iple worker t hreads may consider limit ing worker t hreads
t o t he number of processors in a syst em and use t hread- blocking API s in t he cont rol
t hread.
8.4.4.1 Avoid Coding Pitfalls in Thread Synchronization
Synchronizat ion bet ween mult iple t hreads must be designed and implement ed wit h
care t o achieve good performance scaling wit h respect t o t he number of discret e
processors and t he number of logical processor per physical processor. No single
t echnique is a universal solut ion for every synchronizat ion sit uat ion.
The pseudocode example in Example 8- 5( a) illust rat es a polling loop implement a-
t ion of a cont rol t hread. I f t here is only one runnable worker t hread, an at t empt t o
6. The Sleep() API is not thread-blocking, because it does not guarantee the processor will be
released. Example 8-5(a) shows an example of using Sleep(0), which does not always realize the
processor to another thread.
8-20
call a t iming service API , such as Sleep( 0) , may be ineffect ive in minimizing t he cost
of t hread synchronizat ion. Because t he cont rol t hread st ill behaves like a fast spin-
ning loop, t he only runnable worker t hread must share execut ion resources wit h t he
spin- wait loop if bot h are running on t he same physical processor t hat support s HT
Technology. I f t here are more t han one runnable worker t hreads, t hen calling a
t hread blocking API , such as Sleep( 0) , could st ill release t he processor running t he
spin- wait loop, allowing t he processor t o be used by anot her worker t hread inst ead of
t he spinning loop.
A cont rol t hread wait ing for t he complet ion of worker t hreads can usually implement
t hread synchronizat ion using a t hread- blocking API or a t iming service, if t he worker
t hreads require significant t ime t o complet e. Example 8- 5( b) shows an example t hat
reduces t he overhead of t he cont rol t hread in it s t hread synchronizat ion.
I n general, OS funct ion calls should be used wit h care when synchronizing t hreads.
When using OS- support ed t hread synchronizat ion obj ect s ( crit ical sect ion, mut ex, or
semaphore) , preference should be given t o t he OS service t hat has t he least
synchronizat ion overhead, such as a crit ical sect ion.
Example 8-5. Coding Pitfall using Spin Wait Loop
(a) A spin-wait loop attempts to release the processor incorrectly. It experiences a performance
penalty if the only worker thread and the control thread runs on the same physical processor
package.
// Only one worker thread is running,
// the control loop waits for the worker thread to complete.
ResumeWorkThread(thread_handle);
While (!task_not_done ) {
Sleep(0) // Returns immediately back to spin loop.

}
(b) A polling loop frees up the processor correctly.
// Let a worker thread run and wait for completion.
ResumeWorkThread(thread_handle);
While (!task_not_done ) {
Sleep(FIVE_MILISEC)
// This processor is released for some duration, the processor
// can be used by other threads.

}
8-21
8.4.5 Prevent Sharing of Modified Data and False-Sharing
On an I nt el Core Duo processor or a processor based on I nt el Core microarchit ect ure,
sharing of modified dat a incurs a performance penalt y when a t hread running on one
core t ries t o read or writ e dat a t hat is current ly present in modified st at e in t he first
level cache of t he ot her core. This will cause evict ion of t he modified cache line back
int o memory and reading it int o t he first - level cache of t he ot her core. The lat ency of
such cache line t ransfer is much higher t han using dat a in t he immediat e first level
cache or second level cache.
False sharing applies t o dat a used by one t hread t hat happens t o reside on t he same
cache line as different dat a used by anot her t hread. These sit uat ions can also incur
performance delay depending on t he t opology of t he logical processors/ cores in t he
plat form.
An example of false sharing of mult it hreading environment using processors based
on I nt el Net Burst Microarchit ect ure is when t hreadprivat e dat a and a t hread
synchronizat ion variable are locat ed wit hin t he line size boundary ( 64 byt es) or
sect or boundary ( 128 byt es) . When one t hread modifies t he synchronizat ion vari-
able, t he dirt y cache line must be writ t en out t o memory and updat ed for each
physical processor sharing t he bus. Subsequent ly, dat a is fet ched int o each t arget
processor 128 byt es at a t ime, causing previously cached dat a t o be evict ed from it s
cache on each t arget processor.
False sharing can experience performance penalt y when t he t hreads are running on
logical processors reside on different physical processors. For processors t hat
support HT Technology, false- sharing incurs a performance penalt y when t wo t hreads
run on different cores, different physical processors, or on t wo logical processors in
t he physical processor package. I n t he first t wo cases, t he performance penalt y is
due t o cache evict ions t o maint ain cache coherency. I n t he lat t er case, performance
penalt y is due t o memory order machine clear condit ions.
False sharing is not expect ed t o have a performance impact wit h a single I nt el Core
Duo processor.
User / Sour ce Codi ng Rul e 23. ( H i mpact , M gener al i t y ) Beware of false sharing
wit hin a cache line ( 64 byt es on I nt el Pent ium 4, I nt el Xeon, Pent ium M, I nt el Core
Duo processors) , and wit hin a sect or ( 128 byt es on Pent ium 4 and I nt el Xeon
processors) .
When a common block of paramet ers is passed from a parent t hread t o several
worker t hreads, it is desirable for each work t hread t o creat e a privat e copy of
frequent ly accessed dat a in t he paramet er block.
8.4.6 Placement of Shared Synchronization Variable
On processors based on I nt el Net Burst microarchit ect ure, bus reads t ypically fet ch
128 byt es int o a cache, t he opt imal spacing t o minimize evict ion of cached dat a is
128 byt es. To prevent false- sharing, synchronizat ion variables and syst em obj ect s
8-22
( such as a crit ical sect ion) should be allocat ed t o reside alone in a 128- byt e region
and aligned t o a 128- byt e boundary.
Example 8- 6 shows a way t o minimize t he bus t raffic required t o maint ain cache
coherency in MP syst ems. This t echnique is also applicable t o MP syst ems using
processors wit h or wit hout HT Technology.
On Pent ium M, I nt el Core Solo, I nt el Core Duo processors, and processors based on
I nt el Core microarchit ect ure; a synchronizat ion variable should be placed alone and
in separat e cache line t o avoid false- sharing. Soft ware must not allow a synchroniza-
t ion variable t o span across page boundary.
User / Sour ce Codi ng Rul e 24. ( M i mpact , ML gener al i t y) Place each
synchronizat ion variable alone, separat ed by 128 byt es or in a separat e cache
line.
User / Sour ce Codi ng Rul e 25. ( H i mpact , L gener al i t y ) Do not place any spin
lock variable t o span a cache line boundary.
At t he code level, false sharing is a special concern in t he following cases:
Global dat a variables and st at ic dat a variables t hat are placed in t he same cache
line and are writ t en by different t hreads.
Obj ect s allocat ed dynamically by different t hreads may share cache lines. Make
sure t hat t he variables used locally by one t hread are allocat ed in a manner t o
prevent sharing t he cache line wit h ot her t hreads.
Anot her t echnique t o enforce alignment of synchronizat ion variables and t o avoid a
cacheline being shared is t o use compiler direct ives when declaring dat a st ruct ures.
See Example 8- 7.
Example 8-6. Placement of Synchronization and Regular Variables
int regVar;
int padding[32];
int SynVar[32*NUM_SYNC_VARS];
int AnotherVar;
Example 8-7. Declaring Synchronization Variables without Sharing a Cache Line
__declspec(align(64)) unsigned __int64 sum;
struct sync_struct {};
__declspec(align(64)) struct sync_struct sync_var;
8-23
Ot her t echniques t hat prevent false- sharing include:
Organize variables of different t ypes in dat a st ruct ures ( because t he layout t hat
compilers give t o dat a variables might be different t han t heir placement in t he
source code) .
When each t hread needs t o use it s own copy of a set of variables, declare t he
variables wit h:
Direct ive t hreadprivat e, when using OpenMP
Modifier __declspec ( t hread) , when using Microsoft compiler
I n managed environment s t hat provide aut omat ic obj ect allocat ion, t he obj ect
allocat ors and garbage collect ors are responsible for layout of t he obj ect s in
memory so t hat false sharing t hrough t wo obj ect s does not happen.
Provide classes such t hat only one t hread writ es t o each obj ect field and close
obj ect fields, in order t o avoid false sharing.
One should not equat e t he recommendat ions discussed in t his sect ion as favoring a
sparsely populat ed dat a layout . The dat a- layout recommendat ions should be
adopt ed when necessary and avoid unnecessary bloat in t he size of t he work set .
8.5 SYSTEM BUS OPTIMIZATION
The syst em bus services request s from bus agent s ( e. g. logical processors) t o fet ch
dat a or code from t he memory subsyst em. The performance impact due dat a t raffic
fet ched from memory depends on t he charact erist ics of t he workload, and t he degree
of soft ware opt imizat ion on memory access, localit y enhancement s implement ed in
t he soft ware code. A number of t echniques t o charact erize memory t raffic of a work-
load is discussed in Appendix A. Opt imizat ion guidelines on localit y enhancement is
also discussed in Sect ion 3. 6. 10, Localit y Enhancement , and Sect ion 9.6. 11, Hard-
ware Prefet ching and Cache Blocking Techniques.
The t echniques described in Chapt er 3 and Chapt er 9 benefit applicat ion perfor-
mance in a plat form where t he bus syst em is servicing a single- t hreaded environ-
ment . I n a mult i- t hreaded environment , t he bus syst em t ypically services many
more logical processors, each of which can issue bus request s independent ly. Thus,
t echniques on localit y enhancement s, conserving bus bandwidt h, reducing large-
st ride- cache- miss- delay can have st rong impact on processor scaling performance.
8.5.1 Conserve Bus Bandwidth
I n a mult it hreading environment , bus bandwidt h may be shared by memory t raffic
originat ed from mult iple bus agent s ( These agent s can be several logical processors
and/ or several processor cores) . Preserving t he bus bandwidt h can improve
processor scaling performance. Also, effect ive bus bandwidt h t ypically will decrease
if t here are significant largest ride cache- misses. Reducing t he amount of large-
8-24
st ride cache misses ( or reducing DTLB misses) will alleviat e t he problem of band-
widt h reduct ion due t o largest ride cache misses.
One way for conserving available bus command bandwidt h is t o improve t he localit y
of code and dat a. I mproving t he localit y of dat a reduces t he number of cache line
evict ions and request s t o fet ch dat a. This t echnique also reduces t he number of
inst ruct ion fet ches from syst em memory.
User / Sour ce Codi ng Rul e 26. ( M i mpact , H gener al i t y) I mprove dat a and code
localit y t o conserve bus command bandwidt h.
Using a compiler t hat support s profiler- guided opt imizat ion can improve code localit y
by keeping frequent ly used code pat hs in t he cache. This reduces inst ruct ion fet ches.
Loop blocking can also improve t he dat a localit y. Ot her localit y enhancement t ech-
niques can also be applied in a mult it hreading environment t o conserve bus band-
widt h ( see Sect ion 9. 6, Memory Opt imizat ion Using Prefet ch ) .
Because t he syst em bus is shared bet ween many bus agent s ( logical processors or
processor cores) , soft ware t uning should recognize sympt oms of t he bus
approaching sat urat ion. One useful t echnique is t o examine t he queue dept h of bus
read t raffic ( see Appendix A.2.1.3, Workload Charact erizat ion ) . When t he bus
queue dept h is high, localit y enhancement t o improve cache ut ilizat ion will benefit
performance more t han ot her t echniques, such as insert ing more soft ware
prefet ches or masking memory lat ency wit h overlapping bus reads. An approximat e
working guideline for soft ware t o operat e below bus sat urat ion is t o check if bus read
queue dept h is significant ly below 5.
Some MP and workst at ion plat forms may have a chipset t hat provides t wo syst em
buses, wit h each bus servicing one or more physical processors. The guidelines for
conserving bus bandwidt h described above also applies t o each bus domain.
8.5.2 Understand the Bus and Cache Interactions
Be careful when parallelizing code sect ions wit h dat a set s t hat result s in t he t ot al
working set exceeding t he second- level cache and / or consumed bandwidt h
exceeding t he capacit y of t he bus. On an I nt el Core Duo processor, if only one t hread
is using t he second- level cache and / or bus, t hen it is expect ed t o get t he maximum
benefit of t he cache and bus syst ems because t he ot her core does not int erfere wit h
t he progress of t he first t hread. However, if t wo t hreads use t he second- level cache
concurrent ly, t here may be performance degradat ion if one of t he following condi-
t ions is t rue:
Their combined working set is great er t han t he second- level cache size.
Their combined bus usage is great er t han t he capacit y of t he bus.
They bot h have ext ensive access t o t he same set in t he second- level cache, and
at least one of t he t hreads writ es t o t his cache line.
To avoid t hese pit falls, mult it hreading soft ware should t ry t o invest igat e parallelism
schemes in which only one of t he t hreads access t he second- level cache at a t ime, or
where t he second- level cache and t he bus usage does not exceed t heir limit s.
8-25
8.5.3 Avoid Excessive Software Prefetches
Pent ium 4 and I nt el Xeon Processors have an aut omat ic hardware prefet cher. I t can
bring dat a and inst ruct ions int o t he unified second- level cache based on prior refer-
ence pat t erns. I n most sit uat ions, t he hardware prefet cher is likely t o reduce syst em
memory lat ency wit hout explicit int ervent ion from soft ware prefet ches. I t is also
preferable t o adj ust dat a access pat t erns in t he code t o t ake advant age of t he char-
act erist ics of t he aut omat ic hardware prefet cher t o improve localit y or mask memory
lat ency. Processors based on I nt el Core microarchit ect ure also provides several
advanced hardware prefet ching mechanisms. Dat a access pat t erns t hat can t ake
advant age of earlier generat ions of hardware prefet ch mechanism generally can t ake
advant age of more recent hardware prefet ch implement at ions.
Using soft ware prefet ch inst ruct ions excessively or indiscriminat ely will inevit ably
cause performance penalt ies. This is because excessively or indiscriminat ely using
soft ware prefet ch inst ruct ions wast es t he command and dat a bandwidt h of t he
syst em bus.
Using soft ware prefet ches delays t he hardware prefet cher from st art ing t o fet ch dat a
needed by t he processor core. I t also consumes crit ical execut ion resources and can
result in st alled execut ion. I n some cases, it may be fruit ful t o evaluat e t he reduct ion
or removal of soft ware prefet ches t o migrat e t owards more effect ive use of hardware
prefet ch mechanisms. The guidelines for using soft ware prefet ch inst ruct ions are
described in Chapt er 3. The t echniques for using aut omat ic hardware prefet cher is
discussed in Chapt er 9.
User / Sour ce Codi ng Rul e 27. ( M i mpact , L gener al i t y) Avoid excessive use of
soft ware prefet ch inst ruct ions and allow aut omat ic hardware prefet cher t o work.
Excessive use of soft ware prefet ches can significant ly and unnecessarily increase
bus ut ilizat ion if used inappropriat ely.
8.5.4 Improve Effective Latency of Cache Misses
Syst em memory access lat ency due t o cache misses is affect ed by bus t raffic. This is
because bus read request s must be arbit rat ed along wit h ot her request s for bus
t ransact ions. Reducing t he number of out st anding bus t ransact ions helps improve
effect ive memory access lat ency.
One t echnique t o improve effect ive lat ency of memory read t ransact ions is t o use
mult iple overlapping bus reads t o reduce t he lat ency of sparse reads. I n sit uat ions
where t here is lit t le localit y of dat a or when memory reads need t o be arbit rat ed wit h
ot her bus t ransact ions, t he effect ive lat ency of scat t ered memory reads can be
improved by issuing mult iple memory reads back- t o- back t o overlap mult iple
out st anding memory read t ransact ions. The average lat ency of back- t o- back bus
reads is likely t o be lower t han t he average lat ency of scat t ered reads int erspersed
wit h ot her bus t ransact ions. This is because only t he first memory read needs t o wait
for t he full delay of a cache miss.
8-26
User / Sour ce Codi ng Rul e 28. ( M i mpact , M gener al i t y ) Consider using
overlapping mult iple back- t o- back memory reads t o improve effect ive cache miss
lat encies.
Anot her t echnique t o reduce effect ive memory lat ency is possible if one can adj ust
t he dat a access pat t ern such t hat t he access st rides causing successive cache misses
in t he last - level cache is predominant ly less t han t he t rigger t hreshold dist ance of t he
aut omat ic hardware prefet cher. See Sect ion 9.6. 3, Example of Effect ive Lat ency
Reduct ion wit h Hardware Prefet ch.
User / Sour ce Codi ng Rul e 29. ( M i mpact , M gener al i t y) Consider adj ust ing t he
sequencing of memory references such t hat t he dist ribut ion of dist ances of
successive cache misses of t he last level cache peaks t owards 64 byt es.
8.5.5 Use Full Write Transactions to Achieve Higher Data Rate
Writ e t ransact ions across t he bus can result in writ e t o physical memory eit her using
t he full line size of 64 byt es or less t han t he full line size. The lat t er is referred t o as a
part ial writ e. Typically, writ es t o writ eback ( WB) memory addresses are full- size and
writ es t o writ e- combine ( WC) or uncacheable ( UC) t ype memory addresses result in
part ial writ es. Bot h cached WB st ore operat ions and WC st ore operat ions ut ilize a set
of six WC buffers ( 64 byt es wide) t o manage t he t raffic of writ e t ransact ions. When
compet ing t raffic closes a WC buffer before all writ es t o t he buffer are finished, t his
result s in a series of 8- byt e part ial bus t ransact ions rat her t han a single 64- byt e writ e
t ransact ion.
User / Sour ce Codi ng Rul e 30. ( M i mpact , M gener al i t y ) Use full writ e
t ransact ions t o achieve higher dat a t hroughput .
Frequent ly, mult iple part ial writ es t o WC memory can be combined int o full- sized
writ es using a soft ware writ e- combining t echnique t o separat e WC st ore operat ions
from compet ing wit h WB st ore t raffic. To implement soft ware writ e- combining,
uncacheable writ es t o memory wit h t he WC at t ribut e are writ t en t o a small, t empo-
rary buffer ( WB t ype) t hat fit s in t he first level dat a cache. When t he t emporary
buffer is full, t he applicat ion copies t he cont ent of t he t emporary buffer t o t he final
WC dest inat ion.
When part ial- writ es are t ransact ed on t he bus, t he effect ive dat a rat e t o syst em
memory is reduced t o only 1/ 8 of t he syst em bus bandwidt h.
8.6 MEMORY OPTIMIZATION
Efficient operat ion of caches is a crit ical aspect of memory opt imizat ion. Efficient
operat ion of caches needs t o address t he following:
Cache blocking
Shared memory opt imizat ion
Eliminat ing 64- KByt e aliased dat a accesses
Prevent ing excessive evict ions in first - level cache
8-27
8.6.1 Cache Blocking Technique
Loop blocking is useful for reducing cache misses and improving memory access
performance. The select ion of a suit able block size is crit ical when applying t he loop
blocking t echnique. Loop blocking is applicable t o single- t hreaded applicat ions as
well as t o mult it hreaded applicat ions running on processors wit h or wit hout HT Tech-
nology. The t echnique t ransforms t he memory access pat t ern int o blocks t hat effi-
cient ly fit in t he t arget cache size.
When t arget ing I nt el processors support ing HT Technology, t he loop blocking t ech-
nique for a unified cache can select a block size t hat is no more t han one half of t he
t arget cache size, if t here are t wo logical processors sharing t hat cache. The upper
limit of t he block size for loop blocking should be det ermined by dividing t he t arget
cache size by t he number of logical processors available in a physical processor
package. Typically, some cache lines are needed t o access dat a t hat are not part of
t he source or dest inat ion buffers used in cache blocking, so t he block size can be
chosen bet ween one quart er t o one half of t he t arget cache ( see Chapt er 3, General
Opt imizat ion Guidelines ) .
Soft ware can use t he det erminist ic cache paramet er leaf of CPUI D t o discover which
subset of logical processors are sharing a given cache ( see Chapt er 9, Opt imizing
Cache Usage ) . Therefore, guideline above can be ext ended t o allow all t he logical
processors serviced by a given cache t o use t he cache simult aneously, by placing an
upper limit of t he block size as t he t ot al size of t he cache divided by t he number of
logical processors serviced by t hat cache. This t echnique can also be applied t o
single- t hreaded applicat ions t hat will be used as part of a mult it asking workload.
User / Sour ce Codi ng Rul e 31. ( H i mpact , H gener al i t y ) Use cache blocking t o
improve localit y of dat a access. Target one quart er t o one half of t he cache size
when t arget ing I nt el processors support ing HT Technology or t arget a block size
t hat allow all t he logical processors serviced by a cache t o share t hat cache
simult aneously.
8.6.2 Shared-Memory Optimization
Maint aining cache coherency bet ween discret e processors frequent ly involves
moving dat a across a bus t hat operat es at a clock rat e subst ant ially slower t hat t he
processor frequency.
8.6.2.1 Minimize Sharing of Data between Physical Processors
When t wo t hreads are execut ing on t wo physical processors and sharing dat a,
reading from or writ ing t o shared dat a usually involves several bus t ransact ions
( including snooping, request for ownership changes, and somet imes fet ching dat a
across t he bus) . A t hread accessing a large amount of shared memory is likely t o
have poor processor- scaling performance.
8-28
User / Sour ce Codi ng Rul e 32. ( H i mpact , M gener al i t y) Minimize t he sharing of
dat a bet ween t hreads t hat execut e on different bus agent s sharing a common bus.
The sit uat ion of a plat form consist ing of mult iple bus domains should also minimize
dat a sharing across bus domains.
One t echnique t o minimize sharing of dat a is t o copy dat a t o local st ack variables if it
is t o be accessed repeat edly over an ext ended period. I f necessary, result s from
mult iple t hreads can be combined lat er by writ ing t hem back t o a shared memory
locat ion. This approach can also minimize t ime spent t o synchronize access t o shared
dat a.
8.6.2.2 Batched Producer-Consumer Model
The key benefit of a t hreaded producer- consumer design, shown in Figure 8- 5, is t o
minimize bus t raffic while sharing dat a bet ween t he producer and t he consumer
using a shared second- level cache. On an I nt el Core Duo processor and when t he
work buffers are small enough t o fit wit hin t he first - level cache, reordering of
producer and consumer t asks are necessary t o achieve opt imal performance. This is
because fet ching dat a from L2 t o L1 is much fast er t han having a cache line in one
core invalidat ed and fet ched from t he bus.
Figure 8- 5 illust rat es a bat ched producer- consumer model t hat can be used t o over-
come t he drawback of using small work buffers in a st andard producer- consumer
model. I n a bat ched producer- consumer model, each scheduling quant a bat ches t wo
or more producer t asks, each producer working on a designat ed buffer. The number
of t asks t o bat ch is det ermined by t he crit eria t hat t he t ot al working set be great er
t han t he first - level cache but smaller t han t he second- level cache.
Figure 8-5. Batched Approach of Producer Consumer Model
Main
Thread
P(2) P(5) P(4) P(3)
C(3) C(2) C(1) C(4)
P(1)
P: producer
C: consumer
P(6)
8-29
Example 8- 8 shows t he bat ched implement at ion of t he producer and consumer
t hread funct ions.
Example 8-8. Batched Implementation of the Producer Consumer Threads
void producer_thread()
{ int iter_num = workamount - batchsize;
int mode1;
for (mode1=0; mode1 < batchsize; mode1++)
{ produce(buffs[mode1],count); }
while (iter_num--)
{ Signal(&signal1,1);
produce(buffs[mode1],count); // placeholder function
WaitForSignal(&end1);
mode1++;
if (mode1 > batchsize)
mode1 = 0;
}
}
void consumer_thread()
{ int mode2 = 0;
int iter_num = workamount - batchsize;
while (iter_num--)
{ WaitForSignal(&signal1);
consume(buffs[mode2],count); // placeholder function
Signal(&end1,1);
mode2++;
mode2 = 0;
}
for (i=0;i<batchsize;i++)
{ consume(buffs[mode2],count);
mode2++;
mode2 = 0;
}
}
8-30
8.6.3 Eliminate 64-KByte Aliased Data Accesses
The 64- KByt e aliasing condit ion is discussed in Chapt er 3. Memory accesses t hat
sat isfy t he 64- KByt e aliasing condit ion can cause excessive evict ions of t he first - level
dat a cache. Eliminat ing 64- KByt e aliased dat a accesses originat ing from each t hread
helps improve frequency scaling in general. Furt hermore, it enables t he first - level
dat a cache t o perform efficient ly when HT Technology is fully ut ilized by soft ware
applicat ions.
User / Sour ce Codi ng Rul e 33. ( H i mpact , H gener al i t y) Minimize dat a access
pat t erns t hat are offset by mult iples of 64 KByt es in each t hread.
The presence of 64- KByt e aliased dat a access can be det ect ed using Pent ium 4
processor performance monit oring event s. Appendix B includes an updat ed list of
Pent ium 4 processor performance met rics. These met rics are based on event s
accessed using t he I nt el VTune Performance Analyzer.
Performance penalt ies associat ed wit h 64- KByt e aliasing are applicable mainly t o
current processor implement at ions of HT Technology or I nt el Net Burst microarchit ec-
t ure. The next sect ion discusses memory opt imizat ion t echniques t hat are applicable
t o mult it hreaded applicat ions running on processors support ing HT Technology.
8.6.4 Preventing Excessive Evictions in First-Level Data Cache
Cached dat a in a first - level dat a cache are indexed t o linear addresses but physically
t agged. Dat a in second- level and t hird- level caches are t agged and indexed t o phys-
ical addresses. While t wo logical processors in t he same physical processor package
execut e in separat e linear address space, t he same processors can reference dat a at
t he same linear address in t wo address spaces but mapped t o different physical
addresses. When such compet ing accesses occur simult aneously, t hey can cause
repeat ed evict ions and allocat ions of cache lines in t he first - level dat a cache.
Prevent ing unnecessary evict ions in t he first - level dat a cache by t wo compet ing
t hreads improves t he t emporal localit y of t he first - level dat a cache.
Mult it hreaded applicat ions need t o prevent unnecessary evict ions in t he first - level
dat a cache when:
Mult iple t hreads wit hin an applicat ion t ry t o access privat e dat a on t heir st ack,
some dat a access pat t erns can cause excessive evict ions of cache lines. Wit hin
t he same soft ware process, mult iple t hreads have t heir respect ive st acks, and
t hese st acks are locat ed at different linear addresses. Frequent ly t he linear
addresses of t hese st acks are spaced apart by some fixed dist ance t hat increases
t he likelihood of a cache line being used by mult iple t hreads.
Two inst ances of t he same applicat ion run concurrent ly and are execut ing in lock
st eps ( for example, corresponding dat a in each inst ance are accessed more or
less synchronously) , accessing dat a on t he st ack ( and somet imes accessing dat a
on t he heap) by t hese t wo processes can also cause excessive evict ions of cache
lines because of address conflict s.
8-31
8.6.4.1 Per-thread Stack Offset
To prevent privat e st ack accesses in concurrent t hreads from t hrashing t he first - level
dat a cache, an applicat ion can use a per- t hread st ack offset for each of it s t hreads.
The size of t hese offset s should be mult iples of a common base offset . The opt imum
choice of t his common base offset may depend on t he memory access charact erist ics
of t he t hreads; but it should be mult iples of 128 byt es.
One effect ive t echnique for choosing a per- t hread st ack offset in an applicat ion is t o
add an equal amount of st ack offset each t ime a new t hread is creat ed in a t hread
pool.
7
Example 8- 9 shows a code fragment t hat implement s per- t hread st ack offset
for t hree t hreads using a reference offset of 1024 byt es.
User / Sour ce Codi ng Rul e 34. ( H i mpact , M gener al i t y) Adj ust t he privat e
st ack of each t hread in an applicat ion so t hat t he spacing bet ween t hese st acks is
not offset by mult iples of 64 KByt es or 1 MByt e t o prevent unnecessary cache line
evict ions ( when using I nt el processors support ing HT Technology) .
7. For parallel applications written to run with OpenMP, the OpenMP runtime library in
Intel
KAP/Pro Toolset automatically provides the stack offset adjustment for each thread.
Example 8-9. Adding an Offset to the Stack Pointer of Three Threads
Void Func_thread_entry(DWORD *pArg)
{DWORD StackOffset = *pArg;
DWORD var1; // The local variable at this scope may not benefit
DWORD var2; // from the adjustment of the stack pointer that ensue.
// Call runtime library routine to offset stack pointer.
_alloca(StackOffset) ;
}
// Managing per-thread stack offset to create three threads:
// * Code for the thread function
// * Stack accesses within descendant functions (do_foo1, do_foo2)
// are less likely to cause data cache evictions because of the
// stack offset.
do_foo1();
do_foo2();
}
main ()
{
DWORD Stack_offset, ID_Thread1, ID_Thread2, ID_Thread3;
Stack_offset = 1024;
// Stack offset between parent thread and the first child thread.
ID_Thread1 = CreateThread(Func_thread_entry, &Stack_offset);
// Call OS thread API.
8-32
8.6.4.2 Per-instance Stack Offset
Each inst ance an applicat ion runs in it s own linear address space; but t he address
layout of dat a for st ack segment s is ident ical for t he bot h inst ances. When t he
inst ances are running in lock st ep, st ack accesses are likely t o cause of excessive
evict ions of cache lines in t he first - level dat a cache for some early implement at ions
of HT Technology in I A- 32 processors.
Alt hough t his sit uat ion ( t wo copies of an applicat ion running in lock st ep) is seldom
an obj ect ive for mult it hreaded soft ware or a mult iprocessor plat form, it can happen
by an end- user s direct ion. One solut ion is t o allow applicat ion inst ance t o add a suit -
able linear address- offset for it s st ack. Once t his offset is added at st art - up, a buffer
of linear addresses is est ablished even when t wo copies of t he same applicat ion are
execut ing using t wo logical processors in t he same physical processor package. The
space has negligible impact on running dissimilar applicat ions and on execut ing
mult iple copies of t he same applicat ion.
However, t he buffer space does enable t he first - level dat a cache t o be shared coop-
erat ively when t wo copies of t he same applicat ion are execut ing on t he t wo logical
processors in a physical processor package.
To est ablish a suit able st ack offset for t wo inst ances of t he same applicat ion running
on t wo logical processors in t he same physical processor package, t he st ack point er
can be adj ust ed in t he ent ry funct ion of t he applicat ion using t he t echnique shown in
Example 8- 10. The size of st ack offset s should also be a mult iple of a reference offset
t hat may depend on t he charact erist ics of t he applicat ions dat a access pat t ern. One
way t o det ermine t he per- inst ance value of t he st ack offset s is t o choose a pseudo-
random number t hat is also a mult iple of t he reference offset or 128 byt es. Usually,
t his per- inst ance pseudo- random offset can be less t han 7 KByt e. Example 8- 10
provides a code fragment for adj ust ing t he st ack point er in an applicat ion ent ry func-
t ion.
User / Sour ce Codi ng Rul e 35. ( M i mpact , L gener al i t y) Add per- inst ance st ack
offset when t wo inst ances of t he same applicat ion are execut ing in lock st eps t o
avoid memory accesses t hat are offset by mult iples of 64 KByt e or 1 MByt e, when
t arget ing I nt el processors support ing HT Technology.
}
Example 8-9. Adding an Offset to the Stack Pointer of Three Threads (Contd.)
8-33
8.7 FRONT-END OPTIMIZATION
I n t he I nt el Net Burst microarchit ect ure family of processors, t he inst ruct ions are
decoded int o ops and sequences of ops called t races are st ored in t he Execut ion
Trace Cache. The Trace Cache is t he primary subsyst em in t he front end of t he
processor t hat delivers op t races t o t he execut ion engine. Opt imizat ion guidelines
for front - end operat ion in single- t hreaded applicat ions are discussed in Chapt er 3.
For dual- core processors where t he second- level unified cache ( for dat a and code) is
duplicat ed for each core ( Pent ium Processor Ext reme Edit ion, Pent ium D processor) ,
t here are no special considerat ions for front - end opt imizat ion on behalf of t wo
processor cores in a physical processor.
For dual- core processors where t he second- level unified cache is shared by t wo
processor cores ( I nt el Core Duo processor and processors based on I nt el Core
microarchit ect ure) , mult i- t hreaded soft ware should consider t he increase in code
working set due t o t wo t hreads fet ching code from t he unified cache as part of front -
end and cache opt imizat ion. For quad- core processors based on I nt el Core microar-
chit ect ure, t he considerat ions t hat applies t o I nt el Core 2 Duo processors also apply
t o quad- core processors.
This next t wo subsect ions discuss guidelines for opt imizing t he operat ion of t he
Execut ion Trace Cache on processors support ing HT Technology.
Example 8-10. Adding a Pseudo-random Offset to the Stack Pointer in the Entry Function
void main()
{
char * pPrivate = NULL;
long myOffset = GetMod7Krandom128X()
// A pseudo-random number that is a multiple
// of 128 and less than 7K.
// Use runtime library routine to reposition.
_alloca(myOffset); // The stack pointer.
}
// The rest of application code below, stack accesses in descendant
// functions (e.g. do_foo) are less likely to cause data cache
// evictions because of the stack offsets.
do_foo();
}
8-34
8.7.1 Avoid Excessive Loop Unrolling
Unrolling loops can reduce t he number of branches and improve t he branch predict -
abilit y of applicat ion code. Loop unrolling is discussed in det ail in Chapt er 3. Loop
unrolling must be used j udiciously. Be sure t o consider t he benefit of improved
branch predict abilit y and t he cost of increased code size relat ive t o t he Trace Cache.
User / Sour ce Codi ng Rul e 36. ( M i mpact , L gener al i t y) Avoid excessive loop
unrolling t o ensure t he Trace cache is operat ing efficient ly.
On HT-Technology- enabled processors, excessive loop unrolling is likely t o reduce t he
Trace Caches abilit y t o deliver high bandwidt h op st reams t o t he execut ion engine.
8.7.2 Optimization for Code Size
When t he Trace Cache is cont inuously and repeat edly delivering op t races t hat are
pre- built , t he scheduler in t he execut ion engine can dispat ch ops for execut ion at a
high rat e and maximize t he ut ilizat ion of available execut ion resources. Opt imizing
applicat ion code size by organizing code sequences t hat are repeat edly execut ed int o
sect ions, each wit h a foot print t hat can fit int o t he Trace Cache, can improve applica-
t ion performance great ly.
On HT-Technology- enabled processors, mult it hreaded applicat ions should improve
code localit y of frequent ly execut ed sect ions and t arget one half of t he size of Trace
Cache for each applicat ion t hread when considering code size opt imizat ion. I f code
size becomes an issue affect ing t he efficiency of t he front end, t his may be det ect ed
by evaluat ing performance met rics discussed in t he previous subsect ion wit h
respect t o loop unrolling.
User / Sour ce Codi ng Rul e 37. ( L i mpact , L gener al i t y ) Opt imize code size t o
improve localit y of Trace cache and increase delivered t race lengt h.
8.8 USING THREAD AFFINITIES TO MANAGE SHARED
PLATFORM RESOURCES
Each logical processor in an MP syst em has unique init ial API C_I D which can be
queried using CPUI D. Resources shared by more t han one logical processors in a
mult it hreading plat form can be mapped int o a t hree- level hierarchy for a non- clus-
t ered MP syst em. Each of t he t hree levels can be ident ified by a label, which can be
ext ract ed from t he init ial API C_I D associat ed wit h a logical processor. See Chapt er 7
of t he I nt el 64 and I A- 32 Archit ect ures Soft ware Developers Manual, Volume 3A
for det ails. The t hree levels are:
Physical processor package. A PACKAGE_I D label can be used t o dist inguish
different physical packages wit hin a clust er.
8-35
Core: A physical processor package consist s of one or more processor cores.
ACORE_I D label can be used t o dist inguish different processor cores wit hin a
package.
SMT: A processor core provides one or more logical processors sharing execut ion
resources. A SMT_I D label can be used t o dist inguish different logical processors
in t he same processor core.
Typically, each logical processor t hat is enabled by t he operat ing syst em and made
available t o applicat ion for t hread- scheduling is represent ed by a bit in an OS
const ruct , commonly referred t o as an affinit y mask
8
. Soft ware can use an affinit y
mask t o cont rol t he binding of a soft ware t hread t o a specific logical processor at
runt ime.
Soft ware can query CPUI D on each enabled logical processor t o assemble a t able for
each level of t he t hree- level ident ifiers. These t ables can be used t o t rack t he t opo-
logical relat ionships bet ween PACKAGE_I D, CORE_I D, and SMT_I D and t o const ruct
lookup t ables of init ial API C_I D and affinit y masks.
The sequence t o assemble t ables of PACKAGE_I D, CORE_I D, and SMT_I D is shown in
Example 8- 11. The example uses support rout ines described in Chapt er 7 of t he
I nt el 64 and I A- 32 Archit ect ures Soft ware Developers Manual, Volume 3A.
Affinit y masks can be used t o opt imize shared mult it hreading resources.

8. The number of non-zero bits in the affinity mask provided by the OS at runtime may be less than
the total number of logical processors available in the platform hardware, due to various features
implemented either in the BIOS or OS.
Example 8-11. Assembling 3-level IDs, Affinity Masks for Each Logical Processor
// The BIOS and/or OS may limit the number of logical processors
// available to applications after system boot.
// The below algorithm will compute topology for the logical processors
// visible to the thread that is computing it.
// Extract the 3-levels of IDs on every processor.
// SystemAffinity is a bitmask of all the processors started by the OS.
// Use OS specific APIs to obtain it.
// ThreadAffinityMask is used to affinitize the topology enumeration
// thread to each processor using OS specific APIs.
// Allocate per processor arrays to store the Package_ID, Core_ID and
// SMT_ID for every started processor.
8-36
Arrangement s of affinit y- binding can benefit performance more t han ot her arrange-
ment s. This applies t o:
Scheduling t wo domain- decomposit ion t hreads t o use separat e cores or physical
packages in order t o avoid cont ent ion of execut ion resources in t he same core
typedef struct {
AFFINITYMASK affinity_mask; // 8 byte in 64-bit mode,
// 4 byte otherwise.
unsigned char smt;
; unsigned char core;
unsigned char pkg;
unsigned char initialAPIC_ID;
} APIC_MAP_T;
APIC_MAP_T apic_conf[64];
ThreadAffinityMask = 1;
ProcessorNum = 0;
while (ThreadAffinityMask != 0 && ThreadAffinityMask <=
SystemAffinity) {
// Check to make sure we can utilize this processor first.
if (ThreadAffinityMask & SystemAffinity){
Set thread to run on the processor specified in ThreadAffinityMask.
Wait if necessary and ensure thread is running on specified processor.
apic_conf[ProcessorNum].initialAPIC_ID = GetInitialAPIC_ID();
Extract the Package, Core and SMT ID as explained in three
level extraction algorithm.
apic_conf[ProcessorNum].pkg = PACKAGE_ID;
apic_conf[ProcessorNum].core = CORE_ID;
apic_conf[ProcessorNum].smt = SMT_ID;
apic_conf[ProcessorNum].affinity_mask = ThreadAffinityMask;
ProcessorNum++;
}
ThreadAffinityMask <<= 1;
}
NumStartedLPs = ProcessorNum;
Example 8-11. Assembling 3-level IDs, Affinity Masks for Each Logical Processor (Contd.)
8-37
Scheduling t wo funct ional- decomposit ion t hreads t o use shared execut ion
resources cooperat ively
Scheduling pairs of memory- int ensive t hreads and comput e- int ensive t hreads t o
maximize processor scaling and avoid resource cont ent ions in t he same core
An example using t he 3- level hierarchy and relat ionships bet ween t he init ial API C_I D
and t he affinit y mask t o manage t hread affinit y binding is shown in Example 8- 12.
The example shows an implement at ion of building a lookup t able so t hat t he
sequence of t hread scheduling is mapped t o an array of affinit y masks such t hat
t hreads are scheduled first t o t he primary logical processor of each processor core.
This example is also opt imized t o t he sit uat ions of scheduling t wo memory- int ensive
t hreads t o run on separat e cores and scheduling t wo comput e- int ensive t hreads on
separat e cores.
User / Sour ce Codi ng Rul e 38. ( M i mpact , L gener al i t y) Consider using t hread
affinit y t o opt imize sharing resources cooperat ively in t he same core and
subscribing dedicat ed resource in separat e processor cores.
Some mult icore processor implement at ion may have a shared cache t opology t hat is
not uniform across different cache levels. The det erminist ic cache paramet er leaf of
CPUI D will report such cache- sharing t opology. The 3- level hierarchy and relat ion-
ships bet ween t he init ial API C_I D and affinit y mask can also be used t o manage such
a t opology.
Example 8- 13 illust rat es t he st eps of discovering sibling logical processors in a phys-
ical package sharing a t arget level cache. The algorit hm assumes init ial API C I Ds are
assigned in a manner t hat respect bit field boundaries, wit h respect t o t he modular
boundary of t he subset of logical processor sharing t hat cache level. Soft ware can
query t he number of logical processors in hardware sharing a cache using t he det er-
minist ic cache paramet er leaf of CPUI D. By comparing t he relevant bit s in t he init ial
API C_I D, one can const ruct a mask t o represent sibling logical processors t hat are
sharing t he same cache.
Not e t he bit field boundary of t he cache- sharing t opology is not necessarily t he same
as t he core boundary. Some cache levels can be shared across core boundary.
Example 8-12. Assembling a Look up Table to Manage Affinity Masks
and Schedule Threads to Each Core First
AFFINITYMASK LuT[64]; // A Lookup table to retrieve the affinity
// mask we want to use from the thread
// scheduling sequence index.
int index =0; // Index to scheduling sequence.
j = 0;
8-38
/ Assemble the sequence for first LP consecutively to different core.
while (j < NumStartedLPs) {
// Determine the first LP in each core.
if( ! apic_conf [j ].smt) { // This is the first LP in a core
// supporting HT.
LuT[index++] = apic_conf[j].affinitymask;
}
j++;
}
/// Now the we have assigned each core to consecutive indices,
// we can finish the table to use the rest of the
// LPs in each core.
nThreadsPerCore = MaxLPPerPackage()/MaxCoresPerPackage();
for (i = 0 ; i < nThreadsPerCore; i ++) {
for (j = 0 ; j < NumStartedLPs; j += nThreadsPerCore) {
// Set the affinity binding for another logical
// processor in each core.
if( apic_conf[ i+j ].SMT) {
LuT[ index++] = apic_id[ i+j ].affinitymask;
}
}
}
}
Sharing the Same Cache
// Logical processors sharing the same cache can be determined by bucketing
// the logical processors with a mask, the width of the mask is determined
// from the maximum number of logical processors sharing that cache level.
// The algorithm below assumes that all processors have identical cache hierarchy
// and initial APIC ID assignment across the modular
// boundary of the logical processor sharing the target level cache must respect
// bit-field boundary. This is a requirement similar to those applying to
// core boundary and package boundary. The modular boundary of those
// logical processors sharing the target level cache may coincide with core
// boundary or above core boundary.
Example 8-12. Assembling a Look up Table to Manage Affinity Masks
and Schedule Threads to Each Core First (Contd.)
8-39
ThreadAffinityMask = 1;
ProcessorNum = 0;
while (ThreadAffinityMask != 0 && ThreadAffinityMask <=
SystemAffinity) {
// Check to make sure we can utilize this processor first.
if (ThreadAffinityMask & SystemAffinity){
Set thread to run on the processor specified in
ThreadAffinityMask.
Wait if necessary and ensure thread is running on specified
processor.
initialAPIC_ID = GetInitialAPIC_ID();
Extract the Package, Core and SMT ID as explained in
three level extraction algorithm.
Extract the CACHE_ID similar to the PACKAGE_ID extraction algorithm.
// Cache topology may vary for each cache level, one mask for each level.
// The target level is selected by the input value index
CacheIDMask = ((uchar) (0xff <<
FindMaskWidth(MaxLPSharingCache(TargetLevel))); // See Example 8-9.
CACHE_ID = InitialAPIC_ID & CacheIDMask;
PackageID[ProcessorNUM] = PACKAGE_ID;
CoreID[ProcessorNum] = CORE_ID;
SmtID[ProcessorNum] = SMT_ID;
CacheID[ProcessorNUM] = CACHE_ID;
// Only the target cache is stored in this example
ProcessorNum++;
}
ThreadAffinityMask <<= 1;
}
NumStartedLPs = ProcessorNum;
Sharing the Same Cache (Contd.)
8-40
CacheIDBucket is an array of unique Cache_ID values. Allocate an array
of NumStartedLPs count of entries in this array for the target cache level.
CacheProcessorMask is a corresponding array of the bit mask of logical
processors sharing the same target level cache, these are logical
processors with the same Cache_ID.
The algorithm below assumes there is symmetry across the modular
boundary of target cache topology if more than one socket is populated
in an MP system, and only the topology of the target cache level is discovered.
Topology of other cache level can be determined in a similar manner.
// Bucket Cache IDs and compute processor mask for the target cache of every package.
CacheNum = 1;
CacheIDBucket[0] = CacheID[0];
ProcessorMask = 1;
CacheProcessorMask[0] = ProcessorMask;
If (CacheID[ProcessorNum] == CacheIDBucket[i]) {
CacheProcessorMask[i] |= ProcessorMask;
Break; //Found in existing bucket,skip to next iteration.
}
}
For (ProcessorNum = 1; ProcessorNum < NumStartedLPs; ProcessorNum++) {
ProcessorMask << = 1;
For (i = 0; i < CacheNum; i++) {
// We may be comparing bit-fields of logical processors
// residing in a different modular boundary of the cache
// topology, the code below assume symmetry across this
// modular boundary.
if (i == CacheNum) {
// Cache_ID did not match any bucket, start new bucket.
CacheIDBucket[i] = CacheID[ProcessorNum];
CacheProcessorMask[i] = ProcessorMask;
CacheNum++;
}
}
// CacheNum has the number of distinct modules which contain
// sibling logical processor sharing the target Cache.
// CacheProcessorMask[] array has the mask representing those logical
// processors sharing the same target level cache.
Sharing the Same Cache (Contd.)
8-41
8.9 OPTIMIZATION OF OTHER SHARED RESOURCES
Resource opt imizat ion in mult i- t hreaded applicat ion depends on t he cache t opology
and execut ion resources associat ed wit hin t he hierarchy of processor t opology.
Processor t opology and an algorit hm for soft ware t o ident ify t he processor t opology
are discussed in Chapt er 7 of t he I nt el 64 and I A- 32 Archit ect ures Soft ware Devel-
opers Manual, Volume 3A.
Typically t he bus syst em is shared by mult iple agent s at t he SMT level and at t he
processor core level of t he processor t opology. Thus mult i- t hreaded applicat ion
design should st art wit h an approach t o manage t he bus bandwidt h available t o
mult iple processor agent s sharing t he same bus link in an equit able manner. This can
be done by improving t he dat a localit y of an individual applicat ion t hread or allowing
t wo t hreads t o t ake advant age of a shared second- level cache ( where such shared
cache t opology is available) .
I n general, opt imizing t he building blocks of a mult i- t hreaded applicat ion can st art
from an individual t hread. The guidelines discussed in Chapt er 3 t hrough Chapt er 9
largely apply t o mult i- t hreaded opt imizat ion.
Tuni ng Suggest i on 3. Opt imize single t hreaded code t o maximize execut ion
t hroughput first .
At t he SMT level, HT Technology t ypically can provide t wo logical processors sharing
execut ion resources wit hin a processor core. To help mult it hreaded applicat ions
ut ilize shared execut ion resources effect ively, t he rest of t his sect ion describes guide-
lines t o deal wit h common sit uat ions as well as t hose limit ed sit uat ions where execu-
t ion resource ut ilizat ion bet ween t hreads may impact overall performance.
Most applicat ions only use about 20- 30% of peak execut ion resources when running
in a single- t hreaded environment . A useful indicat or t hat relat es t o t his is by
measuring t he execut ion t hroughput at t he ret irement st age ( See Appendix A.2. 1. 3,
Workload Charact erizat ion ) . I n a processor t hat support s HT Technology, execut ion
t hroughput seldom reaches 50% of peak ret irement bandwidt h. Thus, improving
single- t hread execut ion t hroughput should also benefit mult it hreading performance.
Tuni ng Suggest i on 4. Opt imize mult it hreaded applicat ions t o achieve opt imal
processor scaling wit h respect t o t he number of physical processors or processor
cores.
Following t he guidelines, such as reduce t hread synchronizat ion cost s, localit y
enhancement s, and conserving bus bandwidt h, will allow mult it hreading hardware t o
exploit t ask- level parallelism in t he workload and improve MP scaling. I n general,
reducing t he dependence of resources shared bet ween physical packages will benefit
processor scaling wit h respect t o t he number of physical processors. Similarly, heavy
reliance on resources shared wit h different cores is likely t o reduce processor scaling
performance. On t he ot her hand, using shared resource effect ively can deliver posi-
t ive benefit in processor scaling, if t he workload does sat urat e t he crit ical resource in
cont ent ion.
Tuni ng Suggest i on 5. Schedule t hreads t hat compet e for t he same execut ion
resource t o separat e processor cores.
8-42
Tuni ng Suggest i on 6. Use on- chip execut ion resources cooperat ively if t wo logical
processors are sharing t he execut ion resources in t he same processor core.
8.9.1 Using Shared Execution Resources in a Processor Core
One way t o measure t he degree of overall resource ut ilizat ion by a single t hread is t o
use performance- monit oring event s t o count t he clock cycles t hat a logical processor
is execut ing code and compare t hat number t o t he number of inst ruct ions execut ed
t o complet ion. Such performance met rics are described in Appendix B and can be
accessed using t he I nt el VTune Performance Analyzer.
An event rat io like non- halt ed cycles per inst ruct ions ret ired ( non- halt ed CPI ) and
non- sleep CPI can be useful in direct ing code- t uning effort s. The non- sleep CPI
met ric can be int erpret ed as t he inverse of t he overall t hroughput of a physical
processor package. The non- halt ed CPI met ric can be int erpret ed as t he inverse of
t he t hroughput of a logical processor
9
.
When a single t hread is execut ing and all on- chip execut ion resources are available
t o it , non- halt ed CPI can indicat e t he unused execut ion bandwidt h available in t he
physical processor package. I f t he value of a non- halt ed CPI is significant ly higher
t han unit y and overall on- chip execut ion resource ut ilizat ion is low, a mult it hreaded
applicat ion can direct t uning effort s t o encompass t he fact ors discussed earlier.
An opt imized single t hread wit h exclusive use of on- chip execut ion resources may
exhibit a non- halt ed CPI in t he neighborhood of unit y
10
. Because most frequent ly
used inst ruct ions t ypically decode int o a single micro- op and have t hroughput of no
more t han t wo cycles, an opt imized t hread t hat ret ires one micro- op per cycle is only
consuming about one t hird of peak ret irement bandwidt h. Significant port ions of t he
issue port bandwidt h are left unused. Thus, opt imizing single- t hread performance
usually can be complement ary wit h opt imizing a mult it hreaded applicat ion t o t ake
advant age of t he benefit s of HT Technology.
On a processor support ing HT Technology, it is possible t hat an execut ion unit wit h
lower t hroughput t han one issue every t wo cycles may find it self in cont ent ion from
t wo t hreads implement ed using a dat a decomposit ion t hreading model. I n one
scenario, t his can happen when t he inner loop of bot h t hreads rely on execut ing a
low- t hroughput inst ruct ion, such as FDI V, and t he execut ion t ime of t he inner loop is
bound by t he t hroughput of FDI V.
Using a funct ion decomposit ion t hreading model, a mult it hreaded applicat ion can
pair up a t hread wit h crit ical dependence on a low- t hroughput resource wit h ot her
t hreads t hat do not have t he same dependency.
9. Non-halted CPI can correlate to the resource utilization of an application thread, if the application
thread is affinitized to a fixed logical processor.
10. In current implementations of processors based on Intel NetBurst microarchitecture, the theoret-
ical lower bound for either non-halted CPI or non-sleep CPI is 1/3. Practical applications rarely
achieve any value close to the lower bound.
8-43
User / Sour ce Codi ng Rul e 39. ( M i mpact , L gener al i t y ) I f a single t hread
consumes half of t he peak bandwidt h of a specific execut ion unit ( e. g. FDIV) ,
consider adding a t hread t hat seldom or rarely relies on t hat execut ion unit , when
t uning for HT Technology.
To ensure execut ion resources are shared cooperat ively and efficient ly bet ween t wo
logical processors, it is import ant t o reduce st all condit ions, especially t hose condi-
t ions causing t he machine t o flush it s pipeline.
The primary indicat or of a Pent ium 4 processor pipeline st all condit ion is called
Machine Clear. The met ric is available from t he VTune Analyzer s event sampling
capabilit y. When t he machine clear condit ion occurs, all inst ruct ions t hat are in flight
( at various st ages of processing in t he pipeline) must be resolved and t hen t hey are
eit her ret ired or cancelled. While t he pipeline is being cleared, no new inst ruct ions
can be fed int o t he pipeline for execut ion. Before a machine clear condit ion is de-
assert ed, execut ion resources are idle.
Reducing t he machine clear condit ion benefit s single- t hread performance because it
increases t he frequency scaling of each t hread. The impact is even higher on proces-
sors support ing HT Technology, because a machine clear condit ion caused by one
t hread can impact ot her t hreads execut ing simult aneously.
Several performance met rics can be used t o det ect sit uat ions t hat may cause a pipe-
line t o be cleared. The primary met ric is t he Machine Clear Count : it indicat es t he
t ot al number of t imes a machine clear condit ion is assert ed due t o any cause.
Possible causes include memory order violat ions and self- modifying code. Assist s
while execut ing x87 or SSE inst ruct ions have a similar effect on t he processor s pipe-
line and should be reduced t o a minimum.
Writ e- combining buffers are anot her example of execut ion resources shared
bet ween t wo logical processors. Wit h t wo t hreads running simult aneously on a
processor support ing HT Technology, t he WRI TEs of bot h t hreads count t oward t he
limit of four writ e- combining buffers. For example: if an inner loop t hat writ es t o
t hree separat e areas of memory per it erat ion is run by t wo t hreads simult aneously,
t he t ot al number of cache lines writ t en could be six. This being t rue, t he code loses
t he benefit s of writ e- combining. Loop- fission applied t o t his sit uat ion creat es t wo
loops, neit her of which is allowed t o writ e t o more t han t wo cache lines per it erat ion.
The rules and t uning suggest ions discussed above are summarized in Appendix E.
8-44
9-1
CHAPTER 9
Over t he past decade, processor speed has increased. Memory access speed has
increased at a slower pace. The result ing disparit y has made it import ant t o t une
applicat ions in one of t wo ways: eit her ( a) a maj orit y of dat a accesses are fulfilled
from processor caches, or ( b) effect ively masking memory lat ency t o ut ilize peak
memory bandwidt h as much as possible.
Hardware prefet ching mechanisms are enhancement s in microarchit ect ure t o facili-
t at e t he lat t er aspect , and will be most effect ive when combined wit h soft ware
t uning. The performance of most applicat ions can be considerably improved if t he
dat a required can be fet ched from t he processor caches or if memory t raffic can t ake
advant age of hardware prefet ching effect ively.
St andard t echniques t o bring dat a int o t he processor before it is needed involve addi-
t ional programming which can be difficult t o implement and may require special
st eps t o prevent performance degradat ion. St reaming SI MD Ext ensions addressed
t his issue by providing various prefet ch inst ruct ions.
St reaming SI MD Ext ensions int roduced t he various non- t emporal st ore inst ruct ions.
SSE2 ext ends t his support t o new dat a t ypes and also int roduce non- t emporal st ore
support for t he 32- bit int eger regist ers.
This chapt er focuses on:
Hardware Prefet ch Mechanism, Soft ware Prefet ch and Cacheabilit y I nst ruct ions
Discusses microarchit ect ural feat ure and inst ruct ions t hat allow you t o affect
dat a caching in an applicat ion.
Memory Opt imizat ion Using Hardware Prefet ching, Soft ware Prefet ch and Cache-
abilit y I nst ruct ions Discusses t echniques for implement ing memory opt imiza-
t ions using t he above inst ruct ions.
NOTE
I n a number of cases present ed, t he prefet ching and cache ut ilizat ion
described are specific t o current implement at ions of I nt el Net Burst
microarchit ect ure but are largely applicable for t he fut ure processors.
Using det erminist ic cache paramet ers t o manage cache hierarchy.
9.1 GENERAL PREFETCH CODING GUIDELINES
The following guidelines will help you t o reduce memory t raffic and ut ilize peak
memory syst em bandwidt h more effect ively when large amount s of dat a movement
must originat e from t he memory syst em:
9-2
Take advant age of t he hardware prefet cher s abilit y t o prefet ch dat a t hat are
accessed in linear pat t erns, in eit her a forward or backward direct ion.
Take advant age of t he hardware prefet cher s abilit y t o prefet ch dat a t hat are
accessed in a regular pat t ern wit h access st rides t hat are subst ant ially smaller
t han half of t he t rigger dist ance of t he hardware prefet ch ( see Table 2- 6) .
Use a current - generat ion compiler, such as t he I nt el C+ + Compiler t hat support s
C+ + language- level feat ures for St reaming SI MD Ext ensions. St reaming SI MD
Ext ensions and MMX t echnology inst ruct ions provide int rinsics t hat allow you t o
opt imize cache ut ilizat ion. Examples of I nt el compiler int rinsics include:
_mm_prefet ch, _mm_st ream and _mm_load, _mm_sfence. For det ails, refer t o
I nt el C+ + Compiler User s Guide document at ion.
Facilit at e compiler opt imizat ion by:
Minimize use of global variables and point ers.
Minimize use of complex cont rol flow.
Use t he const modifier, avoid regist er modifier.
Choose dat a t ypes carefully ( see below) and avoid t ype cast ing.
Use cache blocking t echniques ( for example, st rip mining) as follows:
I mprove cache hit rat e by using cache blocking t echniques such as st rip-
mining ( one dimensional arrays) or loop blocking ( t wo dimensional arrays)
Explore using hardware prefet ching mechanism if your dat a access pat t ern
has sufficient regularit y t o allow alt ernat e sequencing of dat a accesses ( for
example: t iling) for improved spat ial localit y. Ot herwise use PREFETCHNTA.
Balance single- pass versus mult i- pass execut ion:
Single- pass, or unlayered execut ion passes a single dat a element t hrough an
ent ire comput at ion pipeline.
Mult i- pass, or layered execut ion performs a single st age of t he pipeline on a
bat ch of dat a element s before passing t he ent ire bat ch on t o t he next st age.
I f your algorit hm is single- pass use PREFETCHNTA. I f your algorit hm is mult i-
pass use PREFETCHT0.
Resolve memory bank conflict issues. Minimize memory bank conflict s by
applying array grouping t o group cont iguously used dat a t oget her or by
allocat ing dat a wit hin 4- KByt e memory pages.
Resolve cache management issues. Minimize t he dist urbance of t emporal dat a
held wit hin processor s caches by using st reaming st ore inst ruct ions.
Opt imize soft ware prefet ch scheduling dist ance:
Far ahead enough t o allow int erim comput at ions t o overlap memory access
t ime.
Near enough t hat prefet ched dat a is not replaced from t he dat a cache.
9-3
Use soft ware prefet ch concat enat ion. Arrange prefet ches t o avoid unnecessary
prefet ches at t he end of an inner loop and t o prefet ch t he first few it erat ions of
t he inner loop inside t he next out er loop.
Minimize t he number of soft ware prefet ches. Prefet ch inst ruct ions are not
complet ely free in t erms of bus cycles, machine cycles and resources; excessive
usage of prefet ches can adversely impact applicat ion performance.
I nt erleave prefet ches wit h comput at ion inst ruct ions. For best performance,
soft ware prefet ch inst ruct ions must be int erspersed wit h comput at ional inst ruc-
t ions in t he inst ruct ion sequence ( rat her t han clust ered t oget her) .
9.2 HARDWARE PREFETCHING OF DATA
Pent ium M, I nt el Core Solo, and I nt el Core Duo processors and processors based on
I nt el Core microarchit ect ure and I nt el Net Burst microarchit ect ure provide hardware
dat a prefet ch mechanisms which monit or applicat ion dat a access pat t erns and
prefet ches dat a aut omat ically. This behavior is aut omat ic and does not require
programmer int ervent ion.
For processors based on I nt el Net Burst microarchit ect ure, charact erist ics of t he
hardware dat a prefet cher are:
1. I t requires t wo successive cache misses in t he last level cache t o t rigger t he
mechanism; t hese t wo cache misses must sat isfy t he condit ion t hat st rides of
t he cache misses are less t han t he t rigger dist ance of t he hardware prefet ch
mechanism ( see Table 2- 6) .
2. At t empt s t o st ay 256 byt es ahead of current dat a access locat ions.
3. Follows only one st ream per 4- KByt e page ( load or st ore) .
4. Can prefet ch up t o 8 simult aneous, independent st reams from eight different
4- KByt e regions
5. Does not prefet ch across 4- KByt e boundary. This is independent of paging
modes.
6. Fet ches dat a int o second/ t hird- level cache.
7. Does not prefet ch UC or WC memory t ypes.
8. Follows load and st ore st reams. I ssues Read For Ownership ( RFO) t ransact ions
for st ore st reams and Dat a Reads for load st reams.
Ot her t han it ems 2 and 4 discussed above, most ot her charact erist ics also apply t o
Pent ium M, I nt el Core Solo and I nt el Core Duo processors. The hardware prefet cher
implement ed in t he Pent ium M processor fet ches dat a t o a second level cache. I t can
t rack 12 independent st reams in t he forward direct ion and 4 independent st reams in
t he backward direct ion. The hardware prefet cher of I nt el Core Solo processor can
t rack 16 forward st reams and 4 backward st reams. On t he I nt el Core Duo processor,
t he hardware prefet cher in each core fet ches dat a independent ly.
9-4
Hardware prefet ch mechanisms of processors based on I nt el Core microarchit ect ure
are discussed in Sect ion 3. 7. 3 and Sect ion 3. 7. 4. Despit e differences in hardware
implement at ion t echnique, t he overall benefit of hardware prefet ching t o soft ware
are similar bet ween I nt el Core microarchit ect ure and prior microarchit ect ures.
9.3 PREFETCH AND CACHEABILITY INSTRUCTIONS
The PREFETCH inst ruct ion, insert ed by t he programmers or compilers, accesses a
minimum of t wo cache lines of dat a on t he Pent ium 4 processor prior t o t he dat a
act ually being needed ( one cache line of dat a on t he Pent ium M processor) . This
hides t he lat ency for dat a access in t he t ime required t o process dat a already resi-
dent in t he cache.
Many algorit hms can provide informat ion in advance about t he dat a t hat is t o be
required. I n cases where memory accesses are in long, regular dat a pat t erns; t he
aut omat ic hardware prefet cher should be favored over soft ware prefet ches.
The cacheabilit y cont rol inst ruct ions allow you t o cont rol dat a caching st rat egy in
order t o increase cache efficiency and minimize cache pollut ion.
Dat a reference pat t erns can be classified as follows:
Tempor al Dat a will be used again soon
Spat i al Dat a will be used in adj acent locat ions ( for example, on t he same
cache line) .
Non- t empor al Dat a which is referenced once and not reused in t he
immediat e fut ure ( for example, for some mult imedia dat a t ypes, as t he vert ex
buffer in a 3D graphics applicat ion) .
These dat a charact erist ics are used in t he discussions t hat follow.
9.4 PREFETCH
This sect ion discusses t he mechanics of t he soft ware PREFETCH inst ruct ions. I n
general, soft ware prefet ch inst ruct ions should be used t o supplement t he pract ice of
t uning an access pat t ern t o suit t he aut omat ic hardware prefet ch mechanism.
9.4.1 Software Data Prefetch
The PREFETCH inst ruct ion can hide t he lat ency of dat a access in performance- crit ical
sect ions of applicat ion code by allowing dat a t o be fet ched in advance of act ual
usage. PREFETCH inst ruct ions do not change t he user- visible semant ics of a
program, alt hough t hey may impact program performance. PREFETCH merely
provides a hint t o t he hardware and generally does not generat e except ions or fault s.
9-5
PREFETCH loads eit her non- t emporal dat a or t emporal dat a in t he specified cache
level. This dat a access t ype and t he cache level are specified as a hint . Depending on
t he implement at ion, t he inst ruct ion fet ches 32 or more aligned byt es ( including t he
specified address byt e) int o t he inst ruct ion- specified cache levels.
PREFETCH is implement at ion- specific; applicat ions need t o be t uned t o each imple-
ment at ion t o maximize performance.
NOTE
Using t he PREFETCH inst ruct ion is recommended only if dat a does
not fit in cache.
PREFETCH provides a hint t o t he hardware; it does not generat e except ions or fault s
except for a few special cases ( see Sect ion 9. 4. 3, Prefet ch and Load I nst ruct ions ) .
However, excessive use of PREFETCH inst ruct ions may wast e memory bandwidt h and
result in a performance penalt y due t o resource const raint s.
Nevert heless, PREFETCH can lessen t he overhead of memory t ransact ions by
prevent ing cache pollut ion and by using caches and memory efficient ly. This is part ic-
ularly import ant for applicat ions t hat share crit ical syst em resources, such as t he
memory bus. See an example in Sect ion 9. 7. 2.1, Video Encoder.
PREFETCH is mainly designed t o improve applicat ion performance by hiding memory
lat ency in t he background. I f segment s of an applicat ion access dat a in a predict able
manner ( for example, using arrays wit h known st rides) , t hey are good candidat es for
using PREFETCH t o improve performance.
Use t he PREFETCH inst ruct ions in:
Predict able memory access pat t erns
Time- consuming innermost loops
Locat ions where t he execut ion pipeline may st all if dat a is not available
9.4.2 Prefetch Instructions Pentium

4 Processor
Implementation
St reaming SI MD Ext ensions include four PREFETCH inst ruct ions variant s, one non-
t emporal and t hree t emporal. They correspond t o t wo t ypes of operat ions, t emporal
and non- t emporal.
NOTE
At t he t ime of PREFETCH, if dat a is already found in a cache level t hat
is closer t o t he processor t han t he cache level specified by t he
inst ruct ion, no dat a movement occurs.
9-6
The non- t emporal inst ruct ion is:
PREFETCHNTA Fet ch t he dat a int o t he second- level cache, minimizing cache
pollut ion.
Temporal inst ruct ions are:
PREFETCHNT0 Fet ch t he dat a int o all cache levels; t hat is, t o t he second- level
cache for t he Pent ium 4 processor.
PREFETCHNT1 This inst ruct ion is ident ical t o PREFETCHT0.
PREFETCHNT2 This inst ruct ion is ident ical t o PREFETCHT0.
9.4.3 Prefetch and Load Instructions
The Pent ium 4 processor has a decoupled execut ion and memory archit ect ure t hat
allows inst ruct ions t o be execut ed independent ly wit h memory accesses ( if t here are
no dat a and resource dependencies) . Programs or compilers can use dummy load
inst ruct ions t o imit at e PREFETCH funct ionalit y; but preloading is not complet ely
equivalent t o using PREFETCH inst ruct ions. PREFETCH provides great er performance
t han preloading.
Current ly, PREFETCH provides great er performance t han preloading because:
Has no dest inat ion regist er, it only updat es cache lines.
Does not st all t he normal inst ruct ion ret irement .
Does not affect t he funct ional behavior of t he program.
Has no cache line split accesses.
Does not cause except ions except when t he LOCK prefix is used. The LOCK prefix
is not a valid prefix for use wit h PREFETCH.
Does not complet e it s own execut ion if t hat would cause a fault .
Current ly, t he advant age of PREFETCH over preloading inst ruct ions are processor-
specific. This may change in t he fut ure.
There are cases where a PREFETCH will not perform t he dat a prefet ch. These include:
PREFETCH causes a DTLB ( Dat a Translat ion Lookaside Buffer) miss. This applies
t o Pent ium 4 processors wit h CPUI D signat ure corresponding t o family 15, model
0, 1, or 2. PREFETCH resolves DTLB misses and fet ches dat a on Pent ium 4
processors wit h CPUI D signat ure corresponding t o family 15, model 3.
An access t o t he specified address t hat causes a fault / except ion.
I f t he memory subsyst em runs out of request buffers bet ween the first-level cache
and t he second- level cache.
PREFETCH t arget s an uncacheable memory region ( for example, USWC and UC) .
The LOCK prefix is used. This causes an invalid opcode except ion.
9-7
9.5 CACHEABILITY CONTROL
This sect ion covers t he mechanics of cacheabilit y cont rol inst ruct ions.
9.5.1 The Non-temporal Store Instructions
This sect ion describes t he behavior of st reaming st ores and reit erat es some of t he
informat ion present ed in t he previous sect ion.
I n St reaming SI MD Ext ensions, t he MOVNTPS, MOVNTPD, MOVNTQ, MOVNTDQ,
MOVNTI , MASKMOVQ and MASKMOVDQU inst ruct ions are st reaming, non- t emporal
st ores. Wit h regard t o memory charact erist ics and ordering, t hey are similar t o t he
Writ e- Combining ( WC) memory t ype:
Wr i t e combi ni ng Successive writ es t o t he same cache line are combined.
Wr i t e col l apsi ng Successive writ es t o t he same byt e( s) result in only t he last
writ e being visible.
Weak l y or der ed No ordering is preserved bet ween WC st ores or bet ween WC
st ores and ot her loads or st ores.
Uncacheabl e and not w r i t e- al l ocat i ng St ored dat a is writ t en around t he
cache and will not generat e a read- for- ownership bus request for t he corre-
sponding cache line.
9.5.1.1 Fencing
Because st reaming st ores are weakly ordered, a fencing operat ion is required t o
ensure t hat t he st ored dat a is flushed from t he processor t o memory. Failure t o use
an appropriat e fence may result in dat a being t rapped wit hin t he processor and will
prevent visibilit y of t his dat a by ot her processors or syst em agent s.
WC st ores require soft ware t o ensure coherence of dat a by performing t he fencing
operat ion. See Sect ion 9. 5. 4, FENCE I nst ruct ions.
9.5.1.2 Streaming Non-temporal Stores
St reaming st ores can improve performance by:
I ncreasing st ore bandwidt h if t he 64 byt es t hat fit wit hin a cache line are writ t en
consecut ively ( since t hey do not require read- for- ownership bus request s and 64
byt es are combined int o a single bus writ e t ransact ion) .
Reducing dist urbance of frequent ly used cached ( t emporal) dat a ( since t hey
writ e around t he processor caches) .
St reaming st ores allow cross- aliasing of memory t ypes for a given memory region.
For inst ance, a region may be mapped as writ e- back ( WB) using page at t ribut e
t ables ( PAT) or memory t ype range regist ers ( MTRRs) and yet is writ t en using a
st reaming st ore.
9-8
9.5.1.3 Memory Type and Non-temporal Stores
Memory t ype can t ake precedence over a non- t emporal hint , leading t o t he following
considerat ions:
I f t he programmer specifies a non- t emporal st ore t o st rongly- ordered
uncacheable memory ( for example, Uncacheable ( UC) or Writ e- Prot ect ( WP)
memory t ypes) , t hen t he st ore behaves like an uncacheable st ore. The non-
t emporal hint is ignored and t he memory t ype for t he region is ret ained.
I f t he programmer specifies t he weakly- ordered uncacheable memory t ype of
Writ e- Combining ( WC) , t hen t he non- t emporal st ore and t he region have t he
same semant ics and t here is no conflict .
I f t he programmer specifies a non- t emporal st ore t o cacheable memory ( for
example, Writ e- Back ( WB) or Writ e-Through ( WT) memory t ypes) , t wo cases
may result :
CASE 1 I f t he dat a is present in t he cache hierarchy, t he inst ruct ion will
ensure consist ency. A part icular processor may choose different ways t o
implement t his. The following approaches are probable: ( a) updat ing dat a in-
place in t he cache hierarchy while preserving t he memory t ype semant ics
assigned t o t hat region or ( b) evict ing t he dat a from t he caches and writ ing
t he new non- t emporal dat a t o memory ( wit h WC semant ics) .
The approaches ( separat e or combined) can be different for fut ure
processors. Pent ium 4, I nt el Core Solo and I nt el Core Duo processors
implement t he lat t er policy ( of evict ing dat a from all processor caches) . The
Pent ium M processor implement s a combinat ion of bot h approaches.
I f t he st reaming st ore hit s a line t hat is present in t he first - level cache, t he
st ore dat a is combined in place wit hin t he first - level cache. I f t he st reaming
st ore hit s a line present in t he second- level, t he line and st ored dat a is
flushed from t he second- level t o syst em memory.
CASE 2 I f t he dat a is not present in t he cache hierarchy and t he
dest inat ion region is mapped as WB or WT; t he t ransact ion will be weakly
ordered and is subj ect t o all WC memory semant ics. This non- t emporal st ore
will not writ e- allocat e. Different implement at ions may choose t o collapse and
combine such st ores.
9.5.1.4 Write-Combining
Generally, WC semant ics require soft ware t o ensure coherence wit h respect t o ot her
processors and ot her syst em agent s ( such as graphics cards) . Appropriat e use of
synchronizat ion and a fencing operat ion must be performed for producer- consumer
usage models ( see Sect ion 9.5. 4, FENCE I nst ruct ions ) . Fencing ensures t hat all
syst em agent s have global visibilit y of t he st ored dat a. For inst ance, failure t o fence
may result in a writ t en cache line st aying wit hin a processor, and t he line would not
be visible t o ot her agent s.
9-9
For processors which implement non- t emporal st ores by updat ing dat a in- place t hat
already resides in t he cache hierarchy, t he dest inat ion region should also be mapped
as WC. Ot herwise, if mapped as WB or WT, t here is a pot ent ial for speculat ive
processor reads t o bring t he dat a int o t he caches. I n such a case, non- t emporal
st ores would t hen updat e in place and dat a would not be flushed from t he processor
by a subsequent fencing operat ion.
The memory t ype visible on t he bus in t he presence of memory t ype aliasing is imple-
ment at ion- specific. As one example, t he memory t ype writ t en t o t he bus may reflect
t he memory t ype for t he first st ore t o t he line, as seen in program order. Ot her alt er-
nat ives are possible. This behavior should be considered reserved and dependence
on t he behavior of any part icular implement at ion risks fut ure incompat ibilit y.
9.5.2 Streaming Store Usage Models
The t wo primary usage domains for st reaming st ore are coherent request s and non-
coherent request s.
9.5.2.1 Coherent Requests
Coherent request s are normal loads and st ores t o syst em memory, which may also
hit cache lines present in anot her processor in a mult iprocessor environment . Wit h
coherent request s, a st reaming st ore can be used in t he same way as a regular st ore
t hat has been mapped wit h a WC memory t ype ( PAT or MTRR) . An SFENCE inst ruc-
t ion must be used wit hin a producer- consumer usage model in order t o ensure coher-
ency and visibilit y of dat a bet ween processors.
Wit hin a single- processor syst em, t he CPU can also re- read t he same memory loca-
t ion and be assured of coherence ( t hat is, a single, consist ent view of t his memory
locat ion) . The same is t rue for a mult iprocessor ( MP) syst em, assuming an accept ed
MP soft ware producer- consumer synchronizat ion policy is employed.
9.5.2.2 Non-coherent requests
Non- coherent request s arise from an I / O device, such as an AGP graphics card, t hat
reads or writ es syst em memory using non- coherent request s, which are not reflect ed
on t he processor bus and t hus will not query t he processor s caches. An SFENCE
inst ruct ion must be used wit hin a producer- consumer usage model in order t o ensure
coherency and visibilit y of dat a bet ween processors. I n t his case, if t he processor is
writ ing dat a t o t he I / O device, a st reaming st ore can be used wit h a processor wit h
any behavior of Case 1 ( Sect ion 9. 5. 1. 3) only if t he region has also been mapped
wit h a WC memory t ype ( PAT, MTRR) .
9-10
NOTE
Failure t o map t he region as WC may allow t he line t o be speculat ively
read int o t he processor caches ( via t he wrong pat h of a mispredict ed
branch) .
I n case t he region is not mapped as WC, t he st reaming might updat e in- place in t he
cache and a subsequent SFENCE would not result in t he dat a being writ t en t o syst em
memory. Explicit ly mapping t he region as WC in t his case ensures t hat any dat a read
from t his region will not be placed in t he processor s caches. A read of t his memory
locat ion by a non- coherent I / O device would ret urn incorrect / out - of- dat e result s.
For a processor which solely implement s Case 2 ( Sect ion 9. 5. 1. 3) , a st reaming st ore
can be used in t his non- coherent domain wit hout requiring t he memory region t o also
be mapped as WB, since any cached dat a will be flushed t o memory by t he st reaming
st ore.
9.5.3 Streaming Store Instruction Descriptions
MOVNTQ/ MOVNTDQ ( non- t emporal st ore of packed int eger in an MMX t echnology or
St reaming SI MD Ext ensions regist er) st ore dat a from a regist er t o memory. They are
implicit ly weakly- ordered, do no writ e- allocat e, and so minimize cache pollut ion.
MOVNTPS ( non- t emporal st ore of packed single precision float ing point ) is similar t o
MOVNTQ. I t st ores dat a from a St reaming SI MD Ext ensions regist er t o memory in 16-
byt e granularit y. Unlike MOVNTQ, t he memory address must be aligned t o a 16- byt e
boundary or a general prot ect ion except ion will occur. The inst ruct ion is implicit ly
weakly- ordered, does not writ e- allocat e, and t hus minimizes cache pollut ion.
MASKMOVQ/ MASKMOVDQU ( non- t emporal byt e mask st ore of packed int eger in an
MMX t echnology or St reaming SI MD Ext ensions regist er) st ore dat a from a regist er
t o t he locat ion specified by t he EDI regist er. The most significant bit in each byt e of
t he second mask regist er is used t o select ively writ e t he dat a of t he first regist er on
a per- byt e basis. The inst ruct ions are implicit ly weakly- ordered ( t hat is, successive
st ores may not writ e memory in original program- order) , do not writ e- allocat e, and
t hus minimize cache pollut ion.
9.5.4 FENCE Instructions
The following fence inst ruct ions are available: SFENCE, lFENCE, and MFENCE.
9.5.4.1 SFENCE Instruction
The SFENCE ( STORE FENCE) inst ruct ion makes it possible for every STORE inst ruc-
t ion t hat precedes an SFENCE in program order t o be globally visible before any
STORE t hat follows t he SFENCE. SFENCE provides an efficient way of ensuring
ordering bet ween rout ines t hat produce weakly- ordered result s.
9-11
The use of weakly- ordered memory t ypes can be import ant under cert ain dat a
sharing relat ionships ( such as a producer- consumer relat ionship) . Using weakly-
ordered memory can make assembling t he dat a more efficient , but care must be
t aken t o ensure t hat t he consumer obt ains t he dat a t hat t he producer int ended t o see.
Some common usage models may be affect ed by weakly- ordered st ores. Examples
are:
Library funct ions, which use weakly- ordered memory t o writ e result s
Compiler- generat ed code, which also benefit s from writ ing weakly- ordered
result s
Hand- craft ed code
The degree t o which a consumer of dat a knows t hat t he dat a is weakly- ordered can
vary for different cases. As a result , SFENCE should be used t o ensure ordering
bet ween rout ines t hat produce weakly- ordered dat a and rout ines t hat consume t his
dat a.
9.5.4.2 LFENCE Instruction
The LFENCE ( LOAD FENCE) inst ruct ion makes it possible for every LOAD inst ruct ion
t hat precedes t he LFENCE inst ruct ion in program order t o be globally visible before
any LOAD inst ruct ion t hat follows t he LFENCE.
The LFENCE inst ruct ion provides a means of segregat ing LOAD inst ruct ions from
ot her LOADs.
9.5.4.3 MFENCE Instruction
The MFENCE ( MEMORY FENCE) inst ruct ion makes it possible for every LOAD/ STORE
inst ruct ion preceding MFENCE in program order t o be globally visible before any
LOAD/ STORE following MFENCE. MFENCE provides a means of segregat ing cert ain
memory inst ruct ions from ot her memory references.
The use of a LFENCE and SFENCE is not equivalent t o t he use of a MFENCE since t he
load and st ore fences are not ordered wit h respect t o each ot her. I n ot her words, t he
load fence can be execut ed before prior st ores and t he st ore fence can be execut ed
before prior loads.
MFENCE should be used whenever t he cache line flush inst ruct ion ( CLFLUSH) is used
t o ensure t hat speculat ive memory references generat ed by t he processor do not
int erfere wit h t he flush. See Sect ion 9. 5. 5, CLFLUSH I nst ruct ion.
9-12
9.5.5 CLFLUSH Instruction
The CLFLUSH inst ruct ion invalidat es t he cache line associat ed wit h t he linear address
t hat cont ain t he byt e address of t he memory locat ion, in all levels of t he processor
cache hierarchy ( dat a and inst ruct ion) . This invalidat ion is broadcast t hroughout t he
coherence domain. I f, at any level of t he cache hierarchy, a line is inconsist ent wit h
memory ( dirt y) , it is writ t en t o memory before invalidat ion. Ot her charact erist ics
include:
The dat a size affect ed is t he cache coherency size, which is 64 byt es on Pent ium 4
processor.
The memory at t ribut e of t he page cont aining t he affect ed line has no effect on
t he behavior of t his inst ruct ion.
The CLFLUSH inst ruct ion can be used at all privilege levels and is subj ect t o all
permission checking and fault s associat ed wit h a byt e load.
CLFLUSH is an unordered operat ion wit h respect t o ot her memory t raffic, including
ot her CLFLUSH inst ruct ions. Soft ware should use a memory fence for cases where
ordering is a concern.
As an example, consider a video usage model where a video capt ure device is using
non- coherent AGP accesses t o writ e a capt ure st ream direct ly t o syst em memory.
Since t hese non- coherent writ es are not broadcast on t he processor bus, t hey will not
flush copies of t he same locat ions t hat reside in t he processor caches. As a result ,
before t he processor re- reads t he capt ure buffer, it should use CLFLUSH t o ensure
t hat st ale copies of t he capt ure buffer are flushed from t he processor caches. Due t o
speculat ive reads t hat may be generat ed by t he processor, it is import ant t o observe
appropriat e fencing ( using MFENCE) .
Example 9- 1 provides pseudocode for CLFLUSH usage.
9.6 MEMORY OPTIMIZATION USING PREFETCH
The Pent ium 4 processor has t wo mechanisms for dat a prefet ch: soft ware- cont rolled
prefet ch and an aut omat ic hardware prefet ch.
Example 9-1. Pseudo-code Using CLFLUSH
while (!buffer_ready} {}
mfence
for(i=0;i<num_cachelines;i+=cacheline_size) {
clflush (char *)((unsigned int)buffer + i)
}
mfence
prefnta buffer[0];
VAR = buffer[0];
9-13
9.6.1 Software-Controlled Prefetch
The soft ware- cont rolled prefet ch is enabled using t he four PREFETCH inst ruct ions
int roduced wit h St reaming SI MD Ext ensions inst ruct ions. These inst ruct ions are
hint s t o bring a cache line of dat a in t o various levels and modes in t he cache hier-
archy. The soft ware- cont rolled prefet ch is not int ended for prefet ching code. Using it
can incur significant penalt ies on a mult iprocessor syst em when code is shared.
Soft ware prefet ching has t he following charact erist ics:
Can handle irregular access pat t erns which do not t rigger t he hardware
prefet cher.
Can use less bus bandwidt h t han hardware prefet ching; see below.
Soft ware prefet ches must be added t o new code, and do not benefit exist ing
applicat ions.
9.6.2 Hardware Prefetch
Aut omat ic hardware prefet ch can bring cache lines int o t he unified last - level cache
based on prior dat a misses. I t will at t empt t o prefet ch t wo cache lines ahead of t he
prefet ch st ream. Charact erist ics of t he hardware prefet cher are:
I t requires some regularit y in t he dat a access pat t erns.
I f a dat a access pat t ern has const ant st ride, hardware prefet ching is effect ive
if t he access st ride is less t han half of t he t rigger dist ance of hardware
prefet cher ( see Table 2- 6) .
I f t he access st ride is not const ant , t he aut omat ic hardware prefet cher can
mask memory lat ency if t he st rides of t wo successive cache misses are less
t han t he t rigger t hreshold dist ance ( small- st ride memory t raffic) .
The aut omat ic hardware prefet cher is most effect ive if t he st rides of t wo
successive cache misses remain less t han t he t rigger t hreshold dist ance and
close t o 64 byt es.
There is a st art - up penalt y before t he prefet cher t riggers and t here may be
fet ches an array finishes. For short arrays, overhead can reduce effect iveness.
The hardware prefet cher requires a couple misses before it st art s operat ing.
Hardware prefet ching generat es a request for dat a beyond t he end of an
array, which is not be ut ilized. This behavior wast es bus bandwidt h. I n
addit ion t his behavior result s in a st art - up penalt y when fet ching t he
beginning of t he next array. Soft ware prefet ching may recognize and handle
t hese cases.
I t will not prefet ch across a 4- KByt e page boundary. A program has t o init iat e
demand loads for t he new page before t he hardware prefet cher st art s
prefet ching from t he new page.
9-14
The hardware prefet cher may consume ext ra syst em bandwidt h if t he appli-
cat ions memory t raffic has significant port ions wit h st rides of cache misses
great er t han t he t rigger dist ance t hreshold of hardware prefet ch ( largest ride
memory t raffic) .
The effect iveness wit h exist ing applicat ions depends on t he proport ions of small-
st ride versus largest ride accesses in t he applicat ions memory t raffic. An
applicat ion wit h a pr eponderance of small- st ride memory t raffic wit h good
t emporal localit y will benefit great ly from t he aut omat ic hardware prefet cher.
I n some sit uat ions, memory t raffic consist ing of a preponderance of largest ride
cache misses can be t ransformed by re- arrangement of dat a access sequences t o
alt er t he concent rat ion of small- st ride cache misses at t he expense of large-
st ride cache misses t o t ake advant age of t he aut omat ic hardware prefet cher.
9.6.3 Example of Effective Latency Reduction
with Hardware Prefetch
Consider t he sit uat ion t hat an array is populat ed wit h dat a corresponding t o a
const ant - access- st ride, circular point er chasing sequence ( see Example 9- 2) . The
pot ent ial of employing t he aut omat ic hardware prefet ching mechanism t o reduce t he
effect ive lat ency of fet ching a cache line from memory can be illust rat ed by varying
t he access st ride bet ween 64 byt es and t he t rigger t hreshold dist ance of hardware
prefet ch when populat ing t he array for circular point er chasing.
The effect ive lat ency reduct ion for several microarchit ect ure implement at ions is
shown in Figure 9- 1. For a const ant - st ride access pat t ern, t he benefit of t he aut o-
Example 9-2. Populating an Array for Circular Pointer Chasing with Constant Stride
register char ** p;
char *next; // Populating pArray for circular pointer
// chasing with constant access stride
// p = (char **) *p; loads a value pointing to next load
p = (char **)&pArray;
for ( i = 0; i < aperture; i += stride) {
p = (char **)&pArray[i];
if (i + stride >= g_array_aperture) {
next = &pArray[0 ];
}
else {
next = &pArray[i + stride];
}
*p = next; // populate the address of the next node
}
9-15
mat ic hardware prefet cher begins at half t he t rigger t hreshold dist ance and reaches
maximum benefit when t he cache- miss st ride is 64 byt es.
9.6.4 Example of Latency Hiding with S/W Prefetch Instruction
Achieving t he highest level of memory opt imizat ion using PREFETCH inst ruct ions
requires an underst anding of t he archit ect ure of a given machine. This sect ion t rans-
lat es t he key archit ect ural implicat ions int o several simple guidelines for program-
mers t o use.
Figure 9- 2 and Figure 9- 3 show t wo scenarios of a simplified 3D geomet ry pipeline as
an example. A 3D- geomet ry pipeline t ypically fet ches one vert ex record at a t ime
and t hen performs t ransformat ion and light ing funct ions on it . Bot h figures show t wo
separat e pipelines, an execut ion pipeline, and a memory pipeline ( front - side bus) .
Since t he Pent ium 4 processor ( similar t o t he Pent ium I I and Pent ium III processors)
complet ely decouples t he funct ionalit y of execut ion and memory access, t he t wo
pipelines can funct ion concurrent ly. Figure 9- 2 shows bubbles in bot h t he execut ion
and memory pipelines. When loads are issued for accessing vert ex dat a, t he execu-
t ion unit s sit idle and wait unt il dat a is ret urned. On t he ot her hand, t he memory bus
sit s idle while t he execut ion unit s are processing vert ices. This scenario severely
decreases t he advant age of having a decoupled archit ect ure.
Figure 9-1. Effective Latency Reduction as a Function of Access Stride
U p p e r b o u n d o f P o in te r -C h a s in g L a te n c y R e d u c tio n
0 %
2 0 %
4 0 %
6 0 %
8 0 %
1 0 0 %
1 2 0 %
6
4
8
0
9
6
1
1
2
1
2
8
1
4
4
1
6
0
1
7
6
1
9
2
2
0
8
2
2
4
2
4
0
S tr i d e (B y te s)
E
f
f
e
c
t
i
v
e

L
a
t
e
n
c
y

R
e
d
u
c
t
i
o
n
F a m . 1 5 ; M o d e l 3 , 4
F a m . 1 5 ; M o d e l 0 , 1 , 2
F a m . 6 ; M o d e l 1 3
F a m . 6 ; M o d e l 1 4
F a m . 1 5 ; M o d e l 6
9-16
The performance loss caused by poor ut ilizat ion of resources can be complet ely elim-
inat ed by correct ly scheduling t he PREFETCH inst ruct ions. As shown in Figure 9- 3,
prefet ch inst ruct ions are issued t wo vert ex it erat ions ahead. This assumes t hat only
one vert ex get s processed in one it erat ion and a new dat a cache line is needed for
Figure 9-2. Memory Access Latency and Execution Without Prefetch
Figure 9-3. Memory Access Latency and Execution With Prefetch
OM15170
Execution units idle
Mem latency
Issue loads
Time
Vertex n+1
Execution units idle
Execution
pipeline
Mem latency
Issue loads
(vertex data)
Vertex n
Front-Side
Bus
FSB idle
OM15171
Time
Vertex n-2
Execution
pipeline
Mem latency for V
n
issue prefetch
for vertex n
Front-Side
Bus
Vertex n-1 Vertex n Vertex n+1
Mem latency for V
n+1
Mem latency for V
n+2
prefetch
V
n+1
prefetch
V
n+2
9-17
each it erat ion. As a result , when it erat ion n, vert ex V
n
, is being processed; t he
request ed dat a is already brought int o cache. I n t he meant ime, t he front - side bus is
t ransferring t he dat a needed for it erat ion n+ 1, vert ex V
n+ 1
. Because t here is no
dependence bet ween V
n+ 1
dat a and t he execut ion of V
n
, t he lat ency for dat a access
of V
n+ 1
can be ent irely hidden behind t he execut ion of V
n
. Under such circumst ances,
no bubbles are present in t he pipelines and t hus t he best possible performance can
be achieved.
Prefet ching is useful for inner loops t hat have heavy comput at ions, or are close t o t he
boundary bet ween being comput e- bound and memory- bandwidt h- bound. I t is prob-
ably not very useful for loops which are predominat ely memory bandwidt h- bound.
When dat a is already locat ed in t he first level cache, prefet ching can be useless and
could even slow down t he performance because t he ext ra ops eit her back up
wait ing for out st anding memory accesses or may be dropped alt oget her. This
behavior is plat form- specific and may change in t he fut ure.
9.6.5 Software Prefetching Usage Checklist
The following checklist covers issues t hat need t o be addressed and/ or resolved t o
use t he soft ware PREFETCH inst ruct ion properly:
Det ermine soft ware prefet ch scheduling dist ance.
Use soft ware prefet ch concat enat ion.
Minimize t he number of soft ware prefet ches.
Mix soft ware prefet ch wit h comput at ion inst ruct ions.
Use cache blocking t echniques ( for example, st rip mining) .
Balance single- pass versus mult i- pass execut ion.
Resolve memory bank conflict issues.
Resolve cache management issues.
Subsequent sect ions discuss t he above it ems.
9.6.6 Software Prefetch Scheduling Distance
Det ermining t he ideal prefet ch placement in t he code depends on many archit ect ural
paramet ers, including: t he amount of memory t o be prefet ched, cache lookup
lat ency, syst em memory lat ency, and est imat e of comput at ion cycle. The ideal
dist ance for prefet ching dat a is processor- and plat form- dependent . I f t he dist ance is
t oo short , t he prefet ch will not hide t he lat ency of t he fet ch behind comput at ion. I f
t he prefet ch is t oo far ahead, prefet ched dat a may be flushed out of t he cache by t he
t ime it is r equired.
Since prefet ch dist ance is not a well- defined met ric, for t his discussion, we define a
new t erm, prefet ch scheduling dist ance ( PSD) , which is represent ed by t he number
of it erat ions. For large loops, prefet ch scheduling dist ance can be set t o 1 ( t hat is,
9-18
schedule prefet ch inst ruct ions one it erat ion ahead) . For small loop bodies ( t hat is,
loop it erat ions wit h lit t le comput at ion) , t he prefet ch scheduling dist ance must be
more t han one it erat ion.
A simplified equat ion t o comput e PSD is deduced from t he mat hemat ical model. For
a simplified equat ion, complet e mat hemat ical model, and met hodology of prefet ch
dist ance det erminat ion, see Appendix E, Summary of Rules and Suggest ions.
Example 9- 3 illust rat es t he use of a prefet ch wit hin t he loop body. The prefet ch
scheduling dist ance is set t o 3, ESI is effect ively t he point er t o a line, EDX is t he
address of t he dat a being referenced and XMM1- XMM4 are t he dat a used in compu-
t at ion. Example 9- 4 uses t wo independent cache lines of dat a per it erat ion. The PSD
would need t o be increased/ decreased if more/ less t han t wo cache lines are used per
it erat ion.
9.6.7 Software Prefetch Concatenation
Maximum performance can be achieved when t he execut ion pipeline is at maximum
t hroughput , wit hout incurring any memory lat ency penalt ies. This can be achieved
by prefet ching dat a t o be used in successive it erat ions in a loop. De- pipelining
memory generat es bubbles in t he execut ion pipeline.
To explain t his performance issue, a 3D geomet ry pipeline t hat processes 3D
vert ices in st rip format is used as an example. A st rip cont ains a list of vert ices
whose predefined vert ex order forms cont iguous t riangles. I t can be easily observed
t hat t he memory pipe is de- pipelined on t he st rip boundary due t o ineffect ive
prefet ch arrangement . The execut ion pipeline is st alled for t he first t wo it erat ions for
each st rip. As a result , t he average lat ency for complet ing an it erat ion will be 165
( FI X) clocks. See Appendix E, Summar y of Rules and Suggest ions , for a det ailed
descript ion.
Example 9-3. Prefetch Scheduling Distance
top_loop:
prefetchnta [edx + esi + 128*3]
prefetchnta [edx*4 + esi + 128*3]
. . . . .
movaps xmm1, [edx + esi]
movaps xmm2, [edx*4 + esi]
movaps xmm3, [edx + esi + 16]
movaps xmm4, [edx*4 + esi + 16]
. . . . .
. . . . .
add esi, 128
cmp esi, ecx
jl top_loop
9-19
This memory de- pipelining creat es inefficiency in bot h t he memory pipeline and
execut ion pipeline. This de- pipelining effect can be removed by applying a t echnique
called prefet ch concat enat ion. Wit h t his t echnique, t he memory access and execu-
t ion can be fully pipelined and fully ut ilized.
For nest ed loops, memory de- pipelining could occur during t he int erval bet ween t he
last it erat ion of an inner loop and t he next it erat ion of it s associat ed out er loop.
Wit hout paying special at t ent ion t o prefet ch insert ion, loads from t he first it erat ion of
an inner loop can miss t he cache and st all t he execut ion pipeline wait ing for dat a
ret urned, t hus degrading t he performance.
I n Example 9- 4, t he cache line cont aining A[ I I ] [ 0] is not prefet ched at all and always
misses t he cache. This assumes t hat no array A[ ] [ ] foot print resides in t he cache.
The penalt y of memory de- pipelining st alls can be amort ized across t he inner loop
it erat ions. However, it may become very harmful when t he inner loop is short . I n
addit ion, t he last prefet ch in t he last PSD it erat ions are wast ed and consume
machine resources. Prefet ch concat enat ion is int roduced here in order t o eliminat e
t he performance issue of memory de- pipelining.
Prefet ch concat enat ion can bridge t he execut ion pipeline bubbles bet ween t he
boundary of an inner loop and it s associat ed out er loop. Simply by unrolling t he last
it erat ion out of t he inner loop and specifying t he effect ive prefet ch address for dat a
used in t he following it erat ion, t he performance loss of memory de- pipelining can be
complet ely removed. Example 9- 5 gives t he rewrit t en code.
This code segment for dat a prefet ching is improved and only t he first it erat ion of t he
out er loop suffers any memory access lat ency penalt y, assuming t he comput at ion
Example 9-4. Using Prefetch Concatenation
for (ii = 0; ii < 100; ii++) {
for (jj = 0; jj < 32; jj+=8) {
prefetch a[ii][jj+8]
computation a[ii][jj]
}
}
Example 9-5. Concatenation and Unrolling the Last Iteration of Inner Loop
for (ii = 0; ii < 100; ii++) {
for (jj = 0; jj < 24; jj+=8) { /* N-1 iterations */
prefetch a[ii][jj+8]
computation a[ii][jj]
}
prefetch a[ii+1][0]
computation a[ii][jj]/* Last iteration */
}
9-20
t ime is larger t han t he memory lat ency. I nsert ing a prefet ch of t he first dat a element
needed prior t o ent ering t he nest ed loop comput at ion would eliminat e or reduce t he
st art - up penalt y for t he very first it erat ion of t he out er loop. This uncomplicat ed high-
level code opt imizat ion can improve memory performance significant ly.
9.6.8 Minimize Number of Software Prefetches
Prefet ch inst ruct ions are not complet ely free in t erms of bus cycles, machine cycles
and resources, even t hough t hey require minimal clock and memory bandwidt h.
Excessive prefet ching may lead t o performance penalt ies because of issue penalt ies
in t he front end of t he machine and/ or resource cont ent ion in t he memory sub-
syst em. This effect may be severe in cases where t he t arget loops are small and/ or
cases where t he t arget loop is issue- bound.
One approach t o solve t he excessive prefet ching issue is t o unroll and/ or soft ware-
pipeline loops t o reduce t he number of prefet ches required. Figure 9- 4 present s a
code example which implement s prefet ch and unrolls t he loop t o remove t he redun-
dant prefet ch inst ruct ions whose prefet ch addresses hit t he previously issued
prefet ch inst ruct ions. I n t his part icular example, unrolling t he original loop once
saves six prefet ch inst ruct ions and nine inst ruct ions for condit ional j umps in every
ot her it erat ion.
Figure 9-4. Prefetch and Loop Unrolling
OM15172
top_loop:
prefetchnta [edx+esi+32]
prefetchnta [edx*4+esi+32]
. . . . .
movaps xmm1, [edx+esi]
movaps xmm2, [edx*4+esi]
. . . . .
add esi, 16
cmp esi, ecx
jl top_loop
top_loop:
prefetchnta [edx+esi+128]
prefetchnta [edx*4+esi+128]
. . . . .
movaps xmm1, [edx+esi]
movaps xmm2, [edx*4+esi]
. . . . .
movaps xmm1, [edx+esi+16]
movaps xmm2, [edx*4+esi+16]
. . . . .
movaps xmm1, [edx+esi+96]
movaps xmm2, [edx*4+esi+96]
. . . . .
. . . . .
add esi, 128
cmp esi, ecx
jl top_loop
unrolled
iteration
9-21
Figure 9- 5 demonst rat es t he effect iveness of soft ware prefet ches in lat ency hiding.
The X axis in Figure 9- 5 indicat es t he number of comput at ion clocks per loop ( each
it erat ion is independent ) . The Y axis indicat es t he execut ion t ime measured in clocks
per loop. The secondary Y axis indicat es t he percent age of bus bandwidt h ut ilizat ion.
The t est s vary by t he following paramet ers:
Number of l oad/ st or e st r eams Each load and st ore st ream accesses one
128- byt e cache line each per it erat ion.
Amount of comput at i on per l oop This is varied by increasing t he number of
dependent arit hmet ic operat ions execut ed.
Number of t he sof t w ar e pr ef et ches per l oop For example, one every
16 byt es, 32 byt es, 64 byt es, 128 byt es.
As expect ed, t he left most port ion of each of t he graphs in Figure 9- 5 shows t hat
when t here is not enough comput at ion t o overlap t he lat ency of memory access,
prefet ch does not help and t hat t he execut ion is essent ially memory- bound. The
graphs also illust rat e t hat redundant prefet ches do not increase performance.
9.6.9 Mix Software Prefetch with Computation Instructions
I t may seem convenient t o clust er all of PREFETCH inst ruct ions at t he beginning of a
loop body or before a loop, but t his can lead t o severe performance degradat ion. I n
order t o achieve t he best possible performance, PREFETCH inst ruct ions must be
int erspersed wit h ot her comput at ional inst ruct ions in t he inst ruct ion sequence rat her
t han clust ered t oget her. I f possible, t hey should also be placed apart from loads. This
improves t he inst ruct ion level parallelism and reduces t he pot ent ial inst ruct ion
Figure 9-5. Memory Access Latency and Execution With Prefetch
OM15171
Time
Vertex n-2
Execution
pipeline
Mem latency for V
n
issue prefetch
for vertex n
Front-Side
Bus
Vertex n-1 Vertex n Vertex n+1
Mem latency for V
n+1
Mem latency for V
n+2
prefetch
V
n+1
prefetch
V
n+2
9-22
resource st alls. I n addit ion, t his mixing reduces t he pressure on t he memory access
resources and in t urn reduces t he possibilit y of t he prefet ch ret iring wit hout fet ching
dat a.
Figure 9- 6 illust rat es dist ribut ing PREFETCH inst ruct ions. A simple and useful
heurist ic of prefet ch spreading for a Pent ium 4 processor is t o insert a PREFETCH
inst ruct ion every 20 t o 25 clocks. Rearranging PREFETCH inst ruct ions could yield a
not iceable speedup for t he code which st resses t he cache resource.
NOTE
To avoid inst ruct ion execut ion st alls due t o t he over- ut ilizat ion of t he
resource, PREFETCH inst ruct ions must be int erspersed wit h compu-
t at ional inst ruct ions
9.6.10 Software Prefetch and Cache Blocking Techniques
Cache blocking t echniques ( such as st rip- mining) are used t o improve t emporal
localit y and t he cache hit rat e. St rip- mining is one- dimensional t emporal localit y opt i-
mizat ion for memory. When t wo- dimensional arrays are used in programs, loop
blocking t echnique ( similar t o st rip- mining but in t wo dimensions) can be applied for
a bet t er memory performance.
Figure 9-6. Spread Prefetch Instructions
top_loop:
prefetchnta [ebx+128]
. . . .
. . . .
movps xmm1, [ebx]
addps xmm2, [ebx+3000]
mulps xmm3, [ebx+4000]
mulps xmm1, xmm2
. . . . . . . .
. . . . . .
. . . . .
add ebx, 128
cmp ebx, ecx
jl top_loop
top_loop:
movps xmm1, [ebx]
mulps xmm1, xmm2
. . . . . . .
. . .
. . . . . .
. . . . . .
. . . .
add ebx, 128
cmp ebx, ecx
jl top_loop
s
p
r
e
a
d

p
r
e
f
e
t
c
h
e
s
9-23
I f an applicat ion uses a large dat a set t hat can be reused across mult iple passes of a
loop, it will benefit from st rip mining. Dat a set s larger t han t he cache will be
processed in groups small enough t o fit int o cache. This allows t emporal dat a t o
reside in t he cache longer, reducing bus t raffic.
Dat a set size and t emporal localit y ( dat a charact erist ics) fundament ally affect how
PREFETCH inst ruct ions are applied t o st rip- mined code. Figure 9- 7 shows t wo simpli-
fied scenarios for t emporally- adj acent dat a and t emporally- non- adj acent dat a.
I n t he t emporally- adj acent scenario, subsequent passes use t he same dat a and find
it already in second- level cache. Prefet ch issues aside, t his is t he preferred sit uat ion.
I n t he t emporally non- adj acent scenario, dat a used in pass m is displaced by pass
( m+ 1) , requiring dat a re- fet ch int o t he first level cache and perhaps t he second level
cache if a lat er pass reuses t he dat a. I f bot h dat a set s fit int o t he second- level cache,
load operat ions in passes 3 and 4 become less expensive.
Figure 9- 8 shows how prefet ch inst ruct ions and st rip- mining can be applied t o
increase performance in bot h of t hese scenarios.
Figure 9-7. Cache Blocking Temporally Adjacent and Non-adjacent Passes

Dataset A
Dataset B
Dataset B
Dataset A
Dataset A
Dataset A
Dataset B
Dataset B
Pass 1
Pass 2
Pass 3
Pass 4
Temporally
adjacent passes
Temporally
non-adjacent
passes
9-24
For Pent ium 4 processors, t he left scenario shows a graphical implement at ion of
using PREFETCHNTA t o prefet ch dat a int o select ed ways of t he second- level cache
only ( SM1 denot es st rip mine one way of second- level) , minimizing second- level
cache pollut ion. Use PREFETCHNTA if t he dat a is only t ouched once during t he ent ire
execut ion pass in order t o minimize cache pollut ion in t he higher level caches. This
provides inst ant availabilit y, assuming t he prefet ch was issued far ahead enough,
when t he read access is issued.
I n scenario t o t he right ( see Figure 9- 8) , keeping t he dat a in one way of t he second-
level cache does not improve cache localit y. Therefore, use PREFETCHT0 t o prefet ch
t he dat a. This amort izes t he lat ency of t he memory references in passes 1 and 2, and
keeps a copy of t he dat a in second- level cache, which reduces memory t raffic and
lat encies for passes 3 and 4. To furt her reduce t he lat ency, it might be wort h consid-
ering ext ra PREFETCHNTA inst ruct ions prior t o t he memory references in passes 3
and 4.
I n Example 9- 6, consider t he dat a access pat t erns of a 3D geomet ry engine first
wit hout st rip- mining and t hen incorporat ing st rip- mining. Not e t hat 4- wide SI MD
inst ruct ions of Pent ium III processor can process 4 vert ices per every it erat ion.
Wit hout st rip- mining, all t he x, y,z coordinat es for t he four vert ices must be re-
fet ched from memory in t he second pass, t hat is, t he light ing loop. This causes
Figure 9-8. Examples of Prefetch and Strip-mining for Temporally Adjacent and
Non-Adjacent Passes Loops
Temporally
non-adjacent passes
Temporally
adjacent passes
Prefetchnta
Dataset A
Reuse
Dataset A
Reuse
Dataset B
Prefetchnta
Dataset B
SM1
SM1
Prefetcht0
Dataset A
Prefetcht0
Dataset B
Reuse
Dataset B
Reuse
Dataset A
SM2
9-25
under- ut ilizat ion of cache lines fet ched during t ransformat ion loop as well as band-
widt h wast ed in t he light ing loop.
Now consider t he code in Example 9- 7 where st rip- mining has been incorporat ed int o
t he loops.
Example 9-6. Data Access of a 3D Geometry Engine without Strip-mining
while (nvtx < MAX_NUM_VTX) {
prefetchnta vertexi data // v =[x,y,z,nx,ny,nz,tu,tv]
prefetchnta vertexi+1 data
TRANSFORMATION code // use only x,y,z,tu,tv of a vertex
nvtx+=4
}
while (nvtx < MAX_NUM_VTX) {
prefetchnta vertexi data // v =[x,y,z,nx,ny,nz,tu,tv]
// x,y,z fetched again
compute the light vectors // use only x,y,z
LOCAL LIGHTING code // use only nx,ny,nz
nvtx+=4
}
Example 9-7. Data Access of a 3D Geometry Engine with Strip-mining
while (nstrip < NUM_STRIP) {
/* Strip-mine the loop to fit data into one way of the second-level
cache */
while (nvtx < MAX_NUM_VTX_PER_STRIP) {
prefetchnta vertex
i
data // v=[x,y,z,nx,ny,nz,tu,tv]
prefetchnta vertex
i+1
data
prefetchnta vertex
i+2
data
prefetchnta vertex
i+3
data
TRANSFORMATION code
nvtx+=4
}
while (nvtx < MAX_NUM_VTX_PER_STRIP) {
/* x y z coordinates are in the second-level cache, no prefetch is
required */
9-26
Wit h st rip- mining, all vert ex dat a can be kept in t he cache ( for example, one way of
second- level cache) during t he st rip- mined t ransformat ion loop and reused in t he
light ing loop. Keeping dat a in t he cache reduces bot h bus t raffic and t he number of
prefet ches used.
Table 9- 1 summarizes t he st eps of t he basic usage model t hat incorporat es only soft -
ware prefet ch wit h st rip- mining. The st eps are:
Do st rip- mining: part it ion loops so t hat t he dat aset fit s int o second- level cache.
Use PREFETCHNTA if t he dat a is only used once or t he dat aset fit s int o 32 KByt es
( one way of second- level cache) . Use PREFETCHT0 if t he dat aset exceeds
32 KByt es.
The above st eps are plat form- specific and provide an implement at ion example. The
variables NUM_STRI P and MAX_NUM_VX_PER_STRI P can be heurist ically det er-
mined for peak performance for specific applicat ion on a specific plat form.
9.6.11 Hardware Prefetching and Cache Blocking Techniques
Tuning dat a access pat t erns for t he aut omat ic hardware prefet ch mechanism can
minimize t he memory access cost s of t he first - pass of t he read- mult iple- t imes and
some of t he read- once memory references. An example of t he sit uat ions of read-
once memory references can be illust rat ed wit h a mat rix or image t ranspose, reading
from a column- first orient at ion and writ ing t o a row- first orient at ion, or vice versa.
Example 9- 8 shows a nest ed loop of dat a movement t hat represent s a t ypical
mat rix/ image t ranspose problem. I f t he dimension of t he array are large, not only
t he foot print of t he dat aset will exceed t he last level cache but cache misses will
compute the light vectors
POINT LIGHTING code
nvtx+=4
}
}
Table 9-1. Software Prefetching Considerations into Strip-mining Code
Read-Once Array References
Read-Multiple-Times Array References
Adjacent Passes Non-Adjacent Passes
Prefetchnta Prefetch0, SM1 Prefetch0, SM1
(2nd Level Pollution)
Evict one way; Minimize
pollution
Pay memory access cost for
the first pass of each array;
Amortize the first pass with
subsequent passes
Pay memory access cost for
the first pass of every strip;
Amortize the first pass with
subsequent passes
Example 9-7. Data Access of a 3D Geometry Engine with Strip-mining (Contd.)
9-27
occur at large st rides. I f t he dimensions happen t o be powers of 2, aliasing condit ion
due t o finit e number of way- associat ivit y ( see Capacit y Limit s and Aliasing in
Caches in Chapt er ) will exacerbat e t he likelihood of cache evict ions.
Example 9- 8 ( b) shows applying t he t echniques of t iling wit h opt imal select ion of t ile
size and t ile widt h t o t ake advant age of hardware prefet ch. Wit h t iling, one can
choose t he size of t wo t iles t o fit in t he last level cache. Maximizing t he widt h of each
t ile for memory read references enables t he hardware prefet cher t o init iat e bus
request s t o read some cache lines before t he code act ually reference t he linear
addresses.
9.6.12 Single-pass versus Multi-pass Execution
An algorit hm can use single- or mult i- pass execut ion defined as follows:
Single- pass, or unlayered execut ion passes a single dat a element t hrough an
ent ire comput at ion pipeline.
Mult i- pass, or layered execut ion performs a single st age of t he pipeline on a
bat ch of dat a element s, before passing t he bat ch on t o t he next st age.
Example 9-8. Using HW Prefetch to Improve Read-Once Memory Traffic
a) Un-optimized image transpose
// dest and src represent two-dimensional arrays
for( i = 0;i < NUMCOLS; i ++) {
// inner loop reads single column
for( j = 0; j < NUMROWS ; j ++) {
// Each read reference causes large-stride cache miss
dest[i*NUMROWS +j] = src[j*NUMROWS + i];
}
}
b)
// tilewidth = L2SizeInBytes/2/TileHeight/Sizeof(element)
for( i = 0; i < NUMCOLS; i += tilewidth) {
for( j = 0; j < NUMROWS ; j ++) {
// access multiple elements in the same row in the inner loop
// access pattern friendly to hw prefetch and improves hit rate
for( k = 0; k < tilewidth; k ++)
dest[j+ (i+k)* NUMROWS] = src[i+k+ j* NUMROWS];
}
}
9-28
A charact erist ic feat ure of bot h single- pass and mult i- pass execut ion is t hat a specific
t radeoff exist s depending on an algorit hms implement at ion and use of a single- pass
or mult iple- pass execut ion. See Figure 9- 9.
Mult i- pass execut ion is oft en easier t o use when implement ing a general purpose
API , where t he choice of code pat hs t hat can be t aken depends on t he specific combi-
nat ion of feat ures select ed by t he applicat ion ( for example, for 3D graphics, t his
might include t he t ype of vert ex primit ives used and t he number and t ype of light
sources) .
Wit h such a broad range of permut at ions possible, a single- pass approach would be
complicat ed, in t erms of code size and validat ion. I n such cases, each possible
permut at ion would require a separat e code sequence. For example, an obj ect wit h
feat ures A, B, C, D can have a subset of feat ures enabled, say, A, B, D. This st age
would use one code pat h; anot her combinat ion of enabled feat ures would have a
different code pat h. I t makes more sense t o perform each pipeline st age as a sepa-
rat e pass, wit h condit ional clauses t o select different feat ures t hat are implement ed
wit hin each st age. By using st rip- mining, t he number of vert ices processed by each
st age ( for example, t he bat ch size) can be select ed t o ensure t hat t he bat ch st ays
wit hin t he processor caches t hrough all passes. An int ermediat e cached buffer is
used t o pass t he bat ch of vert ices from one st age or pass t o t he next one.
Single- pass execut ion can be bet t er suit ed t o applicat ions which limit t he number of
feat ures t hat may be used at a given t ime. A single- pass approach can reduce t he
amount of dat a copying t hat can occur wit h a mult i- pass engine. See Figure 9- 9.
9-29
The choice of single- pass or mult i- pass can have a number of performance implica-
t ions. For inst ance, in a mult i- pass pipeline, st ages t hat are limit ed by bandwidt h
( eit her input or out put ) will reflect more of t his performance limit at ion in overall
execut ion t ime. I n cont rast , for a single- pass approach, bandwidt h- limit at ions can be
dist ribut ed/ amort ized across ot her comput at ion- int ensive st ages. Also, t he choice of
which prefet ch hint s t o use are also impact ed by whet her a single- pass or mult i- pass
approach is used.
Figure 9-9. Single-Pass Vs. Multi-Pass 3D Geometry Engines
Transform
Lighting
Single-Pass
Culling
Lighting
Multi-Pass
Culling
40 vis
40 vis
60 invis
80 vis
80 vis
Vertex
processing
(inner loop)
Outer loop is
processing
strips
Transform
strip list
9-30
9.7 MEMORY OPTIMIZATION USING NON-TEMPORAL
STORES
Non- t emporal st ores can also be used t o manage dat a ret ent ion in t he cache. Uses
for non- t emporal st ores include:
To combine many writ es wit hout dist urbing t he cache hierarchy
To manage which dat a st ruct ures remain in t he cache and which are t ransient
Det ailed implement at ions of t hese usage models are covered in t he following
sect ions.
9.7.1 Non-temporal Stores and Software Write-Combining
Use non- t emporal st ores in t he cases when t he dat a t o be st ored is:
Writ e- once ( non- t emporal)
Too large and t hus cause cache t hrashing
Non- t emporal st ores do not invoke a cache line allocat ion, which means t hey are not
writ e- allocat e. As a result , caches are not pollut ed and no dirt y writ eback is gener-
at ed t o compet e wit h useful dat a bandwidt h. Wit hout using non- t emporal st ores, bus
bandwidt h will suffer when caches st art t o be t hrashed because of dirt y writ ebacks.
I n St reaming SI MD Ext ensions implement at ion, when non- t emporal st ores are
writ t en int o writ eback or writ e- combining memory regions, t hese st ores are weakly-
ordered and will be combined int ernally inside t he processor s writ e- combining buffer
and be writ t en out t o memory as a line burst t ransact ion. To achieve t he best possible
performance, it is recommended t o align dat a along t he cache line boundary and
writ e t hem consecut ively in a cache line size while using non- t emporal st ores. I f t he
consecut ive writ es are prohibit ive due t o programming const raint s, t hen soft ware
writ e- combining ( SWWC) buffers can be used t o enable line burst t ransact ion.
You can declare small SWWC buffers ( a cache line for each buffer) in your applicat ion
t o enable explicit writ e- combining operat ions. I nst ead of writ ing t o non- t emporal
memory space immediat ely, t he program writ es dat a int o SWWC buffers and
combines t hem inside t hese buffers. The program only writ es a SWWC buffer out
using non- t emporal st ores when t he buffer is filled up, t hat is, a cache line ( 128 byt es
for t he Pent ium 4 processor) . Alt hough t he SWWC met hod requires explicit inst ruc-
t ions for performing t emporary writ es and reads, t his ensures t hat t he t ransact ion on
t he front - side bus causes line t ransact ion rat her t han several part ial t ransact ions.
Applicat ion performance gains considerably from implement ing t his t echnique. These
SWWC buffers can be maint ained in t he second- level and reused t hroughout t he
program.
9-31
9.7.2 Cache Management
St reaming inst ruct ions ( PREFETCH and STORE) can be used t o manage dat a and
minimize dist urbance of t emporal dat a held wit hin t he processor s caches.
I n addit ion, t he Pent ium 4 processor t akes advant age of I nt el C + + Compiler support
for C + + language- level feat ures for t he St reaming SI MD Ext ensions. St reaming
SI MD Ext ensions and MMX t echnology inst ruct ions provide int rinsics t hat allow you
t o opt imize cache ut ilizat ion. Examples of such I nt el compiler int rinsics are
_MM_PREFETCH, _MM_STREAM, _MM_LOAD, _MM_SFENCE. For det ail, refer t o t he
I nt el C + + Compiler User s Guide document at ion.
The following examples of using prefet ching inst ruct ions in t he operat ion of video
encoder and decoder as well as in simple 8- byt e memory copy, illust rat e perfor-
mance gain from using t he prefet ching inst ruct ions for efficient cache management .
9.7.2.1 Video Encoder
I n a video encoder, some of t he dat a used during t he encoding process is kept in t he
processor s second- level cache. This is done t o minimize t he number of reference
st reams t hat must be re- read from syst em memory. To ensure t hat ot her writ es do
not dist urb t he dat a in t he second- level cache, st reaming st ores ( MOVNTQ) are used
t o writ e around all processor caches.
The prefet ching cache management implement ed for t he video encoder reduces t he
memory t raffic. The second- level cache pollut ion reduct ion is ensured by prevent ing
single- use video frame dat a from ent ering t he second- level cache. Using a non-
t emporal PREFETCH ( PREFETCHNTA) inst ruct ion brings dat a int o only one way of t he
second- level cache, t hus reducing pollut ion of t he second- level cache.
I f t he dat a brought direct ly t o second- level cache is not reused, t hen t here is a
performance gain from t he non- t emporal prefet ch over a t emporal prefet ch. The
encoder uses non- t emporal prefet ches t o avoid pollut ion of t he second- level cache,
increasing t he number of second- level cache hit s and decreasing t he number of
pollut ing writ e- backs t o memory. The performance gain result s from t he more effi-
cient use of t he second- level cache, not only from t he prefet ch it self.
9.7.2.2 Video Decoder
I n t he video decoder example, complet ed frame dat a is writ t en t o local memory of
t he graphics card, which is mapped t o WC ( Writ e- combining) memory t ype. A copy of
reference dat a is st ored t o t he WB memory at a lat er t ime by t he processor in order
t o generat e fut ure dat a. The assumpt ion is t hat t he size of t he reference dat a is t oo
large t o fit in t he processor s caches. A st reaming st ore is used t o writ e t he dat a
around t he cache, t o avoid displaying ot her t emporal dat a held in t he caches. Lat er,
t he processor re- reads t he dat a using PREFETCHNTA, which ensures maximum
bandwidt h, yet minimizes dist urbance of ot her cached t emporal dat a by using t he
non- t emporal ( NTA) version of prefet ch.
9-32
9.7.2.3 Conclusions from Video Encoder and Decoder Implementation
These t wo examples indicat e t hat by using an appropriat e combinat ion of non-
t emporal prefet ches and non- t emporal st ores, an applicat ion can be designed t o
lessen t he overhead of memory t ransact ions by prevent ing second- level cache pollu-
t ion, keeping useful dat a in t he second- level cache and reducing cost ly writ e- back
t ransact ions. Even if an applicat ion does not gain performance significant ly from
having dat a ready from prefet ches, it can improve from more efficient use of t he
second- level cache and memory. Such design reduces t he encoder s demand for such
crit ical resource as t he memory bus. This makes t he syst em more balanced, result ing
in higher performance.
9.7.2.4 Optimizing Memory Copy Routines
Creat ing memory copy rout ines for large amount s of dat a is a common t ask in soft -
ware opt imizat ion. Example 9- 9 present s a basic algorit hm for a t he simple memory
copy.
This t ask can be opt imized using various coding t echniques. One t echnique uses soft -
ware prefet ch and st reaming st ore inst ruct ions. I t is discussed in t he following para-
graph and a code example shown in Example 9- 10.
The memory copy algorit hm can be opt imized using t he St reaming SI MD Ext ensions
wit h t hese considerat ions:
Alignment of dat a
Proper layout of pages in memory
Cache size
I nt eract ion of t he t ransact ion lookaside buffer ( TLB) wit h memory accesses
Combining prefet ch and st reaming- st ore inst ruct ions.
The guidelines discussed in t his chapt er come int o play in t his simple example. TLB
priming is required for t he Pent ium 4 processor j ust as it is for t he Pent ium III
processor, since soft ware prefet ch inst ruct ions will not init iat e page t able walks on
eit her processor.

Example 9-9. Basic Algorithm of a Simple Memory Copy
#define N 512000
double a[N], b[N];
for (i = 0; i < N; i++) {
b[i] = a[i];
}
9-33
9.7.2.5 TLB Priming
The TLB is a fast memory buffer t hat is used t o improve performance of t he t ransla-
t ion of a virt ual memory address t o a physical memory address by providing fast
access t o page t able ent ries. I f memory pages are accessed and t he page t able ent ry
Example 9-10. A Memory Copy Routine Using Software Prefetch
#define PAGESIZE 4096;
#define NUMPERPAGE 512 // # of elements to fit a page

double a[N], b[N], temp;
for (kk=0; kk<N; kk+=NUMPERPAGE) {
temp = a[kk+NUMPERPAGE]; // TLB priming
// use block size = page size,
// prefetch entire block, one cache line per loop
for (j=kk+16; j<kk+NUMPERPAGE; j+=16) {
_mm_prefetch((char*)&a[j], _MM_HINT_NTA);
}
// copy 128 byte per loop
for (j=kk; j<kk+NUMPERPAGE; j+=16) {
_mm_stream_ps((float*)&b[j],
_mm_load_ps((float*)&a[j]));
_mm_stream_ps((float*)&b[j+2],
_mm_load_ps((float*)&a[j+2]));
} // finished copying one block
} // finished copying N elements
_mm_sfence();
9-34
is not resident in t he TLB, a TLB miss result s and t he page t able must be read from
memory.
The TLB miss result s in a performance degradat ion since anot her memory access
must be performed ( assuming t hat t he t ranslat ion is not already present in t he
processor caches) t o updat e t he TLB. The TLB can be preloaded wit h t he page t able
ent ry for t he next desired page by accessing ( or t ouching) an address in t hat page.
This is similar t o prefet ch, but inst ead of a dat a cache line t he page t able ent ry is
being loaded in advance of it s use. This helps t o ensure t hat t he page t able ent ry is
resident in t he TLB and t hat t he prefet ch happens as request ed subsequent ly.
9.7.2.6 Using the 8-byte Streaming Stores and Software Prefetch
Example 9- 10 present s t he copy algorit hm t hat uses second level cache. The algo-
rit hm performs t he following st eps:
1. Uses blocking t echnique t o t ransfer 8- byt e dat a from memory int o second- level
cache using t he _MM_PREFETCH int rinsic, 128 byt es at a t ime t o fill a block. The
size of a block should be less t han one half of t he size of t he second- level cache,
but large enough t o amort ize t he cost of t he loop.
2. Loads t he dat a int o an XMM regist er using t he _MM_LOAD_PS int rinsic.
3. Transfers t he 8- byt e dat a t o a different memory locat ion via t he _MM_STREAM
int rinsics, bypassing t he cache. For t his operat ion, it is import ant t o ensure t hat
t he page t able ent ry prefet ched for t he memory is preloaded in t he TLB.
I n Example 9- 10, eight _MM_LOAD_PS and _MM_STREAM_PS int rinsics are used so
t hat all of t he dat a prefet ched ( a 128- byt e cache line) is writ t en back. The prefet ch
and st reaming- st ores are execut ed in separat e loops t o minimize t he number of t ran-
sit ions bet ween reading and writ ing dat a. This significant ly improves t he bandwidt h
of t he memory accesses.
The TEMP = A[ KK+ CACHESI ZE] inst ruct ion is used t o ensure t he page t able ent ry for
array, and A is ent ered in t he TLB prior t o prefet ching. This is essent ially a prefet ch
it self, as a cache line is filled from t hat memory locat ion wit h t his inst ruct ion. Hence,
t he prefet ching st art s from KK+ 4 in t his loop.
This example assumes t hat t he dest inat ion of t he copy is not t emporally adj acent t o
t he code. I f t he copied dat a is dest ined t o be reused in t he near fut ure, t hen t he
st reaming st ore inst ruct ions should be replaced wit h regular 128 bit st ores
( _MM_STORE_PS) . This is required because t he implement at ion of st reaming st ores
on Pent ium 4 processor writ es dat a direct ly t o memory, maint aining cache coher-
ency.
9.7.2.7 Using 16-byte Streaming Stores and Hardware Prefetch
An alt ernat e t echnique for opt imizing a large region memory copy is t o t ake advan-
t age of hardware prefet cher, 16- byt e st reaming st ores, and apply a segment ed
9-35
approach t o separat e bus read and writ e t ransact ions. See Sect ion 3. 6. 11, Mini-
mizing Bus Lat ency.
The t echnique employs t wo st ages. I n t he first st age, a block of dat a is read from
memory t o t he cache subsyst em. I n t he second st age, cached dat a are writ t en t o
t heir dest inat ion using st reaming st ores.
Example 9-11. Memory Copy Using Hardware Prefetch and Bus Segmentation
void block_prefetch(void *dst,void *src)
{ _asm {
mov edi,dst
mov esi,src
mov edx,SIZE
align 16
main_loop:
xor ecx,ecx
align 16
}
prefetch_loop:
movaps xmm0, [esi+ecx]
movaps xmm0, [esi+ecx+64]
add ecx,128
cmp ecx,BLOCK_SIZE
jne prefetch_loop
xor ecx,ecx
align 16
cpy_loop:
movdqa xmm0,[esi+ecx]
movdqa xmm1,[esi+ecx+16]
movdqa xmm5,[esi+ecx+16+64]
movntdq [edi+ecx],xmm0
movntdq [edi+ecx+16],xmm1
9-36
9.7.2.8 Performance Comparisons of Memory Copy Routines
The t hroughput of a large- region, memory copy rout ine depends on several fact ors:
Coding t echniques t hat implement s t he memory copy t ask
Charact erist ics of t he syst em bus ( speed, peak bandwidt h, overhead in
read/ writ e t ransact ion prot ocols)
Microarchit ect ure of t he processor
A comparison of t he t wo coding t echniques discussed above and t wo un- opt imized
t echniques is shown in Table 9- 2.
add ecx,128
cmp ecx,BLOCK_SIZE
jne cpy_loop
add esi,ecx
add edi,ecx
sub edx,ecx
jnz main_loop
sfence
}
}
Table 9-2. Relative Performance of Memory Copy Routines
Processor, CPUID
Signature and
FSB Speed Byte Sequential
DWORD
Sequential
SW prefetch + 8
byte streaming
store
4KB-Block HW
prefetch + 16
byte streaming
stores
Pentium M
processor,
0x6Dn, 400
1.3X 1.2X 1.6X 2.5X
Intel Core Solo
and Intel Core
Duo processors,
0x6En, 667
3.3X 3.5X 2.1X 4.7X
Example 9-11. Memory Copy Using Hardware Prefetch and Bus Segmentation (Contd.)
9-37
The baseline for performance comparison is t he t hroughput ( byt es/ sec) of 8- MByt e
region memory copy on a first - generat ion Pent ium M processor ( CPUI D signat ure
0x69n) wit h a 400- MHz syst em bus using byt e- sequent ial t echnique similar t o t hat
shown in Example 9- 9. The degree of improvement relat ive t o t he performance
baseline for some recent processors and plat forms wit h higher syst em bus speed
using different coding t echniques are compared.
The second coding t echnique moves dat a at 4- Byt e granularit y using REP st ring
inst ruct ion. The t hird column compares t he performance of t he coding t echnique
list ed in Example 9- 10. The fourt h column of performance compares t he t hroughput
of fet ching 4- KByt es of dat a at a t ime ( using hardware prefet ch t o aggregat e bus
read t ransact ions) and writ ing t o memory via 16- Byt e st reaming st ores.
I ncreases in bus speed is t he primary cont ribut or t o t hroughput improvement s. The
t echnique shown in Example 9- 11 will likely t ake advant age of t he fast er bus speed
in t he plat form more efficient ly. Addit ionally, increasing t he block size t o mult iples of
4- KByt es while keeping t he t ot al working set wit hin t he second- level cache can
improve t he t hroughput slight ly.
The relat ive performance figure shown in Table 9- 2 is represent at ive of clean
microarchit ect ural condit ions wit hin a processor ( e. g. looping s simple sequence of
code many t imes) . The net benefit of int egrat ing a specific memory copy rout ine int o
an applicat ion ( full- feat ured applicat ions t end t o creat e many complicat ed micro-
archit ect ural condit ions) will vary for each applicat ion.
9.7.3 Deterministic Cache Parameters
I f CPUI D support s t he det erminist ic paramet er leaf, soft ware can use t he leaf t o
query each level of t he cache hierarchy. Enumerat ion of each cache level is by speci-
fying an index value ( st art ing form 0) in t he ECX regist er ( see CPUI D- CPU I dent ifi-
cat ion in Chapt er 3 of t he I nt el 64 and I A- 32 Archit ect ures Soft ware Developers
Manual, Volume 2A) .
The list of paramet ers is shown in Table 9- 3.
Pentium D
processor, 0xF4n,
800
3.4X 3.3X 4.9X 5.7X
Table 9-2. Relative Performance of Memory Copy Routines (Contd.)
Processor, CPUID
Signature and
FSB Speed Byte Sequential
DWORD
Sequential
SW prefetch + 8
byte streaming
store
4KB-Block HW
prefetch + 16
byte streaming
stores
9-38
The det erminist ic cache paramet er leaf provides a means t o implement soft ware wit h
a degree of forward compat ibilit y wit h respect t o enumerat ing cache paramet ers.
Det erminist ic cache paramet ers can be used in several sit uat ions, including:
Det ermine t he size of a cache level.
Adapt cache blocking paramet ers t o different sharing t opology of a cache- level
across Hyper-Threading Technology, mult icore and single- core processors.
Det ermine mult it hreading resource t opology in an MP syst em ( See Chapt er 7,
Mult iple- Processor Management , of t he I nt el 64 and I A- 32 Archit ect ures
Soft ware Developers Manual, Volume 3A) .
Det ermine cache hierarchy t opology in a plat form using mult icore processors
( See Example 8- 13) .
Manage t hreads and processor affinit ies.
Det ermine prefet ch st ride.
The size of a given level of cache is given by:
(# of Ways) * (Partitions) * (Line_size) * (Sets) = (EBX[31:22] + 1) * (EBX[21:12] + 1) *
(EBX[11:0] + 1) * (ECX + 1)
Table 9-3. Deterministic Cache Parameters Leaf
Bit Location Name Meaning
EAX[4:0] Cache Type 0 = Null - No more caches
1 = Data Cache
2 = Instruction Cache
3 = Unified Cache
4-31 = Reserved
EAX[7:5] Cache Level Starts at 1
EAX[8] Self Initializing cache level 1: does not need SW initialization
EAX[9] Fully Associative cache 1: Yes
EAX[13:10] Reserved
EAX[25:14] Maximum number of logical processors
sharing this cache
Plus encoding
EAX[31:26] Maximum number of cores in a package Plus 1 encoding
EBX[11:0] System Coherency Line Size (L) Plus 1 encoding (Bytes)
EBX[21:12] Physical Line partitions (P) Plus 1 encoding
EBX[31:22] Ways of associativity (W) Plus 1 encoding
ECX[31:0] Number of Sets (S) Plus 1 encoding
EDX Reserved
CPUID leaves > 3 < 80000000 are only visible when IA32_CR_MISC_ENABLES.BOOT_NT4 (bit 22)
is clear (Default).
9-39
9.7.3.1 Cache Sharing Using Deterministic Cache Parameters
I mproving cache localit y is an import ant part of soft ware opt imizat ion. For example,
a cache blocking algorit hm can be designed t o opt imize block size at runt ime for
single- processor implement at ions and a variet y of mult iprocessor execut ion environ-
ment s ( including processors support ing HT Technology, or mult icore processors) .
The basic t echnique is t o place an upper limit of t he blocksize t o be less t han t he size
of t he t arget cache level divided by t he number of logical processors serviced by t he
t arget level of cache. This t echnique is applicable t o mult it hreaded applicat ion
programming. The t echnique can also benefit single- t hreaded applicat ions t hat are
part of a mult i- t asking workloads.
9.7.3.2 Cache Sharing in Single-Core or Multicore
Det erminist ic cache paramet ers are useful for managing shared cache hierarchy in
mult it hreaded applicat ions for more sophist icat ed sit uat ions. A given cache level may
be shared by logical processors in a processor or it may be implement ed t o be shared
by logical processors in a physical processor package.
Using t he det erminist ic cache paramet er leaf and init ial API C_I D associat ed wit h
each logical processor in t he plat form, soft ware can ext ract informat ion on t he
number and t he t opological relat ionship of logical processors sharing a cache level.
See also: Sect ion 8. 9. 1, Using Shared Execut ion Resources in a Processor Core.
9.7.3.3 Determine Prefetch Stride
The prefet ch st ride ( see descript ion of CPUI D.01H. EBX) provides t he lengt h of t he
region t hat t he processor will prefet ch wit h t he PREFETCHh inst ruct ions
( PREFETCHT0, PREFETCHT1, PREFETCHT2 and PREFETCHNTA) . Soft ware will use t he
lengt h as t he st ride when prefet ching int o a part icular level of t he cache hierarchy as
ident ified by t he inst ruct ion used. The prefet ch size is relevant for cache t ypes of
Dat a Cache ( 1) and Unified Cache ( 3) ; it should be ignored for ot her cache t ypes.
Soft ware should not assume t hat t he coherency line size is t he prefet ch st ride.
I f t he prefet ch st ride field is zero, t hen soft ware should assume a default size of
64 byt es is t he prefet ch st ride. Soft ware should use t he following algorit hm t o det er-
mine what prefet ch size t o use depending on whet her t he det erminist ic cache param-
et er mechanism is support ed or t he legacy mechanism:
I f a processor support s t he det erminist ic cache paramet ers and provides a non-
zero prefet ch size, t hen t hat prefet ch size is used.
I f a processor support s t he det erminist ic cache paramet ers and does not
provides a prefet ch size t hen default size for each level of t he cache hierarchy is
64 byt es.
I f a processor does not support t he det erminist ic cache paramet ers but provides
a legacy prefet ch size descript or ( 0xF0 - 64 byt e, 0xF1 - 128 byt e) will be t he
prefet ch size for all levels of t he cache hierarchy.
9-40
I f a processor does not support t he det erminist ic cache paramet ers and does not
provide a legacy prefet ch size descript or, t hen 32- byt es is t he default size for all
levels of t he cache hierarchy.
9-1
CHAPTER 9
9.1 INTRODUCTION
This chapt er describes coding guidelines for applicat ion soft ware writ t en t o run in
64- bit mode. Some coding recommendat ions applicable t o 64- bit mode are covered
in Chapt er 3. The guidelines in t his chapt er should be considered as an addendum t o
t he coding guidelines described in Chapt er 3 t hrough Chapt er 8.
Soft ware t hat runs in eit her compat ibilit y mode or legacy non- 64- bit modes should
follow t he guidelines described in Chapt er 3 t hrough Chapt er 8.
9.2 CODING RULES AFFECTING 64-BIT MODE
9.2.1 Use Legacy 32-Bit Instructions When Data Size Is 32 Bits
64- bit mode makes 16 general purpose 64- bit regist ers available t o applicat ions. I f
applicat ion dat a size is 32 bit s, t here is no need t o use 64- bit regist ers or 64- bit arit h-
met ic.
The default operand size for most inst ruct ions is 32 bit s. The behavior of t hose
inst ruct ions is t o make t he upper 32 bit s all zeros. For example, when zeroing out a
regist er, t he following t wo inst ruct ion st reams do t he same t hing, but t he 32- bit
version saves one inst ruct ion byt e:
32- bit version:
xor eax, eax; Performs xor on lower 32bits and zeroes the upper 32 bits.
64- bit version:
xor rax, rax; Performs xor on all 64 bits.
This opt imizat ion holds t rue for t he lower 8 general purpose regist ers: EAX, ECX,
EBX, EDX, ESP, EBP, ESI , EDI . To access t he dat a in regist ers R9- R15, t he REX prefix
is required. Using t he 32- bit form t here does not reduce code size.
9-2
Assembl y/ Compi l er Codi ng Rul e 65. ( H i mpact , M gener al i t y) Use t he 32- bit
versions of inst ruct ions in 64- bit mode t o reduce code size unless t he 64- bit version
is necessary t o access 64- bit dat a or addit ional regist ers.
9.2.2 Use Extra Registers to Reduce Register Pressure
64- bit mode makes 8 addit ional 64- bit general purpose regist ers and 8 addit ional
XMM regist ers available t o applicat ions. To access t he addit ional regist ers, a single
byt e REX prefix is necessary. Using 8 addit ional regist ers can prevent t he compiler
from needing t o spill values ont o t he st ack.
Not e t hat t he pot ent ial increase in code size, due t o t he REX prefix, can increase
cache misses. This can work against t he benefit of using ext ra regist ers t o access t he
dat a. When eight regist ers are sufficient for an algorit hm, dont use t he regist ers t hat
require an REX prefix. This keeps t he code size smaller.
Assembl y/ Compi l er Codi ng Rul e 66. ( M i mpact , MH gener al i t y) When t hey
are needed t o reduce regist er pressure, use t he 8 ext ra general purpose regist ers
for int eger code and 8 ext ra XMM regist ers for float ing- point or SI MD code.
9.2.3 Use 64-Bit by 64-Bit Multiplies To Produce
128-Bit Results Only When Necessary
I nt eger mult iplies of 64- bit by 64- bit operands t hat produce a 128- bit result cost
more t han mult iplies t hat produce a 64- bit result . The upper 64- bit s of a result t ake
longer t o comput e t han t he lower 64 bit s.
I f t he compiler can det ermine at compile t ime t hat t he result of a mult iply will not
exceed 64 bit s, t hen t he compiler should generat e t he mult iply inst ruct ion t hat
produces a 64- bit result . I f t he compiler or assembly programmer can not det ermine
t hat t he result will be less t han 64 bit s, t hen a mult iply t hat produces a 128- bit result
is necessary.
Assembl y/ Compi l er Codi ng Rul e 67. ( ML i mpact , M gener al i t y) Prefer 64- bit
by 64- bit int eger mult iplies t hat produce 64- bit result s over mult iplies t hat produce
128- bit result s.
9.2.4 Sign Extension to Full 64-Bits
When in 64- bit mode, t he archit ect ure is opt imized t o sign- ext end t o 64 bit s in a
single op. I n 64- bit mode, when t he dest inat ion is 32 bit s, t he upper 32 bit s must be
zeroed.
Zeroing t he upper 32 bit s requires an ext ra op and is less opt imal t han sign
ext ending t o 64 bit s. While sign ext ending t o 64 bit s makes t he inst ruct ion one byt e
longer, it reduces t he number of ops t hat t he t race cache has t o st ore, improving
performance.
9-3
For example, t o sign- ext end a byt e int o ESI , use:
movsx rsi, BYTE PTR[rax]
inst ead of:
movsx esi, BYTE PTR[rax]
I f t he next inst ruct ion uses t he 32- bit form of esi regist er, t he result will be t he same.
This opt imizat ion can also be used t o break an unint ended dependency. For example,
if a program writ es a 16- bit value t o a regist er and t hen writ es t he regist er wit h an 8-
bit value, if bit s 15: 8 of t he dest inat ion are not needed, use t he sign- ext ended
version of writ es when available.
For example:
mov r8w, r9w; Requires a merge to preserve
; bits 63:15.
mov r8b, r10b; Requires a merge to preserve bits 63:8
Can be replaced wit h:
movsx r8, r9w ; If bits 63:8 do not need to be
; preserved.
movsx r8, r10b ; If bits 63:8 do not need to
; be preserved.
I n t he above example, t he moves t o R8W and R8B bot h require a merge t o preserve
t he rest of t he bit s in t he regist er. There is an implicit real dependency on R8 bet ween
t he ' MOV R8W, R9W' and ' MOV R8B, R10B' . Using MOVSX breaks t he real depen-
dency and leaves only t he out put dependency, which t he processor can eliminat e
t hrough renaming.
Assembl y/ Compi l er Codi ng Rul e 68. ( M i mpact , M gener al i t y ) Sign ext end t o
64- bit s inst ead of sign ext ending t o 32 bit s, even when t he dest inat ion will be used
as a 32- bit value.
9.3 ALTERNATE CODING RULES FOR 64-BIT MODE
9.3.1 Use 64-Bit Registers Instead of Two 32-Bit Registers
for 64-Bit Arithmetic
Legacy 32- bit mode offers t he abilit y t o support ext ended precision int eger arit h-
met ic ( such as 64- bit arit hmet ic) . However, 64- bit mode offers nat ive support for
64- bit arit hmet ic. When 64- bit int egers are desired, use t he 64- bit forms of arit h-
met ic inst ruct ions.
I n 32- bit legacy mode, get t ing a 64- bit result from a 32- bit by 32- bit int eger mult iply
requires t hree regist ers; t he result is st obbred in 32- bit chunks in t he EDX: EAX pair.
When t he inst ruct ion is available in 64- bit mode, using t he 32- bit version of t he
9-4
inst ruct ion is not t he opt imal implement at ion if a 64- bit result is desired. Use t he
ext ended regist ers.
For example, t he following code sequence loads t he 32- bit values sign- ext ended int o
t he 64- bit regist ers and performs a mult iply:
movsx rax, DWORD PTR[x]
movsx rcx, DWORD PTR[y]
imul rax, rcx
The 64- bit version above is more efficient t han using t he following 32- bit version:
mov eax, DWORD PTR[x]
mov ecx, DWORD PTR[y]
imul ecx
I n t he 32- bit case above, EAX is required t o be a source. The result ends up in t he
EDX: EAX pair inst ead of in a single 64- bit regist er.
Assembl y/ Compi l er Codi ng Rul e 69. ( ML i mpact , M gener al i t y) Use t he
64- bit versions of mult iply for 32- bit int eger mult iplies t hat require a 64 bit result .
To add t wo 64- bit numbers in 32- bit legacy mode, t he add inst ruct ion followed by t he
addc inst ruct ion is used. For example, t o add t wo 64- bit variables ( X and Y) , t he
following four inst ruct ions could be used:
mov eax, DWORD PTR[X]
mov edx, DWORD PTR[X+4]
add eax, DWORD PTR[Y]
adc edx, DWORD PTR[Y+4]
The result will end up in t he t wo- regist er EDX: EAX.
I n 64- bit mode, t he above sequence can be reduced t o t he following:
mov rax, QWORD PTR[X]
add rax, QWORD PTR[Y]
The result is st ored in rax. One regist er is required inst ead of t wo.
Assembl y/ Compi l er Codi ng Rul e 70. ( ML i mpact , M gener al i t y) Use t he
64- bit versions of add for 64- bit adds.
9.3.2 CVTSI2SS and CVTSI2SD
The CVTSI 2SS and CVTSI 2SD inst ruct ions convert a signed int eger in a general-
purpose regist er or memory locat ion t o a single- precision or double- precision
float ing- point value. The signed int eger can be eit her 32- bit s or 64- bit s.
I n processors based on I nt el Net Burst microarchit ect ure, t he 32- bit version will
execut e from t he t race cache; t he 64- bit version will result in a microcode flow from
t he microcode ROM and t akes longer t o execut e. I n most cases, t he 32- bit versions
of CVTSI 2SS and CVTSI 2SD is sufficient .
9-5
I n processors based on I nt el Core microarchit ect ure, CVTSI 2SS and CVTSI 2SD are
improved significant ly over t hose in I nt el Net Burst microarchit ect ure, in t erms of
lat ency and t hroughput . The improvement s applies equally t o 64- bit and 32- bit
versions.
9.3.3 Using Software Prefetch
I nt el recommends t hat soft ware developers follow t he recommendat ions in
Chapt er 3 and Chapt er 7 when considering t he choice of organizing dat a access
pat t erns t o t ake advant age of t he hardware prefet cher ( versus using soft ware
prefet ch) .
Assembl y/ Compi l er Codi ng Rul e 71. ( L i mpact , L gener al i t y ) I f soft ware
prefet ch inst ruct ions are necessary, use t he prefet ch inst ruct ions provided by SSE.
9-6
10-1
CHAPTER 10
10.1 OVERVIEW
Mobile comput ing allows comput ers t o operat e anywhere, anyt ime. Bat t ery life is a
key fact or in delivering t his benefit . Mobile applicat ions require soft ware opt imizat ion
t hat considers bot h performance and power consumpt ion. This chapt er provides
background on power saving t echniques in mobile processors
1
and makes recom-
mendat ions t hat developers can leverage t o provide longer bat t ery life.
A microprocessor consumes power while act ively execut ing inst ruct ions and doing
useful work. I t also consumes power in inact ive st at es ( when halt ed) . When a
processor is act ive, it s power consumpt ion is referred t o as act ive power. When a
processor is halt ed, it s power consumpt ion is referred t o as st at ic power.
ACPI 3. 0 ( ACPI st ands for Advanced Configurat ion and Power I nt erface) provides a
st andard t hat enables int elligent power management and consumpt ion. I t does t his
by allowing devices t o be t urned on when t hey are needed and by allowing cont rol of
processor speed ( depending on applicat ion requirement s) . The st andard defines a
number of P- st at es t o facilit at e management of act ive power consumpt ion; and
several C- st at e t ypes
2
t o facilit at e management of st at ic power consumpt ion.
Pent ium M, I nt el Core Solo, I nt el Core Duo processors, and processors based on I nt el
Core microarchit ect ure implement feat ures designed t o enable t he reduct ion of
act ive power and st at ic power consumpt ion. These include:
Enhanced I nt el SpeedSt ep
Technology enables operat ing syst em ( OS) t o

program a processor t o t ransit ion t o lower frequency and/ or volt age levels while
execut ing a workload.
Support for various act ivit y st at es ( for example: Sleep st at es, ACPI C- st at es) t o
reduces st at ic power consumpt ion by t urning off power t o subsyst ems in t he
processor.
Enhanced I nt el SpeedSt ep Technology provides low- lat ency t ransit ions bet ween
operat ing point s t hat support P- st at e usages. I n general, a high- numbered P- st at e
operat es at a lower frequency t o reduce act ive power consumpt ion. High- numbered
C- st at e t ypes correspond t o more aggressive st at ic power reduct ion. The t radeoff is
t hat t ransit ions out of higher- numbered C- st at es have longer lat ency.
1. For Intel
Centrino
mobile technology and Intel
Centrino
Duo mobile technology, only pro-

cessor-related techniques are covered in this manual.
2. ACPI 3.0 specification defines four C-state types, known as C0, C1, C2, C3. Microprocessors sup-
porting the ACPI standard implement processor-specific states that map to each ACPI C-state
type.
10-2
10.2 MOBILE USAGE SCENARIOS
I n mobile usage models, heavy loads occur in burst s while working on bat t ery power.
Most product ivit y, web, and st reaming workloads require modest performance
invest ment s. Enhanced I nt el SpeedSt ep Technology provides an opport unit y for an
OS t o implement policies t hat t rack t he level of performance hist ory and adapt t he
processor s frequency and volt age. I f demand changes in t he last 300 ms
3
, t he t ech-
nology allows t he OS t o opt imize t he t arget P- st at e by select ing t he lowest possible
frequency t o meet demand.
Consider, for example, an applicat ion t hat changes processor ut ilizat ion from 100%
t o a lower ut ilizat ion and t hen j umps back t o 100%. The diagram in Figure 10- 1
shows how t he OS changes processor frequency t o accommodat e demand and adapt
power consumpt ion. The int eract ion bet ween t he OS power management policy and
performance hist ory is described below:
1. Demand is high and t he processor works at it s highest possible frequency ( P0) .
2. Demand decreases, which t he OS recognizes aft er some delay; t he OS set s t he
processor t o a lower frequency ( P1) .
3. The processor decreases frequency and processor ut ilizat ion increases t o t he
most effect ive level, 80- 90% of t he highest possible frequency. The same
amount of work is performed at a lower frequency.
4. Demand decreases and t he OS set s t he processor t o t he lowest frequency,
somet imes called Low Frequency Mode ( LFM) .
5. Demand increases and t he OS rest ores t he processor t o t he highest frequency.
3. This chapter uses numerical values representing time constants (300 ms, 100 ms, etc.) on power
management decisions as examples to illustrate the order of magnitude or relative magnitude.
Actual values vary by implementation and may vary between product releases from the same
vendor.
Figure 10-1. Performance History and State Transitions
Frequency
& Power
CPU demand
1
2
3
4
5
10-3
10.3 ACPI C-STATES
When comput at ional demands are less t han 100%, part of t he t ime t he processor is
doing useful work and t he rest of t he t ime it is idle. For example, t he processor could
be wait ing on an applicat ion t ime- out set by a Sleep( ) funct ion, wait ing for a web
server response, or wait ing for a user mouse click. Figure 10- 2 illust rat es t he rela-
t ionship bet ween act ive and idle t ime.
When an applicat ion moves t o a wait st at e, t he OS issues a HLT inst ruct ion and t he
processor ent ers a halt ed st at e in which it wait s for t he next int errupt . The int errupt
may be a periodic t imer int errupt or an int errupt t hat signals an event .
As shown in t he illust rat ion of Figure 10- 2, t he processor is in eit her act ive or idle
( halt ed) st at e. ACPI defines four C- st at e t ypes ( C0, C1, C2 and C3) . Processor-
specific C st at es can be mapped t o an ACPI C- st at e t ype via ACPI st andard mecha-
nisms. The C- st at e t ypes are divided int o t wo cat egories: act ive ( C0) , in which t he
processor consumes full power; and idle ( C1- 3) , in which t he processor is idle and
may consume significant ly less power.
The index of a C- st at e t ype designat es t he dept h of sleep. Higher numbers indicat e a
deeper sleep st at e and lower power consumpt ion. They also require more t ime t o
wake up ( higher exit lat ency) .
C- st at e t ypes are described below:
C0 The processor is act ive and performing comput at ions and execut ing
inst ruct ions.
C1 This is t he lowest - lat ency idle st at e, which has very low exit lat ency. I n t he
C1 power st at e, t he processor is able t o maint ain t he cont ext of t he syst em
caches.
C2 This level has improved power savings over t he C1 st at e. The main
improvement s are provided at t he plat form level.
Figure 10-2. Active Time Versus Halted Time of a Processor
10-4
C3 This level provides great er power savings t han C1 or C2. I n C3, t he
processor st ops clock generat ing and snooping act ivit y. I t also allows syst em
memory t o ent er self- refresh mode.
The basic t echnique t o implement OS power management policy t o reduce st at ic
power consumpt ion is by evaluat ing processor idle durat ions and init iat ing t ransit ions
t o higher- numbered C- st at e t ypes. This is similar t o t he t echnique of reducing act ive
power consumpt ion by evaluat ing processor ut ilizat ion and init iat ing P- st at e t ransi-
t ions. The OS looks at hist ory wit hin a t ime window and t hen set s a t arget C- st at e
t ype for t he next t ime window, as illust rat ed in Figure 10- 3:
Consider t hat a processor is in lowest frequency ( LFM- low frequency mode) and ut ili-
zat ion is low. During t he first t ime slice window ( Figure 10- 3 shows an example t hat
uses 100 ms t ime slice for C- st at e decisions) , processor ut ilizat ion is low and t he OS
decides t o go t o C2 for t he next t ime slice. Aft er t he second t ime slice, processor ut ili-
zat ion is st ill low and t he OS decides t o go int o C3.
10.3.1 Processor-Specific C4 and Deep C4 States
The Pent ium M, I nt el Core Solo, I nt el Core Duo processors, and processors based on
I nt el Core microarchit ect ure
4
provide addit ional processor- specific C- st at es ( and
associat ed sub C- st at es) t hat can be mapped t o ACPI C3 st at e t ype. The processor-
specific C st at es and sub C- st at es are accessible using MWAI T ext ensions and can be
discovered using CPUI D. One of t he processor- specific st at e t o reduce st at ic power
consumpt ion is referred t o as C4 st at e. C4 provides power savings in t he following
manner:
The volt age of t he processor is reduced t o t he lowest possible level t hat st ill
allows t he L2 cache t o maint ain it s st at e.
Figure 10-3. Application of C-states to Idle Time
4. Pentium M processor can be detected by CPUID signature with family 6, model 9 or 13; Intel Core
Solo and Intel Core Duo processor has CPUID signature with family 6, model 14; processors based
on Intel Core microarchitecture has CPUID signature with family 6, model 15.
10-5
I n an I nt el Core Solo, I nt el Core Duo processor or a processor based on I nt el Core
microarchit ect ure, aft er st aying in C4 for an ext ended t ime, t he processor may
ent er int o a Deep C4 st at e t o save addit ional st at ic power.
The processor reduces volt age t o t he minimum level required t o safely maint ain
processor cont ext . Alt hough exit ing from a deep C4 st at e may require warming t he
cache, t he performance penalt y may be low enough such t hat t he benefit of longer
bat t ery life out weighs t he lat ency of t he deep C4 st at e.
10.4 GUIDELINES FOR EXTENDING BATTERY LIFE
Follow t he guidelines below t o opt imize t o conserve bat t ery life and adapt for mobile
comput ing usage:
Adopt a power management scheme t o provide j ust - enough ( not t he highest )
performance t o achieve desired feat ures or experiences.
Avoid using spin loops.
Reduce t he amount of work t he applicat ion performs while operat ing on a
bat t ery.
Take advant age of hardware power conservat ion feat ures using ACPI C3 st at e
t ype and coordinat e processor cores in t he same physical processor.
I mplement t ransit ions t o and from syst em sleep st at es ( S1- S4) correct ly.
Allow t he processor t o operat e at a higher- numbered P- st at e ( lower frequency
but higher efficiency in performance- per- wat t ) when demand for processor
performance is low.
Allow t he processor t o ent er higher- numbered ACPI C- st at e t ype ( deeper, low-
power st at es) when user demand for processor act ivit y is infrequent .
10.4.1 Adjust Performance to Meet Quality of Features
When a syst em is bat t ery powered, applicat ions can ext end bat t ery life by reducing
t he performance or qualit y of feat ures, t urning off background act ivit ies, or bot h.
I mplement ing such opt ions in an applicat ion increases t he processor idle t ime.
Processor power consumpt ion when idle is significant ly lower t han when act ive,
result ing in longer bat t ery life.
Example of t echniques t o use are:
Reducing t he qualit y/ color dept h/ resolut ion of video and audio playback.
Turning off aut omat ic spell check and grammar correct ion.
Turning off or reducing t he frequency of logging act ivit ies.
Consolidat ing disk operat ions over t ime t o prevent unnecessary spin- up of t he
hard drive.
Reducing t he amount or qualit y of visual animat ions.
10-6
Turning off, or significant ly reducing file scanning or indexing act ivit ies.
Post poning possible act ivit ies unt il AC power is present .
Performance/ qualit y/ bat t ery life t rade- offs may vary during a single session, which
makes implement at ion more complex. An applicat ion may need t o implement an
opt ion page t o enable t he user t o opt imize set t ings for user s needs ( see
Figure 10- 4) .
To be bat t ery- power- aware, an applicat ion may use appropriat e OS API s. For
Windows XP, t hese include:
Get Syst emPow er St at us Ret rieves syst em power informat ion. This st at us
indicat es whet her t he syst em is running on AC or DC ( bat t ery) power, whet her
t he bat t ery is current ly charging, and how much bat t ery life remains.
Get Act i vePw r Scheme Ret rieves t he act ive power scheme ( current syst em
power scheme) index. An applicat ion can use t his API t o ensure t hat syst em is
running best power scheme.Avoid Using Spin Loops.
Spin loops are used t o wait for short int ervals of t ime or for synchronizat ion. The
main advant age of a spin loop is immediat e response t ime. Using t he PeekMessage( )
in Windows API has t he same advant age for immediat e response ( but is rarely
needed in current mult it asking operat ing syst ems) .
However, spin loops and PeekMessage( ) in message loops require t he const ant at t en-
t ion of t he processor, prevent ing it from ent ering lower power st at es. Use t hem spar-
ingly and replace t hem wit h t he appropriat e API when possible. For example:
When an applicat ion needs t o wait for more t hen a few milliseconds, it should
avoid using spin loops and use t he Windows synchronizat ion API s, such as
Wait ForSingleObj ect ( ) .
When an immediat e response is not necessary, an applicat ion should avoid using
PeekMessage( ) . Use Wait Message( ) t o suspend t he t hread unt il a message is in
t he queue.
I nt el
Mobile Plat form Soft ware Development Kit

5
provides a rich set of API s for
mobile soft ware t o manage and opt imize power consumpt ion of mobile processor
and ot her component s in t he plat form.
10.4.2 Reducing Amount of Work
When a processor is in t he C0 st at e, t he amount of energy a processor consumes
from t he bat t ery is proport ional t o t he amount of t ime t he processor execut es an
act ive workload. The most obvious t echnique t o conserve power is t o reduce t he
number of cycles it t akes t o complet e a workload ( usually t hat equat es t o reducing
t he number of inst ruct ions t hat t he processor needs t o execut e, or opt imizing appli-
cat ion performance) .
5. Evaluation copy may be downloaded at http://www.intel.com/cd/software/products/asmo-
na/eng/219691.htm
10-7
Opt imizing an applicat ion st art s wit h having efficient algorit hms and t hen improving
t hem using I nt el soft ware development t ools, such as I nt el VTune Performance
Analyzers, I nt el compilers, and I nt el Performance Libraries.
See Chapt er 3 t hrough Chapt er 9 for more informat ion about performance opt imiza-
t ion t o reduce t he t ime t o complet e applicat ion workloads.
10.4.3 Platform-Level Optimizations
Applicat ions can save power at t he plat form level by using devices appropriat ely and
redist ribut ing t he workload. The following t echniques do not impact performance and
may provide addit ional power conservat ion:
Read ahead from CD/ DVD dat a and cache it in memory or hard disk t o allow t he
DVD drive t o st op spinning.
Swit ch off unused devices.
When developing a net work- int ensive applicat ion, t ake advant age of opport u-
nit ies t o conserve power. For example, swit ch t o LAN from WLAN whenever bot h
are connect ed.
Send dat a over WLAN in large chunks t o allow t he WiFi card t o ent er low power
mode in bet ween consecut ive packet s. The saving is based on t he fact t hat aft er
every send/ receive operat ion, t he WiFi card remains in high power mode for up t o
several seconds, depending on t he power saving mode. ( Alt hough t he purpose
keeping t he WiFI in high power mode is t o enable a quick wake up) .
Avoid frequent disk access. Each disk access forces t he device t o spin up and st ay
in high power mode for some period aft er t he last access. Buffer small disk reads
and writ es t o RAM t o consolidat e disk operat ions over t ime. Use t he Get Device-
PowerSt at e( ) Windows API t o t est disk st at e and delay t he disk access if it is not
spinning.
10.4.4 Handling Sleep State Transitions
I n some cases, t ransit ioning t o a sleep st at e may harm an applicat ion. For example,
suppose an applicat ion is in t he middle of using a file on t he net work when t he
syst em ent ers suspend mode. Upon resuming, t he net work connect ion may not be
available and informat ion could be lost .
An applicat ion may improve it s behavior in such sit uat ions by becoming aware of
sleep st at e t ransit ions. I t can do t his by using t he WM_POWERBROADCAST message.
This message cont ains all t he necessary informat ion for an applicat ion t o react
appropriat ely.
Here are some examples of an applicat ion react ion t o sleep mode t ransit ions:
Saving st at e/ dat a prior t o t he sleep t ransit ion and rest oring st at e/ dat a aft er t he
wake up t ransit ion.
10-8
Closing all open syst em resource handles such as files and I / O devices ( t his
should include duplicat ed handles) .
Disconnect ing all communicat ion links prior t o t he sleep t ransit ion and re- est ab-
lishing all communicat ion links upon waking up.
Synchronizing all remot e act ivit y, such as like writ ing back t o remot e files or t o
remot e dat abases, upon waking up.
St opping any ongoing user act ivit y, such as st reaming video, or a file download,
prior t o t he sleep t ransit ion and resuming t he user act ivit y aft er t he wake up
t ransit ion.
Recommendat i on: Appropriat ely handling t he suspend event enables more robust ,
bet t er performing applicat ions.
10.4.5 Using Enhanced Intel SpeedStep

Technology
Use Enhanced I nt el SpeedSt ep Technology t o adj ust t he processor t o operat e at a
lower frequency and save energy. The basic idea is t o divide comput at ions int o
smaller pieces and use OS power management policy t o effect a t ransit ion t o higher
P- st at es.
Typically, an OS uses a t ime const ant on t he order of 10s t o 100s of milliseconds
6
t o
det ect demand on processor workload. For example, consider an applicat ion t hat
requires only 50% of processor resources t o reach a required qualit y of service
( QOS) . The scheduling of t asks occurs in such a way t hat t he processor needs t o st ay
in P0 st at e ( highest frequency t o deliver highest performance) for 0.5 seconds and
may t hen goes t o sleep for 0.5 seconds. The demand pat t ern t hen alt ernat es.
Thus t he processor demand swit ches bet ween 0 and 100% every 0.5 seconds,
result ing in an average of 50% of processor resources. As a result , t he frequency
swit ches accordingly bet ween t he highest and lowest frequency. The power
consumpt ion also swit ches in t he same manner, result ing in an average power usage
represent ed by t he equat ion Paverage = ( Pmax+ Pmin) / 2.
Figure 10- 4 illust rat es t he chronological profiles of coarse- grain ( > 300 ms) t ask
scheduling and it s effect on operat ing frequency and power consumpt ion.
6. The actual number may vary by OS and by OS release.
10-9
The same applicat ion can be writ t en in such a way t hat work unit s are divided int o
smaller granularit y, but scheduling of each work unit and Sleep( ) occurring at more
frequent int ervals ( e. g. 100 ms) t o deliver t he same QOS ( operat ing at full perfor-
mance 50% of t he t ime) . I n t his scenario, t he OS observes t hat t he workload does
not require full performance for each 300 ms sampling. I t s power management
policy may t hen commence t o lower t he processor s frequency and volt age while
maint aining t he level of QOS.
The relat ionship bet ween act ive power consumpt ion, frequency and volt age is
expressed by t he equat ion:
I n t he equat ion: V is core volt age, F is operat ing frequency, and is t he act ivit y
fact or. Typically, t he qualit y of service for 100% performance at 50% dut y cycle can
be met by 50% performance at 100% dut y cycle. Because t he slope of frequency
scaling efficiency of most workloads will be less t han one, reducing t he core
frequency t o 50% can achieve more t han 50% of t he original performance level. At
t he same t ime, reducing t he core frequency t o 50% allows for a significant reduct ion
of t he core volt age.
Because execut ing inst ruct ions at higher P- st at e ( lower power st at e) t akes less
energy per inst ruct ion t han at P0 st at e, Energy savings relat ive t o t he half of t he dut y
cycle in P0 st at e ( Pmax / 2) more t han compensat e for t he increase of t he half of t he
dut y cycle relat ive t o inact ive power consumpt ion ( Pmin / 2) . The non- linear relat ion-
ship bet ween power consumpt ion t o frequency and volt age means t hat changing t he
t ask unit t o finer granularit y will deliver subst ant ial energy savings. This opt imizat ion
is possible when processor demand is low ( such as wit h media st reaming, playing a
DVD, or running less resource int ensive applicat ions like a word processor, email or
web browsing) .
An addit ional posit ive effect of cont inuously operat ing at a lower frequency is t hat
frequent changes in power draw ( from low t o high in our case) and bat t ery current
event ually harm t he bat t ery. They accelerat e it s det eriorat ion.
Figure 10-4. Profiles of Coarse Task Scheduling and Power Consumption
CPU demand
Average power
Frequency
& Power
10-10
When t he lowest possible operat ing point ( highest P- st at e) is reached, t here is no
need for dividing comput at ions. I nst ead, use longer idle periods t o allow t he
processor t o ent er a deeper low power mode.
10.4.6 Enabling Intel

Enhanced Deeper Sleep
I n t ypical mobile comput ing usages, t he processor is idle most of t he t ime.
Conserving bat t ery life must address reducing st at ic power consumpt ion.
Typical OS power management policy periodically evaluat es opport unit ies t o reduce
st at ic power consumpt ion by moving t o lower- power C- st at es. Generally, t he longer
a processor st ays idle, OS power management policy direct s t he processor int o
deeper low- power C- st at es.
Aft er an applicat ion reaches t he lowest possible P- st at e, it should consolidat e compu-
t at ions in larger chunks t o enable t he processor t o ent er deeper C- St at es bet ween
comput at ions. This t echnique ut ilizes t he fact t hat t he decision t o change frequency
is made based on a larger window of t ime t han t he period t o decide t o ent er deep
sleep. I f t he processor is t o ent er a processor- specific C4 st at e t o t ake advant age of
aggressive st at ic power reduct ion feat ures, t he decision should be based on:
Whet her t he QOS can be maint ained in spit e of t he fact t hat t he processor will be
in a low- power, long- exit - lat ency st at e for a long period.
Whet her t he int erval in which t he processor st ays in C4 is long enough t o
amort ize t he longer exit lat ency of t his low- power C st at e.
Event ually, if t he int erval is large enough, t he processor will be able t o ent er deeper
sleep and save a considerable amount of power. The following guidelines can help
applicat ions t ake advant age of I nt el
Enhanced Deeper Sleep:

Avoid set t ing higher int errupt rat es. Short er periods bet ween int errupt s may
keep OSes from ent ering lower power st at es. This is because t ransit ion t o/ from a
deep C- st at e consumes power, in addit ion t o a lat ency penalt y. I n some cases,
t he overhead may out weigh power savings.
Avoid polling hardware. I n a ACPI C3 t ype st at e, t he processor may st op
snooping and each bus act ivit y ( including DMA and bus mast ering) requires
moving t he processor t o a lower- numbered C- st at e t ype. The lower- numbered
st at e t ype is usually C2, but may even be C0. The sit uat ion is significant ly
improved in t he I nt el Core Solo processor ( compared t o previous generat ions of
t he Pent ium M processors) , but polling will likely prevent t he processor from
ent ering int o highest - numbered, processor- specific C- st at e.
10.4.7 Multicore Considerations
Mult icore processors deserves some special considerat ions when planning power
savings. The dual- core archit ect ure in I nt el Core Duo processor and mobile proces-
sors based on I nt el Core microarchit ect ure provide addit ional pot ent ial for power
savings for mult i- t hreaded applicat ions.
10-11
10.4.7.1 Enhanced Intel SpeedStep

Technology
Using domain- composit ion, a single- t hreaded applicat ion can be t ransformed t o t ake
advant age of mult icore processors. A t ransformat ion int o t wo domain t hreads means
t hat each t hread will execut e roughly half of t he original number of inst ruct ions. Dual
core archit ect ure enables running t wo t hreads simult aneously, each t hread using
dedicat ed resources in t he processor core. I n an applicat ion t hat is t arget ed for t he
mobile usages, t his inst ruct ion count reduct ion for each t hread enables t he physical
processor t o operat e at lower frequency relat ive t o a single- t hreaded version. This in
t urn enables t he processor t o operat e at a lower volt age, saving bat t ery life.
Not e t hat t he OS views each logical processor or core in a physical processor as a
separat e ent it y and comput es CPU ut ilizat ion independent ly for each logical
processor or core. On demand, t he OS will choose t o run at t he highest frequency
available in a physical package. As a result , a physical processor wit h t wo cores will
oft en work at a higher frequency t han it needs t o sat isfy t he t arget QOS.
For example if one t hread requires 60% of single- t hreaded execut ion cycles and t he
ot her t hread requires 40% of t he cycles, t he OS power management may direct t he
physical processor t o run at 60% of it s maximum frequency.
However, it may be possible t o divide work equally bet ween t hreads so t hat each of
t hem require 50% of execut ion cycles. As a result , bot h cores should be able t o
operat e at 50% of t he maximum frequency ( as opposed t o 60%) . This will allow t he
physical processor t o work at a lower volt age, saving power.
So, while planning and t uning your applicat ion, make t hreads as symmet ric as
possible in order t o operat e at t he lowest possible frequency- volt age point .
10.4.7.2 Thread Migration Considerations
I nt eract ion of OS scheduling and mult icore unaware power management policy may
creat e some sit uat ions of performance anomaly for mult i- t hreaded applicat ions. The
problem can arise for mult it hreading applicat ion t hat allow t hreads t o migrat e freely.
When one full- speed t hread is migrat ed from one core t o anot her core t hat has idled
for a period of t ime, an OS wit hout a mult icore- aware P- st at e coordinat ion policy may
mist akenly decide t hat each core demands only 50% of processor resources ( based
on idle hist ory) . The processor frequency may be reduced by such mult icore unaware
P- st at e coordinat ion, result ing in a performance anomaly. See Figure 10- 5.
10-12
Soft ware applicat ions have a couple of choices t o prevent t his from happening:
Thr ead af f i ni t y management A mult i- t hreaded applicat ion can enumerat e
processor t opology and assign processor affinit y t o applicat ion t hreads t o prevent
t hread migrat ion. This can work around t he issue of OS lacking mult icore aware
P- st at e coordinat ion policy.
Upgr ade t o an OS w i t h mul t i cor e aw ar e P- st at e coor di nat i on pol i cy
Some newer OS releases may include mult icore aware P- st at e coordinat ion
policy. The reader should consult wit h specific OS vendors.
10.4.7.3 Multicore Considerations for C-States
There are t wo issues t hat impact C- st at es on mult icore processors.
Multicore-unaware C-state Coordination May Not Fully Realize Power Savings
When each core in a mult icore processor meet s t he requirement s necessary t o ent er
a different C- st at e t ype, mult icore- unaware hardware coordinat ion causes t he phys-
ical processor t o ent er t he lowest possible C- st at e t ype ( lower- numbered C st at e has
less power saving) . For example, if Core 1 meet s t he requirement t o be in ACPI C1
and Core 2 meet s requirement for ACPI C3, mult icore- unaware OS coordinat ion
t akes t he physical processor t o ACPI C1. See Figure 10- 6.
Figure 10-5. Thread Migration in a Multicore Processor
Core 1
Core 2
active
Idle
active
Idle
10-13
Enabling Both Cores to Take Advantage of Intel Enhanced Deeper Sleep.
To best ut ilize processor- specific C- st at e ( e. g., I nt el Enhanced Deeper Sleep) t o
conserve bat t ery life in mult it hreaded applicat ions, a mult i- t hreaded applicat ion
should synchronize t hreads t o work simult aneously and sleep simult aneously using
OS synchronizat ion primit ives. By keeping t he package in a fully idle st at e longer
( sat isfying ACPI C3 requirement ) , t he physical processor can t ransparent ly t ake
advant age of processor- specific Deep C4 st at e if it is available.
Mult i- t hreaded applicat ions need t o ident ify and correct load- imbalances of it s
t hreaded execut ion before implement ing coordinat ed t hread synchronizat ion. I dent i-
fying t hread imbalance can be accomplished using performance monit oring event s.
I nt el Core Duo processor provides an event for t his purpose. The event
( Serial_Execut ion_Cycle) increment s under t he following condit ions:
Core act ively execut ing code in C0 st at e
Second core in physical processor in idle st at e ( C1- C4)
This event enables soft ware developers t o find code t hat is execut ing serially, by
comparing Serial_Execut ion_Cycle and Unhalt ed_Ref_Cycles. Changing sect ions of
serialized code t o execut e int o t wo parallel t hreads enables coordinat ed t hread
synchronizat ion t o achieve bet t er power savings.
Alt hough Serial_Execut ion_Cycle is available only on I nt el Core Duo processors,
applicat ion t hread wit h load- imbalance sit uat ions usually remains t he same for
symmet ric applicat ion t hreads and on symmet rically configured mult icore proces-
sors, irrespect ive of differences in t heir underlying microarchit ect ure. For t his
reason, t he t echnique t o ident ify load- imbalance sit uat ions can be applied t o mult i-
t hreaded applicat ions in general, and not specific t o I nt el Core Duo processors.
Figure 10-6. Progression to Deeper Sleep
Thread 1
(core 1)
Thread 2
(core 2)
CPU
Active
Sleep
Active
Active
Sleep
Sleep
Deeper
Sleep
10-14
A-1
APPENDIX A
APPLICATION PERFORMANCE
TOOLS
I nt el offers an array of applicat ion performance t ools t hat are opt imized t o t ake
advant age of t he I nt el archit ect ure ( I A) - based processors. This appendix int roduces
t hese t ools and explains t heir capabilit ies for developing t he most efficient programs
wit hout having t o writ e assembly code.
The following performance t ools are available:
I nt el
C+ + Compi l er and I nt el
For t r an Compi l er I nt el compilers

generat e highly opt imized execut able code for I nt el 64 and I A- 32 processors. The
compilers support advanced opt imizat ions t hat include vect orizat ion for MMX
t echnology, t he St reaming SI MD Ext ensions ( SSE) , St reaming SI MD Ext ensions 2
( SSE2) , St reaming SI MD Ext ensions 3 ( SSE3) , and Supplement al St reaming
SI MD Ext ensions 3 ( SSSE3) .
VTune Per f or mance Anal yzer The VTune analyzer collect s, analyzes, and
displays I nt el archit ect ure- specific soft ware performance dat a from t he syst em-
wide view down t o a specific line of code.
I nt el Per f or mance Li br ar i es The I nt el Performance Library family consist s
of a set of soft ware libraries opt imized for I nt el archit ect ure processors. The
library family includes t he following:
I nt el
Mat h Kernel Library ( I nt el

MKL)
I nt el
I nt egrat ed Performance Primit ives ( I nt el

I PP)
I nt el Thr eadi ng Tool s I nt el Threading Tools consist of t he following:
I nt el Thread Checker
Thread Profiler
A.1 COMPILERS
I nt el compilers support several general opt imizat ion set t ings, including / O1, / O2,
/ O3, and / fast . Each of t hem enables a number of specific opt imizat ion opt ions. I n
most cases, / O2 is recommended over / O1 because t he / O2 opt ion enables funct ion
expansion, which helps programs t hat have many calls t o small funct ions. The / O1
may somet imes be preferred when code size is a concern. The / O2 opt ion is on by
default .
The / Od ( - O0 on Linux) opt ion disables all opt imizat ions. The / O3 opt ion enables
more aggressive opt imizat ions, most of which are effect ive only in conj unct ion wit h
processor- specific opt imizat ions described below.
A-2
APPLICATION PERFORMANCE TOOLS
The / fast opt ion maximizes speed across t he ent ire program. For most I nt el 64 and
I A- 32 processors, t he / fast opt ion is equivalent t o / O3 / Qipo / QxP ( - Q3 - ipo -
st at ic -xP on Linux) . For Mac OS, t he "- fast " opt ion is equivalent t o "- O3 - ipo".
All t he command- line opt ions are described in I nt el
C+ + Compiler document at ion.

A.1.1 Recommended Optimization Settings for Intel 64 and IA-32
Processors
64- bit addressable code can only run in 64- bit mode of processors t hat support
I nt el 64 archit ect ure. The opt imal compiler set t ings for 64- bit code generat ion is
different from 32- bit code generat ion. Table A- 1 list s recommended compiler opt ions
for generat ing 32- bit code for I nt el 64 and I A- 32 processors. Table A- 1 also applies
t o code t arget ed t o run in compat ibilit y mode on an I nt el 64 processor, but does not
apply t o running in 64- bit mode. Table A- 2 list s recommended compiler opt ions for
generat ing 64- bit code for I nt el 64 processors, it only applies t o code t arget t o run in
64- bit mode. I nt el compilers provide separat e compiler binary t o generat e 64- bit
code versus 32- bit code. The 64- bit compiler binary generat es only 64- bit address-
able code.
Table A-1. Recommended IA-32 Processor Optimization Options
Need Recommendation Comments
Best performance
on Intel Core 2
processor family and
Intel Xeon processor
3000 and 5100
series, utilizing
SSSE3 and other
processor-specific
instructions
/QxT (-xT on
Linux)
Single code path
Will not run on earlier processors that
do not support SSSE3
Best performance
on Intel Core 2
processor family and
Intel Xeon processor
3000 and 5100
series, utilizing
SSSE3; runs on non-
Intel processor
supporting SSE2
/QaxT /QxW (-axT
-xW on Linux)
Multiple code path are generated
Be sure to validate your application on
all systems where it may be deployed.
Best performance
on IA-32 processors
with SSE3
instruction support
/QxP (-xP on Linux) Single code path
Will not run on earlier processors.that
do not support SSE3
A-3

Best performance
on IA-32 processors
with SSE2
instruction support
/QaxN (-axN on Linux)
Optimized for Pentium
4 and Pentium M
processors, and an
optimized, generic code-
path to run on other
processors
Multiple code paths are generated.
Use /QxN (-xN for Linux) if you know
your application will not be run on
processors older than the Pentium 4 or
Pentium M processors.
Best performance
on IA-32 processors
with SSE3
instruction support
for multiple code
paths
/QaxP /QxW (-axP
-xW on Linux)
Optimized for
Pentium 4
processor and
Pentium 4
processor with
SSE3 instruction
support
Generates two code paths:
one for the Pentium 4 processor
one for the Pentium 4 processor or
non-Intel processors with SSE3
instruction support.
Table A-2. Recommended Processor Optimization Options for 64-bit Code
Best performance on Intel Core
2 processor family and Intel
Xeon processor 3000 and
5100 series, utilizing SSSE3
and other processor-specific
instructions
/QxT (-xT on
Linux)
Single code path
Will not run on earlier
processors that do not
support SSSE3
Best performance on Intel Core
2 processor family and Intel
Xeon processor 3000 and
5100 series, utilizing SSSE3;
runs on non-Intel processor
supporting SSE2
/QaxT /QxW (-axT
-xW on Linux)
Multiple code path are
generated
Be sure to validate your
application on all systems
where it may be deployed.
Best performance on other
processors supporting Intel 64
architecture, utilizing SSE3
where possible
/QxP (-xP on
Linux)
Single code path are
generated
Will not run on processors
that do not support Intel 64
architecture and SSE3.
Table A-1. Recommended IA-32 Processor Optimization Options
A-4
A.1.2 Vectorization and Loop Optimization
The I nt el C+ + and Fort ran Compiler s vect orizat ion feat ure can det ect sequent ial
dat a access by t he same inst ruct ion and t ransforms t he code t o use SSE, SSE2,
SSE3, and SSSE3, depending on t he t arget processor plat form. The vect orizer
support s t he following feat ures:
Mult iple dat a t ypes: Float / double, char/ short / int / long ( bot h signed and
unsigned) , _Complex float / double are support ed.
St ep by st ep diagnost ics: Through t he / Qvec- report [ n] ( - vec- report [ n] on Linux
and Mac OS) swit ch ( see Table A- 3) , t he vect orizer can ident ify, line- by- line and
variable- by- variable, what code was vect orized, what code was not vect orized,
and more import ant ly, why it was not vect orized. This feedback gives t he
developer t he informat ion necessary t o slight ly adj ust or rest ruct ure code, wit h
dependency direct ives and rest rict keywords, t o allow vect orizat ion t o occur.
Advanced dynamic dat a- alignment st rat egies: Alignment st rat egies include loop
peeling and loop unrolling. Loop peeling can generat e aligned loads, enabling
fast er applicat ion performance. Loop unrolling mat ches t he prefet ch of a full
cache line and allows bet t er scheduling.
Port able code: By using appropriat e I nt el compiler swit ches t o t ake advant age
new processor feat ures, developers can avoid t he need t o rewrit e source code.
The processor- specific vect orizer swit ch opt ions are: - Qx[ K,W, N, P, T] and
- Qax[ K, W, N, P, T] . The compiler provides a number of ot her vect orizer swit ch
opt ions t hat allow you t o cont rol vect orizat ion. The lat t er swit ches require t he -
Qx[ , K, W,N, P, T] or - Qax[ K, W, N, P, T] swit ch t o be on. The default is off.
Best performance on other
processors supporting Intel 64
architecture, utilizing SSE3
where possible, while still
running on older Intel as well
as non-Intel x86-64 processors
supporting SSE2
/QaxP /QxW (-axP
-xW on Linux)
Multiple code path are
generated
Be sure to validate your
application on all systems
where it may be deployed.
Table A-3. Vectorization Control Switch Options
-Qvec_report[n] Controls the vectorizers diagnostic levels, where n is either 0, 1, 2, or 3.
-Qrestrict Enables pointer disambiguation with the restrict qualifier.
Table A-2. Recommended Processor Optimization Options for 64-bit Code (Contd.)
A-5
A.1.2.1 Multithreading with OpenMP*
Bot h t he I nt el C+ + and Fort ran Compilers support shared memory parallelism using
OpenMP compiler direct ives, library funct ions and environment variables. OpenMP
direct ives are act ivat ed by t he compiler swit ch / Qopenmp ( - openmp on Linux and
Mac OS) . The available direct ives are described in t he Compiler User' s Guides avail-
able wit h t he I nt el C+ + and Fort ran Compilers. For informat ion about t he OpenMP
st andard, see ht t p: / / www. openmp.org.
A.1.2.2 Automatic Multithreading
Bot h t he I nt el C+ + and Fort ran Compilers can generat e mult it hreaded code aut o-
mat ically for simple loops wit h no dependencies. This is act ivat ed by t he compiler
swit ch / Qparallel ( - parallel in Linux and Mac OS) .
A.1.3 Inline Expansion of Library Functions (/Oi, /Oi-)
The compiler inlines a number of st andard C, C+ + , and mat h library funct ions by
default . This usually result s in fast er execut ion. Somet imes, however, inline expan-
sion of library funct ions can cause unexpect ed result s. For explanat ion, see t he
I nt el C+ + Compiler document at ion.
A.1.4 Floating-point Arithmetic Precision (/Op, /Op-, /Qprec,
/Qprec_div, /Qpc, /Qlong_double)
These opt ions provide a means of cont rolling opt imizat ion t hat might result in a small
change in t he precision of float ing- point arit hmet ic.
A.1.5 Rounding Control Option (/Qrcr, /Qrcd)
The compiler uses t he - Qrcd opt ion t o improve t he performance of code t hat requires
float ing- point calculat ions. The opt imizat ion is obt ained by cont rolling t he change of
t he rounding mode.
The - Qrcd opt ion disables t he change t o t runcat ion of t he rounding mode in float ing-
point - t o- int eger conversions.
For more on code opt imizat ion opt ions, see t he I nt el C+ + Compiler document at ion.
A.1.6 Interprocedural and Profile-Guided Optimizations
The following are t wo met hods t o improve t he performance of your code based on it s
unique profile and procedural dependencies.
A-6
A.1.6.1 Interprocedural Optimization (IPO)
You can use t he / Qip ( - ip in Linux and Mac OS) opt ion t o analyze your code and apply
opt imizat ions bet ween procedures wit hin each source file. Use mult ifile I PO wit h
/ Qipo ( - ipo in Linux and Mac OS) t o enable t he opt imizat ions bet ween procedures in
separat e source files.
A.1.6.2 Profile-Guided Optimization (PGO)
Creat es an inst rument ed program from your source code and special code from t he
compiler. Each t ime t his inst rument ed code is execut ed, t he compiler generat es a
dynamic informat ion file. When you compile a second t ime, t he dynamic informat ion
files are merged int o a summary file. Using t he profile informat ion in t his file, t he
compiler at t empt s t o opt imize t he execut ion of t he most heavily t ravelled pat hs in
t he program.
Profile- guided opt imizat ion is part icularly beneficial for t he Pent ium 4 and I nt el Xeon
processor family. I t great ly enhances t he opt imizat ion decisions t he compiler makes
regarding inst ruct ion cache ut ilizat ion and memory paging. Also, because PGO uses
execut ion- t ime informat ion t o guide t he opt imizat ions, branch- predict ion can be
significant ly enhanced by reordering branches and basic blocks t o keep t he most
commonly used pat hs in t he microarchit ect ure pipeline, as well as generat ing t he
appropriat e branch- hint s for t he processor.
When you use PGO, consider t he following guidelines:
Minimize t he changes t o your program aft er inst rument ed execut ion and before
feedback compilat ion. During feedback compilat ion, t he compiler ignores
dynamic informat ion for funct ions modified aft er t hat informat ion was generat ed.
NOTE
The compiler issues a warning t hat t he dynamic informat ion
corresponds t o a modified funct ion.
Repeat t he inst rument at ion compilat ion if you make many changes t o your
source files aft er execut ion and before feedback compilat ion.
For more on code opt imizat ion opt ions, see t he I nt el C+ + Compiler document at ion.
A.1.7 Auto-Generation of Vectorized Code
This sect ion covers several high- level language examples t hat programmers can
use I nt el Compiler t o generat e vect orized code aut omat ically.

A-7
The following examples are illust rat ive of t he likely differences of t wo compiler
swit ches.

Example 10-1. Storing Absolute Values
int dst[1024], src[1024]
for (i = 0; i < 1024; i++) {
dst[i] = (src[i] >=0) ? src[i] : -src[i];
}
Example 10-2. Auto-Generated Code of Storing Absolutes
Compiler Switch QxW Compiler Switch QxT
movdqa xmm1, _src[eax*4]
pxor xmm0, xmm0
pcmpgtd xmm0, xmm1
pxor xmm1, xmm0
psubd xmm1, xmm0
movdqa _dst[eax*4], xmm1
add eax, 4
cmp eax, 1024
jb $B1$3
pabsd xmm0, _src[eax*4]
add eax, 4
cmp eax, 1024
jb $B1$3
Example 10-3. Changes Signs
int dst[NUM], src[1024];
for (i = 0; i < 1024; i++) {
if (src[i] == 0)
{ dst[i] = 0; }
else if (src[i] < 0)
{ dst[i] = -dst[i]; }
}
A-8

Example 10-4. Auto-Generated Code of Sign Conversion
$B1$3:
mov edx, _src[eax*4]
add eax, 1
test edx, edx
jne $B1$5
$B1$4:
mov _dst[eax*4], 0
jmp $B1$7
ALIGN 4
$B1$3:
movdqa xmm0, _dst[eax*4]
psignd xmm0, _src[eax*4]
add eax, 4
cmp eax, 1024
jb $B1$3
$B1$5:
jge $B1$7
$B1$6:
mov edx, _dst[eax*4]
neg edx
mov _dst[eax*4], edx
$B1$7:
cmp eax, 1024
jl $B1$3
Example 10-5. Data Conversion
int dst[1024];
unsigned char src[1024];
for (i = 0; i < 1024; i++) {
dst[i] = src[i]
}
A-9
I nt el Compiler can use PALI GNR t o generat e code t o avoid penalt ies associat ed wit h
unaligned loads.
Example 10-6. Auto-Generated Code of Data Conversion
$B1$2:
xor eax, eax
pxor xmm0, xmm0
$B1$3:
movd xmm1, _src[eax]
punpcklbw xmm1, xmm0
punpcklwd xmm1, xmm0
add eax, 4
cmp eax, 1024
jb $B1$3
$B1$2:
movdqa xmm0, _2il0fl2t$1DD
xor eax, eax
$B1$3:
movd xmm1, _src[eax]
pshufb xmm1, xmm0
add eax, 4
cmp eax, 1024
jb $B1$3
_2il0fl2t$1DD
0ffffff00H,0ffffff01H,0ffffff02H,0ffffff03H
Example 10-7. Un-aligned Data Operation
__declspec(align(16)) float src[1024], dst[1024];
for(i = 2; i < 1024-2; i++)
dst[i] = src[i-2] - src[i-1] - src[i+2 ];
A-10
A.2 INTEL
VTUNE
PERFORMANCE ANALYZER
The I nt el VTune Performance Analyzer is a powerful soft ware- profiling t ool for
Microsoft Windows and Linux. The VTune analyzer helps you underst and t he perfor-
mance charact erist ics of your soft ware at all levels: syst em, applicat ion, microarchi-
t ect ure.
The sect ions t hat follow describe t he maj or feat ures of t he VTune analyzer and briefly
explain how t o use t hem. For more det ails on t hese feat ures, run t he VTune analyzer
and see t he online document at ion.
All feat ures are available for Microsoft Windows. On Linux, sampling and call graph
are available.
A.2.1 Sampling
Sampling allows you t o profile all act ive soft ware on your syst em, including operat ing
syst em, device driver, and applicat ion soft ware. I t works by occasionally int errupt ing
t he processor and collect ing t he inst ruct ion address, process I D, and t hread I D. Aft er
t he sampling act ivit y complet es, t he VTune analyzer displays t he dat a by process,
t hread, soft ware module, funct ion, or line of source. There are t wo met hods for
generat ing samples: Time- based sampling and Event - based sampling.
Example 10-8. Auto-Generated Code to Avoid Unaligned Loads
$B2$2
movups xmm0, _src[eax+4]
movaps xmm1, _src[eax]
movaps xmm4, _src[eax+16]
movsd xmm3, _src[eax+20]
subps xmm1, xmm0
subps xmm1, _src[eax+16]
movss xmm2, _src[eax+28]
movhps xmm2, _src[eax+32]
movups _dst[eax+8], xmm1
shufps xmm3, xmm2, 132
subps xmm4, xmm3
movlps _dst[eax+24], xmm4
movhps _dst[eax+32], xmm4
add eax, 32
cmp eax, 4064
jb $B2$2
$B2$2:
movaps xmm2, _src[eax+16]
movaps xmm0, _src[eax]
movdqa xmm3, _src[eax+32]
movdqa xmm1, xmm2
subps xmm0, xmm1
movups _dst[eax+8], xmm0
subps xmm2, xmm3
movlps _dst[eax+24], xmm2
movhps _dst[eax+32], xmm2
add eax, 32
cmp eax, 4064
jb $B2$2
A-11
A.2.1.1 Time-based Sampling
Time- based sampling ( TBS) uses an operat ing syst ems ( OS) t imer t o periodically
int errupt t he processor t o collect samples. The sampling int erval is user definable.
TBS is useful for ident ifying t he soft ware on your comput er t hat is t aking t he most
CPU t ime. This feat ure is only available in t he Windows version of t he VTune Analyzer
A.2.1.2 Event-based Sampling
Event - based sampling ( EBS) can be used t o provide det ailed informat ion on t he
behavior of t he microprocessor as it execut es soft ware. Some of t he event s t hat can
be used t o t rigger sampling include clockt icks, cache misses, and branch mispredic-
t ions. The VTune analyzer indicat es where micro archit ect ural event s, specific t o t he
I nt el Core microarchit ect ure, Pent ium 4, Pent ium M and I nt el Xeon processors, occur
t he most oft en. On processors based on I nt el Core microarchit ect ure, it is possible t o
collect up t o 5 event s ( t hree event s using fixed- funct ion count ers, t wo event s using
general- purpose count ers) at a t ime from a list of over 400 event s ( see Appendix A,
Performance Monit oring Event s of I nt el 64 and I A- 32 Archit ect ures Soft ware
Developers Manual, Volume 3B) . On Pent ium M processors, t he VTune analyzer can
collect t wo different event s at a t ime. The number of t he event s t hat t he VTune
analyzer can collect at once on t he Pent ium 4 and I nt el Xeon processor depends on
t he event s select ed.
Event - based samples are collect ed periodically aft er a specific number of processor
event s have occurred while t he program is running. The program is int errupt ed,
allowing t he int errupt handling driver t o collect t he I nst ruct ion Point er ( I P) , load
module, t hread and process I D' s. The inst ruct ion point er is t hen used t o derive t he
funct ion and source line number from t he debug informat ion creat ed at compile t ime.
The Dat a can be displayed as horizont al bar chart s or in more det ail as spread sheet s
t hat can be export ed for furt her manipulat ion and easy disseminat ion.
A.2.1.3 Workload Characterization
Using event - based sampling and processor- specific event s can provide useful
insight s int o t he nat ure of t he int eract ion bet ween a workload and t he microarchit ec-
t ure. A few met rics useful for workload charact erizat ion are discussed in Appendix B.
The event list s available on various I nt el processors can be found in Appendix A,
Performance Monit oring Event s of I nt el 64 and I A- 32 Archit ect ures Soft ware
Developers Manual, Volume 3B.
A.2.2 Call Graph
Call graph helps you underst and t he relat ionships bet ween t he funct ions in your
applicat ion by providing t iming and caller / callee ( funct ions called) informat ion. Call
graph works by inst rument ing t he funct ions in your applicat ion. I nst rument at ion is
t he process of modifying a funct ion so t hat performance dat a can be capt ured when
t he funct ion is execut ed. I nst rument at ion does not change t he funct ionalit y of t he
A-12
program. However, it can reduce performance. The VTune analyzer can det ect
modules as t hey are loaded by t he operat ing syst em, and inst rument t hem at run-
t ime. Call graph can be used t o profile Win32* , Java* , and Microsoft . NET* applica-
t ions. Call graph only works for applicat ion ( ring 3) soft ware.
Call graph profiling provides t he following informat ion on t he funct ions called by your
applicat ion: t ot al t ime, self- t ime, t ot al wait t ime, wait t ime, callers, callees, and t he
number of calls. This dat a is displayed using t hree different views: funct ion
summary, call graph, and call list . These views are all synchronized.
The Funct ion Summary View can be used t o focus t he dat a displayed in t he call graph
and call list views. This view displays all t he informat ion about t he funct ions called by
your applicat ion in a sort able t able format . However, it does not provide callee and
caller informat ion. I t j ust provides t iming informat ion and number of t imes a funct ion
is called.
The Call Graph View depict s t he caller/ callee relat ionships. Each t hread in t he appli-
cat ion is t he root of a call t ree. Each node ( box) in t he call t ree represent s a funct ion.
Each edge ( line wit h an arrow) connect ing t wo nodes represent s t he call from t he
parent t o t he child funct ion. I f t he mouse point er is hovered over a node, a t ool t ip
will pop up displaying t he funct ion' s t iming informat ion.
The Call List View is useful for analyzing programs wit h large, complex call t rees.
This view displays only t he caller and callee informat ion for t he single funct ion t hat
you select in t he Funct ion Summary View. The dat a is displayed in a t able format .
A.2.3 Counter Monitor
Count er monit or helps you ident ify syst em level performance bot t lenecks. I t period-
ically polls soft ware and hardware performance count ers. The performance count er
dat a can help you underst and how your applicat ion is impact ing t he performance of
t he comput er' s various subsyst ems. Count er monit or dat a can be displayed in real-
t ime and logged t o a file. The VTune analyzer can also correlat e performance count er
dat a wit h sampling dat a. This feat ure is only available in t he Windows version of t he
VTune Analyzer
A.3 INTEL

PERFORMANCE LIBRARIES
The I nt el Performance Library family cont ains a variet y of specialized libraries which
has been opt imized for performance on I nt el processors. These opt imizat ions t ake
advant age of appropriat e archit ect ural feat ures, including MMX t echnology,
St reaming SI MD Ext ensions ( SSE) , St reaming SI MD Ext ensions 2 ( SSE2) and
St reaming SI MD Ext ensions 3 ( SSE3) . The library set includes t he I nt el Mat h Kernel
Library ( MKL) and t he I nt el I nt egrat ed Performance Primit ives ( I PP) .
The I nt el Mat h Kernel Library for Linux and Windows: MKL is composed of highly
opt imized mat hemat ical funct ions for engineering, scient ific and financial appli-
cat ions requiring high performance on I nt el plat forms. The funct ional areas of t he
A-13
library include linear algebra consist ing of LAPACK and BLAS, Discret e Fourier
Transforms ( DFT) , vect or t ranscendent al funct ions ( vect or mat h library/ VML) and
vect or st at ist ical funct ions ( VSL) . I nt el MKL is opt imized for t he lat est feat ures
and capabilit ies of t he I nt el Pent ium 4 processor, Pent ium M processor, I nt el Xeon
processors and I nt el
I t anium
2 processors.
I nt el
I nt egrat ed Performance Primit ives for Linux* and Windows* : I PP is a

cross- plat form soft ware library which provides a range of library funct ions for
video decode/ encode, audio decode/ encode, image color conversion, comput er
vision, dat a compression, st ring processing, signal processing, image processing,
JPEG decode/ encode, speech recognit ion, speech decode/ encode, crypt ography
plus mat h support rout ines for such processing capabilit ies.
I nt el I PP is opt imized for t he broad range of I nt el microprocessors: I nt el Core 2
processor family, Dual- core I nt el Xeon processors, I nt el Pent ium 4 processor,
Pent ium M processor, I nt el Xeon processors, t he I nt el I t anium archit ect ure,
I nt el
SA- 1110 and I nt el
PCA applicat ion processors based on t he I nt el

XScale
microarchit ect ure. Wit h a single API across t he range of plat forms, t he
users can have plat form compat ibilit y and reduced cost of development .
A.3.1 Benefits Summary
The overall benefit s t he libraries provide t o t he applicat ion developers are as follows:
Ti me- t o- Mar k et Low- level building block funct ions t hat support rapid
applicat ion development , improving t ime t o market .
Per f or mance Highly- opt imized rout ines wit h a C int erface t hat give
Assembly- level performance in a C/ C+ + development environment ( MKL also
support s a Fort ran int erface) .
Pl at f or m t uned Processor- specific opt imizat ions t hat yield t he best
performance for each I nt el processor.
Compat i bi l i t y Processor- specific opt imizat ions wit h a single applicat ion
programming int erface ( API ) t o reduce development cost s while providing
opt imum performance.
Thr eaded appl i cat i on suppor t Applicat ions can be t hreaded wit h t he
assurance t hat t he MKL and I PP funct ions are safe for use in a t hreaded
environment .
A.3.2 Optimizations with the Intel
Performance Libraries
The I nt el Performance Libraries implement a number of opt imizat ions t hat are
discussed t hroughout t his manual. Examples include archit ect ure- specific t uning
such as loop unrolling, inst ruct ion pairing and scheduling; and memory management
wit h explicit and implicit dat a prefet ching and cache t uning.
A-14
The Libraries t ake advant age of t he parallelism in t he SI MD inst ruct ions using MMX
t echnology, St reaming SI MD Ext ensions ( SSE) , St reaming SI MD Ext ensions 2
( SSE2) , and St reaming SI MD Ext ensions 3 ( SSE3) . These t echniques improve t he
performance of comput at ionally int ensive algorit hms and deliver hand coded perfor-
mance in a high level language development environment .
For performance sensit ive applicat ions, t he I nt el Performance Libraries free t he
applicat ion developer from t he t ime consuming t ask of assembly- level programming
for a mult it ude of frequent ly used funct ions. The t ime required for prot ot yping and
implement ing new applicat ion feat ures is subst ant ially reduced and most import ant ,
t he t ime t o market is subst ant ially improved. Finally, applicat ions developed wit h t he
I nt el Performance Libraries benefit from new archit ect ural feat ures of fut ure genera-
t ions of I nt el processors simply by relinking t he applicat ion wit h upgraded versions of
t he libraries.
A.4 INTEL
THREADING ANALYSIS TOOLS

The I nt el
Threading Analysis Tools consist of t he I nt el Thread Checker 3. 0, t he

Thread Profiler 3. 0, and t he I nt el Threading Building Blocks 1. 0
1
. The I nt el Thread
Checker and Thread Profiler support s Windows and Linux. The I nt el Threading
Building Blocks 1. 0 support s Windows, Linux, and Mac OS.
A.4.1 Intel
Thread Checker 3.0

The I nt el Thread Checker locat es programming errors ( for example: dat a races,
st alls and deadlocks) in t hreaded applicat ions. Use t he I nt el Thread Checker t o find
t hreading errors and reduce t he amount of t ime you spend debugging your t hreaded
applicat ion.
The I nt el Thread Checker product is an I nt el VTune Performance Analyzer plug- in
dat a collect or t hat execut es your program and aut omat ically locat es t hreading
errors. As your program runs, t he I nt el Thread Checker monit ors memory accesses
and ot her event s and aut omat ically det ect s sit uat ions which could cause unpredict -
able t hreading- relat ed result s. The I nt el Thread Checker det ect s t hread deadlocks,
st alls, dat a race condit ions and more.
A.4.2 Intel Thread Profiler 3.0
The t hread profiler is a plug- in dat a collect or for t he I nt el VTune Performance
Analyzer. Use it t o analyze t hreading performance and ident ify parallel performance
problems. The t hread profiler graphically illust rat es what each t hread is doing at
various levels of det ail using a hierarchical summary. I t can ident ify inact ive t hreads,
1 For additional threading resources, visit http://www3.intel.com/cd/software/products/asmo-
na/eng/index.htm
A-15
crit ical pat hs and imbalances in t hread execut ion, et c. Mount ains of dat a are
collapsed int o relevant summaries, sort ed t o ident ify parallel regions or loops t hat
require at t ent ion. I t s int uit ive, color- coded displays make it easy t o assess your
applicat ion' s performance.
Figure A- 1 shows t he execut ion t imeline of a mult i- t hreaded applicat ion when run in
( a) a single- t hreaded environment , ( b) a mult i- t hreaded environment capable of
execut ing t wo t hreads simult aneously, ( c) a mult i- t hreaded environment capable of
execut ing four t hreads simult aneously. I n Figure A- 1, t he color- coded t imeline of
t hree hardware configurat ions are super- imposed t oget her t o compare processor
scaling performance and illust rat e t he imbalance of t hread execut ion.
Load imbalance problem is visually ident ified in t he t wo- way plat form by not ing t hat
t here is a significant port ion of t he t imeline, during which one logical processor had
no t ask t o execut e. I n t he four- way plat form, one can easily ident ify t hose port ions of
t he t imeline of t hree logical processors, each having no t ask t o execut e.
A.4.3 Intel Threading Building Blocks 1.0
The I nt el Threading Building Blocks is a C+ + t emplat e- based runt ime library t hat
simplifies t hreading for scalable, mult i- core performance. I t can help avoid re-
writ ing, re- t est ing, re- t uning common parallel dat a st ruct ures and algorit hms.
Figure A-1. Intel Thread Profiler Showing Critical Paths
of Threaded Execution Timelines
A-16
A.5 INTEL
SOFTWARE COLLEGE
You can find informat ion on classroom t raining offered by t he I nt el Soft ware College
at ht t p: / / developer. int el. com/ soft ware/ college. Find general informat ion for devel-
opers at ht t p: / / soft warecommunit y. int el. com/ isn/ home/ .
Vol. 1 B-1
APPENDIX B
Performance monit oring event s provide facilit ies t o charact erize t he int eract ion
bet ween programmed sequences of inst ruct ions and microarchit ect ural sub-
syst ems. Performance monit oring event s are described in Chapt er 18 and Appendix
A of t he I nt el 64 and I A- 32 Archit ect ures Soft ware Developers Manual, Volume 3B.
The first part of t his chapt er provides informat ion on how t o use performance event s
specific t o processors based on t he I nt el Net Burst microarchit ect ure. Sect ion B.5
discusses similar t opics for performance event s available on I nt el Core Solo and I nt el
Core Duo processors.
B.1 PENTIUM
4 PROCESSOR PERFORMANCE METRICS

The descript ions of I nt el Pent ium 4 processor performance met rics use t erminology
t hat is specific t o t he I nt el Net Burst microarchit ect ure and t o implement at ions in t he
Pent ium 4 and I nt el Xeon processors. The performance met rics in Table B- 1 t hrough
Table B- 13 apply t o processors wit h a CPUI D signat ure t hat mat ches family encoding
15, mode encoding 0, 1, 2, 3, 4, or 6. Several new performance met rics are available
t o I A- 32 processors wit h a CPUI D signat ure t hat mat ches family encoding 15, mode
encoding 3; t he new met rics are list ed in Table B- 11.
The performance met rics list ed in Tables B- 1 t hrough B- 7 may be applicable t o
processors t hat support HT Technology. See Appendix B.4, Using Performance
Met rics wit h Hyper-Threading Technology.
B.1.1 Pentium
4 Processor-Specific Terminology
B.1.1.1 Bogus, Non-bogus, Retire
Branch mispredict ions incur a large penalt y on microprocessors wit h deep pipelines.
I n general, t he direct ion of branches can be predict ed wit h a high degree of accuracy
by t he front end of t he I nt el Pent ium 4 processor, such t hat most comput at ions can
be performed along t he predict ed pat h while wait ing for t he resolut ion of t he branch.
I n t he event of a mispredict ion, inst ruct ions and ops t hat were scheduled t o execut e
along t he mispredict ed pat h must be cancelled. These inst ruct ions and ops are
referred t o as bogus inst ruct ions and bogus ops. A number of Pent ium 4 processor
performance monit oring event s, for example, inst ruct ion_ ret ired and ops_ret ired,
can count inst ruct ions or mops t hat are ret ired based on t he charact erizat ion of
bogus versus nonbogus.
B-2
I n t he event descript ions in Table B- 1, t he t erm bogus refers t o inst ruct ions or micro-
ops t hat must be cancelled because t hey are on a pat h t aken from a mispredict ed
branch. The t erms ret ired and nonbogus refer t o inst ruct ions or ops along t he
pat h t hat result s in commit t ed archit ect ural st at e changes as required by t he program
execut ion. I nst ruct ions and ops are eit her bogus or nonbogus, but not bot h.
B.1.1.2 Bus Ratio
Bus Rat io is t he rat io of t he processor clock t o t he bus clock. I n t he Bus Ut ilizat ion
met ric, it is t he bus_rat io.
B.1.1.3 Replay
I n order t o maximize performance for t he common case, t he I nt el Net Burst microar-
chit ect ure somet imes aggressively schedules ops for execut ion before all t he condi-
t ions for correct execut ion are guarant eed t o be sat isfied. I n t he event t hat all of
t hese condit ions are not sat isfied, ops must be re- issued. This mechanism is called
replay.
Some occurrences of replays are caused by cache misses, dependence violat ions ( for
example, st ore forwarding problems) , and unforeseen resource const raint s. I n
normal operat ion, some number of replays are common and unavoidable. An exces-
sive number of replays indicat e t hat t here is a performance problem.
B.1.1.4 Assist
When t he hardware needs t he assist ance of microcode t o deal wit h some event , t he
machine t akes an assist . One example of such sit uat ion is an underflow condit ion in
t he input operands of a float ing- point operat ion.
The hardware must int ernally modify t he format of t he operands in order t o perform
t he comput at ion. Assist s clear t he ent ire machine of mops before t hey begin t o accu-
mulat e, and are cost ly. The assist mechanism on t he Pent ium 4 processor is similar
in principle t o t hat on t he Pent ium I I processors, which also have an assist event .
B.1.1.5 Tagging
Tagging is a means of marking ops t o be count ed at ret irement . See Appendix A of
t he I nt el 64 and I A- 32 Archit ect ures Soft ware Developers Manual, Volume 3B, for
t he descript ion of t agging mechanisms.
The same event can happen more t han once per op. The t agging mechanisms allow
a op t o be t agged once during it s lifet ime. The ret ired suffix is used for met rics t hat
increment a count once per op, rat her t han once per event . For example, a op may
encount er a cache miss more t han once during it s life t ime, but t he misses ret ired
met ric ( for example, 1
st
- Level Cache Misses Ret ired) will increment only once for t hat
op.
B-3
B.1.2 Counting Clocks
The count of cycles ( known as clock t icks) forms a fundament al basis for measuring
how long a program t akes t o execut e. The count is also used as part of efficiency
rat ios like cycles- per- inst ruct ion ( CPI ) . Some processor clocks may st op t icking
under cert ain circumst ances:
The processor is halt ed ( for example: during I / O) . There may be not hing for t he
CPU t o do while servicing a disk read request and t he processor may halt t o save
power. When HT Technology is enabled, bot h logical processors must be halt ed
for performance- monit oring- relat ed count ers t o be powered down.
The processor is asleep, eit her as a result of being halt ed for a while or as part of
a power- management scheme. There are different levels of sleep. I n t he deeper
sleep levels, t he t ime- st amp count er st ops count ing.
Three mechanisms t o count processor clock cycles for monit oring performance are:
Non- Hal t ed Cl ock Ti ck s Clocks when t he specified logical processor is not
halt ed nor in any power- saving st at es. These can be measured on a per- logical-
processor basis, when HT Technology is enabled.
Non- Sl eep Cl ock Ti ck s Clocks when t he physical processor is not in any of
t he sleep modes, nor power- saving st at es. These cannot be measured on a per-
logical- processor basis.
Ti me- st amp Count er Clocks when t he physical processor is not in deep
sleep. These cannot be measured on a per- logical- processor basis.
The first t wo met rics use performance count ers and can cause an int errupt upon
overflow for sampling. They may also be useful for cases where it is easier for a t ool
t o read a performance count er inst ead of t he t ime- st amp count er. The t ime- st amp
count er is accessed using an RDTSC inst ruct ion.
For applicat ions wit h a significant amount of I / O, t here are t wo rat ios of int erest :
Non- Hal t ed CPI Non- halt ed clock t icks/ inst ruct ions ret ired measures t he CPI
for t he phases where t he CPU was being used. This rat io can be measured on a
per- logical- processor basis, when HT Technology is enabled.
Nomi nal CPI Time- st amp count er t icks/ inst ruct ions ret ired measures t he CPI
over t he ent ire durat ion of t he program, including t hose periods t he machine is
halt ed while wait ing for I / O.
The dist inct ion bet ween t he t wo CPI is import ant for processors t hat support HT
Technology. Non- halt ed CPI should use t he non- halt ed clock t icks performance
met ric in t he numerat or. Nominal CPI should use non- sleep clock t icks in t he
numerat or. non- sleep clock t icks is t he same as t he clock t icks met ric in previous
edit ions of t his manual.
B-4
B.1.2.1 Non-Halted Clock Ticks
Non- halt ed clock t icks can be obt ained by programming t he appropriat e ESCR and
CCCR following t he recipe list ed in t he general met rics cat egory in Table B- 1. I n addi-
t ion, t he T0_OS/ T0_USR/ T1_OS/ T1_USR bit s may be specified t o qualify a specific
logical processor and kernel as opposed t o user mode.
B.1.2.2 Non-Sleep Clock Ticks
Performance monit oring count ers can be configured t o count clocks whenever t he
performance monit oring hardware is not powered- down. To count non- sleep clock
t icks wit h a performance- monit oring count er:
Select one of t he 18 count ers.
Select any of t he possible ESCRs whose event s t he select ed count er can count ,
and set it s event select t o anyt hing ot her t han no_event . This may not seem
necessary, but t he count er may be disabled in some cases if t his is not done.
Turn t hreshold comparison on in t he CCCR by set t ing t he compare bit t o 1.
Set t he t hreshold t o 15 and t he complement t o 1 in t he CCCR. Since no event can
ever exceed t his t hreshold, t he t hreshold condit ion is met every cycle and t he
count er count s every cycle. Not e t hat t his overrides any qualificat ion ( for
example: by CPL) specified in t he ESCR.
Enable count ing in t he CCCR for t hat count er by set t ing t he enable bit .
The count s produced by t he Non- halt ed and Non- sleep met rics are equivalent in
most cases if each physical package support s one logical processor and is not in any
power- saving st at es. The operat ing syst em may execut e t he HLT inst ruct ion and
place a physical processor in a power- saving st at e.
On processors t hat support HT Technology, each physical package can support t wo or
more logical processors. Current implement at ions of HT Technology provide t wo
logical processors for each physical processor.
While bot h logical processors can execut e t wo t hreads simult aneously, one logical
processor may be halt ed t o allow t he ot her t o execut e wit hout having t o share execu-
t ion resources. Non- halt ed clock t icks can be qualified t o count t he number of clock
cycles for a logical processor t hat is not halt ed ( t he count may include t he clock
cycles required complet e a t ransit ion int o a halt ed st at e) .
A physical processor t hat support s HT Technology ent ers int o a power- saving st at e if
all logical processors are halt ed.
Non- sleep clock t icks use is based on t he filt ering mechanism in t he CCCR. The
count cont inues t o increment as long as one logical processor is not halt ed or in a
power- saving st at e. An applicat ion may indirect ly cause a processor t o ent er int o a
power- saving st at e by using an OS service t hat t ransfers cont rol t o t he OS' s idle loop.
The syst em idle loop may place t he processor int o a power- saving st at e aft er an
implement at ion- dependent period if t here is no work t o do.
B-5
B.1.2.3 Time-Stamp Counter
The t ime- st amp count er increment s whenever t he sleep pin is not assert ed or when
t he clock signal on t he syst em bus is act ive. I t is r ead using t he RDTSC inst ruct ion.
The difference in values bet ween t wo reads ( modulo 2* * 64) gives t he number of
processor clocks bet ween reads.
The t ime- st amp count er and Non- sleep clock t icks count s should agree in pract i-
cally all cases if t he physical processor is not in power- saving st at es. However, it is
possible t o have bot h logical processors in a physical package halt ed, which result s in
most of t he chip ( including t he performance monit oring hardware) being powered
down. I n t his sit uat ion, it is possible for t he t ime- st amp count er t o cont inue incre-
ment ing because t he clock signal on t he syst em bus is st ill act ive; but non- sleep
clock t icks may no longer increment because t he performance monit oring hardware
is in power- saving st at es.
B.2 METRICS DESCRIPTIONS AND CATEGORIES
Performance met rics for I nt el Pent ium 4 and I nt el Xeon processors are list ed in
Table B- 1 t hrough Table B- 7. These performance met rics consist of recipes t o
program specific Pent ium 4 and I nt el Xeon processor performance monit oring event s
t o obt ain event count s t hat represent : number of inst ruct ions, cycles, or occur-
rences. The t ables also include a rat ios t hat are derived from count s of ot her perfor-
mance met rics.
On processors t hat support HT Technology, performance count ers and associat ed
model specific regist ers ( MSRs) are ext ended t o support HT Technology. A subset of
performance monit oring event s allow t he event count s t o be qualified by logical
processor. The int erface for qualificat ion of performance monit oring event s by logical
processor is document ed in I nt el 64 and I A- 32 Archit ect ures Soft ware Developers
Manual, Volumes 3A & 3B. Ot her performance monit oring event s produce count s
t hat are independent of which logical processor is associat ed wit h microarchit ect ural
event s. The qualificat ion of t he performance met rics support HT Technology is list ed
in Table B- 11 and Table B- 12.
I n Table B- 1 t hrough Table B- 7, recipes for programming performance met rics using
performance- monit oring event s are arranged as follows:
Column 1 specifies t he met ric. The met ric may be a single- event met ric; for
example, t he met ric I nst ruct ions Ret ired is based on t he count s of t he
performance monit oring event inst r_ret ired, using a specific set of event mask
bit s. Or t he met ric may be an expression built up from ot her met rics. For
example, I PC is derived from t wo single- event met rics.
Column 2 provides a descript ion of t he met ric in column 1. Please refer t o
Appendix B. 1. 1, Pent ium 4 Processor- Specific Terminology, for t erms t hat are
specific t o t he Pent ium 4 processor s capabilit ies.
Column 3 specifies t he performance monit oring event s or algebraic expressions
t hat form met rics. There are several met rics t hat require yet anot her sub- event
B-6
in addit ion t o t he count ing event . The addit ional sub- event informat ion is
included in column 3 as various t ags. These are described in Appendix B.3,
Performance Met rics and Tagging Mechanisms. For event names t hat appear in
t his column, refer t o t he I nt el 64 and I A- 32 Archit ect ures Soft ware Developers
Manual, Volumes 3A & 3B.
Column 4 specifies t he event mask bit s for set t ing up count event s. The address
of various model- specific regist ers ( MSR) , t he event mask bit s in Event Select
Cont rol regist ers ( ESCR) , and t he bit fields in Count er Configurat ion Cont rol
regist ers ( CCCR) are described in t he I nt el 64 and I A- 32 Archit ect ures
Soft ware Developers Manual, Volumes 3A & 3B.
Met rics list ed in Table B- 1 t hrough Table B- 7 cover t he following cat egories:
Gener al Operat ion not specific t o any subsyst em of t he microarchit ect ure
Br anchi ng Branching act ivit ies
Tr ace Cache and Fr ont End Front end act ivit ies and t race cache operat ion
modes
Memor y Memory operat ion relat ed t o t he cache hierarch
Bus Act ivit ies relat ed t o Front - Side Bus ( FSB)
Char act er i zat i on Operat ions specific t o t he processor core
Machi ne Cl ear
Table B-1. Performance Metrics - General
Metric Description
Event Name or Metric
Expression
Event Mask Value
Required
Non-sleep clock ticks The number of clock
ticks while a processor
is not in any sleep
modes
See explanation on
counting clocks in
Section B.1.2.
Non-halted clock ticks The number of clock
ticks that the
processor is in not
halted nor in sleep
Global_power_events RUNNING
B-7
Instructions Retired Non-bogus
instructions executed
to completion
May count more than
once for some
instructions with
complex op flow or if
instructions were
interrupted before
retirement. The count
may vary depending
on the
microarchitectural
states when counting
begins.
Instr_retired NBOGUSNTAG |
NBOGUSTAG
Non-Sleep CPI Cycles per instruction
for a physical
processor package
(Non-Sleep Clock Ticks)
/ (Instructions Retired)
Non-Halted CPI Cycles per instruction
for a logical processor
(Non-Halted Clock
Ticks) / (Instructions
Retired)
ops Retired Non-bogus ops
executed to
completion
ops_retired NBOGUS
UPC op per cycle for a
logical processor
ops Retired/ Non-
Halted Clock Ticks
Speculative ops
Retired
Number of ops
retired
This includes
to completion and
speculatively executed
in the path of branch
mispredictions.
ops_retired NBOGUS | BOGUS
Table B-1. Performance Metrics - General (Contd.)
Metric Description
Expression
Event Mask Value
Required
B-8
Table B-2. Performance Metrics - Branching
Metric Description
Expression
Event Mask Value
Required
Branches Retired All branch instructions
executed to
completion
Branch_retired MMTM | MMNM | MMTP
| MMNP
Tagged Mispredicted
Branches Retired
Counts number of
retired branch
instructions
mispredicted
This stat can be used
with precise event-
based sampling.
Replay_event; set the
following replay tag:
Tagged_mispred_
branch
NBOGUS
Mispredicted Branches
Retired
Mispredicted branch
to completion
This stat is often used
in a per-instruction
ratio.
Mispred_branch_
retired
NBOGUS
Misprediction Ratio Misprediction rate per
branch
(Mispredicted branches
retired) /(Branches
retired)
All returns Number of return
branches
retired_branch_type RETURN
All indirect branches All returns and indirect
calls and indirect jumps
retired_branch_type INDIRECT
All calls All direct and indirect
calls
retired_branch_type CALL
Mispredicted returns Number of
mispredicted returns
including all causes
retired_mispred_
branch_type
RETURN
All conditionals Number of branches
that are conditional
jumps
This may overcount if
the branch is from
build mode or there is
a machine clear near
the branch.
retired_branch_type CONDITIONAL
Mispredicted indirect
branches
All mispredicted
returns and indirect
calls and indirect jumps
retired_mispred_
branch_type
INDIRECT
B-9
Mispredicted calls All mispredicted
indirect calls
retired_branch_type CALL
Mispredicted
conditionals
Number of
mispredicted branches
that are conditional
jumps
retired_mispred_
branch_type
CONDITIONAL
Table B-3. Performance Metrics - Trace Cache and Front End
Metric Description
Expression
Event Mask Value
Required
Page Walk Miss ITLB Number of page walk
requests due to ITLB
misses
page_walk_type ITMISS
ITLB Misses Number of ITLB
lookups that result in a
miss
Page Walk Miss ITLB is
less speculative than
ITLB Misses and is the
recommended
alternative.
ITLB_reference MISS
TC Flushes Number of TC flushes
Counter will count
twice for each
occurrence. Divide the
count by two to get
the number of flushes.
TC_misc FLUSH
Logical Processor 0
Deliver Mode
Number of cycles that
the trace and delivery
engine (TDE) is
delivering traces
associated with logical
processor 0,
regardless of operating
modes of TDE for
traces associated with
logical processor 1
TC_deliver_mode SS | SB | SI
Table B-2. Performance Metrics - Branching (Contd.)
Metric Description
Expression
Event Mask Value
Required
B-10
If a physical processor
supports only one
logical processor, all
traces are associated
with logical processor
0.
This was formerly
known as Trace Cache
Deliver Mode.
Logical Processor 1
Deliver Mode
engine (TDE) is
delivering traces
processor 1,
regardless of the
operating modes of
the TDE for traces
processor 0
TC_deliver_mode SS | BS | IS
Metric is applicable
only if a physical
processor supports HT
Technology and have
two logical processors
per package.
% Logical Processor N
In Deliver Mode
Fraction of all non-
halted cycles for which
the trace cache is
delivering ops
associated with a
given logical processor
(Logical Processor N
Deliver Mode)*
100/(Non-Halted Clock
Ticks)
Logical Processor 0
Build Mode
engine (TDE) is
building traces
processor 0,
modes of TDE for
logical processor 1
TC_deliver_mode BB | BS | BI
Table B-3. Performance Metrics - Trace Cache and Front End (Contd.)
Metric Description
Expression
Event Mask Value
Required
B-11
If a physical processor
supports only one
logical processor, all
traces are associated
with logical processor
0.
Logical Processor 1
Build Mode
engine (TDE) is
building traces
processor 1,
modes of TDE for
logical processor 0
TC_deliver_mode BB | SB | IB
This metric is
applicable only if a
physical processor
supports HT
Technology and has
two logical processors
per package.
Trace Cache Misses Number of times that
significant delays
occurred in order to
decode instructions
and build a trace
because of a TC miss
BPU_fetch_request TCMISS
TC to ROM Transfers Twice the number of
times that ROM
microcode is accessed
to decode complex
instructions instead of
building|delivering
traces
Divide the count by 2
to get the number of
occurrence.
tc_ms_xfer CISC
Speculative TC-Built
ops
Number of speculative
ops originating when
the TC is in build mode
op_queue_writes FROM_TC_BUILD
Metric Description
Expression
Event Mask Value
Required
B-12
Speculative TC-
Delivered Uops
ops originating when
the TC is in deliver
mode
op_queue_writes FROM_TC_DELIVER
Speculative Microcode
ops
ops originating from
the microcode ROM
Not all ops of an
instruction from the
microcode ROM will be
included.
op_queue_writes FROM_ROM
Table B-4. Performance Metrics - Memory
Metric Description Event Name or Metric
Expression
Event Mask Value
Required
Page Walk DTLB All
Misses
Number of page walk
requests due to DTLB
misses from either
load or store
page_walk_type DTMISS
1st Level Cache Load
Misses Retired
Number of retired
ops that experienced
1st Level cache load
misses.
ratio.
1stL_cache_load
_miss_retired
NBOGUS
2nd Level Cache Load
Misses Retired
Number of retired load
2nd Level cache
misses
This stat is known to
undercount when
loads are spaced apart.
2ndL_cache_load_
miss_retired
NBOGUS
DTLB Load Misses
Retired
DTLB misses
DTLB_load_miss_
retired
NBOGUS
Metric Description
Expression
Event Mask Value
Required
B-13
DTLB Store Misses
Retired
Number of retired
store ops that
experienced DTLB
misses
DTLB_store_miss_
retired
NBOGUS
DTLB Load and Store
Misses Retired
or ops that
experienced DTLB
misses
DTLB_all_miss_
retired
NBOGUS
64-KByte Aliasing
Conflicts
1
Number of 64-KByte
aliasing conflicts
A memory reference
causing 64-KByte
aliasing conflict can be
counted more than
once in this stat. The
performance penalty
resulted from
64-KByte aliasing
conflict can vary from
being unnoticeable to
considerable.
Some implementations
of the Pentium 4
processor family can
incur significant
penalties for loads that
alias to preceding
stores.
Memory_cancel 64K_CONF
Split Load Replays Number of load
references to data that
spanned two cache
lines
Memory_complete LSC
Split Loads Retired Number of retired load
ops that spanned
two cache lines
Split_load_retired.
NBOGUS
Split Store Replays Number of store
references spanning
across cache line
boundary
Memory_complete SSC
Split Stores Retired Number of retired
store ops spanning
two cache lines
Split_store_retired.
NBOGUS
Table B-4. Performance Metrics - Memory (Contd.)
Expression
Event Mask Value
Required
B-14
MOB Load Replays Number of replayed
loads related to the
Memory Order Buffer
(MOB)
This metric counts only
the case where the
store-forwarding data
is not an aligned
subset of the stored
data.
MOB_load_replay PARTIAL_DATA,
UNALGN_ADDR
2nd Level Cache Read
Misses
2
Number of 2nd-level
cache read misses
(load and RFO misses)
Beware of granularity
differences.
BSQ_cache_reference RD_2ndL_MISS
2nd Level Cache Read
References
2
Number of 2nd level
cache read references
(loads and RFOs)
differences.
BSQ_cache_reference RD_2ndL_HITS,
RD_2ndL_HITE,
RD_2ndL_HITM,
RD_2ndL_MISS
3rd Level Cache Read
Misses
2

Number of 3rd level
cache read misses
(load and RFOs misses)
differences.
BSQ_cache_reference RD_3rdL_MISS
3rd

Level Cache Read
References
2
Number of 3rd level
(loads and RFOs)
differences.
BSQ_cache_reference RD_3rdL_HITS,
RD_3rdL_HITE,
RD_3rdL_HITM,
RD_3rdL_MISS
2nd Level Cache Reads
Hit Shared
Number of 2nd level
(loads and RFOs) that
hit cache line in shared
state
differences.
BSQ_cache_reference RD_2ndL_HITS
Expression
Event Mask Value
Required
B-15
Hit Modified
Number of 2nd level
hit cache line in
modified state
differences.
BSQ_cache_reference RD_2ndL_HITM
Hit Exclusive
Number of 2nd level
hit cache line in
exclusive state
differences.
BSQ_cache_reference RD_2ndL_HITE
3rd Level Cache Reads
Hit Shared
Number of 3rd level
hit cache line in shared
state
differences.
BSQ_cache_reference RD_3rdL_HITS
3rd-Level Cache Reads
Hit Modified
Number of 3rd level
hit cache line in
modified state
differences.
BSQ_cache_reference RD_3rdL_HITM
3rd-Level Cache Reads
Hit Exclusive

Number of 3rd level
hit cache line in
exclusive state
differences.
BSQ_cache_reference RD_3rdL_HITE
MOB Load Replays
Retired
replays related to MOB
MOB_load_replay_
retired
NBOGUS
Loads Retired Number of retired load
operations that were
tagged at front end
Front_end_event; set
following front end
tag: Memory_loads
NBOGUS
Expression
Event Mask Value
Required
B-16
Stores Retired Number of retired
stored operations that
were tagged at front
end
ratio.
Front_end_event; set
the following front end
tag: Memory_stores
NBOGUS
All WCB Evictions Number of times a WC
buffer eviction
occurred due to any
cause
WC_buffer WCB_EVICTS
This can be used to
distinguish 64-KByte
aliasing cases that
contribute more
significantly to
performance penalty,
for example: stores
that are 64-KByte
aliased.
A high count of this
metric when there is
no significant
contribution due to
write combining buffer
full condition may
indicate the above
situation.
WCB Full Evictions Number of times a WC
buffer eviction
occurred when all of
WC buffers allocated
WC_buffer WCB_FULL_EVICT
NOTES:
1. A memory reference causing 64K aliasing conflict can be counted more than once in this stat. The
resulting performance penalty can vary from unnoticeable to considerable. Some implementations
of the Pentium 4 processor family can incur significant penalties from loads that alias to preceding
stores.
2. Currently, bugs in this event can cause both overcounting and undercounting by as much as a fac-
tor of 2.
Expression
Event Mask Value
Required
B-17
Table B-5. Performance Metrics - Bus
Metric Description
Expression
Event Mask Value
Required
Bus Accesses from
the Processor
Number of all bus
transactions allocated
in the IO Queue from
this processor
issues with this event.
Also Beware of
different recipes in
mask bits for Pentium
4 and Intel Xeon
processors between
CPUID model field
value of 2 and model
value less than 2.
IOQ_allocation 1a. ReqA0, ALL_READ,
ALL_WRITE, OWN,
PREFETCH (CPUID
model < 2);
1b. ReqA0, ALL_READ,
ALL_WRITE,
MEM_WB, MEM_WT,
MEM_WP, MEM_WC,
MEM_UC, OWN,
PREFETCH (CPUID
model >= 2).
2: Enable edge
filtering
1
in the
CCCR.
Non-prefetch Bus
Accesses from the
Processor
Number of all bus
transactions allocated
in the IO Queue from
this processor,
excluding prefetched
sectors.
Also Beware of
4 and Intel Xeon
processors between
CPUID model field
value less than 2.
ALL_WRITE, OWN
(CPUID model < 2);
ALL_WRITE,
MEM_WB, MEM_WT,
MEM_WP, MEM_WC,
MEM_UC, OWN
(CPUID model < 2).
2: Enable edge
filtering
1
in the
CCCR.
Prefetch Ratio Fraction of all bus
transactions (including
retires) that were for
HW or SW prefetching.
(Bus Accesses
Nonprefetch Bus
Accesses)/ (Bus
Accesses)
FSB Data Ready Number of front-side
bus clocks that the bus
is transmitting data
driven by this
processor
This includes full
reads|writes and
partial reads|writes
and implicit
writebacks.
FSB_data_activity 1: DRDY_OWN,
DRDY_DRV
2: Enable edge
filtering
1
in the
CCCR.
B-18
Bus Utilization % of time that bus is
actually occupied
(FSB Data Ready)
*Bus_ratio*100/ Non-
Sleep Clock Ticks
Reads from the
Processor
Number of all read
(includes RFOs)
transactions on the
bus that were
allocated in IO Queue
from this processor
(includes prefetches)
Also Beware of
4 and Intel Xeon
processors between
CPUID model field
value less than 2.
OWN, PREFETCH
(CPUID model < 2);
MEM_WB, MEM_WT,
MEM_WP, MEM_WC,
MEM_UC,
OWN, PREFETCH
(CPUID model >= 2);
2: Enable edge
filtering
1
in the
CCCR.
Writes from the
Processor
Number of all write
transactions on the
bus allocated in IO
Queue from this
processor (excludes
RFOs)
Also Beware of
4 and Intel Xeon
processors between
CPUID model field
value less than 2.
IOQ_allocation 1a. ReqA0, ALL_WRITE,
OWN
(CPUID model < 2);
1b. ReqA0, ALL_WRITE,
MEM_WB, MEM_WT,
MEM_WP, MEM_WC,
MEM_UC, OWN
(CPUID model >= 2).
2: Enable edge
filtering
1
in the
CCCR.
Table B-5. Performance Metrics - Bus (Contd.)
Metric Description
Expression
Event Mask Value
Required
B-19
Reads Non-prefetch
from the Processor
Number of all read
transactions (includes
RFOs but excludes
prefetches) on the bus
that originated from
this processor
Also Beware of
4 and Intel Xeon
processors between
CPUID model field
value less than 2.
OWN (CPUID model <
2);
MEM_WB, MEM_WT,
MEM_WP, MEM_WC,
MEM_UC, OWN
(CPUID model >= 2).
2: Enable edge
filtering
1
in the
CCCR.
All WC from the
Processor
Number of Write
Combining memory
transactions on the
bus that originated
from this processor
Also Beware of
4 and Intel Xeon
processors between
CPUID model field
value less than 2.
IOQ_allocation 1a. ReqA0, MEM_WC,
OWN (CPUID model <
2);
1b. ReqA0,ALL_READ,
ALL_WRITE,
MEM_WC, OWN
(CPUID model >= 2)
2: Enable edge
filtering
1
in the
CCCR.
Metric Description
Expression
Event Mask Value
Required
B-20
All UC from the
Processor
Number of UC
(Uncacheable) memory
transactions on the
bus that originated
from this processor
issues (for example: a
store of dqword to UC
memory requires two
entries in IOQ
allocation). Also
Beware of different
recipes in mask bits for
Pentium 4 and Intel
Xeon processors
between CPUID model
field value of 2 and
model value less
than 2.
IOQ_allocation 1a. ReqA0, MEM_UC,
OWN (CPUID model <
2);
1b. ReqA0,ALL_READ,
ALL_WRITE,
MEM_UC, OWN
(CPUID model >= 2)
2: Enable edge
filtering
1
in the
CCCR.
Bus Accesses from
All Agents
Number of all bus
transactions that were
allocated in the IO
Queue by all agents
Also beware of
4 and Intel Xeon
processors between
CPUID model field
value less than 2.
ALL_WRITE, OWN,
OTHER, PREFETCH
(CPUID model < 2);
ALL_WRITE,
MEM_WB, MEM_WT,
MEM_WP, MEM_WC,
MEM_UC, OWN,
OTHER, PREFETCH
(CPUID model >= 2).
2: Enable edge
filtering
1
in the
CCCR.
Metric Description
Expression
Event Mask Value
Required
B-21
Bus Accesses
Underway from the
processor
2

Accrued sum of the
durations of all bus
transactions by this
processor.
Divide by Bus
Accesses from the
processor to get bus
request latency.
Also beware of
4 and Intel Xeon
processors between
CPUID model field
value less than 2.
IOQ_active_entries 1a. ReqA0, ALL_READ,
ALL_WRITE, OWN,
PREFETCH
(CPUID model < 2);
ALL_WRITE,
MEM_WB, MEM_WT,
MEM_WP, MEM_WC,
MEM_UC, OWN,
PREFETCH (CPUID
model >= 2).
Bus Reads Underway
from the processor
2

Accrued sum of the
durations of all read
(includes RFOs)
processor.
Divide by Reads from
the Processor to get
bus read request
latency.
Also beware of
4 and Intel Xeon
processors between
CPUID model field
value less than 2.
OWN, PREFETCH
(CPUID model < 2);
MEM_WB, MEM_WT,
MEM_WP, MEM_WC,
MEM_UC,
OWN, PREFETCH
(CPUID model >= 2);
Metric Description
Expression
Event Mask Value
Required
B-22
Non-prefetch Reads
Underway from the
processor
2

Accrued sum of the
durations of read
(includes RFOs but
excludes prefetches)
transactions that
originate from this
processor
Divide by Reads Non-
prefetch from the
processor to get Non-
prefetch read request
latency.
Also beware of
4 and Intel Xeon
processors between
CPUID model field
value less than 2.
OWN (CPUID model <
2);
MEM_WB, MEM_WT,
MEM_WP, MEM_WC,
MEM_UC, OWN
(CPUID model >= 2).
All UC Underway
from the processor
2

Accrued sum of the
durations of all UC
processor
Divide by All UC from
the processor to get
UC request latency.
Also beware of
4 and Intel Xeon
processors between
CPUID model field
value less than 2.
IOQ_active_entries 1a. ReqA0, MEM_UC,
OWN (CPUID model <
2);
1b. ReqA0,ALL_READ,
ALL_WRITE,
MEM_UC, OWN
(CPUID model >= 2)
Metric Description
Expression
Event Mask Value
Required
B-23
All WC Underway
from the processor
2

Accrued sum of the
durations of all WC
processor.
Divide by All WC from
the processor to get
WC request latency.
Also beware of
4 and Intel Xeon
processors between
CPUID model field
value less than 2.
IOQ_active_entries 1a. ReqA0, MEM_WC,
OWN (CPUID model <
2);
1b. ReqA0,ALL_READ,
ALL_WRITE,
MEM_WC, OWN
(CPUID model >= 2)
Bus Writes
Underway from the
processor
2

Accrued sum of the
durations of all write
processor
Divide by Writes from
the Processor to get
bus write request
latency.
Also beware of
4 and Intel Xeon
processors between
CPUID model field
value less than 2.
IOQ_active_entries 1a. 1a. ReqA0,
ALL_WRITE, OWN
(CPUID model < 2);
1b. ReqA0, ALL_WRITE,
MEM_WB, MEM_WT,
MEM_WP, MEM_WC,
MEM_UC, OWN
(CPUID model >= 2).
Metric Description
Expression
Event Mask Value
Required
B-24
Bus Accesses
Underway from All
Agents
2
Accrued sum of the
durations of entries by
all agents on the bus
Divide by Bus
Accesses from All
Agents to get bus
request latency.
Also beware of
4 and Intel Xeon
processors between
CPUID model field
value less than 2.
ALL_WRITE, OWN,
OTHER, PREFETCH
(CPUID model < 2);
ALL_WRITE,
MEM_WB, MEM_WT,
MEM_WP, MEM_WC,
MEM_UC, OWN,
OTHER, PREFETCH
(CPUID model >= 2).
Write WC Full (BSQ) The number of write
(but neither writeback
nor RFO) transactions
to WC-type memory.
BSQ_allocation 1: REQ_TYPE1|
REQ_LEN0|
REQ_LEN1|MEM_
TYPE0|REQ_DEM_
TYPE
2: Enable edge
filtering
1
in the
CCCR.
Write WC Partial
(BSQ)
Number of partial
write transactions to
WC-type memory
This event may
undercount WC partials
that originate from
DWord operands.
REQ_LEN0|
MEM_TYPE0|
REQ_DEM_TYPE
2: Enable edge
filtering
1
in the
CCCR.
Writes WB Full (BSQ) Number of writeback
(evicted from cache)
transactions to WB-
type memory.
These writebacks may
not have a
corresponding FSB IOQ
transaction if 3rd level
cache is present.
REQ_TYPE1|
REQ_LEN0|
REQ_LEN1|
MEM_TYPE1|
MEM_TYPE2|
REQ_CACHE_TYPE|R
EQ_DEM_TYPE
2: Enable edge
filtering
1
in the
CCCR.
Metric Description
Expression
Event Mask Value
Required
B-25
Reads Non-prefetch
Full (BSQ)
Number of read
(excludes RFOs and
HW|SW prefetches)
transactions to WB-
type memory.
BSQ_allocation 1: REQ_LEN0|
REQ_LEN1|
MEM_TYPE1|
MEM_TYPE2|
REQ_CACHE_TYPE|R
EQ_DEM_TYPE
2: Enable edge
filtering
1
in the
CCCR.
Reads Invalidate Full-
RFO (BSQ)
Number of read
invalidate (RFO)
transactions to WB-
type memory
REQ_LEN0|
REQ_LEN1|
MEM_TYPE1|
MEM_TYPE2|
REQ_CACHE_TYPE|R
EQ_ORD_TYPE|
REQ_DEM_TYPE
2: Enable edge
filtering
1
in the
CCCR.
UC Reads Chunk
(BSQ)
Number of 8-byte
aligned UC read
transactions
Read requests
associated with
16-byte operands may
under-count.
REQ_ORD_TYPE|
REQ_DEM_TYPE
2: Enable edge
filtering
1
in the
CCCR.
UC Reads Chunk Split
(BSQ)
Number of UC read
transactions spanning
8-byte boundary
Read requests may
under-count if the data
chunk straddles 64-
byte boundary.
REQ_SPLIT_TYPE|RE
Q_ORD_TYPE|
REQ_DEM_TYPE
2: Enable edge
filtering
1
in the
CCCR.
UC Write Partial
(BSQ)
Number of UC write
transactions
issues between BSQ
and FSB IOQ events.
REQ_LEN0|
REQ_SPLIT_TYPE|RE
Q_ORD_TYPE|
REQ_DEM_TYPE
2: Enable edge
filtering
1
in the
CCCR.
Metric Description
Expression
Event Mask Value
Required
B-26
IO Reads Chunk (BSQ) Number of 8-byte
aligned IO port read
transactions
REQ_ORD_TYPE|
REQ_IO_TYPE|
REQ_DEM_TYPE
2: Enable edge
filtering
1
in the
CCCR.
IO Writes Chunk
(BSQ)
Number of IO port
write transactions
REQ_LEN0|
REQ_ORD_TYPE|RE
Q_IO_TYPE|REQ_DE
M_TYPE
2: Enable edge
filtering
1
in the
CCCR.
WB Writes Full
Underway (BSQ)
3

Accrued sum of the
durations of writeback
(evicted from cache)
transactions to WB-
type memory.
Divide by Writes WB
Full (BSQ) to estimate
average request
latency
Beware of effects of
writebacks from 2nd-
level cache that are
quickly satisfied from
the 3rd-level cache (if
present).
BSQ_active_entries REQ_TYPE0|
REQ_TYPE1|
REQ_LEN0|
REQ_LEN1|
MEM_TYPE1|
MEM_TYPE2|
REQ_CACHE_TYPE|REQ_
DEM_TYPE
UC Reads Chunk
Underway (BSQ)
3

Accrued sum of the
durations of UC read
transactions
Divide by UC Reads
Chunk (BSQ) to
estimate average
request latency.
Estimated latency may
be affected by
undercount in
allocated entries.
BSQ_active_entries 1: REQ_LEN0|
REQ_ORD_TYPE|
REQ_DEM_TYPE
2: Enable edge
filtering
1
in the
CCCR.
Metric Description
Expression
Event Mask Value
Required
B-27
Write WC Partial
Underway (BSQ)
3
Accrued sum of the
durations of partial
write transactions to
WC-type memory
Divide by Write WC
Partial (BSQ) to
estimate average
request latency.
Allocated entries of
WC partials that
originate from DWord
operands are not
included.
BSQ_active_entries 1: REQ_TYPE1|
REQ_LEN0|
MEM_TYPE0|
REQ_DEM_TYPE
2: Enable edge
filtering
1
in the
CCCR.
NOTES:
1. Set the following CCCR bits to make edge triggered: Compare=1; Edge=1; Threshold=0.
2. Must program both MSR_FSB_ESCR0 and MSR_FSB_ESCR1.
3. Must program both MSR_BSU_ESCR0 and MSR_BSU_ESCR1.
Table B-6. Performance Metrics - Characterization
Metric Description
Expression
Event Mask Value
Required
x87 Input Assists Number of
occurrences of x87
input operands
needing assistance to
handle an exception
condition
ratio.
X87_assists PREA
x87 Output Assists Number of
occurrences of x87
operations needing
assistance to handle
an exception condition
X87_assists POAO, POAU
Metric Description
Expression
Event Mask Value
Required
B-28
SSE Input Assists Number of
occurrences of
SSE/SSE2 floating-
point operations
needing assistance to
handle an exception
condition
The number of
occurrences includes
speculative counts.
SSE_input_assist ALL
Packed SP Retired
1
Non-bogus packed
single-precision
instructions retired
Execution_event; set
this execution tag:
Packed_SP_retired
NONBOGUS0
Packed DP Retired
1
Non-bogus packed
double-precision
this execution tag:
Packed_DP_retired
NONBOGUS0
Scalar SP Retired
1
Non-bogus scalar
single-precision
this execution tag:
Scalar_SP_retired
NONBOGUS0
Scalar DP Retired
1
Non-bogus scalar
double-precision
this execution tag:
Scalar_DP_retired
NONBOGUS0
64-bit MMX
Instructions Retired
1
Non-bogus 64-bit
integer SIMD
instruction (MMX
instructions) retired
the following
execution tag:
64_bit_MMX_retired
NONBOGUS0
128-bit MMX
1
Non-bogus 128-bit
integer SIMD
this execution tag:
128_bit_MMX_
retired
NONBOGUS0
X87 Retired
2
Non-bogus x87
floating-point
this execution tag:
X87_FP_retired
NONBOGUS0
Stalled Cycles of Store
Buffer Resources (non-
standard
3
)
Duration of stalls due
to lack of store buffers
Resource_stall SBFULL
Table B-6. Performance Metrics - Characterization (Contd.)
Metric Description
Expression
Event Mask Value
Required
B-29
Stalls of Store Buffer
Resources (non-
standard
3
)
Number of allocation
stalls due to lack of
store buffers
Resource_stall SBFULL
(Also set the following
CCCR bits: Compare=1;
Edge=1;
Threshold=0)
NOTES:
1. Most MMX technology instructions, Streaming SIMD Extensions and Streaming SIMD Extensions 2
decode into a single mop. There are some instructions that decode into several mops; in these lim-
ited cases, the metrics count the number of mops that are actually tagged.
2. Most commonly used x87 instructions (e.g., fmul, fadd, fdiv, fsqrt, fstp, etc.) decode into a single-
mop. However, transcendental and some x87 instructions decode into several mops; in these lim-
ited cases, the metrics will count the number of mops that are actually tagged.
3. This metric may not be supported in all models of the Pentium 4 processor family.
Table B-7. Performance Metrics - Machine Clear
Expression
Event Mask Value
Required
Machine Clear Count Number of cycles that
entire pipeline of the
machine is cleared for
all causes
Machine_clear CLEAR
(Also set the following
CCCR bits: Compare=1;
Edge=1; Threshold=0)
Memory Order
Machine Clear
Number of times
machine is cleared due
to memory-ordering
issues
Machine_clear MOCLEAR
Self-modifying Code
Clear
Number of times
machine is cleared due
to self-modifying code
issues
Machine_clear SMCCLEAR
Table B-6. Performance Metrics - Characterization (Contd.)
Metric Description
Expression
Event Mask Value
Required
B-30
B.2.1 Trace Cache Events
The t race cache is not direct ly comparable t o an inst ruct ion cache. The t wo are orga-
nized very different ly. For example, a t race can span many lines wort h of inst ruct ion-
cache dat a. As wit h most microarchit ect ural element s, t race cache performance is
only an issue if somet hing else is not a bigger bot t leneck. I f an applicat ion is bus
bandwidt h bound, t he bandwidt h t hat t he front end is get t ing ops t o t he core may
be irrelevant . When front - end bandwidt h is an issue, t he t race cache, in deliver
mode, can issue ops t o t he core fast er t han eit her t he decoder ( build mode) or t he
microcode st ore ( t he MS ROM) . Thus, t he percent of t ime in t race cache deliver
mode, or similarly, t he percent age of all bogus and nonbogus ops from t he t race
cache can be a useful met ric for det ermining front - end performance.
The met ric t hat is most analogous t o an inst ruct ion cache miss is a t race cache miss.
An unsuccessful lookup of t he t race cache ( colloquially, a miss) is not int erest ing, per
se, if we are in build mode and don' t find a t race available. We j ust keep building
t races. The only penalt y in t hat case is t hat we cont inue t o have a lower front - end
bandwidt h. The t race cache miss met ric t hat is current ly used is not j ust any TC miss,
but rat her one t hat is incurred while t he machine is already in deliver mode ( for
example: when a 15- 20 cycle penalt y is paid) . Again, care must be exercised. A small
average number of TC misses per inst ruct ion does not indicat e good front - end
performance if t he percent age of t ime in deliver mode is also low.
B.2.2 Bus and Memory Metrics
I n order t o correct ly int erpret t he observed count s of performance met rics relat ed t o
bus event s, it is helpful t o underst and t ransact ion sizes, when ent ries are allocat ed in
different queues, and how sect oring and prefet ching affect count s.
Figure B- 1 is a simplified block diagram of t he subsyst ems connect ed t o t he I OQ
unit in t he front side bus subsyst em and t he BSQ unit t hat int erface t o t he I OQ. A
t wo- way SMP configurat ion is illust rat ed. 1st level cache misses and writ ebacks ( also
called core references) result in references t o t he 2nd level cache. The Bus Sequence
Queue ( BSQ) holds request s from t he processor core or prefet cher t hat are t o be
serviced on t he front side bus ( FSB) , or in t he local XAPI C. I f a 3rd level cache is
present on- die, t he BSQ also holds writ eback request s ( dirt y, evict ed dat a) from t he
2nd level cache. The FSB' s I OQ holds request s t hat have gone out ont o t he front side
bus.
B-31
Core references are nominally 64 byt es, t he size of a 1st level cache line. Smaller
sizes are called part ials ( uncacheable and writ e combining reads, uncacheable,
writ e- t hrough and writ e- prot ect writ es, and all I / O) . Writ eback locks, st reaming
st ores and writ e combining st ores may be full line or part ials. Part ials are not relevant
for cache references, since t hey are associat ed wit h non- cached dat a. Likewise,
writ ebacks ( due t o t he evict ion of dirt y dat a) and RFOs ( reads for ownership due t o
program st ores) are not relevant for non- cached dat a.
The granularit y at which t he core references are count ed by different bus and
memory met rics list ed in Table B- 1 varies, depending on t he underlying perfor-
Figure B-1. Relationships Between Cache Hierarchy, IOQ, BSQ and FSB
Chip Set System Memory
1st Level Data
Cache
3rd Level Cache
FSB_ IOQ
BSQ
Unified 2nd Level
Cache
1st Level Data
Cache
3rd Level Cache
FSB_ IOQ
BSQ
Unified 2nd Level
Cache
B-32
mance- monit oring event s from which t hese bus and memory met rics are derived.
The granularit ies of core references are list ed below, according t o t he performance
monit oring event s document ed in Appendix A of I nt el 64 and I A- 32 Archit ect ures
Soft ware Developers Manual, Volume 3B.
B.2.2.1 Reads due to program loads
BSQ_cache_reference 128 byt es for misses ( on current implement at ions) ,
64 byt es for hit s
BSQ_allocat ion 128 byt es for hit s or misses ( on current implement at ions) ,
smaller for part ials' hit s or misses
BSQ_act ive_ent ries 64 byt es for hit s or misses, smaller for part ials' hit s or
misses
I OQ_allocat ion, I OQ_act ive_ent ries 64 byt es, smaller for part ials' hit s or
misses
B.2.2.2 Reads due to program writes (RFOs)
BSQ_cache_reference 64 byt es for hit s or misses
BSQ_allocat ion 64 byt es for hit s or misses ( t he granularit y for misses may
change in fut ure implement at ions of BSQ_allocat ion) , smaller for part ials' hit s or
misses
BSQ_act ive_ent ries 64 byt es for hit s or misses, smaller for part ials' hit s or
misses
I OQ_allocat ion, I OQ_act ive_ent ries 64 byt es for hit s or misses, smaller for
part ials' hit s or misses
B.2.2.3 Writebacks (dirty evictions)
BSQ_cache_reference 64 byt es
BSQ_allocat ion 64 byt es
BSQ_act ive_ent ries 64 byt es
I OQ_allocat ion, I OQ_act ive_ent ries 64 byt es
The count of I OQ allocat ions may exceed t he count of corresponding BSQ allocat ions
on current implement at ions for several reasons, including:
Par t i al s I n t he FSB I OQ, any t ransact ion smaller t han 64 byt es is broken up
int o one t o eight part ials, each being count ed separat ely as a or one t o eight - byt e
chunks. I n t he BSQ, allocat ions of part ials get a count of one. Fut ure implemen-
t at ions will count each part ial individually.
Di f f er ent t r ansact i on si zes Allocat ions of non- part ial programmat ic load
request s get a count of one per 128 byt es in t he BSQ on current implement a-
t ions, and a count of one per 64 byt es in t he FSB I OQ. Allocat ions of RFOs get a
B-33
count of 1 per 64 byt es for earlier processors and for t he FSB I OQ ( This
granularit y may change in fut ure implement at ions) .
Ret r i es I f t he chipset request s a ret ry, t he FSB I OQ allocat ions get one count
per ret ry.
There are t wo not ewort hy cases where t here may be BSQ allocat ions wit hout FSB
I OQ allocat ions. The first is UC reads and writ es t o t he local XAPI C regist ers. Second,
if a cache line is evict ed from t he 2nd- level cache but it hit s in t he on- die 3rd- level
cache, t hen a BSQ ent ry is allocat ed but no FSB t ransact ion is necessary, and t here
will be no allocat ion in t he FSB I OQ. The difference in t he number of writ e t ransac-
t ions of t he writ eback ( WB) memory t ype for t he FSB I OQ and t he BSQ can be an
indicat ion of how oft en t his happens. I t is less likely t o occur for applicat ions wit h
poor localit y of writ es t o t he 3rd- level cache, and of course cannot happen when no
3rd- level cache is present .
B.2.3 Usage Notes for Specific Metrics
The difference bet ween t he met rics Read from t he processor and Reads non-
prefet ch from t he processor is nominally t he number of hardware prefet ches.
The paragraphs below cover several performance met rics t hat are based on t he
Pent ium 4 processor performance- monit oring event BSQ_cache_rerference. The
met rics are:
2nd- Level Cache Read Misses
2nd- Level Cache Read References
3rd- Level Cache Read Misses
3rd- Level Cache Read References
2nd- Level Cache Reads Hit Shared
2nd- Level Cache Reads Hit Modified
2nd- Level Cache Reads Hit Exclusive
3rd- Level Cache Reads Hit Shared
3rd- Level Cache Reads Hit Modified
3rd- Level Cache Reads Hit Exclusive
These met rics based on BSQ_cache_reference may be useful as an indicat or of t he
relat ive effect iveness of t he 2nd- level cache, and t he 3rd- level cache if present . But
due t o t he current implement at ion of BSQ_cache_reference in Pent ium 4 and I nt el
Xeon processors, t hey should not be used t o calculat e cache hit rat es or cache miss
rat es. The following t hree paragraphs describe some of t he issues relat ed t o
BSQ_cache_reference, so t hat it s result s can be bet t er int erpret ed.
Current implement at ions of t he BSQ_cache_reference event do not dist inguish
bet ween programmat ic read and writ e misses. Programmat ic writ es t hat miss must
get t he rest of t he cache line and merge t he new dat a. Such a request is called a read
for ownership ( RFO) . To t he BSQ_cache_reference hardware, bot h a programmat ic
B-34
read and an RFO look like a dat a bus read, and are count ed as such. Furt her dist inc-
t ion bet ween programmat ic reads and RFOs may be provided in fut ure implement a-
t ions.
Current implement at ions of t he BSQ_cache_reference event can suffer from
perceived over- or undercount ing. References are based on BSQ allocat ions, as
described above. Consequent ly, read misses are generally count ed once per
128- byt e line BSQ allocat ion ( whet her one or bot h sect ors are referenced) , but read
and writ e ( RFO) hit s and most writ e ( RFO) misses are count ed once per 64- byt e line,
t he size of a core reference. This makes t he event count s for read misses appear t o
have a 2- t imes overcount ing wit h respect t o read and writ e ( RFO) hit s and writ e
( RFO) misses. This granularit y mismat ch cannot always be correct ed for, making it
difficult t o correlat e t o t he number of programmat ic misses and hit s. I f t he user
knows t hat bot h sect ors in a 128 - byt e line are always referenced soon aft er each
ot her, t hen t he number of read misses can be mult iplied by t wo t o adj ust miss count s
t o a 64- byt e granularit y.
Prefet ches t hemselves are not count ed as eit her hit s or misses, as of Pent ium 4 and
I nt el Xeon processors wit h a CPUI D signat ure of 0xf21. However, in Pent ium 4
Processor implement at ions wit h a CPUI D signat ure of 0xf07 and earlier have t he
problem t hat reads t o lines t hat are already being prefet ched are count ed as hit s in
addit ion t o misses, t hus overcount ing hit s.
The number of Reads Non- prefet ch from t he Processor is a good approximat ion of
t he number of out ermost cache misses due t o loads or RFOs, for t he writ eback
memory t ype.
B.2.4 Usage Notes on Bus Activities
A number of performance met rics in Table B- 1 are based on I OQ_act ive_ent ries and
BSQ_act ive ent ries. The next t hree paragraphs provide informat ion of various bus
t ransact ion underway met rics. These met rics nominally measure t he end- t o- end
lat ency of t ransact ions ent ering t he BSQ ( t he aggregat e sum of t he allocat ion- t o-
deallocat ion durat ions for t he BSQ ent ries used for all individual t ransact ion in t he
processor) . They can be divided by t he corresponding number- of- t ransact ions
met rics ( t hose t hat measure allocat ions) t o approximat e an average lat ency per
t ransact ion. However, t hat approximat ion can be significant ly higher t han t he
number of cycles it t akes t o get t he first chunk of dat a for t he demand fet ch ( load) ,
because t he ent ire t ransact ion must be complet ed before deallocat ion. That lat ency
includes deallocat ion overheads, and t he t ime t o get t he ot her half of t he 128- byt e
line, which is called an adj acent - sect or prefet ch. Since adj acent - sect or prefet ches
have lower priorit y t han demand fet ches, t here is a high probabilit y on a heavily
ut ilized syst em t hat t he adj acent - sect or prefet ch will have t o wait unt il t he next bus
arbit rat ion cycle from t hat processor. On current implement at ions, t he granularit ies
at which BSQ_allocat ion and BSQ_act ive_ent ries count can differ, leading t o a
possible 2- t imes overcount ing of lat encies for non- part ial programmat ic loads.
Users of t he bus t ransact ion underway met rics would be best served by employing
t hem for relat ive comparisons across BSQ lat encies of all t ransact ions. Users t hat
B-35
want t o do cycle- by- cycle or t ype- by- t ype analysis should be aware t hat t his event is
known t o be inaccurat e for UC Reads Chunk Underway and Writ e WC part ial
underway met rics. Relat ive changes t o t he average of all BSQ lat encies should be
viewed as an indicat ion t hat overall memory performance has changed. That
memory performance change may or may not be reflect ed in t he measured FSB
lat encies.
For Pent ium 4 and I nt el Xeon Processor implement at ions wit h an int egrat ed 3rd- level
cache, BSQ ent ries are allocat ed for all 2nd- level writ ebacks ( replaced lines) , not j ust
t hose t hat become bus accesses ( i. e., are also 3rd- level misses) . This can decrease
t he average measured BSQ lat encies for workloads t hat frequent ly t hrash ( miss or
prefet ch a lot int o) t he 2nd- level cache but hit in t he 3rd- level cache. This effect may
be less of a fact or for workloads t hat miss all on- chip caches, since all BSQ ent ries
due t o such references will become bus t ransact ions.
B.3 PERFORMANCE METRICS AND TAGGING
MECHANISMS
A number of met rics require more t ags t o be specified in addit ion t o programming a
count ing event . For example, t he met ric Split Loads Ret ired requires specifying a
split _load_ret ired t ag in addit ion t o programming t he replay_event t o count at ret ire-
ment . This sect ion describes t hree set s of t ags t hat are used in conj unct ion wit h
t hree at - ret irement count ing event s: front _end_event , replay_event , and
execut ion_event . Please refer t o Appendix A of t he I nt el 64 and I A- 32 Archit ec-
t ures Soft ware Developers Manual, Volume 3B for t he descript ion of t he at - ret ire-
ment event s.
B.3.1 Tags for replay_event
Table B- 8 provides a list of t he t ags t hat are used by various met rics in Tables B- 1
t hrough B- 7. These t ags enable you t o mark ops at earlier st age of execut ion and
count t he ops at ret irement using t he replay_event . These t ags require at least t wo
MSRs ( see Table B- 8, column 2 and column 3) t o t ag t he ops so t hey can be
det ect ed at ret irement . Some t ags require addit ional MSR ( see Table B- 8, column 4)
t o select t he event t ypes for t hese t agged ops. The event names referenced in
column 4 are t hose from t he Pent ium 4 processor performance monit oring event s
( Sect ion B. 2) .
B-36
Table B-8. Metrics That Utilize Replay Tagging Mechanism
Replay Metric Tags
1
Bit field to set:
IA32_PEBS_
ENABLE
Bit field to
set: MSR_
PEBS_
MATRIX_VE
RT Additional MSR
See Event
Mask
Parameter for
Replay_
event
1stL_cache_load_
miss_retired
Bit 0, 24, 25 Bit 0 None NBOGUS
2ndL_cache_load_
miss_retired
DTLB_load_miss_retired Bit 2, 24, 25 Bit 0 None NBOGUS
DTLB_store_miss_
retired
DTLB_all_miss_retired Bit 2, 24, 25 Bit 0, Bit 1 None NBOGUS
Tagged_mispred_
branch
Bit 15, 16, 24, 25 Bit 4 None NBOGUS
MOB_load_replay_
retired
Bit 9, 24, 25 Bit 0 Select MOB_load_
replay and set the
PARTIAL_DATA
and
UNALGN_ADDR
bits
NBOGUS
Split_load_retired Bit 10, 24, 25 Bit 0 Select
Load_port_replay
event on
SAAT_CR_ESCR1
and set SPLIT_LD
bit
NBOGUS
Split_store_
retired
Bit 10, 24, 25 Bit 1 Select Store_port_
replay event on
SAAT_CR_ESCR0
and set SPLIT_ST
bit
NBOGUS
NOTES:
1. Certain kinds of ops cannot be tagged. These include I/O operations, UC and locked accesses,
returns, and far transfers.
B-37
B.3.2 Tags for front_end_event
Table B- 9 provides a list of t he t ags t hat are used by various met rics derived from t he
front _end_event . The event names referenced in column 2 can be found from t he
Pent ium 4 processor performance monit oring event s.
B.3.3 Tags for execution_event
Table B- 10 provides a list of t he t ags t hat are used by various met rics derived from
t he execut ion_event . These t ags require programming an upst ream ESCR t o select
event mask wit h it s TagUop and TagValue bit fields. The event mask for t he down-
st ream ESCR is specified in column 4. The event names referenced in column 4 can
be found in t he Pent ium 4 processor performance monit oring event s.
Table B-9. Metrics That Utilize the Front-end Tagging Mechanism
Front-end MetricTags
1
NOTES:
1. May be some undercounting of front end events when there is an overflow or underflow of the
floating point stack.
Additional MSR
See Event Mask Parameter
for Front_end_event
Memory_loads Set the TAGLOADS bit in
Uop_Type
NBOGUS
Memory_stores Set the TAGSTORES bit in
Uop_Type
NBOGUS
Table B-10. Metrics That Utilize the Execution Tagging Mechanism
Execution Metric Tags Upstream ESCR
Tag Value in
Upstream ESCR
See Event Mask
Parameter for
Execution_
event
Packed_SP_retired Set the ALL bit in
the event mask
and the TagUop bit
in the ESCR of
packed_SP_uop.
1 NBOGUS0
Scalar_SP_retired Set the ALL bit in
the event mask
and the TagUop bit
in the ESCR of
scalar_SP_uop.
1 NBOGUS0
B-38
Scalar_DP_retired Set ALL bit in the
event mask and
TagUop bit in the
ESCR of
scalar_DP_uop.
1 NBOGUS0
128_bit_MMX_retired Set ALL bit in the
event mask and
TagUop bit in the
ESCR of
128_bit_MMX_uop
.
1 NBOGUS0
64_bit_MMX_retired Set ALL bit in the
event mask and
TagUop bit in the
ESCR of
64_bit_MMX_uop.
1 NBOGUS0
X87_FP_retired Set ALL bit in the
event mask and
TagUop bit in the
ESCR of
x87_FP_uop.
1 NBOGUS0
Table B-11. New Metrics for Pentium 4 Processor (Family 15, Model 3)
Metric Descriptions
Event Name or
Metric
Expression
Event Mask value
required
Instructions Completed Non-bogus
instructions
completed and
retired
instr_completed NBOGUS
Speculative Instructions Completed Number of
instructions
decoded and
executed
speculatively
instr_completed BOGUS
Table B-10. Metrics That Utilize the Execution Tagging Mechanism (Contd.)
Execution Metric Tags Upstream ESCR
Tag Value in
Upstream ESCR
See Event Mask
Parameter for
Execution_
event
B-39
B.4 USING PERFORMANCE METRICS WITH HYPER-
THREADING TECHNOLOGY
On I nt el Xeon processors t hat support HT Technology, t he performance met rics list ed
in Tables B- 1 t hrough B- 7 may be qualified t o associat e t he count s wit h a specific
logical processor, provided t he relevant performance monit oring event s support s
qualificat ion by logical processor. Wit hin t he subset of t hose performance met rics t hat
support qualificat ion by logical processors, some of t hem can be programmed wit h
parallel ESCRs and CCCRs t o collect separat e count s for each logical processor simul-
t aneously. For some met rics, qualificat ion by logical processor is support ed but t here
is not sufficient number of MSRs for simult aneous count ing of t he same met ric on bot h
logical processors. I n bot h cases, it is also possible t o program t he relevant ESCR for
a performance met ric t hat support s qualificat ion by logical processor t o produce
count s t hat are, t ypically, t he sum of cont ribut ions from bot h logical processors.
A number of performance met rics are based on performance monit oring event s t hat
do not support qualificat ion by logical processor. Any at t empt s t o program t he rele-
vant ESCRs t o qualify count s by logical processor will not produce different result s.
The result s obt ained in t his manner should not be summed t oget her.
The performance met rics list ed in Tables B- 1 t hrough B- 7 fall int o t hree cat egories:
Logical processor specific and support ing parallel count ing
Logical processor specific but const rained by ESCR limit at ions
Logical processor independent and not support ing parallel count ing
Table B- 11 list s performance met rics in t he first and second cat egory. Table B- 12 list s
performance met rics in t he t hird cat egory.
There are four specific performance met rics relat ed t o t he t race cache t hat are
except ions t o t he t hree cat egories above. They are:
Logical Processor 0 Deliver Mode
Logical Processor 1 Deliver Mode
Logical Processor 0 Build Mode
Logical Processor 0 Build Mode
Each of t hese four met rics cannot be qualified by programming bit 0 t o 4 in t he
respect ive ESCR. However, it is possible and useful t o collect t wo of t hese four
met rics simult aneously.
B-40
Table B-12. Metrics Supporting Qualification by
Logical Processor and Parallel Counting
General Metrics ops Retired
Instructions Completed
Speculative Instructions Completed
Non-Halted Clock Ticks
Speculative Uops Retired
Branching Metrics Branches Retired
Tagged Mispredicted Branches Retired
Mispredicted Branches Retired
All returns
All indirect branches
All calls
All conditionals
Mispredicted returns
Mispredicted indirect branches
Mispredicted calls
Mispredicted conditionals
TC and Front End Metrics Trace Cache Misses
ITLB Misses
TC to ROM Transfers
TC Flushes
Speculative TC-Built ops
Speculative TC-Delivered ops
Speculative Microcode ops
Memory Metrics Split Load Replays
1
Split Store Replays
1
MOB Load Replays
1
64k Aliasing Conflicts
1st-Level Cache Load Misses Retired
2nd-Level Cache Load Misses Retired
DTLB Load Misses Retired
Split Loads Retired
1
Split Stores Retired
1
MOB Load Replays Retired
Loads Retired
Stores Retired
DTLB Store Misses Retired
DTLB Load and Store Misses Retired
2nd-Level Cache Read Misses
2nd-Level Cache Read References
3rd-Level Cache Read Misses
3rd-Level Cache Read References
B-41
2nd-Level Cache Reads Hit Shared
2nd-Level Cache Reads Hit Modified
2nd-Level Cache Reads Hit Exclusive
3rd-Level Cache Reads Hit Shared
3rd-Level Cache Reads Hit Modified
3rd-Level Cache Reads Hit Exclusive
Bus Metrics Bus Accesses from the Processor
1
Non-prefetch Bus Accesses from the Processor
1
Reads from the Processor
1
Writes from the Processor
1
Reads Non-prefetch from the Processor
1
All WC from the Processor
1
All UC from the Processor
1
Bus Accesses from All Agents
1
Bus Accesses Underway from the processor
1
Bus Reads Underway from the processor
1
Non-prefetch Reads Underway from the processor
1
All UC Underway from the processor
1
All WC Underway from the processor
1
Bus Writes Underway from the processor
1
Bus Accesses Underway from All Agents
1
Write WC Full (BSQ)
1
Write WC Partial (BSQ)
1
Writes WB Full (BSQ)
1
Reads Non-prefetch Full (BSQ)
1
Reads Invalidate Full- RFO (BSQ)
1
UC Reads Chunk (BSQ)
1
UC Reads Chunk Split (BSQ)
1
UC Write Partial (BSQ)
1
IO Reads Chunk (BSQ)
1
IO Writes Chunk (BSQ)
1
WB Writes Full Underway (BSQ)
1
UC Reads Chunk Underway (BSQ)
1
Write WC Partial Underway(BSQ)
1
Characterization Metrics x87 Input Assists
x87 Output Assists
Machine Clear Count
Memory Order Machine Clear
Self-Modifying Code Clear
Scalar DP Retired
Scalar SP Retired
B-42
B.5 USING PERFORMANCE EVENTS OF INTEL CORE SOLO
AND INTEL CORE DUO PROCESSORS
There are performance event s specific t o t he microarchit ect ure of I nt el Core Solo and
I nt el Core Duo processors. See also: Appendix A of t he I nt el 64 and I A- 32 Archit ec-
t ures Soft ware Developers Manual, Volume 3B) .
B.5.1 Understanding the Results in a Performance Counter
Each performance event det ect s a well- defined microarchit ect ural condit ion occur-
ring in t he core while t he core is act ive. A core is act ive when:
I t s running code ( excluding t he halt inst ruct ion) .
I t s being snooped by t he ot her core or a logical processor on t he plat form. This
can also happen when t he core is halt ed.
Some microarchit ect ural condit ions are applicable t o a subsyst em shared by more
t han one core and some performance event s provide an event mask ( or unit mask)
Packed DP Retired
Packed SP Retired
128-bit MMX Instructions Retired
64-bit MMX Instructions Retired
x87 Instructions Retired
Stalled Cycles of Store Buffer Resources
Stalls of Store Buffer Resources
NOTES:
1. Parallel counting is not supported due to ESCR restrictions.
Table B-13. Metrics Independent of Logical Processors
General Metrics Non-Sleep Clock Ticks
TC and Front End Metrics Page Walk Miss ITLB
Memory Metrics Page Walk DTLB All Misses
All WCB Evictions
WCB Full Evictions
Bus Metrics Bus Data Ready from the Processor
Characterization Metrics SSE Input Assists
B-43
t hat allows qualificat ion at t he physical processor boundary or at bus agent
boundary.
Some event s allow qualificat ions t hat permit t he count ing of microarchit ect ural
condit ions associat ed wit h a part icular core versus count s from all cores in a physical
processor ( see L2 and bus relat ed event s in Appendix A of t he I nt el 64 and I A- 32
Archit ect ures Soft ware Developers Manual, Volume 3B) .
When a mult i- t hreaded workload does not use all cores cont inuously, a performance
count er count ing a core- specific condit ion may progress t o some ext ent on t he halt ed
core and st op progressing or a unit mask may be qualified t o cont inue count ing
occurrences of t he condit ion at t ribut ed t o eit her processor core. Typically, one can
adj ust t he highest t wo bit s ( bit s 15: 14 of t he I A32_PERFEVTSELx MSR) in t he unit
mask field t o dist inguish such asymmet ry ( See Chapt er 18, Debugging and Perfor-
mance Monit oring, of t he I nt el 64 and I A- 32 Archit ect ures Soft ware Developers
Manual, Volume 3B) .
There are t hree cycle- count ing event s which will not progress on a halt ed core, even
if t he halt ed core is being snooped. These are: Unhalt ed core cycles, Unhalt ed refer-
ence cycles, and Unhalt ed bus cycles. All t hree event s are det ect ed for t he unit
select ed by event 3CH.
Some event s det ect microarchit ect ural condit ions but are limit ed in t heir abilit y t o
ident ify t he originat ing core or physical processor. For example, bus_drdy_clocks
may be programmed wit h a unit mask of 20H t o include all agent s on a bus. I n t his
case, t he performance count er in each core will report nearly ident ical values. Perfor-
mance t ools int erpret ing count s must t ake int o account t hat it is only necessary t o
equat e bus act ivit y wit h t he event count from one core ( and not use not t he sum
from each core) .
The above is also applicable when t he core- specificit y sub field ( bit s 15: 14 of
I A32_PERFEVTSELx MSR) wit hin an event mask is programmed wit h 11B. The result
of report ed by performance count er on each core will be nearly ident ical.
B.5.2 Ratio Interpretation
Rat ios of t wo event s are useful for analyzing various charact erist ics of a workload. I t
may be possible t o acquire such rat ios at mult iple granularit ies, for example: ( 1) per-
applicat ion t hread, ( 2) per logical processor, ( 3) per core, and ( 4) per physical
processor.
The first rat io is most useful from a soft ware development perspect ive, but requires
mult i- t hreaded applicat ions t o manage processor affinit y explicit ly for each applica-
t ion t hread. The ot her opt ions provide insight s on hardware ut ilizat ion.
I n general, collect measurement s ( for all event s in a rat io) in t he same run. This
should be done because:
I f measuring rat ios for a mult i- t hreaded workload, get t ing result s for all event s in
t he same run enables you t o underst and which event count er values belongs t o
each t hread.
B-44
Some event s, such as writ ebacks, may have non- det erminist ic behavior for
different runs. I n such a case, only measurement s collect ed in t he same run yield
meaningful rat io values.
B.5.3 Notes on Selected Events
This sect ion provides event - specific not es for int erpret ing performance event s list ed
in Appendix A of t he I nt el 64 and I A- 32 Archit ect ures Soft ware Developers Manual,
Volume 3B.
L2_Rej ect _Cycl es, event number 30H This event count s t he cycles during
which t he L2 cache rej ect ed new access request s.
L2_No_Request _Cycl es, event number 32H This event count s cycles
during which no request s from t he L1 or prefet ches t o t he L2 cache were issued.
Unhal t ed_Cor e_Cy cl es, event number 3C, uni t mask 00H This event
count s t he smallest unit of t ime recognized by an act ive core.
I n many operat ing syst ems, t he idle t ask is implement ed using HLT inst ruct ion.
I n such operat ing syst ems, clock t icks for t he idle t ask are not count ed. A
t ransit ion due t o Enhanced I nt el SpeedSt ep Technology may change t he
operat ing frequency of a core. Therefore, using t his event t o init iat e t ime- based
sampling can creat e art ifact s.
Unhal t ed_Ref _Cycl es, ev ent number 3C, uni t mask 01H This event
guarant ees a uniform int erval for each cycle being count ed. Specifically, count s
increment at bus clock cycles while t he core is act ive. The cycles can be
convert ed t o core clock domain by mult iplying t he bus rat io which set s t he core
clock frequency.
Ser i al _Ex ecut i on_Cycl es, event number 3C, uni t mask 02H This event
count s t he bus cycles during which t he core is act ively execut ing code ( non-
halt ed) while t he ot her core in t he physical processor is halt ed.
L1_Pr ef _Req, event number 4FH, uni t mask 00H This event count s t he
number of t imes t he Dat a Cache Unit ( DCU) request s t o prefet ch a dat a cache
line from t he L2 cache. Request s can be rej ect ed when t he L2 cache is busy.
Rej ect ed request s are re- submit t ed.
DCU_Snoop_t o_Shar e, event number 78H, uni t mask 01H This event
count s t he number of t imes t he DCU is snooped for a cache line needed by t he
ot her core. The cache line is missing in t he L1 inst ruct ion cache or dat a cache of
t he ot her core; or it is set for read- only, when t he ot her core want s t o writ e t o it .
These snoops are done t hrough t he DCU st ore port . Frequent DCU snoops may
conflict wit h st ores t o t he DCU, and t his may increase st ore lat ency and impact
performance.
Bus_Not _I n_Use, event number 7DH, uni t mask 00H This event count s
t he number of bus cycles for which t he core does not have a t ransact ion wait ing
for complet ion on t he bus.
B-45
Bus_Snoops, event number 77H, uni t mask 00H This event count s t he
number of CLEAN, HI T, or HI TM responses t o ext ernal snoops det ect ed on t he
bus.
I n a single- processor syst em, CLEAN and HI T responses are not likely t o
happen. I n a mult iprocessor syst em t his event indicat es an L2 miss in one
processor t hat did not find t he missed dat a on ot her processors.
I n a single- processor syst em, an HI TM response indicat es t hat an L1 miss
( inst ruct ion or dat a) found t he missed cache line in t he ot her core in a modified
st at e. I n a mult iprocessor syst em, t his event also indicat es t hat an L1 miss
( inst ruct ion or dat a) found t he missed cache line in anot her core in a modified
st at e.
B.6 DRILL-DOWN TECHNIQUES FOR PERFORMANCE
ANALYSIS
Soft ware performance int ert wines code and microarchit ect ural charact erist ics of t he
processor. Performance monit oring event s provide insight s t o t hese int eract ions.
Each microarchit ect ure oft en provides a large set of performance event s t hat t arget
different subsyst ems wit hin t he microarchit ect ure. Having a met hodical approach t o
select key performance event s will likely improve a programmer s underst anding of
t he performance bot t lenecks and improve t he efficiency of code- t uning effort .
Recent generat ions of I nt el 64 and I A- 32 processors feat ure microarchit ect ures using
an out - of- order execut ion engine. They are also accompanied by an in- order front
end and ret irement logic t hat enforces program order. Superscalar hardware, buff-
ering and speculat ive execut ion oft en complicat es t he int erpret at ion of performance
event s and soft ware- visible performance bot t lenecks.
This sect ion discusses a met hodology of using performance event s t o drill down on
likely areas of performance bot t leneck. By narrowed down t o a small set of perfor-
mance event s, t he programmer can t ake advant age of I nt el VTune Performance
Analyzer t o correlat e performance bot t lenecks wit h source code locat ions and apply
coding recommendat ions discussed in Chapt er 3 t hrough Chapt er 8. Alt hough t he
general principles of our met hod can be applied t o different microarchit ect ures, t his
sect ion will use performance event s available in processors based on I nt el Core
microarchit ect ure for simplicit y.
Performance t uning usually cent ers around reducing t he t ime it t akes t o complet e a
well- defined workload. Performance event s can be used t o measure t he elapsed t ime
bet ween t he st art and end of a workload. Thus, reducing elapsed t ime of complet ing
a workload is equivalent t o reducing measured processor cycles.
The drill- down met hodology can be summarized as four phases of performance event
measurement s t o help charact erize int eract ions of t he code wit h key pipe st ages or
subsyst ems of t he microarchit ect ure. The relat ion of t he performance event drill-
down met hodology t o t he soft ware t uning feedback loop is illust rat ed in Figure B- 2.
B-46
Typically, t he logic in performance monit oring hardware measures microarchit ect ural
condit ions t hat varies across different count ing domains, ranging from cycles, micro-
ops, address references, inst ances, et c. The drill- down met hodology at t empt s t o
provide an int uit ive, cycle- based view across different phases by making suit able
approximat ions t hat are described below:
Tot al cycl e measur ement This is t he st art t o finish view of t ot al number of
cycle t o complet e t he workload of int erest . I n t ypical performance t uning
sit uat ions, t he met ric Tot al_cycles can be measured by t he event
CPU_CLK_UNHALTED. CORE. See Appendix A, Performance Monit oring Event s,
of I nt el 64 and I A- 32 Archit ect ures Soft ware Developers Manual, Volume 3B) .
Cycl e composi t i on at i ssue por t The reservat ion st at ion ( RS) dispat ches
micro- ops for execut ion so t hat t he program can make forward progress. Hence
t he met ric Tot al_cycles can be decomposed as consist ing of t wo exclusive
component s: Cycles_not _issuing_uops represent ing cycles t hat t he RS is not
Figure B-2. Performance Events Drill-Down and Software Tuning Feedback Loop
TotaI_CycIes_CompIetion
Start_to_Finish
View
Issuing_uops Not_Issuing_uops
RS View
StaIIed Retiring_uops Non_retiring_uops
Execution
View
Stalls
Drill-down
Tuning Focus
Store
Fwd
LCP
Cache
Miss
...
Code Layout,
Branch
Misprediction
Vectorize w/
SIMD
Identify hot spot
code, appIy fix
Tuning
Consistency
AppIy one fix at time; repeat
from the top
20
B-47
issuing micro- ops for execut ion, and Cycles_issuing_uops cycles t hat t he RS is
issuing micro- ops for execut ion. The lat t er component includes ops in t he
archit ect ed code pat h or in t he speculat ive code pat h.
Cy cl e composi t i on of OOO ex ecut i on The out - of- order engine provides
mult iple execut ion unit s t hat can execut e ops in parallel. I f one execut ion unit
st alls, it does not necessarily imply t he program execut ion is st alled. Our
met hodology at t empt s t o const ruct a cycle- composit ion view t hat approximat es
t he progress of program execut ion. The t hree relevant met rics are:
Cycles_st alled, Cycles_not _ret iring_uops, and Cycles_ret iring_uops.
Ex ecut i on st al l anal ysi s From t he cycle composit ions of overall program
execut ion, t he programmer can narrow down t he select ion of performance
event s t o furt her pin- point unproduct ive int eract ion bet ween t he workload and a
microarchit ect ural subsyst em.
When cycles lost t o a st alled microarchit ect ural subsyst em, or t o unproduct ive spec-
ulat ive execut ion are ident ified, t he programmer can use VTune Analyzer t o correlat e
each significant performance impact t o source code locat ion. I f t he performance
impact of st alls or mispredict ion is insignificant , VTune can also ident ify t he source
locat ions of hot funct ions, so t he programmer can evaluat e t he benefit s of vect oriza-
t ion on t hose hot funct ions.
B.6.1 Cycle Composition at Issue Port
Recent processor microarchit ect ures employ out - of- order engines t hat execut e
st reams of ops nat ively, while decoding program inst ruct ions int o ops in it s front
end. The met ric Tot al_cycles alone, is opaque wit h respect t o decomposing cycles
t hat are product ive or non- product ive for program execut ion. To est ablish a consis-
t ent cycle- based decomposit ion, we const ruct t wo met rics t hat can be measured
using performance event s available in processors based on I nt el Core microarchit ec-
t ure. These are:
Cy cl es_not _i ssui ng_uops This can be measured by t he event
RS_UOPS_DI SPATCHED, set t ing t he I NV bit and specifying a count er mask
( CMASK) value of 1 in t he t arget performance event select ( I A32_PERFEVSELx)
MSR ( See Chapt er 18 of t he I nt el 64 and I A- 32 Archit ect ures Soft ware
Developers Manual, Volume 3B) . I n VTune Analyzer, t he special values for
CMASK and I NV is already configured for t he VTune event name
RS_UOPS_DI SPATCHED. CYCLES_NONE.
Cy cl es_i ssui ng_uops This can be measured using t he event
RS_UOPS_DI SPATCHED, clear t he I NV bit and specifying a count er mask
( CMASK) value of 1 in t he t arget performance event select MSR
Not e t he cycle decomposit ion view here is approximat e in nat ure; it does not dist in-
guish specificit ies, such as whet her t he RS is full or empt y, t ransient sit uat ions of RS
being empt y but some in- flight uops is get t ing ret ired.
B-48
B.6.2 Cycle Composition of OOO Execution
I n an OOO engine, speculat ive execut ion is an import ant part of making forward
progress of t he program. But speculat ive execut ion of ops in t he shadow of mispre-
dict ed code pat h represent unproduct ive work t hat consumes execut ion resources
and execut ion bandwidt h.
Cycles_not _issuing_uops, by definit ion, represent s t he cycles t hat t he OOO engine is
st alled ( Cycles_st alled) . As an approximat ion, t his can be int erpret ed as t he cycles
t hat t he program is not making forward progress.
The ops t hat are issued for execut ion do not necessarily end in ret irement . Those
ops t hat do not reach ret irement do not help forward progress of program execu-
t ion. Hence, a furt her approximat ion is made in t he formalism of decomposit ion of
Cycles_issuing_uops int o:
Cy cl es_non_r et i r i ng_uops Alt hough t here isn t a direct event t o measure
t he cycles associat ed wit h non- ret iring ops, we will derive t his met ric from
available performance event s, and several assumpt ions:
A const ant issue rat e of ops flowing t hrough t he issue port . Thus, we define:
uops_rat e = Dispat ch_uops/ Cycles_issuing_uops, where Dispat ch_uops
can be measured wit h RS_UOPS_DI SPATCHED, clearing t he I NV bit and t he
CMASK.
We approximat e t he number of non- product ive, non- ret iring ops by
[ non_product ive_uops = Dispat ch_uops - execut ed_ret ired_uops] , where
execut ed_ret ired_uops represent product ive ops cont ribut ing t owards
forward progress t hat consumed execut ion bandwidt h.
The execut ed_ret ired_uops can be approximat ed by t he sum of t wo cont ribu-
t ions: num_ret ired_uops ( measured by t he event UOPS_RETI RED. ANY) and
num_fused_uops ( measured by t he event UOPS_RETI RED. FUSED) .
Thus, Cycles_non_ret iring_uops = non_product ive_uops / uops_rat e.
Cy cl es_r et i r i ng_uops This can be derived from Cycles_ret iring_uops =
num_ret ired_uops / uops_rat e.
The cycle- decomposit ion met hodology here does not dist inguish sit uat ions where
product ive uops and non- product ive ops may be dispat ched in t he same cycle int o
t he OOO engine. This approximat ion may be reasonable because heurist ically high
cont ribut ion of non- ret iring uops likely correlat es t o sit uat ions of congest ions in t he
OOO engine and subsequent ly cause t he program t o st all.
Evaluat ions of t hese t hree component s: Cycles_non_ret iring_uops, Cycles_st alled,
Cycles_ret iring_uops, relat ive t o t he Tot al_cycles, can help st eer t uning effort in t he
following direct ions:
I f t he cont ribut ion from Cycles_non_ret iring_uops is high, focusing on code
layout and reducing branch mispredict ions will be import ant .
I f bot h t he cont ribut ions from Cycles_non_ret iring_uops and Cycles_st alled are
insignificant , t he focus for performance t uning should be direct ed t o vect orizat ion
or ot her t echniques t o improve ret irement t hroughput of hot funct ions.
B-49
I f t he cont ribut ions from Cycles_st alled is high, addit ional drill- down may be
necessary t o locat e bot t lenecks t hat lies deeper in t he microarchit ect ure pipeline.
B.6.3 Drill-Down on Performance Stalls
I n some sit uat ions, it may be useful t o evaluat e cycles lost t o st alls associat ed wit h
various st ress point s in t he microarchit ect ure and sum up t he cont ribut ions from
each candidat e st ress point s. This approach implies a very gross simplificat ion and
int roduce complicat ions t hat may be difficult t o reconcile wit h t he superscalar nat ure
and buffering in an OOO engine.
Due t o t he variat ions of count ing domains associat ed wit h different performance
event s, cycle- based est imat ion of performance impact at each st ress point may carry
different degree of errors due t o over- est imat ion of exposures or under- est imat ions.
Over- est imat ion is likely t o occur when overall performance impact for a given cause
is est imat ed by mult iplying t he per- inst ance- cost t o an event count t hat measures
t he number of occurrences of t hat microarchit ect ural condit ion. Consequent ly, t he
sum of mult iple cont ribut ions of lost cycles due t o different st ress point s may exceed
t he more accurat e met ric Cycles_st alled.
However an approach t hat sums up lost cycles associat ed wit h individual st ress point
may st ill be beneficial as an it erat ive indicat or t o measure t he effect iveness of code
t uning loop effort when t uning code t o fix t he performance impact of each st ress
point . The remaining of t his subsect ion will discuss a few common causes of perfor-
mance bot t lenecks t hat can be count ed by performance event s and fixed by following
coding recommendat ions described in t his manual.
The following it ems discuss several common st ress point s of t he microarchit ect ure:
L2 Mi ss I mpact An L2 load miss may expose t he full lat ency of memory sub-
syst em. The lat ency of accessing syst em memory varies wit h different chipset ,
generally on t he order of more t han a hundred cycles. Server chipset t end t o
exhibit longer lat ency t han deskt op chipset s. The number L2 cache miss
references can be measured by MEM_LOAD_RETI RED. L2_LI NE_MI SS.
An est imat ion of overall L2 miss impact by mult iplying syst em memory lat ency
wit h t he number of L2 misses ignores t he OOO engines abilit y t o handle mult iple
out st anding load misses. Mult iplicat ion of lat ency and number of L2 misses imply
each L2 miss occur serially.
To improve t he accuracy of est imat ing L2 miss impact , an alt ernat ive t echnique
should also be considered, using t he event BUS_REQUEST_OUTSTANDI NG wit h a
CMASK value of 1. This alt ernat ive t echnique effect ively measures t he cycles t hat
t he OOO engine is wait ing for dat a from t he out st anding bus read request s. I t can
overcome t he over- est imat ion of mult iplying memory lat ency wit h t he number of
L2 misses.
L2 Hi t I mpact Memory accesses from L2 will incur t he cost of L2 lat ency ( See
Table 2- 3) . The number cache line references of L2 hit can be measured by t he
B-50
difference bet ween t wo event s: MEM_LOAD_RETI RED. L1D_LI NE_MI SS -
MEM_LOAD_RETI RED. L2_LI NE_MI SS.
An est imat ion of overall L2 hit impact by mult iplying t he L2 hit lat ency wit h t he
number of L2 hit references ignores t he OOO engines abilit y t o handle mult iple
out st anding load misses.
L1 DTLB Mi ss I mpact The cost of a DTLB lookup miss is about 10 cycles. The
event MEM_LOAD_RETI RED. DTLB_MI SS measures t he number of load micro- ops
t hat experienced a DTLB miss.
LCP I mpact The overall impact of LCP st alls can be direct ly measured by t he
event I LD_STALLS. The event I LD_STALLS measures t he number of t imes t he
slow decoder was t riggered, t he cost of each inst ance is 6 cycles
St or e f or w ar di ng st al l I mpact When a st ore forwarding sit uat ion does not
meet address or size requirement s imposed by hardware, a st all occurs. The
delay varies for different st ore forwarding st all sit uat ions. Consequent ly, t here
are several performance event s t hat provide fine- grain specificit y t o det ect
different st ore- forwarding st all condit ions. These include:
A load blocked by preceding st ore t o unknown address: This sit uat ion can be
measure by t he event Load_Blocks. St a. The per- inst ance cost is about 5
cycles.
Load part ially overlaps wit h proceeding st ore or 4- KByt e aliased address
bet ween a load and a proceeding st ore: t hese t wo sit uat ions can be
measured by t he event Load_Blocks. Overlap_st ore.
A load spanning across cache line boundary: This can be measured by
Load_Blocks.Unt il_Ret ire. The per- inst ance cost is about 20 cycles.
B.7 EVENT RATIOS FOR INTEL CORE
MICROARCHITECTURE
Appendix B. 6 provides examples of using performance event s t o quickly diagnose
performance bot t lenecks. This sect ion provides addit ional informat ion on using
performance event s t o evaluat e met rics t hat can help in wide range of performance
analysis, workload charact erizat ion, and performance t uning. Not e t hat many perfor-
mance event names in t he I nt el Core microarchit ect ure carry t he format of
XXXX. YYY. t his not at ion derives from t he general convent ion t hat XXXX t ypically
corresponds t o a unique event select code in t he performance event select regist er
( I A32_PERFEVSELx) , while YYY corresponds t o a unique sub- event mask t hat
uniquely defines a specific microarchit ect ural condit ion ( See Chapt er 18 and
Appendix A of t he I nt el 64 and I A- 32 Archit ect ures Soft ware Developers Manual,
Volume 3B) .
B-51
B.7.1 Clocks Per Instructions Retired Ratio (CPI)
1. Clocks Per I nst ruct ion Ret ired Rat io ( CPI ) : CPU_CLK_UNHALTED. CORE /
I NST_RETI RED. ANY.
The I nt el Core microarchit ect ure is capable of reaching CPI as low as 0.25 in ideal
sit uat ions. But most of t he code has higher CPI The great er value of CPI for a given
workload indicat e it has more opport unit y for code t uning t o improve performance.
The CPI is an overall met ric, it does not provide specificit y of what microarchit ect ural
subsyst em may be cont ribut ing t o a high CPI value.
The following subsect ions defines a list of event rat ios t hat are useful t o charact erize
int eract ions wit h t he front end, execut ion, and memory.
B.7.2 Front-end Ratios
2. RS Full Rat io: RESOURCE_STALLS.RS_FULL / CPU_CLK_UNHALTED. CORE * 100
3. ROB Full Rat io: RESOURCE_STALLS. ROB_FULL / CPU_CLK_UNHALTED.CORE *
100
4. Load or St ore Buffer Full Rat io: RESOURCE_STALLS. LD_ST /
CPU_CLK_UNHALTED.CORE * 100
When t here is a low value for t he ROB Full Rat io, RS Full Rat io, and Load St ore Buffer
Full Rat io, and high CPI it is likely t hat t he front end cannot provide inst ruct ions and
micro- ops at a rat e high enough t o fill t he buffers in t he out - of- order engine, and
t herefore it is st arved wait ing for micro- ops t o execut e. I n t his case check furt her for
ot her front end performance issues.
B.7.2.1 Code Locality
5. I nst ruct ion Fet ch St all: CYCLES_L1I _MEM_STALLED /
CPU_CLK_UNHALTED. CORE * 100
The I nst ruct ion Fet ch St all rat io is t he percent age of cycles during which t he I nst ruc-
t ion Fet ch Unit ( I FU) cannot provide cache lines for decoding due t o cache and
I nst ruct ion TLB ( I TLB) misses. A high value for t his rat io indicat es pot ent ial opport u-
nit ies t o improve performance by reducing t he working set size of code pages and
inst ruct ions being execut ed, hence improving code localit y.
6. I TLB Miss Rat e: I TLB_MI SS_RETI RED / I NST_RETI RED. ANY
A high I TLB Miss Rat e indicat es t hat t he execut ed code is spread over t oo many
pages and cause many I nst ruct ions TLB misses. Ret ired I TLB misses cause t he pipe-
line t o nat urally drain, while t he miss st alls fet ching of more inst ruct ions.
7. L1 I nst ruct ion Cache Miss Rat e: L1I _MI SSES / I NST_RETI RED. ANY
A high value for L1 I nst ruct ion Cache Miss Rat e indicat es t hat t he code working set is
bigger t han t he L1 inst ruct ion cache. Reducing t he code working set may improve
performance.
B-52
8. L2 I nst ruct ion Cache Line Miss Rat e: L2_I FETCH. SELF. I _STATE /
I NST_RETI RED. ANY
L2 I nst ruct ion Cache Line Miss Rat e higher t han zero indicat es inst ruct ion cache line
misses from t he L2 cache may have a not iceable performance impact of program
performance.
B.7.2.2 Branching and Front-end
9. BACLEAR Performance I mpact : 7 * BACLEARS / CPU_CLK_UNHALTED.CORE
A high value for BACLEAR Performance I mpact rat io usually indicat es t hat t he code
has many branches such t hat t hey cannot be consumed by t he Branch Predict ion
Unit .
10. Taken Branch Bubble: ( BR_TKN_BUBBLE_1+ BR_TKN_BUBBLE_2) /
CPU_CLK_UNHALTED. CORE
A high value for Taken Branch Bubble rat io indicat es t hat t he code cont ains many
t aken branches coming one aft er t he ot her and cause bubbles in t he front - end. This
may affect performance only if it is not covered by execut ion lat encies and st alls lat er
in t he pipe.
B.7.2.3 Stack Pointer Tracker
11. ESP Synchronizat ion: ESP. SYNCH / ESP. ADDI TI ONS
The ESP Synchronizat ion rat io calculat es t he rat io of ESP explicit use ( for example by
load or st ore inst ruct ion) and implicit uses ( for example by PUSH or POP inst ruct ion) .
The expect ed rat io value is 0.2 or lower. I f t he rat io is higher, consider rearranging
your code t o avoid ESP synchronizat ion event s.
B.7.2.4 Macro-fusion
12. Macro- Fusion: UOPS_RETI RED. MACRO_FUSI ON / I NST_RETI RED. ANY
The Macro- Fusion rat io calculat es how many of t he ret ired inst ruct ions were fused t o
a single micro- op. You may find t his rat io is high for a 32- bit binary execut able but
significant ly lower for t he equivalent 64- bit binary, and t he 64- bit binary performs
slower t han t he 32- bit binary. A possible reason is t he 32- bit binary benefit ed from
macro- fusion significant ly.
B.7.2.5 Length Changing Prefix (LCP) Stalls
13. LCP Delays Det ect ed: I LD_STALL / CPU_CLK_UNHALTED. CORE
A high value of t he LCP Delays Det ect ed rat io indicat es t hat many Lengt h Changing
Prefix ( LCP) delays occur in t he measured code.
B-53
B.7.2.6 Self Modifying Code Detection
14. Self Modifying Code Clear Performance I mpact : MACHI NE_NUKES. SMC * 150 /
A program t hat writ es int o code sect ions and short ly aft erwards execut es t he gener-
at ed code may incur severe penalt ies. Self Modifying Code Performance I mpact est i-
mat es t he percent age of cycles t hat t he program spends on self- modifying code
penalt ies.
B.7.3 Branch Prediction Ratios
Appendix B. 7. 2. 2, discusses branching t hat impact s t he front - end performance. This
sect ion describes event rat ios t hat are commonly used t o charact erize branch
mispredict ions.
B.7.3.1 Branch Mispredictions
15. Branch Mispredict ion Performance I mpact : RESOURCE_STALLS.BR_MI SS_CLEAR
/ CPU_CLK_UNHALTED. CORE * 100
Wit h t he Branch Mispredict ion Performance I mpact , you can t ell t he percent age of
cycles t hat t he processor spends in recovering from branch mispredict ions.
16. Branch Mispredict ion per Micro- Op Ret ired:
BR_I NST_RETI RED. MI SPRED/ UOPS_RETI RED. ANY
The rat io Branch Mispredict ion per Micro- Op Ret ired indicat es if t he code suffers from
many branch mispredict ions. I n t his case, improving t he predict abilit y of branches
can have a not iceable impact on t he performance of your code.
I n addit ion, t he performance impact of each branch mispredict ion might be high. This
happens if t he code prior t o t he mispredict ed branch has high CPI , such as cache
misses, which cannot be parallelized wit h following code due t o t he branch mispre-
dict ion. Reducing t he CPI of t his code will reduce t he mispredict ion performance
impact . See ot her rat ios t o ident ify t hese cases.
You can use t he precise event BR_I NST_RETI RED.MI SPRED t o det ect t he act ual
t arget s of t he mispredict ed branches. This may help you t o ident ify t he mispredict ed
branch.
B.7.3.2 Virtual Tables and Indirect Calls
17. Virt ual Table Usage: BR_I ND_CALL_EXEC / I NST_RETI RED. ANY
A high value for t he rat io Virt ual Table Usage indicat es t hat t he code includes many
indirect calls. The dest inat ion address of an indirect call is hard t o predict .
18. Virt ual Table Misuse: BR_CALL_MI SSP_EXEC / BR_I NST_RETI RED.MI SPRED
B-54
A high value of Branch Mispredict ion Performance I mpact rat io ( Rat io 15) t oget her
wit h high Virt ual Table Misuse rat io indicat e t hat significant t ime is spent due t o
mispredict ed indirect funct ion calls.
I n addit ion t o explicit use of funct ion point ers in C code, indirect calls are used for
implement ing inherit ance, abst ract classes, and virt ual met hods in C+ + .
B.7.3.3 Mispredicted Returns
19. Mispredict ed Ret urn I nst ruct ion Rat e: BR_RET_MI SSP_EXEC/ BR_RET_EXEC
The processor has a special mechanism t hat t racks CALL- RETURN pairs. The
processor assumes t hat every CALL inst ruct ion has a mat ching RETURN inst ruct ion.
I f a RETURN inst ruct ion rest ores a ret urn address, which is not t he one st ored during
t he mat ching CALL, t he code incurs a mispredict ion penalt y.
B.7.4 Execution Ratios
This sect ion covers event rat ios t hat can provide insight s t o t he int eract ions of micro-
ops wit h RS, ROB, execut ion unit s, et c.
B.7.4.1 Resource Stalls
A high value for t he RS Full Rat io ( Rat io 2) indicat es t hat t he Reservat ion St at ion ( RS)
oft en get s full wit h ops due t o long dependency chains. The ops t hat get int o t he
RS cannot execut e because t hey wait for t heir operands t o be comput ed by previous
ops, or t hey wait for a free execut ion unit t o be execut ed. This prevent s exploit ing
t he parallelism provided by t he mult iple execut ion unit s.
A high value for t he ROB Full Rat io ( Rat io 3) indicat es t hat t he reorder buffer ( ROB)
oft en get s full wit h ops. This usually implies on long lat ency operat ions, such as L2
cache demand misses.
B.7.4.2 ROB Read Port Stalls
20. ROB Read Port St all Rat e: RAT_STALLS. ROB_READ_PORT /
The rat io ROB Read Port St all Rat e ident ifies ROB read port st alls. However it should
be used only if t he number of resource st alls, as indicat ed by Resource St all Rat io, is
low.
B.7.4.3 Partial Register Stalls
21. Part ial Regist er St alls Rat io: RAT_STALLS. PARTI AL_CYCLES /
CPU_CLK_UNHALTED. CORE* 100
B-55
Frequent accesses t o regist ers t hat cause part ial st alls increase access lat ency and
decrease performance. Part ial Regist er St alls Rat io is t he percent age of cycles when
part ial st alls occur.
B.7.4.4 Partial Flag Stalls
22. Part ial Flag St alls Rat io: RAT_STALLS. FLAGS / CPU_CLK_UNHALTED. CORE
Part ial flag st alls have high penalt y and t hey can be easily avoided. However, in some
cases, Part ial Flag St alls Rat io might be high alt hough t here are no real flag st alls.
There are a few inst ruct ions t hat part ially modify t he RFLAGS regist er and may cause
part ial flag st alls. The most popular are t he shift inst ruct ions ( SAR, SAL, SHR, and
SHL) and t he I NC and DEC inst ruct ions.
B.7.4.5 Bypass Between Execution Domains
23. Delayed Bypass t o FP Operat ion Rat e: DELAYED_BYPASS.FP /
CPU_CLK_UNHALTED.CORE
24. Delayed Bypass t o SI MD Operat ion Rat e: DELAYED_BYPASS. SI MD /
25. Delayed Bypass t o Load Operat ion Rat e: DELAYED_BYPASS.LOAD /
Domain bypass adds one cycle t o inst ruct ion lat ency. To ident ify frequent domain
bypasses in t he code you can use t he above rat ios.
B.7.4.6 Floating Point Performance Ratios
26. Float ing Point I nst ruct ions Rat io: X87_OPS_RETI RED.ANY / I NST_RETI RED. ANY
* 100
Significant float ing- point act ivit y indicat es t hat specialized opt imizat ions for float ing-
point algorit hms may be applicable.
27. FP Assist Performance I mpact : FP_ASSI ST * 80 / CPU_CLK_UNHALTED. CORE *
100
Float ing Point assist is act ivat ed for non- regular FP values like denormals and NANs.
FP assist is ext remely slow compared t o regular FP execut ion. Different assist s incur
different penalt ies. FP Assist Performance I mpact est imat es t he overall impact .
28. Divider Busy: I DLE_DURI NG_DI V / CPU_CLK_UNHALTED. CORE * 100
A high value for t he Divider Busy rat io indicat es t hat t he divider is busy and no ot her
execut ion unit or load operat ion is in progress for many cycles. Using t his rat io
ignores L1 dat a cache misses and L2 cache misses t hat can be execut ed in parallel
and hide t he divider penalt y.
29. Float ing- Point Cont rol Word St all Rat io: RESOURCE_STALLS.FPCW /
B-56
Frequent modificat ions t o t he Float ing- Point Cont rol Word ( FPCW) might significant ly
decrease performance. The main reason for changing FPCW is for changing rounding
mode when doing FP t o int eger conversions.
B.7.5 Memory Sub-System - Access Conflicts Ratios
A high value for Load or St ore Buffer Full Rat io ( Rat io 4) indicat es t hat t he load buffer
or st ore buffer are frequent ly full, hence new micro- ops cannot ent er t he execut ion
pipeline. This can reduce execut ion parallelism and decrease performance.
30. Load Rat e: L1D_CACHE_LD.MESI / CPU_CLK_UNHALTED.CORE
One memory read operat ion can be served by a core each cycle. A high Load Rat e
indicat es t hat execut ion may be bound by memory read operat ions.
31. St ore Order Block: STORE_BLOCK.ORDER / CPU_CLK_UNHALTED. CORE * 100
St ore Order Block rat io is t he percent age of cycles t hat st ore operat ions, which miss
t he L2 cache, block commit t ing dat a of lat er st ores t o t he memory subsyst em. This
behavior can furt her cause t he st ore buffer t o fill up ( see Rat io 4) .
B.7.5.1 Loads Blocked by the L1 Data Cache
32. Loads Blocked by L1 Dat a Cache Rat e:
LOAD_BLOCK.L1D/ CPU_CLK_UNHALTED.CORE
A high value for Loads Blocked by L1 Dat a Cache Rat e indicat es t hat load opera-
t ions are blocked by t he L1 dat a cache due t o lack of resources, usually happening as
a result of many simult aneous L1 dat a cache misses.
B.7.5.2 4K Aliasing and Store Forwarding Block Detection
33. Loads Blocked by Overlapping St ore Rat e:
LOAD_BLOCK. OVERLAP_STORE/ CPU_CLK_UNHALTED. CORE
4K aliasing and st ore forwarding block are t wo different scenarios in which loads are
blocked by preceding st ores due t o different reasons. Bot h scenarios are det ect ed by
t he same event : LOAD_BLOCK. OVERLAP_STORE. A high value for Loads Blocked by
Overlapping St ore Rat e indicat es t hat eit her 4K aliasing or st ore forwarding block
may affect performance.
B.7.5.3 Load Block by Preceding Stores
34. Loads Blocked by Unknown St ore Address Rat e: LOAD_BLOCK. STA /
A high value for Loads Blocked by Unknown St ore Address Rat e indicat es t hat loads
are frequent ly blocked by preceding st ores wit h unknown address and implies perfor-
mance penalt y.
B-57
35. Loads Blocked by Unknown St ore Dat a Rat e: LOAD_BLOCK. STD /
A high value for Loads Blocked by Unknown St ore Dat a Rat e indicat es t hat loads
are frequent ly blocked by preceding st ores wit h unknown dat a and implies perfor-
mance penalt y.
B.7.5.4 Memory Disambiguation
The memory disambiguat ion feat ure of I nt el Core microarchit ect ure eliminat es most
of t he non- required load blocks by st ores wit h unknown address. When t his feat ure
fails ( possibly due t o flaky load - st ore disambiguat ion cases) t he event
LOAD_BLOCK. STA will be count ed and also MEMORY_DI SAMBI GUATI ON. RESET.
B.7.5.5 Load Operation Address Translation
36. L0 DTLB Miss due t o Loads - Performance I mpact : DTLB_MI SSES.L0_MI SS_LD *
2 / CPU_CLK_UNHALTED.CORE
High number of DTLB0 misses indicat es t hat t he dat a set t hat t he workload uses
spans a number of pages t hat is bigger t han t he DTLB0. The high number of misses
is expect ed t o impact workload performance only if t he CPI ( Rat io 1) is low - around
0. 8. Ot herwise, it is likely t hat t he DTLB0 miss cycles are hidden by ot her lat encies.
B.7.6 Memory Sub-System - Cache Misses Ratios
B.7.6.1 Locating Cache Misses in the Code
I nt el Core microarchit ect ure provides you wit h precise event s for ret ired load inst ruc-
t ions t hat miss t he L1 dat a cache or t he L2 cache. As precise event s t hey provide t he
inst ruct ion point er of t he inst ruct ion following t he one t hat caused t he event . There-
fore t he inst ruct ion t hat comes immediat ely prior t o t he point ed inst ruct ion is t he one
t hat causes t he cache miss. These event s are most helpful t o quickly ident ify on
which loads t o focus t o fix a performance problem. The event s are:
MEM_LOAD_RETI RE. L1D_MI SS
MEM_LOAD_RETI RE. L1D_LI NE_MI SS
MEM_LOAD_RETI RE. L2_MI SS
MEM_LOAD_RETI RE. L2_LI NE_MI SS
B-58
B.7.6.2 L1 Data Cache Misses
37. L1 Dat a Cache Miss Rat e: L1D_REPL / I NST_RETI RED. ANY
A high value for L1 Dat a Cache Miss Rat e indicat es t hat t he code misses t he L1 dat a
cache t oo oft en and pays t he penalt y of accessing t he L2 cache. See also Loads
Blocked by L1 Dat a Cache Rat e ( Rat io 32) .
You can count separat ely cache misses due t o loads, st ores, and locked operat ions
using t he event s L1D_CACHE_LD. I _STATE, L1D_CACHE_ST.I _STATE, and
L1D_CACHE_LOCK. I _STATE, accordingly.
B.7.6.3 L2 Cache Misses
38. L2 Cache Miss Rat e: L2_LI NES_I N.SELF. ANY / I NST_RETI RED. ANY
A high L2 Cache Miss Rat e indicat es t hat t he running workload has a dat a set larger
t han t he L2 cache. Some of t he dat a might be evict ed wit hout being used. Unless all
t he required dat a is brought ahead of t ime by t he hardware prefet cher or soft ware
prefet ching inst ruct ions, bringing dat a from memory has a significant impact on t he
performance.
39. L2 Cache Demand Miss Rat e: L2_LI NES_I N. SELF. DEMAND / I NST_RETI RED. ANY
A high value for L2 Cache Demand Miss Rat e indicat es t hat t he hardware prefet chers
are not exploit ed t o bring t he dat a t his workload needs. Dat a is brought from
memory when needed t o be used and t he workload bears memory lat ency for each
such access.
B.7.7 Memory Sub-system - Prefetching
B.7.7.1 L1 Data Prefetching
The event L1D_PREFETCH. REQUESTS is count ed whenever t he DCU at t empt s t o
prefet ch cache lines from t he L2 ( or memory) t o t he DCU. I f you expect t he DCU
prefet chers t o work and t o count t his event , but inst ead you det ect t he event
MEM_LOAD_RETI RE.L1D_MI SS, it might be t hat t he I P prefet cher suffers from load
inst ruct ion address collision of several loads.
B.7.7.2 L2 Hardware Prefetching
Wit h t he event L2_LD.SELF. PREFETCH. MESI you can count t he number of prefet ch
request s t hat were made t o t he L2 by t he L2 hardware prefet chers. The act ual
number of cache lines prefet ched t o t he L2 is count ed by t he event
L2_LD.SELF. PREFETCH. I _STATE.
B-59
B.7.7.3 Software Prefetching
The event s for soft ware prefet ching cover each level of prefet ching separat ely.
40. Useful Prefet chNTA Rat io: SSE_PRE_MI SS. NTA / SSE_PRE_EXEC. NTA * 100
41. Useful Prefet chT0 Rat io: SSE_PRE_MI SS.L1 / SSE_PRE_EXEC. L1 * 100
42. Useful Prefet chT1 and Prefet chT2 Rat io: SSE_PRE_MI SS.L2 / SSE_PRE_EXEC. L2
* 100
A low value for any of t he prefet ch usefulness rat ios indicat es t hat some of t he SSE
prefet ch inst ruct ions prefet ch dat a t hat is already in t he caches.
43. Lat e Prefet chNTA Rat io: LOAD_HI T_PRE / SSE_PRE_EXEC. NTA
44. Lat e Prefet chT0 Rat io: LOAD_HI T_PRE / SSE_PRE_EXEC. L1
45. Lat e Prefet chT1 and Prefet chT2 Rat io: LOAD_HI T_PRE / SSE_PRE_EXEC. L2
A high value for any of t he lat e prefet ch rat ios indicat es t hat soft ware prefet ch
inst ruct ions are issued t oo lat e and t he load operat ions t hat use t he prefet ched dat a
are wait ing for t he cache line t o arrive.
B.7.8 Memory Sub-system - TLB Miss Ratios
46. TLB miss penalt y: PAGE_WALKS. CYCLES / CPU_CLK_UNHALTED. CORE * 100
A high value for t he TLB miss penalt y rat io indicat es t hat many cycles are spent on
TLB misses. Reducing t he number of TLB misses may improve performance. This
rat io does not include DTLB0 miss penalt ies ( see Rat io 37) .
The following rat ios help t o focus on t he kind of memory accesses t hat cause TLB
misses most frequent ly See I TLB Miss Rat e ( Rat io 6) for TLB misses due t o inst ruc-
t ion fet ch.
47. DTLB Miss Rat e: DTLB_MI SSES.ANY / I NST_RETI RED. ANY
A high value for DTLB Miss Rat e indicat es t hat t he code accesses t oo many dat a
pages wit hin a short t ime, and causes many Dat a TLB misses.
48. DTLB Miss Rat e due t o Loads: DTLB_MI SSES. MI SS_LD / I NST_RETI RED. ANY
A high value for DTLB Miss Rat e due t o Loads indicat es t hat t he code accesses loads
dat a from t oo many pages wit hin a short t ime, and causes many Dat a TLB misses.
DTLB misses due t o load operat ions may have a significant impact , since t he DTLB
miss increases t he load operat ion lat ency. This rat io does not include DTLB0 miss
penalt ies ( see Rat io 37) .
To precisely locat e load inst ruct ions t hat caused DTLB misses you can use t he precise
event MEM_LOAD_RETI RE. DTLB_MI SS.
49. DTLB Miss Rat e due t o St ores: DTLB_MI SSES. MI SS_ST / I NST_RETI RED. ANY
A high value for DTLB Miss Rat e due t o St ores indicat es t hat t he code accesses t oo
many dat a pages wit hin a short t ime, and causes many Dat a TLB misses due t o st ore
B-60
operat ions. These misses can impact performance if t hey do not occur in parallel t o
ot her inst ruct ions. I n addit ion, if t here are many st ores in a row, some of t hem
missing t he DTLB, it may cause st alls due t o full st ore buffer.
B.7.9 Memory Sub-system - Core Interaction
B.7.9.1 Modified Data Sharing
50. Modified Dat a Sharing Rat io: EXT_SNOOP. ALL_AGENTS.HI TM /
I NST_RETI RED. ANY
Frequent occurrences of modified dat a sharing may be due t o t wo t hreads using and
modifying dat a laid in one cache line. Modified dat a sharing causes L2 cache misses.
When it happens unint ent ionally ( aka false sharing) it usually causes demand misses
t hat have high penalt y. When false sharing is removed code performance can
dramat ically improve.
51. Local Modified Dat a Sharing Rat io: EXT_SNOOP. THI S_AGENT.HI TM /
I NST_RETI RED. ANY
Modified Dat a Sharing Rat io indicat es t he amount of t ot al modified dat a sharing
observed in t he syst em. For syst ems wit h several processors you can use Local Modi-
fied Dat a Sharing Rat io t o indicat es t he amount of modified dat a sharing bet ween
t wo cores in t he same processor. ( I n syst ems wit h one processor t he t wo rat ios are
similar) .
B.7.9.2 Fast Synchronization Penalty
52. Locked Operat ions I mpact : ( L1D_CACHE_LOCK_DURATI ON + 20 *
L1D_CACHE_LOCK. MESI ) / CPU_CLK_UNHALTED.CORE * 100
Fast synchronizat ion is frequent ly implement ed using locked memory accesses. A
high value for Locked Operat ions I mpact indicat es t hat locked operat ions used in t he
workload have high penalt y. The lat ency of a locked operat ion depends on t he loca-
t ion of t he dat a: L1 dat a cache, L2 cache, ot her core cache or memory.
B.7.9.3 Simultaneous Extensive Stores and Load Misses
53. St ore Block by Snoop Rat io: ( STORE_BLOCK. SNOOP /
CPU_CLK_UNHALTED. CORE) * 100
A high value for St ore Block by Snoop Rat io indicat es t hat st ore operat ions are
frequent ly blocked and performance is reduced. This happens when one core
execut es a dense st ream of st ores while t he ot her core in t he processor frequent ly
snoops it for cache lines missing in it s L1 dat a cache.
B-61
B.7.10 Memory Sub-system - Bus Characterization
B.7.10.1 Bus Utilization
54. Bus Ut ilizat ion: BUS_TRANS_ANY. ALL_AGENTS * 2 / CPU_CLK_UNHALTED. BUS *
100
Bus Ut ilizat ion is t he percent age of bus cycles used for t ransferring bus t ransact ions
of any t ype. I n single processor syst ems most of t he bus t ransact ions carry dat a. I n
mult iprocessor syst ems some of t he bus t ransact ions are used t o coordinat e cache
st at es t o keep dat a coherency.
55. Dat a Bus Ut ilizat ion: BUS_DRDY_CLOCKS.ALL_AGENTS /
CPU_CLK_UNHALTED.BUS * 100
Dat a Bus Ut ilizat ion is t he percent age of bus cycles used for t ransferring dat a among
all bus agent s in t he syst em, including processors and memory. High bus ut ilizat ion
indicat es heavy t raffic bet ween t he processor( s) and memory. Memory subsyst em
lat ency can impact t he performance of t he program. For comput e- int ensive applica-
t ions wit h high bus ut ilizat ion, look for opport unit ies t o improve dat a and code
localit y. For ot her t ypes of applicat ions ( for example, copying large amount s of dat a
from one memory area t o anot her) , t ry t o maximize bus ut ilizat ion.
56. Bus Not Ready Rat io: BUS_BNR_DRV.ALL_AGENTS * 2 /
Bus Not Ready Rat io est imat es t he percent age of bus cycles during which new bus
t ransact ions cannot st art . A high value for Bus Not Ready Rat io indicat es t hat t he bus
is highly loaded. As a result of t he Bus Not Ready ( BNR) signal, new bus t ransact ions
might defer and t heir lat ency will have higher impact on program performance.
57. Burst Read in Bus Ut ilizat ion: BUS_TRANS_BRD. SELF * 2 /
A high value for Burst Read in Bus Ut ilizat ion indicat es t hat bus and memory lat ency
of burst read operat ions may impact t he performance of t he program.
58. RFO in Bus Ut ilizat ion: BUS_TRANS_RFO.SELF * 2 / CPU_CLK_UNHALTED.BUS *
100
A high value for RFO in Bus Ut ilizat ion indicat es t hat lat ency of Read For Ownership
( RFO) t ransact ions may impact t he performance of t he program. RFO t ransact ions
may have a higher impact on t he program performance compared t o ot her burst read
operat ions ( for example, as a result of loads t hat missed t he L2) . See also Rat io 31.
B.7.10.2 Modified Cache Lines Eviction
59. L2 Modified Lines Evict ion Rat e: L2_M_LI NES_OUT. SELF.ANY /
I NST_RETI RED. ANY
When a new cache line is brought from memory, an exist ing cache line, possibly
modified, is evict ed from t he L2 cache t o make space for t he new line. Frequent evic-
B-62
t ions of modified lines from t he L2 cache increase t he lat ency of t he L2 cache misses
and consume bus bandwidt h.
60. Explicit WB in Bus Ut ilizat ion: BUS_TRANS_WB. SELF * 2 /
CPU_CLK_UNHALTED. BUS* 100
Explicit Writ e- back in Bus Ut ilizat ion considers modified cache line evict ions not only
from t he L2 cache but also from t he L1 dat a cache. I t represent s t he percent age of
bus cycles used for explicit writ e- backs from t he processor t o memory.
C-1
APPENDIX C
This appendix cont ains t ables showing t he lat ency and t hroughput are associat ed
wit h commonly used inst ruct ions
1
. The inst ruct ion t iming dat a varies across proces-
sors family/ models. I t cont ains t he following sect ions:
Appendi x C.1, Over v i ew Provides an overview of issues relat ed t o
inst ruct ion select ion and scheduling.
Appendi x C.2, Def i ni t i ons Present s definit ions.
Appendi x C. 3, Lat ency and Thr oughput List s inst ruct ion t hroughput ,
lat ency associat ed wit h commonly- used inst ruct ion.
C.1 OVERVIEW
This appendix provides informat ion t o assembly language programmers and
compiler writ ers. The informat ion aids in t he select ion of inst ruct ion sequences ( t o
minimize chain lat ency) and in t he arrangement of inst ruct ions ( assist s in hardware
processing) . The performance impact of applying t he informat ion has been shown t o
be on t he order of several percent . This is for applicat ions not dominat ed by ot her
performance fact ors, such as:
cache miss lat encies
bus bandwidt h
I / O bandwidt h
I nst ruct ion select ion and scheduling mat t ers when t he programmer has already
addressed t he performance issues discussed in Chapt er 2:
observe st ore forwarding rest rict ions
avoid cache line and memory order buffer split s
do not inhibit branch predict ion
minimize t he use of xchg inst ruct ions on memory locat ions
1. Although instruction latency may be useful in some limited situations (e.g., a tight loop with a
dependency chain that exposes instruction latency), software optimization on super-scalar, out-
of-order microarchitecture, in general, will benefit much more on increasing the effective
throughput of the larger-scale code path. Coding techniques that rely on instruction latency
alone to influence the scheduling of instruction is likely to be sub-optimal as such coding tech-
nique is likely to interfere with the out-of-order machine or restrict the amount of instruction-
level parallelism.
C-2
While several it ems on t he above list involve select ing t he right inst ruct ion, t his
appendix focuses on t he following issues. These are list ed in priorit y order, t hough
which it em cont ribut es most t o performance varies by applicat ion:
Maximize t he flow of ops int o t he execut ion core. I nst ruct ions which consist of
more t han four ops require addit ional st eps from microcode ROM. I nst ruct ions
wit h longer op flows incur a delay in t he front end and reduce t he supply of ops
t o t he execut ion core.
I n Pent ium 4 and I nt el Xeon processors, t ransfers t o microcode ROM oft en
reduce how efficient ly ops can be packed int o t he t race cache. Where possible,
it is advisable t o select inst ruct ions wit h four or fewer ops. For example, a
32- bit int eger mult iply wit h a memory operand fit s in t he t race cache wit hout
going t o microcode, while a 16- bit int eger mult iply t o memory does not .
Avoid resource conflict s. I nt erleaving inst ruct ions so t hat t hey dont compet e for
t he same port or execut ion unit can increase t hroughput . For example, alt ernat e
PADDQ and PMULUDQ ( each has a t hroughput of one issue per t wo clock cycles) .
When int erleaved, t hey can achieve an effect ive t hroughput of one inst ruct ion
per cycle because t hey use t he same port but different execut ion unit s. Select ing
inst ruct ions wit h fast t hroughput also helps t o preserve issue port bandwidt h,
hide lat ency and allows for higher soft ware performance.
Minimize t he lat ency of dependency chains t hat are on t he crit ical pat h. For
example, an operat ion t o shift left by t wo bit s execut es fast er when encoded as
t wo adds t han when it is encoded as a shift . I f lat ency is not an issue, t he shift
result s in a denser byt e encoding.
I n addit ion t o t he general and specific rules, coding guidelines and t he inst ruct ion
dat a provided in t his manual, you can t ake advant age of t he soft ware performance
analysis and t uning t oolset available at ht t p: / / developer. int el. com/ soft ware/ prod-
uct s/ index. ht m. The t ools include t he I nt el VTune Performance Analyzer, wit h it s
performance- monit oring capabilit ies.
C.2 DEFINITIONS
The dat a is list ed in several t ables. The t ables cont ain t he following:
I nst r uct i on Name The assembly mnemonic of each inst ruct ion.
Lat ency The number of clock cycles t hat are required for t he execut ion core t o
complet e t he execut ion of all of t he ops t hat form an inst ruct ion.
Thr oughput The number of clock cycles required t o wait before t he issue
port s are free t o accept t he same inst ruct ion again. For many inst ruct ions, t he
t hroughput of an inst ruct ion can be significant ly less t han it s lat ency.
C-3
C.3 LATENCY AND THROUGHPUT
This sect ion present s t he lat ency and t hroughput informat ion for commonly- used
inst ruct ions including: MMX t echnology, St reaming SI MD Ext ensions, subsequent
generat ions of SI MD inst ruct ion ext ensions, and most of t he frequent ly used general-
purpose int eger and x87 float ing- point inst ruct ions.
Due t o t he complexit y of dynamic execut ion and out - of- order nat ure of t he execut ion
core, t he inst ruct ion lat ency dat a may not be sufficient t o accurat ely predict realist ic
performance of act ual code sequences based on adding inst ruct ion lat ency dat a.
I nst ruct ion lat ency dat a is useful when t uning a dependency chain. However,
dependency chains limit t he out - of- order cores abilit y t o execut e micro- ops in
parallel. I nst ruct ion t hroughput dat a are useful when t uning parallel code
unencumbered by dependency chains.
Numeric dat a in t he t ables is:
approximat e and subj ect t o change in fut ure implement at ions of t he microar-
chit ect ure.
not meant t o be used as reference for inst ruct ion- level performance
benchmarks. Comparison of inst ruct ion- level performance of micropro-
cessors t hat are based on different microarchit ect ures is a complex subj ect
and requires informat ion t hat is beyond t he scope of t his manual.
Comparisons of lat ency and t hroughput dat a bet ween different microarchit ect ures
can be misleading.
Appendix C. 3. 1 provides lat ency and t hroughput dat a for t he regist er- t o- regist er
inst ruct ion t ype. Appendix C. 3. 3 discusses how t o adj ust lat ency and t hroughput
specificat ions for t he regist er- t o- memory and memory- t o- regist er inst ruct ions.
I n some cases, t he lat ency or t hroughput figures given are j ust one half of a clock.
This occurs only for t he double- speed ALUs.
C.3.1 Latency and Throughput with Register Operands
I nst ruct ion lat ency and t hroughput dat a are present ed in Table C- 1 t hrough
Table C- 10. Tables include Supplement al St reaming SI MD Ext ension 3, St reaming
SI MD Ext ension 3, St reaming SI MD Ext ension 2, St reaming SI MD Ext ension, MMX
t echnology and most common I nt el 64 and I A- 32 inst ruct ions. I nst ruct ion lat ency
and t hroughput for different processor microarchit ect ures are in separat e columns.
Pr ocessor inst ruct ion t iming dat a may vary from one implement at ion t o anot her.
I nt el 64 and I A- 32 processors wit h different implement at ion charact erist ics can be
ident ified by t he encoded values of display_family and display_model. The defini-
t ions of display_family and display_model can be found in t he reference pages of
CPUI D ( see I nt el 64 and I A- 32 Archit ect ures Soft ware Developers Manual, Volume
2A) . The t ables of inst ruct ion and lat ency dat a are grouped by an abbreviat ed form
of hex values DisplayFamilyValue_DisplayModelValue. Processors based on I nt el
Net Burst microarchit ect ure has a DisplayFamilyValue of 0FH, DisplayModelValue
C-4
of processors based on I nt el Net Burst microarchit ect ure ranges from 0, 1, 2, 3, 4,
and 6. The dat a shown for 0F_03H also applies t o OF_04H, and 0F_06H.
Pent ium M processor inst ruct ion t iming dat a is shown in t he columns represent ed by
DisplayFamilyValue_DisplayModelValue of 06_09H, and 06_0DH.
I nt el Core Solo and I nt el Core Duo processors are represent ed by 06_0EH. Proces-
sors bases on I nt el Core microarchit ect ure are represent ed by 06_0FH.
Table C-1. Supplemental Streaming SIMD Extension 3 SIMD Instructions
Instruction Latency
1
Throughput
DisplayFamily_DisplayModel 06_0FH 06_0FH
PALIGNR mm1, mm2, imm 2 1
PALIGNR xmm1, xmm2, imm 2 1
PHADDD mm1, mm2 3 2
PHADDD xmm1, xmm2 5 3
PHADDW/PHADDSW mm1, mm2 5 4
PHADDW/PHADDSW xmm1, xmm2 6 4
PHSUBD mm1, mm2 3 2
PHSUBD xmm1, xmm2 5 3
PHSUBW/PHSUBSW mm1, mm2 5 4
PHSUBW/PHSUBSW xmm1, xmm2 6 4
PMADDUBSW mm1, mm2 3 1
PMADDUBSW xmm1, xmm2 3 1
PMULHRSW mm1, mm2 3 1
PMULHRSW xmm1, xmm2 3 1
PSHUFB mm1, mm2 1 1
PSHUFB xmm1, xmm2 3 2
PSIGNB/PSIGND/PSIGNW mm1, mm2 1 0.5
PSIGNB/PSIGND/PSIGNW xmm1, xmm2 1 0.5
See Appendix C.3.2, Table Footnotes
C-5
Table C-2. Streaming SIMD Extension 3 SIMD Floating-point Instructions
Instruction Latency
1
Throughput Execution Unit
DisplayFamily_DisplayModel 0F_03H 0F_03H 0F_03H
ADDSUBPD/ADDSUBPS 5 2 FP_ADD
HADDPD/HADDPS 13 4 FP_ADD,FP_MISC
HSUBPD/HSUBPS 13 4 FP_ADD,FP_MISC
MOVDDUP xmm1, xmm2 4 2 FP_MOVE
MOVSHDUP xmm1, xmm2 6 2 FP_MOVE
MOVSLDUP xmm1, xmm2 6 2 FP_MOVE
Table C-2a. Streaming SIMD Extension 3 SIMD Floating-point Instructions
Instruction Latency
1
Throughput
DisplayFamily_DisplayModel 06_0FH 06_0EH 06_0FH 06_0EH
ADDSUBPD/ADDSUBPS 3 4 1 2
HADDPD xmm1, xmm2 5 4 2 2
HADDPS xmm1, xmm2 9 6 4 4
HSUBPD xmm1, xmm2 5 4 2 2
HSUBPS xmm1, xmm2 9 6 4 4
MOVDDUP xmm1, xmm2 1 1 1 1
MOVSHDUP xmm1, xmm2 2 2 1 2
MOVSLDUP xmm1, xmm2 2 2
Table C-3. Streaming SIMD Extension 2 128-bit Integer Instructions
Instruction Latency
1
2
DisplayFamily_DisplayModel 0F_03H 0F_02H 0F_03H 0F_02H 0F_02H
CVTDQ2PS
3
xmm, xmm 5 5 2 2 FP_ADD
CVTPS2DQ
3
CVTTPS2DQ
3
MOVD xmm, r32 6 6 2 2 MMX_MISC,MMX_S
HFT
C-6
MOVD r32, xmm 10 10 1 1 FP_MOVE,
FP_MISC
MOVDQA xmm, xmm 6 6 1 1 FP_MOVE
MOVDQU xmm, xmm 6 6 1 1 FP_MOVE
MOVDQ2Q mm, xmm 8 8 2 2 FP_MOVE,
MMX_ALU
MOVQ2DQ xmm, mm 8 8 2 2 FP_MOVE,
MMX_SHFT
MOVQ xmm, xmm 2 2 2 2 MMX_SHFT
PACKSSWB/PACKSSDW/
PACKUSWB xmm, xmm
4 4 2 2 MMX_SHFT
PADDB/PADDW/PADDD
xmm, xmm
2 2 2 2 MMX_ALU
PADDSB/PADDSW/
PADDUSB/PADDUSW
xmm, xmm
2 2 2 2 MMX_ALU
PADDQ mm, mm 2 2 1 1 FP_MISC
PSUBQ mm, mm 2 2 1 1 FP_MISC
PADDQ/ PSUBQ
3
xmm, xmm 6 6 2 2 FP_MISC
PAND xmm, xmm 2 2 2 2 MMX_ALU
PANDN xmm, xmm 2 2 2 2 MMX_ALU
PAVGB/PAVGW xmm, xmm 2 2 2 2 MMX_ALU
PCMPEQB/PCMPEQD/
PCMPEQW xmm, xmm
2 2 2 2 MMX_ALU
PCMPGTB/PCMPGTD/PCMPG
TW xmm, xmm
2 2 2 2 MMX_ALU
PEXTRW r32, xmm, imm8 7 7 2 2 MMX_SHFT,
FP_MISC
PINSRW xmm, r32, imm8 4 4 2 2 MMX_SHFT,MMX_
MISC
PMADDWD xmm, xmm 9 8 2 2 FP_MUL
PMAX xmm, xmm 2 2 2 2 MMX_ALU
PMIN xmm, xmm 2 2 2 2 MMX_ALU
PMOVMSKB
3
r32, xmm 7 7 2 2 FP_MISC
Table C-3. Streaming SIMD Extension 2 128-bit Integer Instructions (Contd.)
Instruction Latency
1
2
C-7
PMULHUW/PMULHW/
PMULLW
3
xmm, xmm
9 8 2 2 FP_MUL
PMULUDQ mm, mm 9 8 1 FP_MUL
PMULUDQ xmm, xmm 9 8 2 2 FP_MUL
POR xmm, xmm 2 2 2 2 MMX_ALU
PSADBW xmm, xmm 4 4 2 2 MMX_ALU
PSHUFD xmm, xmm, imm8 4 4 2 2 MMX_SHFT
PSHUFHW xmm, xmm, imm8 2 2 2 2 MMX_SHFT
PSHUFLW xmm, xmm, imm8 2 2 2 2 MMX_SHFT
PSLLDQ xmm, imm8 4 4 2 2 MMX_SHFT
PSLLW/PSLLD/PSLLQ xmm,
xmm/imm8
2 2 2 2 MMX_SHFT
PSRAW/PSRAD xmm,
xmm/imm8
2 2 2 2 MMX_SHFT
PSRLDQ xmm, imm8 4 4 2 2 MMX_SHFT
PSRLW/PSRLD/PSRLQ xmm,
xmm/imm8
2 2 2 2 MMX_SHFT
PSUBB/PSUBW/PSUBD
xmm, xmm
2 2 2 2 MMX_ALU
PSUBSB/PSUBSW/PSUBUSB
/PSUBUSW xmm, xmm
2 2 2 2 MMX_ALU
PUNPCKHBW/PUNPCKHWD/
PUNPCKHDQ xmm, xmm
4 4 2 2 MMX_SHFT
PUNPCKHQDQ xmm, xmm 4 4 2 2 MMX_SHFT
PUNPCKLBW/PUNPCKLWD/P
UNPCKLDQ xmm, xmm
2 2 2 2 MMX_SHFT
PUNPCKLQDQ
3
xmm,
xmm
4 4 1 1 FP_MISC
PXOR xmm, xmm 2 2 2 2 MMX_ALU
Table C-3. Streaming SIMD Extension 2 128-bit Integer Instructions (Contd.)
Instruction Latency
1
2
C-8
Table C-3a. Streaming SIMD Extension 2 128-bit Integer Instructions
Instruction Latency
1
Throughput
DisplayFamily_DisplayModel
06_0
FH
06_0
EH
06_0
DH
06_09
H
06_0F
H
06_0E
H
06_0
DH
06_09
H
CVTDQ2PS xmm, xmm 3 4 1 2
CVTPS2DQ xmm, xmm 3 4 4 4 1 2 2 2
CVTTPS2DQ xmm, xmm 3 4 4 4 1 2 2 2
MASKMOVDQU xmm, xmm 8 2
MOVD xmm, r32 1 1 1 1 0.5 0.5 0.5 0.5
MOVD xmm, r64 1 N/A N/A N/A 0.5 N/A N/A N/A
MOVD r32, xmm 1 1 1 1 1 1 1 1
MOVD r64, xmm 1 N/A N/A N/A 0.33 N/A N/A N/A
MOVDQA xmm, xmm 1 1 1 1 0.33 1 1 1
MOVDQU xmm, xmm 1 1 1 1 0.5 1 1 1
MOVDQ2Q mm, xmm 1 1 1 1 0.5 0.5 0.5
MOVQ2DQ xmm, mm 1 1 1 1 1 1 1 1
MOVQ xmm, xmm 1 1 1 0.33 1 1
PACKSSWB/PACKSSDW/
PACKUSWB xmm, xmm
4 2 2 2 3 2 2 2
PADDB/PADDW/PADDD xmm,
xmm
1 1 1 1 0.33 1 1 1
PADDSB/PADDSW/
PADDUSB/PADDUSW
xmm, xmm
1 1 1 1 0.33 1 1 1
PADDQ mm, mm 2 2 2 2 1 1 1 1
PSUBQ mm, mm 2 2 2 2 1 1 1 1
PADDQ/ PSUBQ
3
xmm, xmm 2 3 3 3 1 2 2 2
PAND xmm, xmm 1 1 1 1 0.33 1 1 1
PANDN xmm, xmm 1 1 1 0.33 1 1 1
PAVGB/PAVGW xmm, xmm 1 1 1 1 0.5 1 1 1
PCMPEQB/PCMPEQD/
PCMPEQW xmm, xmm
1 1 1 1 0.33 1 1 1
PCMPGTB/PCMPGTD/PCMPG
TW xmm, xmm
1 1 1 1 0.33 1 1 1
PEXTRW r32, xmm, imm8 2 3 3 3 1 2 2 2
PINSRW xmm, r32, imm8 3 2 2 2 1 2 2 2
C-9
PMADDWD xmm, xmm 3 4 4 4 1 2 2 2
PMAX xmm, xmm 1 1 1 1 0.5 1 1 1
PMIN xmm, xmm 1 1 1 1 0.5 1 1 1
PMOVMSKB
3
r32, xmm 1 1 4 1 1 1 3
PMULHUW/PMULHW/
PMULLW xmm, xmm
3 4 4 4 1 2 2 2
PMULUDQ mm, mm 3 4 4 4 1 1 1 1
PMULUDQ xmm, xmm 3 8 8 8 1 2 2 2
POR xmm, xmm 1 1 1 1 0.33 1 1 1
PSADBW xmm, xmm 3 7 7 7 1 2 2 2
PSHUFD xmm, xmm, imm8 2 2 2 2 2 2 2 2
PSHUFHW xmm, xmm, imm8 1 1 1 1 1 1 1 1
PSHUFLW xmm, xmm, imm8 1 1 1 1 1 1 1 1
PSLLDQ xmm, imm8 2 4 4 4 2 3 3 3
PSLLW/PSLLD/PSLLQ xmm,
xmm/imm8
2 2 2 2 1 2 2 2
PSRAW/PSRAD xmm,
xmm/imm8
2 2 2 2 1 2 2 2
PSRLDQ xmm, imm8 2 4 4 2 4 4
PSRLW/PSRLD/PSRLQ xmm,
xmm/imm8
2 2 2 2 1 2 2 2
PSUBB/PSUBW/PSUBD xmm,
xmm
1 1 1 1 0.33 1 1 1
PSUBSB/PSUBSW/PSUBUSB/
PSUBUSW xmm, xmm
1 1 1 1 0.33 1 1 1
PUNPCKHBW/PUNPCKHWD/P
UNPCKHDQ xmm, xmm
3 2 2 2 3 2 2 2
PUNPCKHQDQ xmm, xmm 1 1 1 1 1 1 1 1
PUNPCKLBW/PUNPCKLWD/P
UNPCKLDQ xmm, xmm
3 2 2 2 3 2 2 2
PUNPCKLQDQ xmm, xmm 1 1 1 1 1 1 1 1
PXOR xmm, xmm 1 1 1 1 0.33 1 1 1
Table C-3a. Streaming SIMD Extension 2 128-bit Integer Instructions (Contd.)
Instruction Latency
1
Throughput
DisplayFamily_DisplayModel
06_0
FH
06_0
EH
06_0
DH
06_09
H
06_0F
H
06_0E
H
06_0
DH
06_09
H
C-10
Table C-4. Streaming SIMD Extension 2 Double-precision
Floating-point Instructions
Instruction Latency
1
2
ADDPD xmm, xmm 5 4 2 2 FP_ADD
ADDSD xmm, xmm 5 4 2 2 FP_ADD
ANDNPD
3
xmm, xmm 4 4 2 2 MMX_ALU
ANDPD
3
CMPPD xmm, xmm, imm8 5 4 2 2 FP_ADD
CMPSD xmm, xmm, imm8 5 4 2 2 FP_ADD
COMISD xmm, xmm 7 6 2 2 FP_ADD,
FP_MISC
CVTDQ2PD xmm, xmm 8 8 3 3 FP_ADD,
MMX_SHFT
CVTPD2PI mm, xmm 12 11 3 3 FP_ADD,
MMX_SHFT,
MMX_ALU
CVTPD2DQ xmm, xmm 10 9 2 2 FP_ADD,
MMX_SHFT
CVTPD2PS
3
xmm, xmm 11 10 2 2 FP_ADD,
MMX_SHFT
CVTPI2PD xmm, mm 12 11 2 4 FP_ADD,
MMX_SHFT,
MMX_ALU
CVTPS2PD
3
xmm, xmm 3 2 2 FP_ADD,
MMX_SHFT,
MMX_ALU
CVTSD2SI r32, xmm 9 8 2 2 FP_ADD,
FP_MISC
CVTSD2SS
3
xmm, xmm 17 16 2 4 FP_ADD,
MMX_SHFT
CVTSI2SD
3
xmm, r32 16 15 2 3 FP_ADD,
MMX_SHFT,
MMX_MISC
CVTSS2SD
3
xmm, xmm 9 8 2 2
CVTTPD2PI mm, xmm 12 11 3 3 FP_ADD,
MMX_SHFT,
MMX_ALU
C-11
CVTTPD2DQ xmm, xmm 10 9 2 2 FP_ADD,
MMX_SHFT
CVTTSD2SI r32, xmm 8 8 2 2 FP_ADD,
FP_MISC
DIVPD xmm, xmm 70 69 70 69 FP_DIV
DIVSD xmm, xmm 39 38 39 38 FP_DIV
MAXPD xmm, xmm 5 4 2 2 FP_ADD
MAXSD xmm, xmm 5 4 2 2 FP_ADD
MINPD xmm, xmm 5 4 2 2 FP_ADD
MINSD xmm, xmm 5 4 2 2 FP_ADD
MOVAPD xmm, xmm 6 6 1 1 FP_MOVE
MOVMSKPD r32, xmm 6 6 2 2 FP_MISC
MOVSD xmm, xmm 6 6 2 2 MMX_SHFT
MOVUPD xmm, xmm 6 6 1 1 FP_MOVE
MULPD xmm, xmm 7 6 2 2 FP_MUL
MULSD xmm, xmm 7 6 2 2 FP_MUL
ORPD
3
SHUFPD
3
xmm, xmm, imm8 6 6 2 2 MMX_SHFT
SQRTPD xmm, xmm 70 69 70 69 FP_DIV
SQRTSD xmm, xmm 39 38 39 38 FP_DIV
SUBPD xmm, xmm 5 4 2 2 FP_ADD
SUBSD xmm, xmm 5 4 2 2 FP_ADD
UCOMISD xmm, xmm 7 6 2 2 FP_ADD,
FP_MISC
UNPCKHPD xmm, xmm 6 6 2 2 MMX_SHFT
UNPCKLPD
3
xmm, xmm 4 4 2 2 MMX_SHFT
XORPD
3
Table C-4. Streaming SIMD Extension 2 Double-precision
Floating-point Instructions (Contd.)
Instruction Latency
1
2
C-12
Table C-4a. Streaming SIMD Extension 2 Double-precision
Instruction Latency
1
Throughput
DisplayFamily_DisplayModel 06_0
FH
06_0
EH
06_0
DH
06_0
9H
06_0
FH
06_0
EH
06_0
DH
06_09
H
ADDPD xmm, xmm 3 4 4 4 1 2 2 2
ADDSD xmm, xmm 3 3 3 3 1 1 1 1
ANDNPD xmm, xmm 1 1 1 1 1 1 1 1
ANDPD xmm, xmm 1 1 1 1 1 1 1 1
CMPPD xmm, xmm, imm8 3 4 4 4 1 2 2 2
CMPSD xmm, xmm, imm8 3 3 3 3 1 1 1 1
COMISD xmm, xmm 1 1 1 1 1 1 1 1
CVTDQ2PD xmm, xmm 5 1 4
CVTDQ2PS xmm, xmm 4 1
CVTPD2PI mm, xmm 5 1 3
CVTPD2DQ xmm, xmm 4 5 1 3
CVTPD2PS xmm, xmm 4 5 3 3 1 2 2 2
CVTPI2PD xmm, mm 4 5 5 5 1 4
CVT[T]PS2DQ xmm, xmm 3 1
CVTPS2PD xmm, xmm 2 3 3 3 2 3 3 3
CVTSD2SI r32, xmm 3 4 4 4 1 1 1 1
CVT[T]SD2SI r64, xmm 3 N/A N/A N/A 1 N/A N/A N/A
CVTSD2SS xmm, xmm 4 4 4 4 1 1 1 1
CVTSI2SD xmm, r32 4 4 4 1 1 1 1
CVTSI2SD xmm, r64 4 N/A N/A N/A 1 N/A N/A N/A
CVTSS2SD xmm, xmm 2 2 2 2 2 2 2 2
CVTTPD2PI mm, xmm 5 5 5 1 3
CVTTPD2DQ xmm, xmm 4 1
CVTTSD2SI r32, xmm 3 4 4 4 1 1 1 1
DIVPD xmm, xmm 32 63 63 63 31 62 62 62
DIVSD xmm, xmm 32 32 32 32 31 31 31 31
MAXPD xmm, xmm 3 4 4 4 1 2 2 2
MAXSD xmm, xmm 3 3 3 3 1 1 1 1
MINPD xmm, xmm 3 4 4 4 1 2 2 2
C-13
MINSD xmm, xmm 3 3 3 3 1 1 1 1
MOVAPD xmm, xmm 1 1 1 1 0.33 1 1 1
MOVMSKPD r32, xmm 1 1 1 1 1 1 1 1
MOVMSKPD r64, xmm 1 N/A N/A N/A 1 N/A N/A N/A
MOVSD xmm, xmm 1 1 1 1 0.33 0.5 0.5 0.5
MOVUPD xmm, xmm 1 1 1 1 0.5 1 1 1
MULPD xmm, xmm 5 7 7 7 1 4 4 4
MULSD xmm, xmm 5 5 5 5 1 2 2 2
ORPD xmm, xmm 1 1 1 1 1 1 1 1
SHUFPD xmm, xmm, imm8 1 2 2 2 1 2 2 2
SQRTPD xmm, xmm 58 115 115 115 57 114 114 114
SQRTSD xmm, xmm 58 58 58 58 57 57 57 57
SUBPD xmm, xmm 3 4 4 4 1 2 2 2
SUBSD xmm, xmm 3 3 3 3 1 1 1 1
UCOMISD xmm, xmm 1 1 1 1 1 1
UNPCKHPD xmm, xmm 1 1 1 1 1 1
UNPCKLPD xmm, xmm 1 1 1 1 1 1
XORPD
3
xmm, xmm 1 1 1 1 1 1
Table C-5. Streaming SIMD Extension Single-precision
Instruction Latency
1
2
ADDPS xmm, xmm 5 4 2 2 FP_ADD
ADDSS xmm, xmm 5 4 2 2 FP_ADD
ANDNPS
3
ANDPS
3
CMPPS xmm, xmm 5 4 2 2 FP_ADD
CMPSS xmm, xmm 5 4 2 2 FP_ADD
Table C-4a. Streaming SIMD Extension 2 Double-precision
Instruction Latency
1
Throughput
DisplayFamily_DisplayModel 06_0
FH
06_0
EH
06_0
DH
06_0
9H
06_0
FH
06_0
EH
06_0
DH
06_09
H
C-14
COMISS xmm, xmm 7 6 2 2 FP_ADD,FP_
MISC
CVTPI2PS xmm, mm 12 11 2 4 MMX_ALU,FP_
ADD,MMX_
SHFT
CVTPS2PI mm, xmm 8 7 2 2 FP_ADD,MMX_
ALU
CVTSI2SS
3
xmm, r32 12 11 2 2 FP_ADD,MMX_
SHFT, MMX_MISC
CVTSS2SI r32, xmm 9 8 2 2 FP_ADD,FP_
MISC
CVTTPS2PI mm, xmm 8 7 2 2 FP_ADD,MMX_
ALU
CVTTSS2SI r32, xmm 9 8 2 2 FP_ADD,FP_
MISC
DIVPS xmm, xmm 40 39 17 39 FP_DIV
DIVSS xmm, xmm 32 23 17 23 FP_DIV
MAXPS xmm, xmm 5 4 2 2 FP_ADD
MAXSS xmm, xmm 5 4 2 2 FP_ADD
MINPS xmm, xmm 5 4 2 2 FP_ADD
MINSS xmm, xmm 5 4 2 2 FP_ADD
MOVAPS xmm, xmm 6 6 1 1 FP_MOVE
MOVHLPS
3
MOVLHPS
3
MOVMSKPS r32, xmm 6 6 2 2 FP_MISC
MOVSS xmm, xmm 4 4 2 2 MMX_SHFT
MOVUPS xmm, xmm 6 6 1 1 FP_MOVE
MULPS xmm, xmm 7 6 2 2 FP_MUL
MULSS xmm, xmm 7 6 2 2 FP_MUL
ORPS
3
RCPPS
3
xmm, xmm 6 6 4 4 MMX_MISC
RCPSS
3
xmm, xmm 6 6 2 2 MMX_MISC,
MMX_SHFT
Instruction Latency
1
2
C-15
RSQRTPS
3
xmm, xmm 6 6 4 4 MMX_MISC
RSQRTSS
3
xmm, xmm 6 6 4 4 MMX_MISC,
MMX_SHFT
SHUFPS
3
xmm, xmm, imm8 6 6 2 2 MMX_SHFT
SQRTPS xmm, xmm 40 39 40 39 FP_DIV
SQRTSS xmm, xmm 32 23 32 23 FP_DIV
SUBPS xmm, xmm 5 4 2 2 FP_ADD
SUBSS xmm, xmm 5 4 2 2 FP_ADD
UCOMISS xmm, xmm 7 6 2 2 FP_ADD, FP_MISC
UNPCKHPS
3
UNPCKLPS
3
XORPS
3
FXRSTOR 150
FXSAVE 100
See Appendix C.3.2
Table C-5a. Streaming SIMD Extension Single-precision
Instruction Latency
1
Throughput
DisplayFamily_Display
Model
06_0F
H
06_0E
H
06_0D
H
06_09
H 06_0FH
06_0E
H
06_0D
H
06_09
H
ADDPS xmm, xmm 3 4 4 4 1 2 2 2
ADDSS xmm, xmm 3 3 3 3 1 1 1 1
ANDNPS xmm, xmm 1 2 1 2
ANDPS xmm, xmm 1 2 1 2
CMPPS xmm, xmm 3 4 4 4 1 2 2 2
CMPSS xmm, xmm 3 3 3 3 1 1 1 1
COMISS xmm, xmm 1 1 1 1 1 1 1 1
CVTPI2PS xmm, mm 3 3 3 3 1 1 1 1
CVTPS2PI mm, xmm 3 3 1 1
CVTSI2SS xmm, r32 4 4 1 2
Instruction Latency
1
2
C-16
CVTSS2SI r32, xmm 3 4 4 4 1 1 1 1
CVT[T]SS2SI r64,
xmm
4 N/A N/A N/A 1 N/A N/A N/A
CVTTPS2PI mm, xmm 3 3 3 3 1 1 1 1
CVTTSS2SI r32, xmm 3 4 4 4 1 1 1 1
DIVPS xmm, xmm 18 35 35 35 17 34 34 34
DIVSS xmm, xmm 18 18 18 18 17 17 17 17
MAXPS xmm, xmm 3 4 4 4 1 2 2 2
MAXSS xmm, xmm 3 3 3 3 1 1 1 1
MINPS xmm, xmm 3 4 4 4 1 2 2 2
MINSS xmm, xmm 3 3 3 3 1 1 1 1
MOVAPS xmm, xmm 1 1 1 1 0.33 1 1 1
MOVHLPS xmm, xmm 1 1 1 1 1 0.5 0.5 0.5
MOVLHPS xmm, xmm 1 1 1 1 1 0.5 0.5 0.5
MOVMSKPS r32, xmm 1 1 1 1 1 1 1 1
MOVMSKPS r64, xmm 1 N/A N/A N/A 1 N/A N/A N/A
MOVSS xmm, xmm 1 1 1 1 0.33 0.5 0.5 0.5
MOVUPS xmm, xmm 1 1 1 1 0.5 1 1 1
MULPS xmm, xmm 4 5 5 5 1 2 2 2
MULSS xmm, xmm 4 4 4 4 1 1 1 1
ORPS xmm, xmm 1 2 0.33 2
RCPPS xmm, xmm 3 2 1 2
RCPSS xmm, xmm 3 1 1 1
RSQRTPS xmm, xmm 3 2 2 2
RSQRTSS xmm, xmm 3 2 1
SHUFPS xmm, xmm,
imm8
4 2 3 2
SQRTPS xmm, xmm 29 29+28 28 58
SQRTSS xmm, xmm 29 30 28 29
SUBPS xmm, xmm 3 4 1 2
SUBSS xmm, xmm 3 3 1 1
Instruction Latency
1
Throughput
Model
06_0F
H
06_0E
H
06_0D
H
06_09
H 06_0FH
06_0E
H
06_0D
H
06_09
H
C-17
UCOMISS xmm, xmm 1 1 1 1
UNPCKHPS xmm, xmm 4 3 3 2
UNPCKLPS xmm, xmm 4 3 3 2
XORPS xmm, xmm 1 2 0.33 2
FXRSTOR
FXSAVE
Table C-6. Streaming SIMD Extension 64-bit Integer Instructions
Instruction Latency
1
CPUID 0F_03H 0F_02H 0F_03H 0F_02H 0F_02H
PAVGB/PAVGW mm, mm 2 2 1 1 MMX_ALU
PEXTRW r32, mm, imm8 7 7 2 2 MMX_SHFT,
FP_MISC
PINSRW mm, r32, imm8 4 4 1 1 MMX_SHFT,
MMX_MISC
PMAX mm, mm 2 2 1 1 MMX_ALU
PMIN mm, mm 2 2 1 1 MMX_ALU
PMOVMSKB
3
r32, mm 7 7 2 2 FP_MISC
PMULHUW
3
mm, mm 9 8 1 1 FP_MUL
PSADBW mm, mm 4 4 1 1 MMX_ALU
PSHUFW mm, mm, imm8 2 2 1 1 MMX_SHFT
Instruction Latency
1
Throughput
Model
06_0F
H
06_0E
H
06_0D
H
06_09
H 06_0FH
06_0E
H
06_0D
H
06_09
H
C-18
Table C-6a. Streaming SIMD Extension 64-bit Integer Instructions
Instruction Latency
1
Throughput
Model
06_0F
H
06_0E
H
06_0D
H
06_09
H
06_0F
H
06_0E
H
06_0D
H
06_09
H
MASKMOVQ mm, mm 3 1
PAVGB/PAVGW mm,
mm
1 1 1 1 0.5 0.5 0.5 0.5
PEXTRW r32, mm,
imm8
2* 2 2 2 1 1 1 1
PINSRW mm, r32,
imm8
1 1 1 1 1 1 1 1
PMAX mm, mm 1 1 1 1 0.5 0.5 0.5 0.5
PMIN mm, mm 1 1 1 1 0.5 0.5 0.5 0.5
PMOVMSKB r32, mm 1 1 1 1 1 1 1
PMULHUW mm, mm 3 3 3 3 1 1 1 1
PSADBW mm, mm 3 5 5 5 1 2 2 2
PSHUFW mm, mm,
imm8
1 1 1 1 1 1 1 1
Table C-7. MMX Technology 64-bit Instructions
Instruction Latency
1
2
DisplayFamily_DisplayMod
el
0F_03H 0F_02H 0F_03H 0F_02H 0F2n
MOVD mm, r32 2 2 1 1 MMX_ALU
MOVD
3
r32, mm 5 5 1 1 FP_MISC
MOVQ mm, mm 6 6 1 1 FP_MOV
PACKSSWB/PACKSSDW/PA
CKUSWB mm, mm
2 2 1 1 MMX_SHFT
PADDB/PADDW/PADDD
mm, mm
2 2 1 1 MMX_ALU
PADDSB/PADDSW
/PADDUSB/PADDUSW mm,
mm
2 2 1 1 MMX_ALU
PAND mm, mm 2 2 1 1 MMX_ALU
C-19
PANDN mm, mm 2 2 1 1 MMX_ALU
PCMPEQB/PCMPEQD
PCMPEQW mm, mm
2 2 1 1 MMX_ALU
PCMPGTB/PCMPGTD/
PCMPGTW mm, mm
2 2 1 1 MMX_ALU
PMADDWD
3
mm, mm 9 8 1 1 FP_MUL
PMULHW/PMULLW
3

mm, mm
9 8 1 1 FP_MUL
POR mm, mm 2 2 1 1 MMX_ALU
PSLLQ/PSLLW/
PSLLD mm, mm/imm8
2 2 1 1 MMX_SHFT
PSRAW/PSRAD mm,
mm/imm8
2 2 1 1 MMX_SHFT
PSRLQ/PSRLW/PSRLD
mm, mm/imm8
2 2 1 1 MMX_SHFT
PSUBB/PSUBW/PSUBD
mm, mm
2 2 1 1 MMX_ALU
PSUBSB/PSUBSW/PSUBU
SB/PSUBUSW mm, mm
2 2 1 1 MMX_ALU
PUNPCKHBW/PUNPCKHW
D/PUNPCKHDQ mm, mm
2 2 1 1 MMX_SHFT
PUNPCKLBW/PUNPCKLWD
/PUNPCKLDQ mm, mm
2 2 1 1 MMX_SHFT
PXOR mm, mm 2 2 1 1 MMX_ALU
EMMS
1
12 12
Table C-8. MMX Technology 64-bit Instructions
Instruction Latency
1
Throughput
el
06_0F
H
06_0E
H
06_0D
H
06_09
H
06_0
FH
06_0
EH
06_0
DH
06_09
H
MOVD mm, r32 1 1 1 1 0.5 0.5 0.5 0.5
MOVD r32, mm 1 1 1 1 0.33 0.5 0.5 0.5
Table C-7. MMX Technology 64-bit Instructions (Contd.)
Instruction Latency
1
2
el
0F_03H 0F_02H 0F_03H 0F_02H 0F2n
C-20
MOVQ mm, mm 1 1 1 1 0.5 0.5 0.5 0.5
PACKSSWB/PACKSSDW/PA
CKUSWB mm, mm
1 1 1 1 1 1 1 1
PADDB/PADDW/PADDD
mm, mm
1 1 1 1 0.33 1 1 1
PADDSB/PADDSW
/PADDUSB/PADDUSW mm,
mm
1 1 1 1 0.33 1 1 1
PAND mm, mm 1 1 1 1 0.33 0.5 0.5 0.5
PANDN mm, mm 1 1 1 1 0.33 0.5 0.5 0.5
PCMPEQB/PCMPEQD
PCMPEQW mm, mm
1 1 1 1 0.33 0.5 0.5 0.5
PCMPGTB/PCMPGTD/
PCMPGTW mm, mm
1 1 1 1 0.33 0.5 0.5 0.5
PMADDWD mm, mm 3 3 3 3 1 1 1 1
PMULHW/PMULLW
3

mm, mm
3 3 3 3 1 1 1 1
POR mm, mm 1 1 1 1 0.33 0.5 0.5 0.5
PSLLQ/PSLLW/
PSLLD mm, mm/imm8
1 1 1 1 1 1 1 1
PSRAW/PSRAD mm,
mm/imm8
1 1 1 1 1 1 1 1
PSRLQ/PSRLW/PSRLD
mm, mm/imm8
1 1 1 1 1 1 1 1
PSUBB/PSUBW/PSUBD
mm, mm
1 1 1 1 0.33 0.5 0.5 0.5
PSUBSB/PSUBSW/PSUBU
SB/PSUBUSW mm, mm
1 1 1 1 0.33 0.5 0.5 0.5
PUNPCKHBW/PUNPCKHW
D/PUNPCKHDQ mm, mm
1 1 1 1 1 1 1 1
PUNPCKLBW/PUNPCKLWD
/PUNPCKLDQ mm, mm
1 1 1 1 1 1 1 1
PXOR mm, mm 1 1 1 1 0.33 0.5 0.5 0.5
EMMS
1
6 6 6 5 5 5
Table C-8. MMX Technology 64-bit Instructions (Contd.)
Instruction Latency
1
Throughput
el
06_0F
H
06_0E
H
06_0D
H
06_09
H
06_0
FH
06_0
EH
06_0
DH
06_09
H
C-21
Table C-9. x87 Floating-point Instructions
Instruction Latency
1
2
el
0F_03H 0F_02H 0F_03H 0F_02H 0F_02H
FABS 3 2 1 1 FP_MISC
FADD 6 5 1 1 FP_ADD
FSUB 6 5 1 1 FP_ADD
FMUL 8 7 2 2 FP_MUL
FCOM 3 2 1 1 FP_MISC
FCHS 3 2 1 1 FP_MISC
FDIV Single Precision 30 23 30 23 FP_DIV
FDIV Double Precision 40 38 40 38 FP_DIV
FDIV Extended Precision 44 43 44 43 FP_DIV
FSQRT SP 30 23 30 23 FP_DIV
FSQRT DP 40 38 40 38 FP_DIV
FSQRT EP 44 43 44 43 FP_DIV
F2XM1
4
100-
200
90-
150
60
FCOS
4
180-
280
190-
240
130
FPATAN
4
220-
300
150-
300
140
FPTAN
4
240-
300
225-
250
170
FSIN
4
160-
200
160-
180
130
FSINCOS
4
170-
250
160-
220
140
FYL2X
4
100-
250
140-
190
85
FYL2XP1
4
140-
190
85
FSCALE
4
60 7
FRNDINT
4
30 11
FXCH
5
0 1 FP_MOVE
FLDZ
6
0
C-22
FINCSTP/FDECSTP
6
0
Table C-9a. x87 Floating-point Instructions
Instruction Latency
1
Throughput
DisplayFamily_DisplayMo
del
06_0F
H
06_0E
H
06_0D
H
06_09
H
06_0F
H
06_0E
H
06_0
DH
06_09
H
FABS 1 1 1 1 1 1 1 1
FADD 3 3 3 3 1 1 1
FSUB 3 3 3 3 1 1 1 1
FMUL 5 5 5 5 2 2 2 2
FCOM 1 1 1 1 1 1 1 1
FCHS
FDIV Single Precision 18 17
FDIV Double Precision 32 31
FDIV Extended Precision 38 37
FSQRT Single Precision 29 28
FSQRT Double Precision 58 58 58 58 58 58 58 58
F2XM1
4
69 69 69 67 67 67
FCOS
4
119 119 119 117 117 117
FPATAN
4
147 147 147 147 147 147
FPTAN
4
123 123 123 83 83 83
FSIN
4
119 119 119 116 116 116
FSINCOS
4
119 119 119 85 85 85
FYL2X
4
96 96 96 92 92 92
FYL2XP1
4
98 98 98 93 93 93
FSCALE
4
17 17 17 15 15 15
FRNDINT
4
21 21 21 20 20 20
FXCH
5
FLDZ
6
1 1 1 1 1 1 1 1
FINCSTP/FDECSTP
6
1 1 1 1 1 1
Table C-9. x87 Floating-point Instructions (Contd.)
Instruction Latency
1
2
el
0F_03H 0F_02H 0F_03H 0F_02H 0F_02H
C-23
Table C-10. General Purpose Instructions
Instruction Latency
1
2
ADC/SBB reg, reg 8 8 3 3
ADC/SBB reg, imm 8 6 2 2 ALU
ADD/SUB 1 0.5 0.5 0.5 ALU
AND/OR/XOR 1 0.5 0.5 0.5 ALU
BSF/BSR 16 8 2 4
BSWAP 1 7 0.5 1 ALU
BTC/BTR/BTS 8-9 1
CLI 26
CMP/TEST 1 0.5 0.5 0.5 ALU
DEC/INC 1 1 0.5 0.5 ALU
IMUL r32 10 14 1 3 FP_MUL
IMUL imm32 14 1 3 FP_MUL
IMUL 15-18 5
IDIV 66-80 56-70 30 23
IN/OUT
1
<225 40
Jcc
7
Not App-
licable
0.5 ALU
LOOP 8 1.5 ALU
MOV 1 0.5 0.5 0.5 ALU
MOVSB/MOVSW 1 0.5 0.5 0.5 ALU
MOVZB/MOVZW 1 0.5 0.5 0.5 ALU
NEG/NOT/NOP 1 0.5 0.5 0.5 ALU
POP r32 1.5 1 MEM_LOAD,
ALU
PUSH 1.5 1 MEM_STORE,
ALU
Table C-9a. x87 Floating-point Instructions (Contd.)
Instruction Latency
1
Throughput
del
06_0F
H
06_0E
H
06_0D
H
06_09
H
06_0F
H
06_0E
H
06_0
DH
06_09
H
C-24
RCL/RCR reg, 1
8
6 4 1 1
ROL/ROR 1 4 0.5 1
RET 8 1 MEM_LOAD,
ALU
SAHF 1 0.5 0.5 0.5 ALU
SAL/SAR/SHL/SHR 1 4 0.5 1
SCAS 4 1.5 ALU,MEM_
LOAD
SETcc 5 1.5 ALU
STI 36
STOSB 5 2 ALU,MEM_
STORE
XCHG 1.5 1.5 1 1 ALU
CALL 5 1 ALU,MEM_
STORE
MUL 10 14-18 1 5
DIV 66-80 56-70 30 23
Table C-10a. General Purpose Instructions
Instruction Latency
1
Throughput
del
06_0F
H
06_0E
H
06_0D
H
06_09
H
06_0F
H
06_0E
H
06_0D
H
06_09
H
ADC/SBB reg, reg 2 2 2 2 0.33 2 2 2
ADC/SBB reg, imm 2 1 1 1 0.33 0.5 0.5 0.5
ADD/SUB 1 1 1 1 0.33 0.5 0.5 0.5
AND/OR/XOR 1 1 1 1 0.33 0.5 0.5 0.5
BSF/BSR 2 2 2 2 1 1 1 1
BSWAP 2 2 2 2 0.5 1 1 1
BT 1 0.33
BTC/BTR/BTS 1 1 1 1 0.33 0.5 0.5 0.5
CBW 1 0.33
Table C-10. General Purpose Instructions (Contd.)
Instruction Latency
1
2
C-25
C.3.2 Table Footnotes
The following foot not es refer t o all t ables in t his appendix.
1. Lat ency informat ion for many inst ruct ions t hat are complex ( > 4 ops) are
est imat es based on conservat ive ( worst - case) est imat es. Act ual performance of
t hese inst ruct ions by t he out - of- order core execut ion unit can range from
somewhat fast er t o significant ly fast er t han t he lat ency dat a shown in t hese
t ables.
2. The names of execut ion unit s apply t o processor implement at ions of t he I nt el
Net Burst microarchit ect ure wit h a CPUI D signat ure of family 15, model encoding
= 0, 1, 2. They include: ALU, FP_EXECUTE, FPMOVE, MEM_LOAD, MEM_STORE.
See Figure 2- 5 for execut ion unit s and port s in t he out - of- order core. Not e t he
following:
CLC/CMC 1 0.33
CLI 9 11 11 11 9 11 11 11
CMOV 2 0.5
CMP/TEST 1 1 1 1 0.5 0.5 0.5 0.5
DEC/INC 1 1 1 1 0.33 0.5 0.5 0.5
IMUL r32 3 4 4 4 0.5 1 1 1
IMUL imm32 3 4 4 4 0.5 1 1 1
IDIV 22 22 38 22 22 38
MOVSB/MOVSW 1 1 1 1 0.33 0.5 0.5 0.5
MOVZB/MOVZW 1 1 1 1 0.33 0.5 0.5 0.5
NEG/NOT/NOP 1 1 1 1 0.33 0.5 0.5 0.5
PUSH 3 3 3 3 1 1 1 1
RCL/RCR 1 1 1 1 1 1
ROL/ROR 1 1 1 1 0.33 1 1 1
SAHF 1 1 1 1 0.33 0.5 0.5 0.5
SAL/SAR/SHL/SHR 1 0.33
SETcc 1 1 1 1 0.33 0.5 0.5 0.5
XCHG 3 2 2 2 1 1 1 1
Table C-10a. General Purpose Instructions (Contd.)
Instruction Latency
1
Throughput
del
06_0F
H
06_0E
H
06_0D
H
06_09
H
06_0F
H
06_0E
H
06_0D
H
06_09
H
C-26
The FP_EXECUTE unit is act ually a clust er of execut ion unit s, roughly
consist ing of seven separat e execut ion unit s.
The FP_ADD unit handles x87 and SI MD float ing- point add and subt ract
operat ion.
The FP_MUL unit handles x87 and SI MD float ing- point mult iply operat ion.
The FP_DI V unit handles x87 and SI MD float ing- point divide square- root
operat ions.
The MMX_SHFT unit handles shift and rot at e operat ions.
The MMX_ALU unit handles SI MD int eger ALU operat ions.
The MMX_MI SC unit handles reciprocal MMX comput at ions and some int eger
operat ions.
The FP_MI SC designat es ot her execut ion unit s in port 1 t hat are separat ed
from t he six unit s list ed above.
3. I t may be possible t o const ruct repet it ive calls t o some I nt el 64 and I A- 32
inst ruct ions in code sequences t o achieve lat ency t hat is one or t wo clock cycles
fast er t han t he more realist ic number list ed in t his t able.
4. Lat ency and Throughput of t ranscendent al inst ruct ions can vary subst ant ially in a
dynamic execut ion environment . Only an approximat e value or a range of values
are given for t hese inst ruct ions.
5. The FXCH inst ruct ion has 0 lat ency in code sequences. However, it is limit ed t o an
issue rat e of one inst ruct ion per clock cycle.
6. The load const ant inst ruct ions, FI NCSTP, and FDECSTP have 0 lat ency in code
sequences.
7. Select ion of condit ional j ump inst ruct ions should be based on t he recommen-
dat ion of sect ion Sect ion 3. 4. 1, Branch Predict ion Opt imizat ion, t o improve t he
predict abilit y of branches. When branches are predict ed successfully, t he lat ency
of j cc is effect ively zero.
8. RCL/ RCR wit h shift count of 1 are opt imized. Using RCL/ RCR wit h shift count
ot her t han 1 will be execut ed more slowly. This applies t o t he Pent ium 4 and I nt el
Xeon processors.
C.3.3 Latency and Throughput with Memory Operands
Typically, inst ruct ions wit h a memory address as t he source operand, add one more
op t o t he reg, reg inst ruct ions. However, t he t hroughput in most cases remains
t he same because t he load operat ion ut ilizes port 2 wit hout affect ing port 0 or port 1.
Many inst ruct ions accept a memory address as eit her t he source operand or as t he
dest inat ion operand. The former is commonly referred t o as a load operat ion, while
t he lat t er a st ore operat ion.
C-27
The lat ency for inst ruct ions t hat perform eit her a load or a st ore operat ion are t ypi-
cally longer t han t he lat ency of corresponding regist er- t o- regist er t ype of t he same
inst ruct ions. This is because load or st ore operat ions require access t o t he cache
hierarchy and, in some cases, t he memory subsyst em.
For t he sake of simplicit y, all dat a being request ed is assumed t o reside in t he first
level dat a cache ( cache hit ) . I n general, inst ruct ions wit h load operat ions t hat
execut e in t he int eger ALU unit s require t wo more clock cycles t han t he corre-
sponding regist er- t o- regist er flavor of t he same inst ruct ion. Throughput of t hese
inst ruct ions wit h load operat ion remains t he same wit h t he regist er- t o- regist er flavor
of t he inst ruct ions.
Float ing- point , MMX t echnology, St reaming SI MD Ext ensions and St reaming SI MD
Ext ension 2 inst ruct ions wit h load operat ions require 6 more clocks in lat ency t han
t he regist er- only version of t he inst ruct ions, but t hroughput remains t he same.
When st ore operat ions are on t he crit ical pat h, t heir result s can generally be
forwarded t o a dependent load in as few as zero cycles. Thus, t he lat ency t o complet e
and st ore isnt relevant here.
C-28
D-1
APPENDIX D
STACK ALIGNMENT
This appendix det ails on t he alignment of t he st acks of dat a for St reaming SI MD
Ext ensions and St reaming SI MD Ext ensions 2.
D.1 STACK FRAMES
This sect ion describes t he st ack alignment convent ions for bot h ESP- based ( normal) ,
and EDP- based ( debug) st ack frames. A st ack frame is a cont iguous block of memory
allocat ed t o a funct ion for it s local memory needs. I t cont ains space for t he funct ions
paramet ers, ret urn address, local variables, regist er spills, paramet ers needing t o be
passed t o ot her funct ions t hat a st ack frame may call, and possibly ot hers. I t is t ypi-
cally delineat ed in memory by a st ack frame point er ( ESP) t hat point s t o t he base of
t he frame for t he funct ion and from which all dat a are referenced via appropriat e
offset s. The convent ion on I nt el 64 and I A- 32 is t o use t he ESP regist er as t he st ack
frame point er for normal opt imized code, and t o use EDP in place of ESP when debug
informat ion must be kept . Debuggers use t he EDP regist er t o find t he informat ion
about t he funct ion via t he st ack frame.
I t is import ant t o ensure t hat t he st ack frame is aligned t o a 16- byt e boundary upon
funct ion ent ry t o keep local __m128 dat a, paramet ers, and XMM regist er spill loca-
t ions aligned t hroughout a funct ion invocat ion. The I nt el C+ + Compiler for Win32*
Syst ems support s convent ions present ed here help t o prevent memory references
from incurring penalt ies due t o misaligned dat a by keeping t hem aligned t o 16- byt e
boundaries. I n addit ion, t his scheme support s improved alignment for __m64 and
double t ype dat a by enforcing t hat t hese 64- bit dat a it ems are at least eight - byt e
aligned ( t hey will now be 16- byt e aligned) .
For variables allocat ed in t he st ack frame, t he compiler cannot guarant ee t he base of
t he variable is aligned unless it also ensures t hat t he st ack frame it self is 16- byt e
aligned. Previous soft ware convent ions, as implement ed in most compilers, only
ensure t hat individual st ack frames are 4- byt e aligned. Therefore, a funct ion called
from a Microsoft - compiled funct ion, for example, can only assume t hat t he frame
point er it used is 4- byt e aligned.
Earlier versions of t he I nt el C+ + Compiler for Win32 Syst ems have at t empt ed t o
provide 8- byt e aligned st ack frames by dynamically adj ust ing t he st ack frame
point er in t he prologue of main and preserving 8- byt e alignment of t he funct ions it
compiles. This t echnique is limit ed in it s applicabilit y for t he following reasons:
The main funct ion must be compiled by t he I nt el C+ + Compiler.
There may be no funct ions in t he call t ree compiled by some ot her compiler ( as
might be t he case for rout ines regist ered as callbacks) .
Support is not provided for proper alignment of paramet ers.
D-2
STACK ALIGNMENT
The solut ion t o t his problem is t o have t he funct ions ent ry point assume only 4- byt e
alignment . I f t he funct ion has a need for 8- byt e or 16- byt e alignment , t hen code can
be insert ed t o dynamically align t he st ack appropriat ely, result ing in one of t he st ack
frames shown in Figure 4- 1.
As an opt imizat ion, an alt ernat e ent ry point can be creat ed t hat can be called when
proper st ack alignment is provided by t he caller. Using call graph profiling of t he
VTune analyzer, calls t o t he normal ( unaligned) ent ry point can be opt imized int o
calls t o t he ( alt ernat e) aligned ent ry point when t he st ack can be proven t o be prop-
erly aligned. Furt hermore, a funct ion alignment requirement at t ribut e can be modi-
fied t hroughout t he call graph so as t o cause t he least number of calls t o unaligned
ent ry point s.
As an example of t his, suppose funct ion F has only a st ack alignment requirement of
4, but it calls funct ion G at many call sit es, and in a loop. I f Gs alignment require-
ment is 16, t hen by promot ing Fs alignment requirement t o 16, and making all calls
t o G go t o it s aligned ent ry point , t he compiler can minimize t he number of t imes t hat
cont rol passes t hrough t he unaligned ent ry point s. Example D- 1 and Example D- 2 in
t he following sect ions illust rat e t his t echnique. Not e t he ent ry point s foo and
foo. aligned; t he lat t er is t he alt ernat e aligned ent ry point .
Figure 4-1. Stack Frames Based on Alignment Type
Parameter
Pointer
EBP
ESP
EBP-based Aligned Frame
Parameters
Return Address
Padding
Previous EBP
Local Variables and
Spill Slots
Parameter Passing
Space
EBP-frame Saved
Register Area
Return Address 1
SEH/CEH Record
Parameter
Pointer
ESP
ESP-based Aligned Frame
Parameters
Return Address
Padding
Register Save Area
Local Variables and
Spill Slots
__cdecl Parameter
Passing Space
__stdcall Parameter
Passing Space
D-3
STACK ALIGNMENT
D.1.1 Aligned ESP-Based Stack Frames
This sect ion discusses dat a and paramet er alignment and t he declspec( align)
ext ended at t ribut e, which can be used t o request alignment in C and C+ + code. I n
creat ing ESP- based st ack frames, t he compiler adds padding bet ween t he ret urn
address and t he regist er save area as shown in Example 4- 9. This frame can be used
only when debug informat ion is not request ed, t here is no need for except ion
handling support , inlined assembly is not used, and t here are no calls t o alloca wit hin
t he funct ion.
I f t he above condit ions are not met , an aligned EDP- based frame must be used.
When using t his t ype of frame, t he sum of t he sizes of t he ret urn address, saved
regist ers, local variables, regist er spill slot s, and paramet er space must be a mult iple
of 16 byt es. This causes t he base of t he paramet er space t o be 16- byt e aligned. I n
addit ion, any space reserved for passing paramet ers for st dcall funct ions also must
be a mult iple of 16 byt es. This means t hat t he caller needs t o clean up some of t he
st ack space when t he size of t he paramet ers pushed for a call t o a st dcall funct ion is
not a mult iple of 16. I f t he caller does not do t his, t he st ack point er is not rest ored t o
it s pre- call value.
I n Example D- 1, we have 12 byt es on t he st ack aft er t he point of alignment from t he
caller: t he ret urn point er, EBX and EDX. Thus, we need t o add four more t o t he st ack
point er t o achieve alignment . Assuming 16 byt es of st ack space are needed for local
variables, t he compiler adds 16 + 4 = 20 byt es t o ESP, making ESP aligned t o a 0
mod 16 address.
Example D-1. Aligned esp-Based Stack Frame
void _cdecl foo (int k)
{
int j;
foo: // See Note A below
push ebx
mov ebx, esp
sub esp, 0x00000008
and esp, 0xfffffff0
add esp, 0x00000008
jmp common
foo.aligned:
push ebx
mov ebx, esp
D-4
STACK ALIGNMENT
D.1.2 Aligned EDP-Based Stack Frames
I n EDP- based frames, padding is also insert ed immediat ely before t he ret urn
address. However, t his frame is slight ly unusual in t hat t he ret urn address may act u-
ally reside in t wo different places in t he st ack. This occurs whenever padding must be
added and except ion handling is in effect for t he funct ion. Example D- 2 shows t he
code generat ed for t his t ype of frame. The st ack locat ion of t he ret urn address is
aligned 12 mod 16. This means t hat t he value of EDP always sat isfies t he condit ion
( EDP & 0x0f ) = = 0x08. I n t his case, t he sum of t he sizes of t he ret urn address, t he
previous EDP, t he except ion handling record, t he local variables, and t he spill area
must be a mult iple of 16 byt es. I n addit ion, t he paramet er passing space must be a
mult iple of 16 byt es. For a call t o a st dcall funct ion, it is necessary for t he caller t o
reserve some st ack space if t he size of t he paramet er block being pushed is not a
mult iple of 16.
common: // See Note B below
push edx
sub esp, 20
j = k;
mov edx, [ebx + 8]
mov [esp + 16], edx
foo(5);
mov [esp], 5
call foo.aligned
return j;
mov eax, [esp + 16]
add esp, 20
pop edx
mov esp, ebx
pop ebx
ret
// NOTES:
// (A) Aligned entry points assume that parameter block beginnings are aligned. This places the
// stack pointer at a 12 mod 16 boundary, as the return pointer has been pushed. Thus, the
// unaligned entry point must force the stack pointer to this boundary
// (B) The code at the common label assumes the stack is at an 8 mod 16 boundary, and adds
// sufficient space to the stack so that the stack pointer is aligned to a 0 mod 16 boundary.
Example D-1. Aligned esp-Based Stack Frame (Contd.)
D-5
STACK ALIGNMENT
Example D-2. Aligned ebp-based Stack Frames
void _stdcall foo (int k)
{
int j;
foo:
push ebx
mov ebx, esp
sub esp, 0x00000008
and esp, 0xfffffff0
add esp, 0x00000008 // esp is (8 mod 16) after add
jmp common
foo.aligned:
push ebx // esp is (8 mod 16) after push
mov ebx, esp
common:
push ebp // this slot will be used for
// duplicate return pt
push ebp // esp is (0 mod 16) after push
// (rtn,ebx,ebp,ebp)
mov ebp, [ebx + 4] // fetch return pointer and store
mov [esp + 4], ebp // relative to ebp
// (rtn,ebx,rtn,ebp)
mov ebp, esp // ebp is (0 mod 16)
sub esp, 28 // esp is (4 mod 16)
// see Note A below
push edx // esp is (0 mod 16) after push
// goal is to make esp and ebp
// (0 mod 16) here
j = k;
mov edx, [ebx + 8] // k is (0 mod 16) if caller
// aligned its stack
mov [ebp - 16], edx // J is (0 mod 16)
foo(5);
add esp, -4 // normal call sequence to
// unaligned entry
mov [esp],5
call foo // for stdcall, callee
// cleans up stack
D-6
STACK ALIGNMENT
D.1.3 Stack Frame Optimizations
The I nt el C+ + Compiler provides cert ain opt imizat ions t hat may improve t he way
aligned frames are set up and used. These opt imizat ions are as follows:
I f a procedure is defined t o leave t he st ack frame 16- byt e- aligned and it calls
anot her procedure t hat requires 16- byt e alignment , t hen t he callees aligned
ent ry point is called, bypassing all of t he unnecessary aligning code.
I f a st at ic funct ion requires 16- byt e alignment , and it can be proven t o be called
only by ot her funct ions t hat require 16- byt e alignment , t hen t hat funct ion will not
have any alignment code in it . That is, t he compiler will not use EBX t o point t o
t he argument block and it will not have alt ernat e ent ry point s, because t his
funct ion will never be ent ered wit h an unaligned frame.
foo.aligned(5);
add esp,-16 // aligned entry, this should
// be a multiple of 16
mov [esp],5
call foo.aligned
add esp,12 // see Note B below
return j;
mov eax,[ebp-16]
pop edx
mov esp,ebp
pop ebp
mov esp,ebx
pop ebx
ret 4
}
// NOTES:
// (A) Here we allow for local variables. However, this value should be adjusted so that, after
// pushing the saved registers, esp is 0 mod 16.
// (B) Just prior to the call, esp is 0 mod 16. To maintain alignment, esp should be adjusted by 16.
// When a callee uses the stdcall calling sequence, the stack pointer is restored by the callee. The
// final addition of 12 compensates for the fact that only 4 bytes were passed, rather than
// 16, and thus the caller must account for the remaining adjustment.
Example D-2. Aligned ebp-based Stack Frames (Contd.)
D-7
STACK ALIGNMENT
D.2 INLINED ASSEMBLY AND EBX
When using aligned frames, t he EBX regist er generally should not be modified in
inlined assembly blocks since EBX is used t o keep t rack of t he argument block.
Programmers may modify EBX only if t hey do not need t o access t he argument s and
provided t hey save EBX and rest ore it before t he end of t he funct ion ( since ESP is
rest ored relat ive t o EBX in t he funct ions epilog) .
NOTE
Do not use t he EBX regist er in inline assembly funct ions t hat use
dynamic st ack alignment for double, __m64, and __m128 local
variables unless you save and rest ore EBX each t ime you use it . The
I nt el C+ + Compiler uses t he EBX regist er t o cont rol alignment of
variables of t hese t ypes, so t he use of EBX, wit hout preserving it , will
cause unexpect ed program execut ion.
D-8
STACK ALIGNMENT
E-1
APPENDIX E
This appendix summarizes t he rules and suggest ions specified in t his manual. Please
be reminded t hat coding recommendat ions are ranked in import ance according t o
t hese t wo crit eria:
Local impact ( referred t o earlier as impact ) t he difference t hat a recommen-
dat ion makes t o performance for a given inst ance.
Generalit y how frequent ly such inst ances occur across all applicat ion domains.
Again, underst and t hat t his ranking is int ent ionally very approximat e, and can vary
depending on coding st yle, applicat ion domain, and ot her fact ors. Throughout t he
chapt er you observed references t o t hese crit eria using t he high, medium and low
priorit ies for each recommendat ion. I n places where t here was no priorit y assigned,
t he local impact or generalit y has been det ermined not t o be applicable.
E.1 ASSEMBLY/COMPILER CODING RULES
Assembl er / Compi l er Codi ng Rul e 1. ( MH i mpact , M gener al i t y ) Arrange code
t o make basic blocks cont iguous and eliminat e unnecessary branches. .. . . . . . . . 3- 7
Assembl er / Compi l er Codi ng Rul e 2. ( M i mpact , ML gener al i t y) Use t he SETCC
and CMOV inst ruct ions t o eliminat e unpredict able condit ional branches where
possible. Do not do t his for predict able branches. Do not use t hese inst ruct ions t o
eliminat e all unpredict able condit ional branches ( because using t hese inst ruct ions
will incur execut ion overhead due t o t he requirement for execut ing bot h pat hs of
a condit ional branch) . I n addit ion, convert ing a condit ional branch t o SETCC or
CMOV t rades off cont rol flow dependence for dat a dependence and rest rict s t he
capabilit y of t he out - of- order engine. When t uning, not e t hat all I nt el 64 and
I A- 32 processors usually have very high branch predict ion rat es. Consist ent ly
mispredict ed branches are generally rare. Use t hese inst ruct ions only if t he
increase in comput at ion t ime is less t han t he expect ed cost of a mispredict ed
branch. . .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. . 3- 7
Assembl er / Compi l er Codi ng Rul e 3. ( M i mpact , H gener al i t y ) Arrange code t o
be consist ent wit h t he st at ic branch predict ion algorit hm: make t he fall- t hrough
code following a condit ional branch be t he likely t arget for a branch wit h a forward
t arget , and make t he fall- t hrough code following a condit ional branch be t he
unlikely t arget for a branch wit h a backward t arget . .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. . 3- 10
Assembl er / Compi l er Codi ng Rul e 4. ( MH i mpact , MH gener al i t y) Near calls
must be mat ched wit h near ret urns, and far calls must be mat ched wit h far
E-2
ret urns. Pushing t he ret urn address on t he st ack and j umping t o t he rout ine t o be
called is not recommended since it creat es a mismat ch in calls and ret urns. 3- 12
Assembl er / Compi l er Codi ng Rul e 5. ( MH i mpact , MH gener al i t y ) Select ively
inline a funct ion if doing so decreases code size or if t he funct ion is small and t he
call sit e is frequent ly execut ed. . .. . .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. . 3- 12
Assembl er / Compi l er Codi ng Rul e 6. ( H i mpact , H gener al i t y ) Do not inline a
funct ion if doing so increases t he working set size beyond what will fit in t he t race
cache. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. . 3- 12
Assembl er / Compi l er Codi ng Rul e 7. ( ML i mpact , ML gener al i t y) I f t here are
more t han 16 nest ed calls and ret urns in rapid succession; consider t ransforming
t he program wit h inline t o reduce t he call dept h. . .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. . 3- 12
Assembl er / Compi l er Codi ng Rul e 8. ( ML i mpact , ML gener al i t y ) Favor inlining
small funct ions t hat cont ain branches wit h poor predict ion rat es. I f a branch
mispredict ion result s in a RETURN being premat urely predict ed as t aken, a
performance penalt y may be incurred.) . .. .. .. .. .. . .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. . 3- 12
Assembl er / Compi l er Codi ng Rul e 9. ( L i mpact , L gener al i t y ) I f t he last
st at ement in a funct ion is a call t o anot her funct ion, consider convert ing t he call
t o a j ump. This will save t he call/ ret urn overhead as well as an ent ry in t he ret urn
st ack buffer. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. . 3- 12
Assembl er / Compi l er Codi ng Rul e 10. ( M i mpact , L gener al i t y ) Do not put
more t han four branches in a 16- byt e chunk. . .. . .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. . 3- 12
Assembl er / Compi l er Codi ng Rul e 11. ( M i mpact , L gener al i t y ) Do not put
more t han t wo end loop branches in a 16- byt e chunk. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. . 3- 12
Assembl er / Compi l er Codi ng Rul e 12. ( M i mpact , H gener al i t y) All branch
t arget s should be 16- byt e aligned. . .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. . 3- 13
Assembl er / Compi l er Codi ng Rul e 13. ( M i mpact , H gener al i t y) I f t he body of
a condit ional is not likely t o be execut ed, it should be placed in anot her part of
t he program. I f it is highly unlikely t o be execut ed and code localit y is an issue,
it should be placed on a different code page. . .. . .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. . 3- 13
Assembl er / Compi l er Codi ng Rul e 14. ( M i mpact , L gener al i t y) When indirect
branches are present , t ry t o put t he most likely t arget of an indirect branch
immediat ely following t he indirect branch. Alt ernat ively, if indirect branches are
common but t hey cannot be predict ed by branch predict ion hardware, t hen follow
t he indirect branch wit h a UD2 inst ruct ion, which will st op t he processor from
decoding down t he fall- t hrough pat h. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. . 3- 13
Assembl er / Compi l er Codi ng Rul e 15. ( H i mpact , M gener al i t y) Unroll small
loops unt il t he overhead of t he branch and induct ion variable account s ( generally)
for less t han 10% of t he execut ion t ime of t he loop. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. . 3- 16
Assembl er / Compi l er Codi ng Rul e 16. ( H i mpact , M gener al i t y) Avoid unrolling
loops excessively; t his may t hrash t he t race cache or inst ruct ion cache. . . . . . 3- 16
Assembl er / Compi l er Codi ng Rul e 17. ( M i mpact , M gener al i t y) Unroll loops
t hat are frequent ly execut ed and have a predict able number of it erat ions t o
reduce t he number of it erat ions t o 16 or fewer. Do t his unless it increases code
size so t hat t he working set no longer fit s in t he t race or inst ruct ion cache. I f t he
E-3
loop body cont ains more t han one condit ional branch, t hen unroll so t hat t he
number of it erat ions is 16/ ( # condit ional branches) . .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. . 3- 16
Assembl er / Compi l er Codi ng Rul e 18. ( ML i mpact , M gener al i t y) For improving
fet ch/ decode t hroughput , Give preference t o memory flavor of an inst ruct ion over
t he regist er- only flavor of t he same inst ruct ion, if such inst ruct ion can benefit
from micro- fusion. . .. . .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. . 3- 17
Assembl er / Compi l er Codi ng Rul e 19. ( M i mpact , ML gener al i t y) Employ
macro- fusion where possible using inst ruct ion pairs t hat support macro- fusion.
Prefer TEST over CMP if possible. Use unsigned variables and unsigned j umps
when possible. Try t o logically verify t hat a variable is non- negat ive at t he t ime
of comparison. Avoid CMP or TEST of MEM- I MM flavor when possible. However,
do not add ot her inst ruct ions t o avoid using t he MEM- I MM flavor. . . . . . . . . . . . . . . 3- 19
Assembl er / Compi l er Codi ng Rul e 20. ( M i mpact , ML gener al i t y) Soft ware can
enable macro fusion when it can be logically det ermined t hat a variable is non-
negat ive at t he t ime of comparison; use TEST appropriat ely t o enable macro-
fusion when comparing a variable wit h 0. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. . 3- 21
Assembl er / Compi l er Codi ng Rul e 21. ( MH i mpact , MH gener al i t y ) Favor
generat ing code using imm8 or imm32 values inst ead of imm16 values. .. .. . 3- 22
Assembl er / Compi l er Codi ng Rul e 22. ( M i mpact , ML gener al i t y) Ensure
inst ruct ions using 0xF7 opcode byt e does not st art at offset 14 of a fet ch line; and
avoid using t hese inst ruct ion t o operat e on 16- bit dat a, upcast short dat a t o 32
bit s. .. .. . .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. . 3- 23
Assembl er / Compi l er Codi ng Rul e 23. ( MH i mpact , MH gener al i t y ) Break up
a loop long sequence of inst ruct ions int o loops of short er inst ruct ion blocks of no
more t han 18 inst ruct ions. . .. .. .. .. .. . .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. . 3- 23
Assembl er / Compi l er Codi ng Rul e 24. ( MH i mpact , M gener al i t y) Avoid
unrolling loops cont aining LCP st alls, if t he unrolled block exceeds 18 inst ruct ions.
3- 23
Assembl er / Compi l er Codi ng Rul e 25. ( M i mpact , M gener al i t y) Avoid put t ing
explicit references t o ESP in a sequence of st ack operat ions ( POP, PUSH, CALL,
RET) . .. . .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. . 3- 24
Assembl er / Compi l er Codi ng Rul e 26. ( ML i mpact , L gener al i t y ) Use simple
inst ruct ions t hat are less t han eight byt es in lengt h. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. . 3- 24
Assembl er / Compi l er Codi ng Rul e 27. ( M i mpact , MH gener al i t y) Avoid using
prefixes t o change t he size of immediat e and displacement . . .. .. . .. .. .. .. .. .. .. . 3- 24
Assembl er / Compi l er Codi ng Rul e 28. ( M i mpact , H gener al i t y) Favor single-
micro- operat ion inst ruct ions. Also favor inst ruct ion wit h short er lat encies. . . 3- 25
Assembl er / Compi l er Codi ng Rul e 29. ( M i mpact , L gener al i t y) Avoid prefixes,
especially mult iple non- 0F- prefixed opcodes. . .. .. . .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. . 3- 25
Assembl er / Compi l er Codi ng Rul e 30. ( M i mpact , L gener al i t y) Do not use
many segment regist ers. . .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. . 3- 25
Assembl er / Compi l er Codi ng Rul e 31. ( ML i mpact , M gener al i t y ) Avoid using
complex inst ruct ions ( for example, ent er, leave, or loop) t hat have more t han
E-4
four ops and require mult iple cycles t o decode. Use sequences of simple
inst ruct ions inst ead. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. . 3- 25
Assembl er / Compi l er Codi ng Rul e 32. ( M i mpact , H gener al i t y) I NC and DEC
inst ruct ions should be replaced wit h ADD or SUB inst ruct ions, because ADD and
SUB overwrit e all flags, whereas I NC and DEC do not , t herefore creat ing false
dependencies on earlier inst ruct ions t hat set t he flags. . .. .. .. .. .. . .. .. .. .. .. .. .. .. . 3- 26
Assembl er / Compi l er Codi ng Rul e 33. ( ML i mpact , L gener al i t y) I f an LEA
inst ruct ion using t he scaled index is on t he crit ical pat h, a sequence wit h ADDs
may be bet t er. I f code densit y and bandwidt h out of t he t race cache are t he
crit ical fact or, t hen use t he LEA inst ruct ion. . .. .. . .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. . 3- 27
Assembl er / Compi l er Codi ng Rul e 34. ( ML i mpact , L gener al i t y ) Avoid ROTATE
by regist er or ROTATE by immediat e inst ruct ions. I f possible, replace wit h a
ROTATE by 1 inst ruct ion. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. . 3- 27
Assembl er / Compi l er Codi ng Rul e 35. ( M i mpact , ML gener al i t y ) Use
dependency- breaking- idiom inst ruct ions t o set a regist er t o 0, or t o break a false
dependence chain result ing from reuse of regist ers. I n cont ext s where t he
condit ion codes must be preserved, move 0 int o t he regist er inst ead. This
requires more code space t han using XOR and SUB, but avoids set t ing t he
condit ion codes. . .. .. . .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. . 3- 28
Assembl er / Compi l er Codi ng Rul e 36. ( M i mpact , MH gener al i t y) Break
dependences on port ions of regist ers bet ween inst ruct ions by operat ing on 32- bit
regist ers inst ead of part ial regist ers. For moves, t his can be accomplished wit h
32- bit moves or by using MOVZX. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. . 3- 29
Assembl er / Compi l er Codi ng Rul e 37. ( M i mpact , M gener al i t y) Try t o use zero
ext ension or operat e on 32- bit operands inst ead of using moves wit h sign
ext ension. . .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. . 3- 29
Assembl er / Compi l er Codi ng Rul e 38. ( ML i mpact , L gener al i t y ) Avoid placing
inst ruct ions t hat use 32- bit immediat es which cannot be encoded as sign-
ext ended 16- bit immediat es near each ot her. Try t o schedule ops t hat have no
immediat e immediat ely before or aft er ops wit h 32- bit immediat es. .. .. .. .. . 3- 29
Assembl er / Compi l er Codi ng Rul e 39. ( ML i mpact , M gener al i t y ) Use t he TEST
inst ruct ion inst ead of AND when t he result of t he logical AND is not used. This
saves ops in execut ion. Use a TEST if a regist er wit h it self inst ead of a CMP of
t he regist er t o zero, t his saves t he need t o encode t he zero and saves encoding
space. Avoid comparing a const ant t o a memory operand. I t is preferable t o load
t he memory operand and compare t he const ant t o a regist er. . . .. .. .. .. .. .. .. .. . 3- 30
Assembl er / Compi l er Codi ng Rul e 40. ( ML i mpact , M gener al i t y) Eliminat e
unnecessary compare wit h zero inst ruct ions by using t he appropriat e condit ional
j ump inst ruct ion when t he flags are already set by a preceding arit hmet ic
inst ruct ion. I f necessary, use a TEST inst ruct ion inst ead of a compare. Be cert ain
E-5
t hat any code t ransformat ions made do not int roduce problems wit h
overflow. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. . 3- 30
Assembl er / Compi l er Codi ng Rul e 41. ( H i mpact , MH gener al i t y) For small
loops, placing loop invariant s in memory is bet t er t han spilling loop- carried
dependencies. . .. .. .. .. . .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. . 3- 32
Assembl er / Compi l er Codi ng Rul e 42. ( M i mpact , ML gener al i t y) Avoid
int roducing dependences wit h part ial float ing point regist er writ es, e.g. from t he
MOVSD XMMREG1, XMMREG2 inst ruct ion. Use t he MOVAPD XMMREG1, XMMREG2
inst ruct ion inst ead. .. . .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. . 3- 38
Assembl er / Compi l er Codi ng Rul e 43. ( ML i mpact , L gener al i t y ) I nst ead of
using MOVUPD XMMREG1, MEM for a unaligned 128- bit load, use MOVSD
XMMREG1, MEM; MOVSD XMMREG2, MEM+ 8; UNPCKLPD XMMREG1, XMMREG2.
I f t he addit ional regist er is not available, t hen use MOVSD XMMREG1, MEM;
MOVHPD XMMREG1, MEM+ 8. .. .. .. .. . .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. . 3- 38
Assembl er / Compi l er Codi ng Rul e 44. ( M i mpact , ML gener al i t y) I nst ead of
using MOVUPD MEM, XMMREG1 for a st ore, use MOVSD MEM, XMMREG1;
UNPCKHPD XMMREG1, XMMREG1; MOVSD MEM+ 8, XMMREG1 inst ead. . .. .. . 3- 38
Assembl er / Compi l er Codi ng Rul e 45. ( H i mpact , H gener al i t y) Align dat a on
nat ural operand size address boundaries. I f t he dat a will be accessed wit h vect or
inst ruct ion loads and st ores, align t he dat a on 16- byt e boundaries. . . . . . . . . . . . 3- 48
Assembl er / Compi l er Codi ng Rul e 46. ( H i mpact , M gener al i t y) Pass
paramet ers in regist ers inst ead of on t he st ack where possible. Passing
argument s on t he st ack requires a st ore followed by a reload. While t his sequence
is opt imized in hardware by providing t he value t o t he load direct ly from t he
memory order buffer wit hout t he need t o access t he dat a cache if permit t ed by
st ore- forwarding rest rict ions, float ing point values incur a significant lat ency in
forwarding. Passing float ing point argument s in ( preferably XMM) regist ers should
save t his long lat ency operat ion. . .. . .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. . 3- 50
Assembl er / Compi l er Codi ng Rul e 47. ( H i mpact , M gener al i t y ) A load t hat
forwards from a st ore must have t he same address st art point and t herefore t he
same alignment as t he st ore dat a. . .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. . 3- 52
Assembl er / Compi l er Codi ng Rul e 48. ( H i mpact , M gener al i t y) The dat a of a
load which is forwarded from a st ore must be complet ely cont ained wit hin t he
st ore dat a. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. . 3- 52
Assembl er / Compi l er Codi ng Rul e 49. ( H i mpact , ML gener al i t y) I f it is
necessary t o ext ract a non- aligned port ion of st ored dat a, read out t he smallest
aligned port ion t hat complet ely cont ains t he dat a and shift / mask t he dat a as
necessary. This is bet t er t han incurring t he penalt ies of a failed
st ore- forward. . .. .. .. .. . .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. . 3- 52
Assembl er / Compi l er Codi ng Rul e 50. ( MH i mpact , ML gener al i t y ) Avoid
several small loads aft er large st ores t o t he same area of memory by using a
single large read and regist er copies as needed. . . .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. . 3- 52
Assembl er / Compi l er Codi ng Rul e 51. ( H i mpact , MH gener al i t y) Where it is
possible t o do so wit hout incurring ot her penalt ies, priorit ize t he allocat ion of
E-6
variables t o regist ers, as in regist er allocat ion and for paramet er passing, t o
minimize t he likelihood and impact of st ore- forwarding problems. Try not t o
st ore- forward dat a generat ed from a long lat ency inst ruct ion - for example, MUL
or DI V. Avoid st ore- forwarding dat a for variables wit h t he short est st ore- load
dist ance. Avoid st ore- forwarding dat a for variables wit h many and/ or long
dependence chains, and especially avoid including a st ore forward on a loop-
carried dependence chain. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. . 3- 56
Assembl er / Compi l er Codi ng Rul e 52. ( M i mpact , MH gener al i t y) Calculat e
st ore addresses as early as possible t o avoid having st ores block loads. . . . . . 3- 56
Assembl er / Compi l er Codi ng Rul e 53. ( H i mpact , M gener al i t y) Try t o arrange
dat a st ruct ures such t hat t hey permit sequent ial access. .. .. .. .. . .. .. .. .. .. .. .. .. . 3- 58
Assembl er / Compi l er Codi ng Rul e 54. ( H i mpact , M gener al i t y) I f 64- bit dat a
is ever passed as a paramet er or allocat ed on t he st ack, make sure t hat t he st ack
is aligned t o an 8- byt e boundary. . .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. . 3- 59
Assembl er / Compi l er Codi ng Rul e 55. ( H i mpact , M gener al i t y) Avoid having a
st ore followed by a non- dependent load wit h addresses t hat differ by a mult iple
of 4 KByt es. Also, lay out dat a or order comput at ion t o avoid having cache lines
t hat have linear addresses t hat are a mult iple of 64 KByt es apart in t he same
working set . Avoid having more t han 4 cache lines t hat are some mult iple of 2
KByt es apart in t he same first - level cache working set , and avoid having more
t han 8 cache lines t hat are some mult iple of 4 KByt es apart in t he same first - level
cache working set . . . .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. . 3- 62
Assembl er / Compi l er Codi ng Rul e 56. ( M i mpact , L gener al i t y ) I f ( hopefully
read- only) dat a must occur on t he same page as code, avoid placing it
immediat ely aft er an indirect j ump. For example, follow an indirect j ump wit h it s
most ly likely t arget , and place t he dat a aft er an uncondit ional branch. .. . . . . . 3- 63
Assembl er / Compi l er Codi ng Rul e 57. ( H i mpact , L gener al i t y ) Always put code
and dat a on separat e pages. Avoid self- modifying code wherever possible. I f code
is t o be modified, t ry t o do it all at once and make sure t he code t hat performs
t he modificat ions and t he code being modified are on separat e 4- KByt e pages or
on separat e aligned 1- KByt e subpages. . .. .. .. .. .. . .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. . 3- 64
Assembl er / Compi l er Codi ng Rul e 58. ( H i mpact , L gener al i t y ) I f an inner loop
writ es t o more t han four arrays ( four dist inct cache lines) , apply loop fission t o
break up t he body of t he loop such t hat only four arrays are being writ t en t o in
each it erat ion of each of t he result ing loops. .. .. . .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. . 3- 65
Assembl er / Compi l er Codi ng Rul e 59. ( H i mpact , M gener al i t y) Minimize
changes t o bit s 8- 12 of t he float ing point cont rol word. Changes for more t han
t wo values ( each value being a combinat ion of t he following bit s: precision,
rounding and infinit y cont rol, and t he rest of bit s in FCW) leads t o delays t hat are
on t he order of t he pipeline dept h. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. . 3- 81
Assembl er / Compi l er Codi ng Rul e 60. ( H i mpact , L gener al i t y) Minimize t he
number of changes t o t he rounding mode. Do not use changes in t he rounding
E-7
mode t o implement t he floor and ceiling funct ions if t his involves a t ot al of more
t han t wo values of t he set of rounding, precision, and infinit y bit s. . . . . . . . . . . . . 3- 83
Assembl er / Compi l er Codi ng Rul e 61. ( H i mpact , L gener al i t y) Minimize t he
number of changes t o t he precision mode. .. .. .. .. . .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. . 3- 84
Assembl er / Compi l er Codi ng Rul e 62. ( M i mpact , M gener al i t y ) Use FXCH only
where necessary t o increase t he effect ive name space. .. .. .. .. .. .. . .. .. .. .. .. .. .. . 3- 84
Assembl er / Compi l er Codi ng Rul e 63. ( M i mpact , M gener al i t y ) Use St reaming
SI MD Ext ensions 2 or St reaming SI MD Ext ensions unless you need an x87
feat ure. Most SSE2 arit hmet ic operat ions have short er lat ency t hen t heir X87
count erpart and t hey eliminat e t he overhead associat ed wit h t he management of
t he X87 regist er st ack. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. . 3- 85
Assembl er / Compi l er Codi ng Rul e 64. ( M i mpact , L gener al i t y) Try t o use
32- bit operands rat her t han 16- bit operands for FI LD. However, do not do so at
t he expense of int roducing a st ore- forwarding problem by writ ing t he t wo halves
of t he 32- bit memory operand separat ely. . .. .. .. .. . .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. . 3- 86
Assembl er / Compi l er Codi ng Rul e 65. ( H i mpact , M gener al i t y) Use t he 32- bit
versions of inst ruct ions in 64- bit mode t o reduce code size unless t he 64- bit
version is necessary t o access 64- bit dat a or addit ional regist ers. . .. .. .. .. .. .. .. . 9- 2
Assembl er / Compi l er Codi ng Rul e 66. ( M i mpact , MH gener al i t y) When t hey
are needed t o reduce regist er pressure, use t he 8 ext ra general purpose
regist ers for int eger code and 8 ext ra XMM regist ers for float ing- point or SI MD
code. . .. . .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. . 9- 2
Assembl er / Compi l er Codi ng Rul e 67. ( ML i mpact , M gener al i t y) Prefer 64- bit
by 64- bit int eger mult iplies t hat produce 64- bit result s over mult iplies t hat
produce 128- bit result s. . .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. . 9- 2
Assembl er / Compi l er Codi ng Rul e 68. ( M i mpact , M gener al i t y) Sign ext end t o
64- bit s inst ead of sign ext ending t o 32 bit s, even when t he dest inat ion will be
used as a 32- bit value. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. . 9- 3
Assembl er / Compi l er Codi ng Rul e 69. ( ML i mpact , M gener al i t y) Use t he
64- bit versions of mult iply for 32- bit int eger mult iplies t hat require a 64 bit result .
9- 4
Assembl er / Compi l er Codi ng Rul e 70. ( ML i mpact , M gener al i t y) Use t he
64- bit versions of add for 64- bit adds. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. . 9- 4
Assembl er / Compi l er Codi ng Rul e 71. ( L i mpact , L gener al i t y) I f soft ware
prefet ch inst ruct ions are necessary, use t he prefet ch inst ruct ions provided by
SSE. .. .. . .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. . 9- 5
E.2 USER/SOURCE CODING RULES
User / Sour ce Codi ng Rul e 1. ( M i mpact , L gener al i t y ) I f an indirect branch has
t wo or more common t aken t arget s and at least one of t hose t arget s is correlat ed
wit h branch hist ory leading up t o t he branch, t hen convert t he indirect branch t o
a t ree where one or more indirect branches are preceded by condit ional branches
E-8
t o t hose t arget s. Apply t his peeling procedure t o t he common t arget of an
indirect branch t hat correlat es t o branch hist ory . .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. . 3- 14
User / Sour ce Codi ng Rul e 2. ( H i mpact , M gener al i t y) Use t he smallest possible
float ing- point or SI MD dat a t ype, t o enable more parallelism wit h t he use of a
( longer) SI MD vect or. For example, use single precision inst ead of double
precision where possible. . .. .. .. .. .. . .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. . 3- 39
User / Sour ce Codi ng Rul e 3. ( M i mpact , ML gener al i t y ) Arrange t he nest ing of
loops so t hat t he innermost nest ing level is free of int er- it erat ion dependencies.
Especially avoid t he case where t he st ore of dat a in an earlier it erat ion happens
lexically aft er t he load of t hat dat a in a fut ure it erat ion, somet hing which is called
a lexically backward dependence. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. . 3- 39
User / Sour ce Codi ng Rul e 4. ( M i mpact , ML gener al i t y) Avoid t he use of
condit ional branches inside loops and consider using SSE inst ruct ions t o eliminat e
branches .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. . 3- 39
User / Sour ce Codi ng Rul e 5. ( M i mpact , ML gener al i t y) Keep induct ion ( loop)
variable expressions simple . .. .. .. . .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. . 3- 39
User / Sour ce Codi ng Rul e 6. ( H i mpact , M gener al i t y) Pad dat a st ruct ures
defined in t he source code so t hat every dat a element is aligned t o a nat ural
operand size address boundary .. . .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. . 3- 56
User / Sour ce Codi ng Rul e 7. ( M i mpact , L gener al i t y ) Beware of false sharing
wit hin a cache line ( 64 byt es) and wit hin a sect or of 128 byt es on processors
based on I nt el Net Burst microarchit ect ure . . .. .. . .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. . 3- 59
User / Sour ce Codi ng Rul e 8. ( H i mpact , ML gener al i t y ) Consider using a special
memory allocat ion library wit h address offset capabilit y t o avoid aliasing. . . 3- 62
User / Sour ce Codi ng Rul e 9. ( M i mpact , M gener al i t y) When padding variable
declarat ions t o avoid aliasing, t he great est benefit comes from avoiding aliasing
on second- level cache lines, suggest ing an offset of 128 byt es or more . . . . . 3- 62
User / Sour ce Codi ng Rul e 10. ( H i mpact , H gener al i t y ) Opt imizat ion t echniques
such as blocking, loop int erchange, loop skewing, and packing are best done by
t he compiler. Opt imize dat a st ruct ures eit her t o fit in one- half of t he first - level
cache or in t he second- level cache; t urn on loop opt imizat ions in t he compiler t o
enhance localit y for nest ed loops . .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. . 3- 66
User / Sour ce Codi ng Rul e 11. ( M i mpact , ML gener al i t y) I f t here is a blend of
reads and writ es on t he bus, changing t he code t o separat e t hese bus
t ransact ions int o read phases and writ e phases can help performance . . . . . . . 3- 67
User / Sour ce Codi ng Rul e 12. ( H i mpact , H gener al i t y ) To achieve effect ive
amort izat ion of bus lat ency, soft ware should favor dat a access pat t erns t hat
result in higher concent rat ions of cache miss pat t erns, wit h cache miss st rides
E-9
t hat are significant ly smaller t han half t he hardware prefet ch t rigger t hreshold .
3- 67
User / Sour ce Codi ng Rul e 13. ( M i mpact , M gener al i t y) Enable t he compilers
use of SSE, SSE2 or SSE3 inst ruct ions wit h appropriat e swit ches . .. .. .. .. .. .. . 3- 77
User / Sour ce Codi ng Rul e 14. ( H i mpact , ML gener al i t y ) Make sure your
applicat ion st ays in range t o avoid denormal values, underflows. . .. .. .. .. .. .. . 3- 78
User / Sour ce Codi ng Rul e 15. ( M i mpact , ML gener al i t y ) Do not use double
precision unless necessary. Set t he precision cont rol ( PC) field in t he x87 FPU
cont rol word t o Single Precision . This allows single precision ( 32- bit )
comput at ion t o complet e fast er on some operat ions ( for example, divides due t o
early out ) . However, be careful of int roducing more t han a t ot al of t wo values for
t he float ing point cont rol word, or t here will be a large performance penalt y. See
Sect ion 3. 8. 3 .. .. .. .. .. . .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. . 3- 78
User / Sour ce Codi ng Rul e 16. ( H i mpact , ML gener al i t y ) Use fast float - t o- int
rout ines, FI STTP, or SSE2 inst ruct ions. I f coding t hese rout ines, use t he FI STTP
inst ruct ion if SSE3 is available, or t he CVTTSS2SI and CVTTSD2SI inst ruct ions if
coding wit h St reaming SI MD Ext ensions 2. . .. .. .. . .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. . 3- 78
User / Sour ce Codi ng Rul e 17. ( M i mpact , ML gener al i t y) Removing dat a
dependence enables t he out - of- order engine t o ext ract more I LP from t he code.
When summing up t he element s of an array, use part ial sums inst ead of a single
accumulat or. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. . 3- 78
User / Sour ce Codi ng Rul e 18. ( M i mpact , ML gener al i t y) Usually, mat h libraries
t ake advant age of t he t ranscendent al inst ruct ions ( for example, FSI N) when
evaluat ing element ary funct ions. I f t here is no crit ical need t o evaluat e t he
t ranscendent al funct ions using t he ext ended precision of 80 bit s, applicat ions
should consider an alt ernat e, soft ware- based approach, such as a lookup- t able-
based algorit hm using int erpolat ion t echniques. I t is possible t o improve
t ranscendent al performance wit h t hese t echniques by choosing t he desired
numeric precision and t he size of t he lookup t able, and by t aking advant age of
t he parallelism of t he SSE and t he SSE2 inst ruct ions. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. . 3- 78
User / Sour ce Codi ng Rul e 19. ( H i mpact , ML gener al i t y ) Denormalized float ing-
point const ant s should be avoided as much as possible . .. .. .. .. .. . .. .. .. .. .. .. .. . 3- 79
User / Sour ce Codi ng Rul e 20. ( M i mpact , H gener al i t y ) I nsert t he PAUSE
inst ruct ion in fast spin loops and keep t he number of loop repet it ions t o a
minimum t o improve overall syst em performance. . . . . . . . . . . . . . . . . . . . .. .. .. .. .. .. .. . 7- 17
User / Sour ce Codi ng Rul e 21. ( M i mpact , L gener al i t y) Replace a spin lock t hat
may be acquired by mult iple t hreads wit h pipelined locks such t hat no more t han
t wo t hreads have writ e accesses t o one lock. I f only one t hread needs t o writ e t o
a variable shared by t wo t hreads, t here is no need t o use a lock. . .. .. .. .. .. .. . 7- 18
User / Sour ce Codi ng Rul e 22. ( H i mpact , M gener al i t y) Use a t hread- blocking
API in a long idle loop t o free up t he processor .. . .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. . 7- 19
User / Sour ce Codi ng Rul e 23. ( H i mpact , M gener al i t y) Beware of false sharing
wit hin a cache line ( 64 byt es on I nt el Pent ium 4, I nt el Xeon, Pent ium M, I nt el
E-10
Core Duo processors) , and wit hin a sect or ( 128 byt es on Pent ium 4 and I nt el Xeon
processors) .. .. .. .. .. . .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. . 7- 21
User / Sour ce Codi ng Rul e 24. ( M i mpact , ML gener al i t y) Place each
synchronizat ion variable alone, separat ed by 128 byt es or in a separat e cache
line. . . .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. . 7- 22
User / Sour ce Codi ng Rul e 25. ( H i mpact , L gener al i t y ) Do not place any spin
lock variable t o span a cache line boundary .. .. . .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. . 7- 22
User / Sour ce Codi ng Rul e 26. ( M i mpact , H gener al i t y ) I mprove dat a and code
localit y t o conserve bus command bandwidt h. . . .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. . 7- 24
User / Sour ce Codi ng Rul e 27. ( M i mpact , L gener al i t y) Avoid excessive use of
soft ware prefet ch inst ruct ions and allow aut omat ic hardware prefet cher t o work.
Excessive use of soft ware prefet ches can significant ly and unnecessarily increase
bus ut ilizat ion if used inappropriat ely. . . . . . . . . . . . . . . . . . . . . . .. . . . . . . .. . . . . . . . .. . . . . . . . . . . . 7- 25
User / Sour ce Codi ng Rul e 28. ( M i mpact , M gener al i t y) Consider using
overlapping mult iple back- t o- back memory reads t o improve effect ive cache miss
lat encies. . .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. . 7- 26
User / Sour ce Codi ng Rul e 29. ( M i mpact , M gener al i t y ) Consider adj ust ing t he
sequencing of memory references such t hat t he dist ribut ion of dist ances of
successive cache misses of t he last level cache peaks t owards 64 byt es. . . . . 7- 26
User / Sour ce Codi ng Rul e 30. ( M i mpact , M gener al i t y) Use full writ e
t ransact ions t o achieve higher dat a t hroughput . . .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. . 7- 26
User / Sour ce Codi ng Rul e 31. ( H i mpact , H gener al i t y) Use cache blocking t o
improve localit y of dat a access. Target one quart er t o one half of t he cache size
when t arget ing I nt el processors support ing HT Technology or t arget a block size
t hat allow all t he logical processors serviced by a cache t o share t hat cache
simult aneously. . .. .. . .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. . 7- 27
User / Sour ce Codi ng Rul e 32. ( H i mpact , M gener al i t y) Minimize t he sharing of
dat a bet ween t hreads t hat execut e on different bus agent s sharing a common
bus. The sit uat ion of a plat form consist ing of mult iple bus domains should also
minimize dat a sharing across bus domains . .. .. . .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. . 7- 28
User / Sour ce Codi ng Rul e 33. ( H i mpact , H gener al i t y ) Minimize dat a access
pat t erns t hat are offset by mult iples of 64 KByt es in each t hread. . . . . . . . . . . . . . 7- 30
User / Sour ce Codi ng Rul e 34. ( H i mpact , M gener al i t y ) Adj ust t he privat e st ack
of each t hread in an applicat ion so t hat t he spacing bet ween t hese st acks is not
offset by mult iples of 64 KByt es or 1 MByt e t o prevent unnecessary cache line
evict ions ( when using I nt el processors support ing HT Technology) . .. .. .. .. .. . 7- 31
User / Sour ce Codi ng Rul e 35. ( M i mpact , L gener al i t y ) Add per- inst ance st ack
offset when t wo inst ances of t he same applicat ion are execut ing in lock st eps t o
E-11
avoid memory accesses t hat are offset by mult iples of 64 KByt e or 1 MByt e, when
t arget ing I nt el processors support ing HT Technology. . .. .. .. .. .. .. . .. .. .. .. .. .. .. . 7- 32
User / Sour ce Codi ng Rul e 36. ( M i mpact , L gener al i t y ) Avoid excessive loop
unrolling t o ensure t he Trace cache is operat ing efficient ly. . . . . . . . . . . . . . . . .. . . . . . 7- 34
User / Sour ce Codi ng Rul e 37. ( L i mpact , L gener al i t y) Opt imize code size t o
improve localit y of Trace cache and increase delivered t race lengt h . .. .. .. .. . 7- 34
User / Sour ce Codi ng Rul e 38. ( M i mpact , L gener al i t y ) Consider using t hread
affinit y t o opt imize sharing resources cooperat ively in t he same core and
subscribing dedicat ed resource in separat e processor cores. .. .. . .. .. .. .. .. .. .. . 7- 37
User / Sour ce Codi ng Rul e 39. ( M i mpact , L gener al i t y) I f a single t hread
consumes half of t he peak bandwidt h of a specific execut ion unit ( e. g. fdiv) ,
consider adding a t hread t hat seldom or rarely relies on t hat execut ion unit , when
t uning for HT Technology .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. . 7- 43
E.3 TUNING SUGGESTIONS
Tuni ng Suggest i on 1. I n rare cases, a performance problem may be caused by
execut ing dat a on a code page as inst ruct ions. This is very likely t o happen when
execut ion is following an indirect branch t hat is not resident in t he t race cache. I f
t his is clearly causing a performance problem, t ry moving t he dat a elsewhere, or
insert ing an illegal opcode or a pause inst ruct ion immediat ely aft er t he indirect
branch. Not e t hat t he lat t er t wo alt ernat ives may degrade performance in some
circumst ances. . .. .. .. . .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. . 3- 63
Tuni ng Suggest i on 2. . .. .. .. .. .. I f a load is found t o miss frequent ly, eit her insert a
prefet ch before it or ( if issue bandwidt h is a concern) move t he load up t o execut e
earlier. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. . 3- 70
Tuni ng Suggest i on 3. Opt imize single t hreaded code t o maximize execut ion
t hroughput first . . .. .. . .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. . 7- 41
Tuni ng Suggest i on 4. Opt imize mult it hreaded applicat ions t o achieve opt imal
processor scaling wit h respect t o t he number of physical processors or processor
cores. . . .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. . 7- 41
Tuni ng Suggest i on 5. Schedule t hreads t hat compet e for t he same execut ion
resource t o separat e processor cores. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. . 7- 41
Tuni ng Suggest i on 6. Use on- chip execut ion resources cooperat ively if t wo logical
processors are sharing t he execut ion resources in t he same processor
core. .. . .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. .. .. . 7- 42
Tuni ng Suggest i on 7.
E-12
Index-1
I NDEX
Numer i cs
64-bit mode
arithmetic, 9-3
coding guidelines, 9-1
compiler settings, A-2
CVTSI2SD instruction, 9-4
CVTSI2SS instruction, 9-4
default operand size, 9-1
introduction, 2-45
legacy instructions, 9-1
multiplication notes, 9-2
register usage, 9-2, 9-3
REX prefix, 9-1
sign-extension, 9-2
software prefetch, 9-5
A
absolute difference of signed numbers, 5-20
absolute difference of unsigned numbers, 5-20
absolute value, 5-21
active power, 10-1
ADDSUBPD instruction, 6-17
ADDSUBPS instruction, 6-17, 6-19
algorithm to avoid changing the rounding mode, 3-82
alignment
arrays, 3-56
code, 3-12
stack, 3-59
structures, 3-56
Amdahls law, 7-2
AoS format, 4-20
application performance tools, A-1
arrays
aligning, 3-56
assembler/compiler coding rules, E-1
assist, B-2
automatic vectorization, 4-12, 4-13
B
battery life
guidelines for extending, 10-5
mobile optimization, 10-1
OS APIs, 10-6
quality trade-offs, 10-5
bogus, non-bogus, retire, B-1
branch prediction
choosing types, 3-13
code examples, 3-8
eliminating branches, 3-7
optimizing, 3-6
unrolling loops, 3-15
bus ratio, B-2
C
C4-state, 10-4
cache management
blocking techniques, 8-22
cache level, 8-5
CLFLUSH instruction, 8-12
compiler choices, 8-2
compiler intrinsics, 8-2
CPUID instruction, 3-5, 8-37
function leaf, 3-5
optimizing, 8-1
simple memory copy, 8-32
smart cache, 2-36
video decoder, 8-31
video encoder, 8-31
See also: optimizing cache utilization
call graph profiling, A-11
CD/DVD, 10-7
changing the rounding mode, 3-82
classes (C/C++), 4-11
CLFLUSH instruction, 8-12
clipping to an arbitrary signed range, 5-25
clipping to an arbitrary unsigned range, 5-27
clock ticks
in performance matrics, B-6
nominal CPI, B-3
non-halted clock ticks, B-3
non-halted CPI, B-3
non-sleep clock ticks, B-3
time-stamp counter, B-3
See also: performance monitoring events
coding techniques, 4-7, 7-23
64-bit guidelines, 9-1
absolute difference of signed numbers, 5-20
absolute difference of unsigned numbers, 5-20
absolute value, 5-21
clipping to an arbitrary signed range, 5-25
clipping to an arbitrary unsigned range, 5-27
conserving power, 10-7
data in segment, 3-63
generating constants, 5-19
interleaved pack with saturation, 5-8
interleaved pack without saturation, 5-10
latency and throughput, C-1
methodologies, 4-8
non-interleaved unpack, 5-10
optimization options, A-2
rules, 3-5, E-1
signed unpack, 5-7
simplified clip to arbitrary signed range, 5-26
sleep transitions, 10-7
suggestions, 3-5, E-1
I NDEX
Index-2
summary of rules, E-1
tuning hints, 3-5, E-1
unsigned unpack, 5-6
See also: floating-point code
coherent requests, 8-9
command-line options
floating-point arithmetic precision, A-5
inline expansion of library functions, A-5
rounding control, A-5
vectorizer switch, A-4
comparing register values, 3-27, 3-29
compatibility mode, 9-1
compatibility model, 2-45
compiler intrinsics
_mm_load, 8-2, 8-31
_mm_prefetch, 8-2, 8-31
_mm_stream, 8-2, 8-31
compilers
branch prediction support, 3-16
documentation, 1-3
general recommendations, 3-2
plug-ins, A-1
supported alignment options, 4-16
See also: Intel C++ Compiler & Intel Fortran
Compiler
computation
intensive code, 4-6
converting 64-bit to 128-bit SIMD integers, 5-36
converting code to MMX technology, 4-4
CPUID instruction
AP-485, 1-3
cache paramaters, 8-37
function leaf, 8-37
function leaf 4, 3-5
Intel compilers, 3-4
MMX support, 4-2
SSE support, 4-2
SSE2 support, 4-3
SSE3 support, 4-3
SSSE3 support, 4-4
strategy for use, 3-4
C-states, 10-1, 10-3
CVTSI2SD instruction, 9-4
CVTSI2SS instruction, 9-4
CVTTPS2PI instruction, 6-16
CVTTSS2SI instruction, 6-16
D
data
access pattern of array, 3-58
aligning arrays, 3-56
aligning structures, 3-56
alignment, 4-13
arrangement, 6-3
code segment and, 3-63
deswizzling, 6-10, 6-11
prefetching, 2-37
swizzling, 6-7
swizzling using intrinsics, 6-8
declspec(align), D-3
deeper sleep, 10-4
denormals-are-zero (DAZ), 6-16
deterministic cache parameters
cache sharing, 8-37, 8-39
multicore, 8-39
overview, 8-37
prefetch stride, 8-39
domain decomposition, 7-5
Dual-core Intel Xeon processors, 2-1
E
EDP-based stack frames, D-4
EMMS instruction, 5-2, 5-3
guidelines for using, 5-3
Enhanced Intel SpeedStep Technology
description of, 10-8
multicore processors, 10-11
usage scenario, 10-2
ESP-based stack frames, D-3
extract word instruction, 5-12
F
fencing operations, 8-7
LFENCE instruction, 8-11
MFENCE instruction, 8-11
FIST instruction, 3-82
FLDCW instruction, 3-82
floating-point code
arithmetic precision options, A-5
copying, shuffling, 6-12
data arrangement, 6-3
data deswizzling, 6-10
data swizzling using intrinsics, 6-8
guidelines for optimizing, 3-77
horizontal ADD, 6-13
improving parallelism, 3-84
memory access stall information, 3-53
operations with integer operands, 3-86
operations, integer operands, 3-86
optimizing, 3-77
planning considerations, 6-1
rules and suggestions, 6-1
scalar code, 6-2
transcendental functions, 3-86
vertical versus horizontal computation, 6-3
See also: coding techniques
flush-to-zero (FTZ), 6-16
front end
branching ratios, B-52
characterizing mispredictions, B-53
HT Technology, 7-33
Index-3
I NDEX
key practices, 7-13
loop unrolling, 7-13, 7-34
multithreading, 7-33
optimization, 3-6
optimization for code size, 7-34
Pentium M processor, 3-24
tagging mechanisms, B-37
trace cache, 7-13
trace cache events, B-30
functional decomposition, 7-5
FXCH instruction, 3-85, 6-2
G
generating constants, 5-19
GetActivePwrScheme, 10-6
GetSystemPowerStatus, 10-6
H
HADDPD instruction, 6-17
HADDPS instruction, 6-17, 6-22
hardware multithreading
support for, 3-5
hardware prefetch
cache blocking techniques, 8-26
description of, 8-3
latency reduction, 8-14
memory optimization, 8-12
operation, 8-13
horizontal computations, 6-13
hotspots
definition of, 4-6
identifying, 4-6
VTune analyzer, 4-6
HSUBPD instruction, 6-17
HSUBPS instruction, 6-17, 6-22
Hyper-Threading Technology
avoid excessive software prefetches, 7-25
bus optimization, 7-12
cache blocking technique, 7-27
conserve bus command bandwidth, 7-23
eliminate 64-K-aliased data accesses, 7-30
excessive loop unrolling, 7-34
front-end optimization, 7-33
full write transactions, 7-26
functional decomposition, 7-5
improve effective latency of cache misses, 7-25
memory optimization, 7-26
minimize data sharing between physical
processors, 7-27
multitasking environment, 7-3
optimization, 7-1
optimization for code size, 7-34
optimization guidelines, 7-11
optimization with spin-locks, 7-18
overview, 2-37
parallel programming models, 7-5
performance metrics, B-39
per-instance stack offset, 7-32
per-thread stack offset, 7-31
pipeline, 2-40
placement of shared synchronization variable,
7-21
prevent false-sharing of data, 7-21
preventing excessive evictions in first-level data
cache, 7-30
processor resources, 2-38
shared execution resources, 7-41
shared-memory optimization, 7-27
synchronization for longer periods, 7-18
synchronization for short periods, 7-16
system bus optimization, 7-23
thread sync practices, 7-12
thread synchronization, 7-14
tools for creating multithreaded applications, 7-10
I
IA-32e mode, 2-45
IA32_PERFEVSELx MSR, B-50
increasing bandwidth
memory fills, 5-35
video fills, 5-35
indirect branch, 3-13
inline assembly, 5-4
inline expansion library functions option, A-5
inlined-asm, 4-9
insert word instruction, 5-13
instruction latency/throughput
overview, C-1
instruction scheduling, 3-63
Intel 64 and IA-32 processors, 2-1
Intel 64 architecture
and IA-32 processors, 2-45
features of, 2-45
IA-32e mode, 2-45
Intel Advanced Digital Media Boost, 2-3
Intel Advanced Memory Access, 2-13
Intel Advanced Smart Cache, 2-2, 2-17
Intel Core Duo processors, 2-1, 2-36
128-bit integers, 5-37
data prefetching, 2-37
front end, 2-36
microarchitecture, 2-36
packed FP performance, 6-22
performance events, B-42
prefetch mechanism, 8-3
processor perspectives, 3-3
shared cache, 2-43
SIMD support, 4-1
special programming models, 7-6
static prediction, 3-9
Intel Core microarchitecture, 2-1, 2-2
advanced smart cache, 2-17
branch prediction unit, 2-6
I NDEX
Index-4
event ratios, B-50
execution core, 2-9
execution units, 2-10
issue ports, 2-10
front end, 2-5
instruction decode, 2-8
instruction fetch unit, 2-6
instruction queue, 2-7
advanced memory access, 2-13
micro-fusion, 2-9
pipeline overview, 2-3
special programming models, 7-6
stack pointer tracker, 2-8
Intel Core Solo processors, 2-1
128-bit SIMD integers, 5-37
front end, 2-36
performance events, B-42
prefetch mechanism, 8-3
SIMD support, 4-1
Intel Core2 Duo processors, 2-1
Intel C++ Compiler, 3-1
64-bit mode settings, A-2
branch prediction support, 3-16
description, A-1
IA-32 settings, A-2
multithreading support, A-5
OpenMP, A-5
optimization settings, A-1
related Information, 1-3
stack frame support, D-1
Intel Debugger
description, A-1
Intel Enhanced Deeper Sleep
C-state numbers, 10-3
enabling, 10-10
multiple-cores, 10-13
Intel Fortran Compiler
description, A-1
multithreading support, A-5
OpenMP, A-5
optimization settings, A-1
related information, 1-3
Intel Integrated Performance Primitives
for Linux, A-13
for Windows, A-13
Intel Math Kernel Library for Linux, A-12
Intel Math Kernel Library for Windows, A-12
Intel Mobile Platform SDK, 10-6
Intel NetBurst microarchitecture, 2-1
core, 2-22, 2-25
design goals, 2-20
front end, 2-22
introduction, 2-19
out-of-order core, 2-25
pipeline, 2-20, 2-23
prefetch characteristics, 8-3
retirement, 2-23
trace cache, 3-11
Intel Pentium D processors, 2-1, 2-41
Intel Pentium M processors, 2-1
core, 2-35
front end, 2-33
retirement, 2-35
Intel Performance Libraries, A-12
benefits, A-13
optimizations, A-13
Intel performance libraries
description, A-1
Intel Performance Tools, 3-1, A-1
Intel Smart Cache, 2-36
Intel Smart Memory Access, 2-2
Intel software network link, 1-3
Intel Thread Checker, 7-11
example output, A-14
Intel Thread Profiler
Intel Threading Tools, 7-11
Intel Threading Tools, A-1, A-14
Intel VTune Performance Analyzer
call graph, A-11
code coach, 4-6
coverage, 3-2
description, A-1
Intel Wide Dynamic Execution, 2-2
interleaved pack with saturation, 5-8
interleaved pack without saturation, 5-10
interprocedural optimization, A-6
introduction
chapter summaries, 1-2
optimization features, 2-1
processors covered, 1-1
references, 1-3
IPO. See interprocedural optimization
L
large load stalls, 3-54
latency, 8-4, 8-15
legacy mode, 9-1
LFENCE Instruction, 8-11
links to web data, 1-3
load instructions and prefetch, 8-6
loading-storing to-from same DRAM page, 5-36
loop
blocking, 4-23
unrolling, 8-20, A-5
Index-5
I NDEX
M
MASKMOVDQU instruction, 8-7
memory bank conflicts, 8-2
memory optimizations
loading-storing to-from same DRAM page, 5-36
overview, 5-31
partial memory accesses, 5-32
performance, 4-18
reference instructions, 3-27
using aligned stores, 5-36
using prefetch, 8-12
MFENCE instruction, 8-11
micro-op fusion, 2-36
misaligned data access, 4-13
misalignment in the FIR filter, 4-15
mobile computing
ACPI standard, 10-1, 10-3
active power, 10-1
battery life, 10-1, 10-5, 10-6
C4-state, 10-4
CD/DVD, WLAN, WiFi, 10-7
C-states, 10-1, 10-3
deep sleep transitions, 10-7
deeper sleep, 10-4, 10-10
Intel Mobil Platform SDK, 10-6
OS APIs, 10-6
OS changes processor frequency, 10-2
OS synchronization APIs, 10-6
overview, 10-1
performance options, 10-5
platform optimizations, 10-7
P-states, 10-1
Speedstep technology, 10-8
spin-loops, 10-6
state transitions, 10-2
static power, 10-1
WM_POWERBROADCAST message, 10-7
MOVAPD instruction, 6-3
MOVAPS instruction, 6-3
MOVDDUP instruction, 6-17
move byte mask to integer, 5-14
MOVHLPS instruction, 6-13
MOVHPS instruction, 6-7, 6-10
MOVLHPS instruction, 6-13
MOVLPS instruction, 6-7, 6-10
MOVNTDQ instruction, 8-7
MOVNTI instruction, 8-7
MOVNTPD instruction, 8-7
MOVNTPS instruction, 8-7
MOVNTQ instruction, 8-7
MOVQ Instruction, 5-35
MOVSHDUP instruction, 6-17, 6-19
MOVSLDUP instruction, 6-17, 6-19
MOVUPD instruction, 6-3
MOVUPS instruction, 6-3
multicore processors
architecture, 2-1
C-state considerations, 10-12
energy considerations, 10-10
features of, 2-41
functional example, 2-41
pipeline and core, 2-43
SpeedStep technology, 10-11
thread migration, 10-11
multiprocessor systems
dual-core processors, 7-1
HT Technology, 7-1
optimization techniques, 7-1
See also: multithreading & Hyper-Threading
Technology
multithreading
Amdahls law, 7-2
application tools, 7-10
bus optimization, 7-12
compiler support, A-5
dual-core technology, 3-5
environment description, 7-1
guidelines, 7-11
hardware support, 3-5
HT technology, 3-5
Intel Core microarchitecture, 7-6
parallel & sequential tasks, 7-2
programming models, 7-4
shared execution resources, 7-41
specialized models, 7-6
thread sync practices, 7-12
See Hyper-Threading Technology
N
Newton-Raphson iteration, 6-1
non-coherent requests, 8-9
non-halted clock ticks, B-4
non-interleaved unpack, 5-10
non-temporal stores, 8-8, 8-30
NOP, 3-30
O
OpenMP compiler directives, 7-10, A-5
optimization
branch prediction, 3-6
branch type selection, 3-13
features, 2-1
general techniques, 3-1
spin-wait and idle loops, 3-9
optimizing cache utilization
cache management, 8-31
examples, 8-11
non-temporal store instructions, 8-7
prefetch and load, 8-6
prefetch instructions, 8-5
I NDEX
Index-6
prefetching, 8-5
SFENCE instruction, 8-10, 8-11
streaming, non-temporal stores, 8-7
See also: cache management
OS APIs, 10-6
P
pack instructions, 5-8
packed average byte or word), 5-29
packed multiply high unsigned, 5-28
packed shuffle word, 5-15
packed signed integer word maximum, 5-28
packed sum of absolute differences, 5-28
parallelism, 4-7, 7-5
partial memory accesses, 5-32
PAUSE instruction, 3-9, 7-12
PAVGB instruction, 5-29
PAVGW instruction, 5-29
PeekMessage(), 10-6
Pentium 4 processors
inner loop iterations, 3-15
Pentium M processors
prefetch mechanisms, 8-3
Pentium Processor Extreme Edition, 2-1, 2-41
performance models
Amdahls law, 7-2
multithreading, 7-2
parallelism, 7-1
usage, 7-1
performance monitoring events
analysis techniques, B-45
Bus_Not_In_Use, B-44
Bus_Snoops, B-45
DCU_Snoop_to_Share, B-44
drill-down techniques, B-45
event ratios, B-50
HT Technology, B-39
Intel Core Duo processors, B-42
Intel Core Solo processors, B-42
Intel Netburst architecture, B-1
Intel Xeon processors, B-1
L1_Pref_Req, B-44
L2_No_Request_Cycles, B-44
L2_Reject_Cycles, B-44
metrics and categories, B-5
Pentium 4 processor, B-1
performance counter, B-42
ratio interpretation, B-43
See also: clock ticks
Serial_Execution_Cycles, B-44
Unhalted_Core_Cycles, B-44
Unhalted_Ref_Cycles, B-44
performance tools, 3-1
PEXTRW instruction, 5-12
PGO. See profile-guided optimization
PINSRW instruction, 5-13
PMINSW instruction, 5-28
PMINUB instruction, 5-28
PMOVMSKB instruction, 5-14
PMULHUW instruction, 5-28
predictable memory access patterns, 8-5
prefetch
64-bit mode, 9-5
compiler intrinsics, 8-2
concatenation, 8-19
deterministic cache parameters, 8-37
hardware mechanism, 8-3
characteristics, 8-13
latency, 8-14
how instructions designed, 8-5
innermost loops, 8-5
instruction considerations
cache block techniques, 8-22
checklist, 8-17
concatenation, 8-18
hint mechanism, 8-4
minimizing number, 8-20
scheduling distance, 8-17
single-pass execution, 8-2, 8-27
spread with computations, 8-21
strip-mining, 8-24
summary of, 8-4
instruction variants, 8-5
latency hiding/reduction, 8-15
load Instructions, 8-6
memory access patterns, 8-5
memory optimization with, 8-12
minimizing number of, 8-20
scheduling distance, 8-2, 8-17
software data, 8-4
spreading, 8-22
when introduced, 8-1
PREFETCHNT0 instruction, 8-6
PREFETCHNTA instruction, 8-6, 8-24
usage guideline, 8-2
PREFETCHT0 instruction, 8-24
usage guideline, 8-2
producer-consumer model, 7-6
profile-guided optimization, A-6
PSADBW instruction, 5-28
PSHUF instruction, 5-15
P-states, 10-1
Q
-Qparallel, 7-10
Index-7
I NDEX
R
ratios, B-50
branching and front end, B-52
references, 1-3
releases of, 2-48
replay, B-2
rounding control option, A-5
rules, E-1
S
sampling
event-based, A-11
scheduling distance (PSD), 8-17
Self-modifying code, 3-63
SFENCE Instruction, 8-10
SHUFPS instruction, 6-3, 6-7
signed unpack, 5-7
SIMD
auto-vectorization, 4-12
cache instructions, 8-1
classes, 4-11
coding techniques, 4-7
data alignment for MMX, 4-16
data and stack alignment, 4-13
data slignment for 128-bits, 4-16
example computation, 2-45
history, 2-45
identifying hotspots, 4-6
instruction selection, 4-25
loop blocking, 4-23
memory utilization, 4-18
microarchitecture differences, 4-26
MMX technology support, 4-2
padding to align data, 4-14
parallelism, 4-7
SSE support, 4-2
SSE2 support, 4-3
SSE3 support, 4-3
SSSE3 support, 4-4
stack alignment for 128-bits, 4-15
strip-mining, 4-22
using arrays, 4-14
vectorization, 4-7
VTune capabilities, 4-6
SIMD floating-point instructions
copying, shuffling, 6-12
data arrangement, 6-3
data deswizzling, 6-10
data swizzling, 6-7
different microarchitectures, 6-16
general rules, 6-1
horizontal ADD, 6-13
Intel Core Duo processors, 6-22
Intel Core Solo processors, 6-22
planning considerations, 6-1
reciprocal instructions, 6-1
scalar code, 6-2
SSE3 complex math, 6-18
SSE3 FP programming, 6-17
using
ADDSUBPS, 6-19
CVTTPS2PI, 6-16
CVTTSS2SI, 6-16
FXCH, 6-2
HADDPS, 6-22
HSUBPS, 6-22
MOVAPD, 6-3
MOVAPS, 6-3
MOVHLPS, 6-13
MOVHPS, 6-7, 6-10
MOVLHPS, 6-13
MOVLPS, 6-7, 6-10
MOVSHDUP, 6-19
MOVSLDUP, 6-19
MOVUPD, 6-3
MOVUPS, 6-3
SHUFPS, 6-3, 6-7
UNPACKHPS, 6-7
UNPACKLPS, 6-7
UNPCKHPS, 6-10
UNPCKLPS, 6-10
vertical vs horizontal computation, 6-3
with x87 FP instructions, 6-2
SIMD technology, 2-48
SIMD-integer instructions
64-bits to 128-bits, 5-36
data alignment, 5-4
data movement techniqes, 5-6
extract word, 5-12
insert word, 5-13
integer intensive, 5-1
memory optimizations, 5-31
move byte mask to integer, 5-14
optimization by architecture, 5-37
packed average byte or word), 5-29
packed multiply high unsigned, 5-28
packed shuffle word, 5-15
packed signed integer word maximum, 5-28
packed sum of absolute differences, 5-28
rules, 5-1
signed unpack, 5-7
using
EMMS, 5-2
MOVDQ, 5-35
MOVQ2DQ, 5-18
PABSW, 5-20
PACKSSDW, 5-8
PADDQ, 5-30
PALIGNR, 5-4
PAVGB, 5-29
PAVGW, 5-29
PEXTRW, 5-12
PINSRW, 5-13
PMADDWD, 5-30
I NDEX
Index-8
PMAXSW, 5-28
PMAXUB, 5-28
PMINSW, 5-28
PMINUB, 5-28
PMOVMSKB, 5-14
PMULHUW, 5-28
PMULHW, 5-28
PMULUDQ, 5-30
PSADBW, 5-28
PSHUF, 5-15
PSHUFB, 5-21, 5-23
PSHUFLW, 5-17
PSLLDQ, 5-31
PSRLDQ, 5-31
PSUBQ, 5-30
PUNPCHQDQ, 5-18
PUNPCKLQDQ, 5-18
simplified 3D geometry pipeline, 8-15
simplified clipping to an arbitrary signed range, 5-27
single vs multi-pass execution, 8-27
sleep transitions, 10-7
smart cache, 2-36
SoA format, 4-20
software write-combining, 8-30
spin-loops, 10-6
optimization, 3-9
PAUSE instruction, 3-9
SSE, 2-48
SSE2, 2-48
SSE3, 2-49
SSSE3, 2-49
stack
aligned EDP-based frames, D-4
aligned ESP-based frames, D-3
alignment 128-bit SIMD, 4-15
alignment stack, 3-59
dynamic alignment, 3-59
frame optimizations, D-6
inlined assembly & EBX, D-7
Intel C++ Compiler support for, D-1
overview, D-1
state transitions, 10-2
static branch prediction algorithm, 3-10
static power, 10-1
streaming stores, 8-7
coherent requests, 8-9
improving performance, 8-7
non-coherent requests, 8-9
strip-mining, 4-22, 4-23, 8-24, 8-25
prefetch considerations, 8-26
structures
aligning, 3-56
suggestions, E-1
summary of coding rules, E-1
swizzling data
See data swizzling.
system bus optimization, 7-23
T
tagging, B-2
tagging mechanisms
execution_event, B-37
front_end_event, B-37
replay_event, B-35
time-based sampling, A-11
time-consuming innermost loops, 8-5
time-stamp counter, B-5
RDTSC instruction, B-5
sleep pin, B-5
TLB. See transaction lookaside buffer
trace cache
events, B-30
transaction lookaside buffer, 8-32
transcendental functions, 3-86
U
unpack instructions, 5-10
UNPACKHPS instruction, 6-7
UNPACKLPS instruction, 6-7
UNPCKHPS instruction, 6-10
UNPCKLPS instruction, 6-10
unrolling loops
benefits of, 3-15
code examples, 3-16
limitation of, 3-15
using MMX code for copy, shuffling, 6-12
V
vector class library, 4-12
vectorized code
auto generation, A-6
automatic vectorization, 4-12
high-level examples, A-6
parallelism, 4-7
SIMD architecture, 4-7
switch options, A-4
vertical vs horizontal computation, 6-3
W
WaitForSingleObject(), 10-6
WaitMessage(), 10-6
weakly ordered stores, 8-7
WiFi, 10-7
WLAN, 10-7
workload characterization
retirement throughput, A-11
Index-9
I NDEX
write-combining
buffer, 8-30
memory, 8-30
semantics, 8-8
X
XCHG EAX,EAX, support for, 3-31
I NDEX
Index-10
INTEL SALES OFFICES
ASIA PACIFIC
Australia
Intel Corp.
Level 2
448 St Kilda Road
Melbourne VIC
3004
Australia
Fax:613-9862 5599
China
Intel Corp.
Rm 709, Shaanxi
Zhongda Int'l Bldg
No.30 Nandajie Street
Xian AX710002
China
Fax:(86 29) 7203356
Intel Corp.
Rm 2710, Metropolian
Tower
68 Zourong Rd
Chongqing CQ
400015
China
Intel Corp.
C1, 15 Flr, Fujian
Oriental Hotel
No. 96 East Street
Fuzhou FJ
350001
China
Intel Corp.
Rm 5803 CITIC Plaza
233 Tianhe Rd
Guangzhou GD
510613
China
Intel Corp.
Rm 1003, Orient Plaza
No. 235 Huayuan Street
Nangang District
Harbin HL
150001
China
Intel Corp.
Rm 1751 World Trade
Center, No 2
Han Zhong Rd
Nanjing JS
210009
China
Intel Corp.
Hua Xin International
Tower
215 Qing Nian St.
ShenYang LN
110015
China
Intel Corp.
Suite 1128 CITIC Plaza
Jinan
150 Luo Yuan St.
Jinan SN
China
Intel Corp.
Suite 412, Holiday Inn
Crowne Plaza
31, Zong Fu Street
Chengdu SU
610041
China
Fax:86-28-6785965
Intel Corp.
Room 0724, White Rose
Hotel
No 750, MinZhu Road
WuChang District
Wuhan UB
430071
China
India
Intel Corp.
Paharpur Business
Centre
21 Nehru Place
New Delhi DH
110019
India
Intel Corp.
Hotel Rang Sharda, 6th
Floor
Bandra Reclamation
Mumbai MH
400050
India
Fax:91-22-6415578
Intel Corp.
DBS Corporate Club
31A Cathedral Garden
Road
Chennai TD
600034
India
Intel Corp.
DBS Corporate Club
2nd Floor, 8 A.A.C. Bose
Road
Calcutta WB
700017
India
Japan
Intel Corp.
Kokusai Bldg 5F, 3-1-1,
Marunouchi
Chiyoda-Ku, Tokyo
1000005
Japan
Intel Corp.
2-4-1 Terauchi
Toyonaka-Shi
Osaka
5600872
Japan
Malaysia
Intel Corp.
Lot 102 1/F Block A
Wisma Semantan
12 Jalan Gelenggang
Damansara Heights
Kuala Lumpur SL
50490
Malaysia
Thailand
Intel Corp.
87 M. Thai Tower, 9th Fl.
All Seasons Place,
Wireless Road
Lumpini, Patumwan
Bangkok
10330
Thailand
Viet Nam
Intel Corp.
Hanoi Tung Shing
Square, Ste #1106
2 Ngo Quyen St
Hoan Kiem District
Hanoi
Viet Nam
EUROPE & AFRICA
Belgium
Intel Corp.
Woluwelaan 158
Diegem
1831
Belgium
Czech Rep
Intel Corp.
Nahorni 14
Brno
61600
Czech Rep
Denmark
Intel Corp.
Soelodden 13
Maaloev
DK2760
Denmark
Germany
Intel Corp.
Sandstrasse 4
Aichner
86551
Germany
Intel Corp.
Dr Weyerstrasse 2
Juelich
52428
Germany
Intel Corp.
Buchenweg 4
Wildberg
72218
Germany
Intel Corp.
Kemnader Strasse 137
Bochum
44797
Germany
Intel Corp.
Klaus-Schaefer Strasse
16-18
Erfstadt NW
50374
Germany
Intel Corp.
Heldmanskamp 37
Lemgo NW
32657
Germany
Italy
Intel Corp Italia Spa
Milanofiori Palazzo E/4
Assago
Milan
20094
Italy
Fax:39-02-57501221
Netherland
Intel Corp.
Strausslaan 31
Heesch
5384CW
Netherland
Poland
Intel Poland
Developments, Inc
Jerozolimskie Business
Park
Jerozolimskie 146c
Warsaw
2305
Poland
Fax:+48-22-570 81 40
Portugal
Intel Corp.
PO Box 20
Alcabideche
2765
Portugal
Spain
Intel Corp.
Calle Rioja, 9
Bajo F Izquierda
Madrid
28042
Spain
South Africa
Intel SA Corporation
Bldg 14, South Wing,
2nd Floor
Uplands, The Woodlands
Western Services Road
Woodmead
2052
Sth Africa
Fax:+27 11 806 4549
Intel Corp.
19 Summit Place,
Halfway House
Cnr 5th and Harry
Galaun Streets
Midrad
1685
Sth Africa
United Kingdom
Intel Corp.
The Manse
Silver Lane
Needingworth CAMBS
PE274SL
UK
Intel Corp.
2 Cameron Close
Long Melford SUFFK
CO109TS
UK
Israel
Intel Corp.
MTM Industrial Center,
P.O.Box 498
Haifa
31000
Israel
Fax:972-4-8655444
LATIN AMERICA &
CANADA
Argentina
Intel Corp.
Dock IV - Bldg 3 - Floor 3
Olga Cossettini 240
Buenos Aires
C1107BVA
Argentina
Brazil
Intel Corp.
Rua Carlos Gomez
111/403
Porto Alegre
90480-003
Brazil
Intel Corp.
Av. Dr. Chucri Zaidan
940 - 10th Floor
San Paulo
04583-904
Brazil
Intel Corp.
Av. Rio Branco,
1 - Sala 1804
Rio de Janeiro
20090-003
Brazil
Columbia
Intel Corp.
Carrera 7 No. 71021
Torre B, Oficina 603
Santefe de Bogota
Columbia
Mexico
Intel Corp.
Av. Mexico No. 2798-9B,
S.H.
Guadalajara
44680
Mexico
Intel Corp.
Torre Esmeralda II,
7th Floor
Blvd. Manuel Avila
Comacho #36
Mexico Cith DF
11000
Mexico
Intel Corp.
Piso 19, Suite 4
Av. Batallon de San
Patricio No 111
Monterrey, Nuevo le
66269
Mexico
Canada
Intel Corp.
168 Bonis Ave, Suite 202
Scarborough
MIT3V6
Canada
Fax:416-335-7695
Intel Corp.
3901 Highway #7,
Suite 403
Vaughan
L4L 8L5
Canada
Fax:905-856-8868
Intel Corp.
999 CANADA PLACE,
Suite 404,#11
Vancouver BC
V6C 3E2
Canada
Fax:604-844-2813
Intel Corp.
2650 Queensview Drive,
Suite 250
Ottawa ON
K2B 8H6
Canada
Fax:613-820-5936
Intel Corp.
190 Attwell Drive,
Suite 500
Rexcdale ON
M9W 6H8
Canada
Fax:416-675-2438
Intel Corp.
171 St. Clair Ave. E,
Suite 6
Toronto ON
Canada
Intel Corp.
1033 Oak Meadow Road
Oakville ON
L6M 1J6
Canada
USA
California
Intel Corp.
551 Lundy Place
Milpitas CA
95035-6833
USA
Fax:408-451-8266
Intel Corp.
1551 N. Tustin Avenue,
Suite 800
Santa Ana CA
92705
USA
Fax:714-541-9157
Intel Corp.
Executive Center del Mar
12230 El Camino Real
Suite 140
San Diego CA
92130
USA
Fax:858-794-5805
Intel Corp.
1960 E. Grand Avenue,
Suite 150
El Segundo CA
90245
USA
Fax:310-640-7133
Intel Corp.
23120 Alicia Parkway,
Suite 215
Mission Viejo CA
92692
USA
Fax:949-586-9499
Intel Corp.
30851 Agoura Road
Suite 202
Agoura Hills CA
91301
USA
Fax:818-874-1166
Intel Corp.
28202 Cabot Road,
Suite #363 & #371
Laguna Niguel CA
92677
USA
Intel Corp.
657 S Cendros Avenue
Solana Beach CA
90075
USA
Intel Corp.
43769 Abeloe Terrace
Fremont CA
94539
USA
Intel Corp.
1721 Warburton, #6
Santa Clara CA
95050
USA
Colorado
Intel Corp.
600 S. Cherry Street,
Suite 700
Denver CO
80222
USA
Fax:303-322-8670
Connecticut
Intel Corp.
Lee Farm Corporate Pk
83 Wooster Heights
Road
Danbury CT
6810
USA
Fax:203-778-2168
Florida
Intel Corp.
7777 Glades Road
Suite 310B
Boca Raton FL
33434
USA
Fax:813-367-5452
Georgia
Intel Corp.
20 Technology Park,
Suite 150
Norcross GA
30092
USA
Fax:770-448-0875
Intel Corp.
Three Northwinds Center
2500 Northwinds
Parkway, 4th Floor
Alpharetta GA
30092
USA
Fax:770-663-6354
Idaho
Intel Corp.
910 W. Main Street, Suite
236
Boise ID
83702
USA
Fax:208-331-2295
Illinois
Intel Corp.
425 N. Martingale Road
Suite 1500
Schaumburg IL
60173
USA
Fax:847-605-9762
Intel Corp.
999 Plaza Drive
Suite 360
Schaumburg IL
60173
USA
Intel Corp.
551 Arlington Lane
South Elgin IL
60177
USA
Indiana
Intel Corp.
9465 Counselors Row,
Suite 200
Indianapolis IN
46240
USA
Fax:317-805-4939
Massachusetts
Intel Corp.
125 Nagog Park
Acton MA
01720
USA
Fax:978-266-3867
Intel Corp.
59 Composit Way
suite 202
Lowell MA
01851
USA
Intel Corp.
800 South Street,
Suite 100
Waltham MA
02154
USA
Maryland
Intel Corp.
131 National Business
Parkway, Suite 200
Annapolis Junction MD
20701
USA
Fax:301-206-3678
Michigan
Intel Corp.
32255 Northwestern
Hwy., Suite 212
Farmington Hills MI
48334
USA
Fax:248-851-8770
MInnesota
Intel Corp.
3600 W 80Th St
Suite 450
Bloomington MN
55431
USA
Fax:952-831-6497
North Carolina
Intel Corp.
2000 CentreGreen Way,
Suite 190
Cary NC
27513
USA
Fax:919-678-2818
New Hampshire
Intel Corp.
7 Suffolk Park
Nashua NH
03063
USA
New Jersey
Intel Corp.
90 Woodbridge Center
Dr, Suite. 240
Woodbridge NJ
07095
USA
Fax:732-602-0096
New York
Intel Corp.
628 Crosskeys Office Pk
Fairport NY
14450
USA
Fax:716-223-2561
Intel Corp.
888 Veterans Memorial
Highway
Suite 530
Hauppauge NY
11788
USA
Fax:516-234-5093
Ohio
Intel Corp.
3401 Park Center Drive
Suite 220
Dayton OH
45414
USA
Fax:937-890-8658
Intel Corp.
56 Milford Drive
Suite 205
Hudson OH
44236
USA
Fax:216-528-1026
Oregon
Intel Corp.
15254 NW Greenbrier
Parkway, Building B
Beaverton OR
97006
USA
Fax:503-645-8181
Pennsylvania
Intel Corp.
925 Harvest Drive
Suite 200
Blue Bell PA
19422
USA
Fax:215-641-0785
Intel Corp.
7500 Brooktree
Suite 213
Wexford PA
15090
USA
Fax:714-541-9157
Texas
Intel Corp.
5000 Quorum Drive,
Suite 750
Dallas TX
75240
USA
Fax:972-233-1325
Intel Corp.
20445 State Highway
249, Suite 300
Houston TX
77070
USA
Fax:281-376-2891
Intel Corp.
8911 Capital of Texas
Hwy, Suite 4230
Austin TX
78759
USA
Fax:512-338-9335
Intel Corp.
7739 La Verdura Drive
Dallas TX
75248
USA
Intel Corp.
77269 La Cabeza Drive
Dallas TX
75249
USA
Intel Corp.
3307 Northland Drive
Austin TX
78731
USA
Intel Corp.
15190 Prestonwood
Blvd. #925
Dallas TX
75248
USA
Intel Corp.
Washington
Intel Corp.
2800 156Th Ave. SE
Suite 105
Bellevue WA
98007
USA
Fax:425-746-4495
Intel Corp.
550 Kirkland Way
Suite 200
Kirkland WA
98033
USA
Wisconsin
Intel Corp.
405 Forest Street
Suites 109/112
Oconomowoc Wi
53066
USA

Intel 64 and IA-32 Architectures Optimization Reference Manual

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Intel 64 and IA-32 Architectures Optimization Reference Manual

Transféré par

Droits d'auteur :

Formats disponibles

N

Intel 64 and IA-32 Architectures

pr ocessor suppor t ing Hyper-

Virt ualizat ion Technology r equir es a comput er syst em wit h an enabled I nt el

pr ocessor, BI OS, vir t ual

Vir t ualizat ion

64 ar chit ect ure. Processor s will not operat e

Core Duo, I nt el Core Solo, Pent ium

Core2 Duo, I nt el Core Duo, I nt el Core

and Pent ium M processors, t his t ool can monit or an

64 and I A- 32 Pr ocessor Ar chi t ect ur es Describes t he

archit ect ure, t echniques, and t he processor

C+ + Compiler document at ion and online help

Fort ran Compiler document at ion and online help

VTune Performance Analyzer document at ion and online help

64 AND IA-32 PROCESSOR ARCHITECTURES

Wi de Dynami c Ex ecut i on enables each processor core t o fet ch,

Advanced Smar t Cache delivers higher bandwidt h from t he second

Smar t Memor y Access prefet ches dat a from memory in response t o

Advanced Di gi t al Medi a Boost improves most 128- bit SI MD inst ruc-

Microarchitecture Pipeline Overview

Processor I dent ificat ion wit h CPUI D

Performance Library Suit e. Because VTune analyzer is

Threading Analysis Tools include I nt el

Thread Checker and I nt el

Technology enables operat ing syst em ( OS) t o

mobile technology and Intel

Duo mobile technology, only pro-

Mobile Plat form Soft ware Development Kit

Enhanced Deeper Sleep:

For t r an Compi l er I nt el compilers

Mat h Kernel Library ( I nt el

I nt egrat ed Performance Primit ives ( I nt el

C+ + Compiler document at ion.

I nt egrat ed Performance Primit ives for Linux* and Windows* : I PP is a

SA- 1110 and I nt el

PCA applicat ion processors based on t he I nt el

THREADING ANALYSIS TOOLS

Threading Analysis Tools consist of t he I nt el Thread Checker 3. 0, t he

Thread Checker 3.0

4 PROCESSOR PERFORMANCE METRICS

Vous aimerez peut-être aussi