Vous êtes sur la page 1sur 11

Two-Level

Adaptive
Tse-Yu

Training
Yeh and Yale

Branch
N. and Patt Computer

Predict

ion

Department

of Electrical The Ann

Engineering

Science

University Michigan

of Michigan 48109-2122

Arbor,

Abstract

1
present other get

Introduction
at least time hand, deeper as early as [18] and continuing one of the most on a single impede unresolved bandwidth machine effective On processor. performance As greater, increases. conditional to be resolved before the tarbranches In until a the to the ways the due the [6], has been performance branches stalls or effect have target for issuing

Pipelining, High-performance structures, The tiveness branches proposals Some and make This the the dynamic are profiling deep of a deep importance microarchitectures pipelines of a good pipeline to help branch In fact, of branch they to make use run-time use use, speed predictor the literature prediction opcode execution among up other to improve to pipeline

execution. to the effec-

in the presence

of conditional contains schemes. information Others history are to

branches. becomes

pipelines

is well-known. for a number in they that statistics

negative Among branches and the

of branches types for to wait address can for the

on performance of the branches, condition

static in that

different

predictions.

to be calculated

predictions. paper proposes Adaptive prediction a new dynamic on branch the basis predictor, which alters of inforTwo-Level branch Training algorithm scheme,

get instruction have the to wait target

be fetched. target

Unconditional

address instruction of cycles are two the fast

to be calculated. issuing resulting taken ways stalls in pipeline

conventional bubbles.

computers, address When the

is determined, number There resolve

mation
Several Training and tive dynamic curacy

collected Branch branch on nine

at run-time. of the are Two-Level introduced, known Two-Level Adaptive simulated, static and Adapaccompared a predicexecution rate. Adaptive for the other imflushes Predictor to simulations prediction of the ten

to resolve to reduce

configurations

branch stalls loss: sible The the tion and

is large, the to first reduce

the performance is to the

loss due to the pipeline branch fetch fetching the as early pipeline and execution to reduce as posbubbles. decoding pipeline the execuprefetching before into on the branch that taken all are the the of

is considerable.

compared Training

of other schemes. SPEC

instruction to reduce

Branch 93 percent

Prediction for other the relevant for

achieves schemes. metric the (best than

97 percent Since

second target penalty initiating

is to provide instruction prediction

benchmarks,

to less than tion The miss miss already

bubbles.

Branch due

is a way

requires rate scheme This in

flushing

of the

speculative Two-Level case) of

to branches of the

by predicting, branch target

in progress,

is the miss

execution

is 3 percent vs.

branch Branch schemes mation diction branches are

is resolved. prediction and used are dynamic to not make can taken schemes schemes predictions. be as simple or that and Smith predicting all can be classified depending Static that prediction In are the in this static inforpreall can inabout Static Certain in one also Taken can

Training schemes. provement required.

7 percent more the

represents reducing

a 100 percent pipeline

number

schemes

as predicting branches

branches accuracy

taken.

Predicting by Lee of the of can than into

Permission to copy without fee all or part of this material is grantrd pro. vialed that the copies are not made or distributed for direct conurremal advantage, the ACM ropyright notice and the title of the publication and its date appesr, and notire is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission. @ 1991 ACM 0-89791-460-0/91/001 1/0051 $1.50

achieve structions

approximately

68 percent [13]. used branches on tend The such

as reported 60 percent predictions direction be taken 51

dynamic study,

benchmarks also be based

conditional instructions the other.

taken. more

opcode. direction Backward

classes of branch

to branch as the

branch

consideration

and fective once

Forward

Not

Taken

scheme

[16]

which

is fairly

efonly

other more using

schemes than the

achieve

under

93 percent. in

This

represents by This gain pro-

in loop-bound over all iterations work [12, well 5] can

programs, of a loop. also the be used tendencies bit sets the

because However, with

it misses this the

100 percent Twc-Level can two lead

reduction Adaptive

mispredictions scheme. performance to Branch the

scheme branch and Howdifferat the prethey sat-

Training

does not Profiling path ever, with ent by presetting

on programs

irregular of the

branches.

reduction Section posed scheme. in this tion lection static cluding four

directly an

to a large introduction Training the

to predict in the

on a high-performance gives Two-Level Section study and repozts

processor. Adaptive Prediction used Secsethe conmodels. of a wide and some

measuring a static sample tendencies branch Lee and

branches opcode. have that

prediction data than prediction run-time Smith Buffer

program certain branch

profiling

has to be performed which data takes proposed to collect may sets

in advance occur of

three the the

discusses simulated simulation both Section

methodology

prediction results the five

run-time. Dynamic knowledge dictions. called urating which history entry get the is based Buffer result advantage a structure uses 2-bit information The of the branch The to branch. execution branchs prediction Branch record Taronly history

of schemes branch

including

dynamic contains

predictors.

of branches Target counters used

behavior [13] which

to make

remarks.

a Branch up-down is then

Two-Level Branch

Adaptive Prediction

Training

to make changes In can their of the also execution proposed

predictions. the state

dynamically in the on buffer. the design of the scheme Training from state last also scheme

scheme, entry. of the be simplified

The

Two-Level has the

Adaptive following

Training characteristics: is during based the

Branch

Prediction

scheme

Another is the colpatof

Branch branches the

prediction executed

on

the

history execution

of of

dynamic Static lected tern the tage has same sets. There pipelined diction work tivation branch levels The second of that tory ing the advantage proposed Branch only record on the first

by Lee and

Smith

current

[13] which of the n run-time scheme

uses the statistics and The the a history results execution major the statistics to different is that

program. history history table of the pattern information of the predictor. information execution in the is collected by updating branch Therefore, the history no pre-

a pre-run to make Static first may

program

Execution pattern pattern runs

consisting branch of the

of the last Training not

on the fly of the program

a prediction. to accumulate

disadvanprogram and the data

to be run statistics

program

are necessary.

be applicable

is

serious

performance superscalar

degradation machines caused

in by

deeppre-

2.1

Concept

of Branch
Adaptive the history the Static by

Two-Level Prediction
Training branch pattern Training Adaptive profiling by the table The scheme history table (PT),

Adaptive

and/or misses due to

Training
The jor and those Smith execution tions of the the are entries Two-Level data the branch in

to

the

large

amount [1, 8].

of speculative This is the mo-

has register

two similar

ma(HR) to and of the

that for

has

be discarded a new,

structures,

proposing history is the branch pattern

higher-accuracy The of the for last new last the scheme to make last fly

dynamic uses two The hisdispredictions. s occurrences The major executscheme Training based particular study. branch The prebenchTraining accuracy most for not

prediction of branch level is the unique program here

scheme. history behavior of the

used [13].

scheme Training, the

of

Lee instead

information

In Tw&Level statistics history based of the

n branches.

accumulating

programs, branch the history

information is collected and

on which updating pattern depending

prediccontents bits in outrethe tamost of the entry for the

n branches. without the The are but of the in this prediction static the Adaptive SPEC

information

is collected beforehand, Training because of the last last is called

on the

history

registers pattern

eliminating Prediction.

on the the branch All register the

of Static Prediction, record of the last

Two-Level

Adaptive

comes register sults history ble recent of

of the which the

branches. shifts most recent are results Branch history content that pattern for

history representing

register

is a shift

in bits

predictions n branches,

history in history the bits

information. a history represent contents are made pattern table bits

on the record of the

moreover

registers The branch register. pattern by the branch the history

contained

s occurrences were used and on

(HRT).

n branches. Training dynamic simulated Two-Level average branch scheme

particular in the

Trace-driven Two-Level aa well diction mark Branch the as the schemes suite.

simulations Adaptive other were using the reaches By

history ing the

predictions

by check-

indexed particular Since instruction

of the is being

history predicted.

register

register the

table history

is indexed register table

by

branch is called

Prediction,

prediction while

addresses,

benchmarks

97 percent,

of the

52

Pat&n BraDchmtllry
Pattero BrauchEistmy B.egk& (Eli) wheoupiate) (Mftleft RwBi@+I II . . . . . .. RwR~.I ~~~~~ I

Table (PT)

OQ-...oo 00.. .0I 00-,10_

o
*y=

s: Pattern m) Hie4tK0y edidion


Bit(a)

Autanatm

Ime-t-Time

(LT)

Auta!mtnn

Al

eat

Automaton AZ 8mwlnml# up&uvl Ox.nt,.)

of B

$ , ,,,,,0

s,
lramitim

$*!=6(SGM

11.....11 ~ @

Auto-n

Automaton

A4

Figure Training

1:

The

structure

of

the

Two-Level

Adaptive

Figure state the

2:

The

state used

transition for updating

diagrams the

of

the

finitein

scheme.

machines pattern

pattern

history

table

entry.

a per-address tern the table history

history is called registers is shown

register a global access in the

table pattern same 1.

(PHRT). table, pattern Training

The because table.

patall

in the old
branch bits. come

pattern

history to the

bits

and the

the new

outcome pattern bits

of the history S.+l be-

as inputs Therefore,

generate new

The branch needed track bits

structure

of Two-Level Figure based history on the the

Adaptive The history branch; for

Branch of a last

pattern

history

Prediction

prediction of the branch to then there

Bj

is

pattern therefore, each

k
are
$%+1 =

out comes

of executing

k bits
a

6(s.,

Ri,c)

(2) logic circuit the table. S and pattern The the is used histranout-

in the

register

keep 1 k

of the history. if not, the in the history in history

If the branch register, at most In

was taken, Since 2k different order are to

A straightforward to implement tory sition come which the bits function the in the

combinational function entries 6, pattern of the comprise pattern

is recorded; appear of the

a O is recorded. register. patterns, entry

are

6 to update history bits

patterns

history of the

keep track
in

there is indexed

2k entries

R of the branch
can is based machine transition in this pattern table

a finite-state 1 and history with

machine, 2. Since bits, the

the pattern history When contents denoted comes pattern dressed table diction

table;

each

by one distinct

be characterized

by equations on the 1. of the are shown stores of the pattern pattern pattern machine

prediction

pattern. a conditional of its history ss &,c-kRi,c-k+~ of executing table. entry of the The the branch register,

Bi is being predicted, the HE, whose content is for the last k outto address the the adbits
in

finite-state The chines tory state used in the

is a Moore diagrams study for entry

the output mahisFigure hisof the The next

z characterized

by equation

finite-state the pattern in the pattern

. . . . .. Ri.c-l branch, history

is used

updating

pattern
,,=-kR,,c-k+l

SC in
the

PTR
used branch

2. The tory branch time bit

automaton only when same the


records

Last-Time
outcome history

in the

, . . . .. R..=_l

pattern
The pre-

are then

for is

predicting

the

branch.

the the

last

execution the one bit The times there

appeared. appears Only last when history the

the

history

prediction is needed automathe same when will will be be

Zc

A(SC),
decision branch function. is resolved, the

(1)

will ton

be what

happened pattern
the

last history

time.
of the

to store

information. two Only same

where After come

A is the the &,e least

prediction left

Al

results

conditional is shifted significant the pattern

out-

history branch

pattern recorded,

appeared. the next has the taken; The which Branch when

is no taken branch

into bit history

the position

history and in the After

register is also pattern being

H~
used taup-

execution

of the

in the
to ble

the history predicted predicted ing in The and next up-down Lee and counter

register as not

pattern branch A2

update entry

bits

otherwise, automaton Target when the branch will

pTRi,=_kR,,c_k+

l . . . . .. Ri.c_l.

as taken. counter, Smiths

is a saturatbut differently, design taken. [13]. The is taken as taken

dated, the content of the history register becomes by Rj,C-k+~~,C_k+z . ..... &,C and the state represented
the pattern pattern by the history bits becomes bits transition in the SC+l. pattern The transition table 6 which entry takes of the is done history state

is also used, the

Buffer is not

is incremented of the branch

branch

is decremented execution

function

be predicted

53

when

the

counter the A3

value branch

is greater will be

than

or equal as not to A2. Adaptive

to two; taken.

conditional try entry in the in the branch

branch AHRT AHRT, does not for

is to be predicted, is located the have the tag first. contents to address an entry branch. store

the

branchs has table. corresponding

enan If cost regThe

otherwise, Automata Both ing are dictions namic these tion Level from diction pattern output branch dictions at tive ate tory two in the

predicted similar

If the of the the There in the

branch

and

A4 branch

are both and

Static dynamic

Training

Two-Level

Traintheir predythe

history the entry for ister tional The

register is allocated second table branch w

is used

pattern AHRT,

predictors, The the major pattern

because difference history dynamically in Static input

a new

are based branch schemes pattern

on run-time is that table

information,

i.e.

is an extra approach. the history of the table.

history.

between informain TwoTraining to the pre-

implementing

the approach a hash is used

in this The this

is to implement table. for using Table hashing into

changes but

address approach

a condiis called

Adaptive profiling. decision

Training In Static function,

is preset the

Training, J, for before That same hand, given before

per-address the sions Hash

history History

table Register

a given

branch for

history the preAdapappropriresult hiscan can be for a given appears

(HHRT). table, in the prediction would tag store

Since this

colliimple-

is determined history are pattern, made times history different the pattern to the

execution. execution is, the history updates with the history therefore, decision Predictions the

Therefore, same branch

can occur results As one approach study, History

when

accessing

a hash

of A is determined if the during other

mentation history. for with this In this the Ideal is a history were Branch

in more would but the the above for the The is lower

interference the what of the practical (IHRT), static was than

execution accuracy

expect, cost two Table each AHRT 4-way The to

pattern Two-Level the actual

be obtained is saved. and there branch, Training with two

different Training, pattern pattern, in

execution.

an AHRT,

on the

approaches in which

information pattern table; prediction

Register for

of each be found different Two-Level Adaptive the Since Adaptive rent proper tive ferent contrary, sults

branch,

As a result,

same

branch there

register

conditional Adaptive simulated

information

simulated Predictor.

Two-Level

inputs

function of Two-Level

configurations: entry lated ulation is lost history 4-way with due data to register

512-entry set-associative. 512 entries is provided the history table and

set-associative HHRT how was The much in the

and 256also IHRT simusim-

Adaptive Training the pattern execution can still and not execution Training,

Training. change history the With data behavior be highly sets. well behavior.

adaptively bits

in accordance change can in

with

256 entries. show interference

program

accuracy practical

Two-Level to the curto make Adapmany difon the sets re-

predictor

adjust program

designs.

branch Training

of the accurate Static

predictions. programs may

the updates,

Two-Level over Training, data

3.2
The needs tion. which

Prediction
Two-Level two

Latency
Adaptive table Training lookups for next is to Branch to make into Predictor a predicone cycle, address. the pattern and store patof a prothe taprethe of a

predict execution

if changing

sequential

in different

behavior.

It is hard is usually in solution table at the a prediction with next the time table

to squeeze determining to time this with the from the

the two lookups the

the requirement problem history the bit

a high-performance instruction perform history table, history branch.

processor

3
3.1

Implementation
Implementations History Register
to have branch a big

Methods
of the Per-address

The tern branch duce

lookup

updated pattern in the for must history when the result been the

register

is updated, register

Table
enough history history two register regisHi~tory per-address A fixed together (LRU) lower table part and in numas a alof a the the histhe a

prediction It ter is not in real feasible ble the table for each static for to have the implement cache. grouped are its own

as a prediction history the does problem of the very branch often Since does not when this the register branch in the not

Therefore, the and that

be predicted, register the table, prediction

implementations. implementing is to table

Therefore,

approaches

diction pattern

is available

are propoeed Register The register ber set. branch higher entry tory Table. first table

Per-addrew the

have

to be accessed of the

cycle. excase usutenand

Another branch ecution appears ally the as a set-associative in the the

occurs before has

approach

is required

previous This but not taken the

confirmed. loop is being machine,

of entries Within

a tight kind branch to

executed

a set, for

Least-Recently-Used The into the which The in this Table to index

by a deep-pipelined otherwise. to machine result dency branch be taken,

superscalar

gorithm

is used address part

replacement. as a tag the branch.

of branch staIl until

has a high

is used

is predicted

is used for table History

is recorded per-address way

have

previous

allocated register

is confirmed.

implemented Register

is called When

Associative

(AHRT).

54

Dynamic Indmwti.n

DMribution
100 @o 80

=i\ ~ i

P
I

70 80 60 40 30 20

Nm+snti Inu
h Sub hsl km Imt mallw

e r c : t a 0 e

u bum s Jm

El Im 5rdl

r
u ml. u Ihm s WV
mlti O CuWuwl A

Fmm sub hs! h!, Iml

IWSU

Bmmh Ins!

10 0 7&A ~

10 0

oqnt q+.

~.

1,

FPA tic m

fpWP maw. ,mca 1-.

son 2*

Mwhmarh

Figure

3: Distribution

of dynamic

instructions.

Figure

4: Distribution

of dynamic

branch

instructions.

Methodology Model

and

Simulation

Benchmark Name

Number of Static Cnd. Br.

Bencbrnark Name

Number

of

Trace-driven torola for lator verifies statistics address 88100 traces generating which

simulations instruction instruction

were used in this level simulator The predicts accuracy. traces.

study, (ISIM)

A Mois used and simuand

instruction prediction branches, results

eqntott gcc doduc matrix300 tomcat v

277 6922
1149

213 370

espresso Ii fpppp spice2@

Static Cnd. Br. 556 489 653 606

are fed into with

the branch the branch in the M88100 classes: immediate branches

decodes

instructions, prediction four

the predictions for branch into instructions return and other

to collect set

Table

1: The

number

of static

conditional

branches

in

each benchmark. instruction unconditional on registers. into Inthe

The branch [4] are classified subroutine branches, structions non-branch Conditional in order branches stack. a subroutine for the branch is detected. when sets without routines, and Emma prediction. get address struction address branches which can the the return can A return

conditional

branches,

ing point The

benchmarks point

and four benchmarks

are integer include

benchmarks.

branches, than

floating

doduc,

fpppp,

unconditional class.

the branches

are classified for condition Subroutine a return as the a return prediction for returns proposed the return the offset therefore, onto

matrix300, spice2g6 and torncatv and the integer ones eqntott, espresso, gee, and li. Nasa7 is not ininclude
cluded because benchmarks, loop execution; it takes too long to capture the the five have tend branch branch floating repetaccuto have behavwhere condiand branch branches million gcc inexeare simuto 1.8 in Figfor are behavior point itive many ior. of all seven kernels. thus, The it study all branches is on the focuses Among high

instruction branches

have to wait targets. by using

codes return address when

to decide

the branch address

matrix300
integer

and tomcatv benchmarks integer

be predicted is called target The special double and

a very

prediction

is pushed when

the stack

racy is attainable. conditional Therefore, this

is popped address

prediction instruction may from by miss subKaeli tartarget value

and irregular predictor on the million

address stack

benchmarks is tested. prediction for

return

the mettle Since tional

of the branch

address

overflows. scheme

For instruction

instructions stacks

branches, The The

benchmarks

except

fpppp

were simulated structions. cution lated billion. The dynamic executed. before

for twenty benchmarks twenty number

conditional and

in [2] is able An to the immediate program is calculated be generated

to perform by adding counter;

address in the inthe

f pppp

gcc finish

unconditional

branchs

millions range

conditional from fifty

of dynamic

instructions

for the benchmarks instruction 24 percent benchmarks

immediately. to become

Unconditional ready.

on registers

have to wait address

for the register

distribution of the dynamic and about point

is shown instructions 5 percent

is the target

ure 3. About the integer namic branch

of the dy-

4.1

Description
from

of Traces
the SPEC benchmark study. Five suite are are

instructions instructions. distribution

for the floating of the dynamic

benchmarks instructions

Nine benchmarks
used in this

The

branch

branch

prediction

float-

55

Branch
HRT NaIu. AT(AHRT(2S6,12SR), PT(212,A2),) AT(AHRT(512,12sR), PT(212,A2),) AT(AHRT(512,12SR), PTGJ2,A3),) AT(AHRT(512,1!JSR), PT(212 ,A4),) 512 612 512 .512 256 512 12-bit 10-bit 8-bit 6-bit 12-bit 12-bit 12-bit ),) 512 .512 12-bit 12-bit 12-bit 612 512 . ),,) ),,) 512 512 512 512 . + 12-bit SR SR SR SR SR SR A2 LT A2 LT Al LT PB PB PB PB PB PB ,PB),SaIXIe) ,PB),SaIIIe) ,PB),SaIIIe) ,PB)$DM) SR SR SR SR SR SR SR Atra AtuI AtIU Atm A41u AtIU Atza LT AZ A2 AT(AHRT(512,12SR), PT(212,LT),) AT(AHRT(512,1OSR), PT(210,A2),) AT(AHRT(512,8SR), PT(28,A2),) AT(AHRT(512,6SR), PT(26,A2),) AT(HHRT(256,12SR), AZ A2 A2 AX 512 12-bit SR At= A4 512 I%bit SR Aim A3 512 12. bit SR AtDI A2 ~ Entries 256 1. !Lner. tatmzl Entry content 12-bii SR Ilelliat*o Eniry C.rlten< At= A2 Il

Target history AHRT, in the in the the

Buffer

design

(LS).

In

History
for the

(Size,
for

Entry-Content),
keeping IHRT, specifies an entry tomaton Pattern(Size, mentation patterns, plementation, in each history 2. the For entry. table Lee and part the for Size of entries or

History
information HHRT. in Figure

is the Size
each register 2 or

implementation specifies and The can

of branches,

example, number

implementation, history in

Ent ry.Cent
content be any

ent
of auIn

content

entry. table

shown

a history

register. is the for

Entry_Content),
keeping specifies and The can history the number

Pattern
information

implehistory

of entries specifies in shown Buffer because in their used. set

in the imthe the in there content pattern Figure designs, is no

Entry_Content
content be any automaton Branch included, kept sets same used are data

PT(#z ,Aa)>)
AT(HHRT(512,12SR), PT(212 PT(212,A2 PT(_J12 PT(212 PT(212 PT(212 ,A2),) AT(1HRT(,12SR), ST(AHRT(512,12SR), ST(HHRT(51;,12SR), ST(IHRT(,12SR), ST(AHRT(512,12SR), ST(HHRT(512,12SR), PT(212,PB),Diff) ST(1HRT(,12SR), PT(212 ,PB),Diff) LS(AHRT(512,A2 LS(AHRT(512,LT),,) LS(HHRT(512,A2 LS(HHRT(512,LT),,) LS(lHRT( ,A2),,) LS(lHRT( ,LT),,) 12-bit Atm Aim AiIII A*UI Aim A* III

of an entry Target

Smiths is not data the are information

Putt ern
history how

pattern specifies specified training different

designs. When is used and

Data Data is
both

as Same, and data testing. sets

for as testing.

When

Data
for

is specified data

Diff,
If for

12-bit

training

Data
or Lee this

is not specified,
and Smiths and are listed simulation usually all HRT to the

no training

set is needed Training

the shemes,

as in Two-Level Branch scheme in Table results, should bits

Adaptive Target of each 2.

schemes The in

Buffer simulation

designs. model

configuration study to our Since


LS Lee SetRegwter Register, Bit.

AT and

Two-Level Smsths

Adapt%ve Branch Target Register Hwtory LT -

Z%ain%ng, Buffer Table, Regwter Tzme,

ST

.%at$c AHRT Hash SR

lkainmg,

about

60 percent

of branches the contain history to static execution,

are taken contents more

accordhis0s.

ing tory

of the 1s than

Des$gn, HHRT

Four-way Hzstory Shift

register

Associative Table, Atm IHRT

History Ideal

Accordingly, entry in the of program is re-allocated register The is not pattern branches using

in the

register when the

of each beginning an entry history

Table, PB -

are initialized During a different

1s at the branch,

Automaton,

Last-

Preset

Prediction

execution.

Table

2: Configurations

of simulated

branch

predictors.

re-initialized. history are to state bits more Al, 1 such will the Smiths in the beginning likely, for pattern those A4, table entries Since taare beentries

is shown tion, tions is the

in Figure

4. As can of the branches. should The tapes

be seen from dynamic The be studied

the

distribuinstrucbranch the

are also initialized taken bles automata, to state of execution addition Lee Target branch to and

at the

of execution. pattern all all entries

about are branch

80 percent conditional class that accuracy.

branch conditional

A2,
that

to improve conditional

are initialized initialized ginning taken. In

3. For

A3, and Last-Time,


the likely

prediction branches in Table

number

of static

branches

at the

in the 1.

trace

of the benchmarks

are listed

be more Two-Level Static schemes

to be predicted Training schemes dynamic simulated Static and and for

Adaptive Training and some were

4.2
Several Adaptive tory tions, (HHRT), ulated. the

Simulation
configurations Training table associative along In naming is order with to the

Model
were scheme. (PHRT), HRT the simulated For two (AHRT) ideal for HRT the the the for the Two-Level per-address implementathe hash were HRT simhis-

schemes, Branch static scheme scheme that were

Buffer prediction

designs, Lee the but and with

comparison with

purposes. to an IHRT The simulated method Smiths with branch two

Smiths

Training Training difference

register

practical and (IHRT)

is similar

Two-Level pattern

Adaptive

the important approaches Training above. for with

the prediction also

for a given practical for Static

is pre-determined the the HRT same

distinguish

different branch

schemes, prediction

by profiling.

convention

schemes

Scheme (History Entry-Content),


for example, Training

(Size, Data).
Two-Level (ST), or

Entry.Content), Scheme
Adaptive Lee and speciTrainSmiths

accessing Lee The and

introduced Branch prediction

Putt ern(Size,
fies the scheme, ing (AT), Static

Target A2, A3, schemes

Buffer A4, and

designs

were

simulated static

automata

Last-Time.
include

simulated

56

the

Always and

Taken, a simple

Backward profiling the branch

Taken

and

Forward The

Not

Wo-I.avd 1

khPth l%ddIw

SdIme4 Udq

Dlffmmnt State lhndth

taken, scheme not-taken tion. branch was

scheme. frequency in the

profiling and
1 Om \ ~ c c : OM - - . . . . . . . . . . ... ....... .. ... . + ATIAHflTIS!2.12SR),PT140WA4).) O AT(AHIIT(5M! 2SRWWW.LTI.I 0.92 - -. AT(A~T(SI z,,2S~,PT(400AA2L)

is done for

by counting each static

of taken profiling is the same

execuone the data set the of

The

predicted most profiling

direction frequently. and

of a branch Since execution the

takes for

used

in this by taking

study, the ratio the

prediction the sum possible number

accuracy of the larger directions of the

was calculated number of every static

. .
y 0.24

in the two numbers branch branch over

for two total

dynamic

conditional

instructions.

0.8

. . . . . . . . . . . . . . . . . . . 1
. . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5
The run Static signs, ures the egory mean the ing This ferent

Simulation
simulation with and results the Two-Level Schemes, static 10 show as Tot all the some

Results
presented Adaptive the the On G branch the in this section schemes, Buffer schemes. accuracy axis, the G Mean benchmarks, across shows axis the the were the deFigacross catFigure different
r90-Z.awl

Training Target prediction

5:

Two-Level state transition

Adaptive

Training

schemes

using

Training 5 through nine labeled across

Branch prediction

automata.
ImdEemA8A2m

benchmarks.

horizontal shows Int mean

Mean

geometric shows and predifall float-

benchmarks, across The from with the geometric vertical

geometric point

mean shows

all integer

FP G Mean diction accuracy section branch

benchmarks. scaled concludes prediction

0 ~v-k$y ~ : v
,2.>;.V$Z...... I...
-

Ad@iveTn2mhw -.

mug Ditrelentxmr

n.n . - ...................$$.

.. ....... .... .......................

c c

a ATIAHITISW.1UlUPT(401qMLl ATIHHIAW2,1 ~Tw2v2L)

0,22

- -----------------------

. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . .

76 percent a comparison

to 100 percent. between

a c

#u-

- . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . .

.. .. .. .. .

3
in the state the

* ATIHRTI!6, MI%?T(4wLA21.I

* ATlN41T@2t.m2Wl(401&A2L) - ATWRTP2t1 z3RI,PT(4s90,W

schemes.

0.8 -- . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5.1
The ulated ent HRT lengths using

Two-Level
Two-Level with

Adaptive
Adaptive Training transition and

Training
schemes automata, history Training the and accuracy. were simdifferregister The scheme scheme is used table.

0.78 i TaIJ km itG q.lt, w. We II WO tifPPwm.ti, w 300 Xm qc

! 10MS

different their

state effects

implementations, to show of the IHRT Two-Level

different

Figure
different

6: Two-Level
history

Adaptive table

Training

schemes

using

on prediction Adaptive the table and ideal accuracy miss history

register

implementations.

simulations an

demonstrate history to Lee

can achieve scheme

without

effect Static

In

order each

to

show scheme which

the

curves

clearly with performs used the

following transition

as a comparison which

Smiths

Training

figures, automata the state

is shown usually automata

also uses the

register

A2

best study.

among

transition

in this

5.1.1 Figure

Effect 5 shows Four

of

State

Transition of different

Automata 5.1.2 state transition

the efficiency state transition

automata. and cause other using early

automata, Al A3, is not and A4. 1 percent which

A2,

A3, A4,
beto the scheme than similar than last the time; in

Effect of History ment ation 6 shows


prediction schemes. with the the AHRT the the history best, 512-entry scheme worst, This history the effects accuracy Every same the register HHRT in the to as the history 512-entry the fourth the hit

Register

Table

Imple-

Last- Time were simulated.


experiments automata, performs the other more only indicated A2, automata The history about four-state

included, The

Figure
on the Training ulated forms ond, entry scheme hit the ratio.

of the of the scheme length,

HRT in the

implementations Adaptive was simWith scheme the the the persec256HRT in graph

it was inferior worse

Two-Level length. scheme the third,

Last- Time
around maintain which and A4

register AHRT scheme

the ones using accuracy machines


LastTime

achieve

equivalent

the IHRT

97 percent. records

four-state information happened tolerant

finite-state

what more

and the 256-entry order decreases. of the increasing ratio

HHRT

A2, the

A3,

are therefore

to noise

decreasing

execution

history.

is due

interference

branch

57

WO.Lwd Ad@iw

sn

A c c ; * c Y

a 8,

ala

0s4

0,1

. 1~
rnkJvg Safnmm

U,hlgaf.lnrys@stelUof

f.a@u

Benchmark eqntott espresso gcc li doduc fpppp matrix300 spice2g6 tomcat v

Name

Training

Data NA Cps cexp.i

Set

Testing

Data bca

S=

.*. .. . .-;..* - ... .,, 0


?

. .. .. .

.. .. .. . . ..

~...

~$

:..: ................

int.pri~.eqn dbxout eight doducin nat oms NA greycode.in NA .i

//

. .. . . . . . . . . . . . . . . .

. ,g-----------------------------------

tower tiny

of hanoi doducin NA NA

queens

. ,:., e

. .. . . .. . . . . . .

. . . .. .. .. .. .. .. .. . ..

. . . .. . . . .

. .. .. .. .. . .

. . .

.. .

. .. .. .. . .. .

.. .. ..

El
-

* AT(N+IT(6,2.IaPT(4M1.A2L) a ATlAtM17512.tmPT(lm) AqWiT@latSlyT@6v,q)

* AlpWT(siaamMAf))

short

greycode.in NA

. . . . .. .. .. .. .. .. .. . . .

. . . . . .. .

. .

Table mark.

3: Training

and

testing

data

sets of each

bench-

implementations Figure history 7: Two-Level Adaptive Training lengths. schemes using the Static Level table 5.1.3 Figure the using lated. Effect 7 shows of the History effect Register of history of Twc-Level Adaptive register for for Length register Adaptive Training lengths about 2 bits. the history were length on similar. tern the their data sets the names) are same table In order Static Training

used

in this schemes. any

study The

were cost

simulated to than history logic

with

implement for Twoare patsets, in

registers

of different

Training Adaptive and pattern However,

is not

less expensive because required the by both

Training, table the for the state the

register

schemes in the scheme. data

transition Static the of the

is simpler to show results which those were

Training training (with

prediction four The

accuracy The different accuracy the

Trainschemes simuby

effects for the

ing schemes.

Two-Level history increases

simulation names) set and which both

schemes and tested (with testing

Same

were for trained

trained and All by

tested

on the same lli~~ data in were in their data are for sets order

0.5 percent According

schemes the other schemes data set,

lengthening to the length accuracy

history results,

registers increasing prediction

on different

simulation often

register until the

presented. used In Training the

improves

the

accuracy

as those

schemes which the can

asymptote

is reached.

a fair and best

comparison. executed the set, Static on the

the same best

trained are the that are ap-

results

schemes

achieve for

with

5.2
Static tory the

Static
Training pattern statistics

~aking
Branch of the gathered last Prediction examines the the the hisand with the his-

data known Five

because

predictions were trained are

branches other

beforehand. of nine data benchmarks sets. The and other with excluded four benchmarks, eqntott, because applicable data 3. in sets

n executions
from profiling calculate not-taken the branch required done keep History in track

of a branch program

plicable

a training branch tory will

data

set

to or

probabilities the given

matrix300,
there data sets are

fpppp,
too

tomcatv,
data to each

be taken to predict the

with path. to

are no other

applicable similar and testing

sets or the other. The

pattern

Although training Training history quires to keep static the used

accounting can needs static support. be to

gather the

the

the

used The tions Figure tion for

in training Static to the 6 are accuracy training

are shown with 8. using is about Training The the history execution, by

in Table similar

statistics scheme of every hardware track its branch

software, of the program,

Static re-

Training Two-Level shown of the and by the in

schemes Adaptive Figure

configuraschemes predicdata set This 12 bit is about Adaptive and prediction lower is more floating Ii the an data

execution which be used of each is being to index branch is then number bit

Tkaining The the scheme Two-Level

branch branch run-time. history which The the from table

in the execution When pattern

highest same using

registers

must history

schemes

of the during recorded

execution Static achieved using AHRT. training drop in and

97 percent. accuracy registers different the for For are not Since of the 1 percent

a branch is used preset the

is achieved history the Training 512-entry sets accuracy respectively. significant. point The same

predicted, branch for prediction of static the

registers as that scheme 4-way for The for

an IHRT. 12 bit and the

pattern information. predicting branches

contains Because one program

preset

prediction

branch.

However,

when

varies

to another, which like

are used

number the

of history hardware all the

registers to static offer

required a big

changes, enough in the table

gcc and

espresso
5 percent branch

is about
accuracy lower.

requires IHRT In order tions,

to hold

branches

programs.

It is about the regular

to consider

the to the

effects IHRT,

of practical the two

implementapractical HRT

benchmarks, degradations

degradations behavior

so apparprograms. the data

in addition

ent due to the

are within

0.5 percent.

58

camu190a
1

d Srmdl

Plwdictlm

sch-

+ Pr&h@(..s.fnl)

Figure schemes.

8:

Prediction

accuracy

of

Static

Training

Figure

10:

Comparison

of branch

prediction

schemes

bound cal HRT BTFN the percent


* W4RT@t.LTI,J a LY+mT@,f*\,) -

at 93 percent implementations. lower and than Always

for using Some

the A2.

same Using

schemes Last-Time poorly data points Not other

with

practi4 to

is about
compared fall

Taken

predict of the

1
U3(AMIT(512LT).J 0 LS(MIYT(5WA2).) * L.SWIMT(SULTM a L$$+HRT@12#2).) Xm - N.q, Wal ~ Pm$llnj(,,samo)

other

schemes.

below scheme like

76 percent, The (BTFN) matrix300 For racy The quite average the Backward is effective and Taken for tomcatv and the but Forward loop-bound not for taken benchmarks benchmarks. accuthe other

loop-bound

benchmarks,

the However, lower Taken

prediction for than

is as high average markedly is about simple times

as 98 percent. is often Always one

benchmarks, The

its accuracy accuracy of from the

70 percent. changes Its

is approximate benchmark

69 percent. scheme to another.

accuracy

60 percent. scheme accumulate is taken each branch. simulated the and how The here statistics many is to run of how times the bit to for

Figure designs, scheme.

9:

Prediction BTFN,

accuracy Always

of Branch and

Target the

Buffer Profiling

The the many branch

profiling once taken of the the taken count the branch

Taken,

program is not opcode

prediction depending than prediction

for the Static for training for

Training and testing schemes

Schemes is not is not

using complete, graphed.

different the

data average

sets ac-

in the on

branch branch

is set or cleared count The to is larger run-time the 92.5 cost

whether

the

curacy

the

not-taken of the

branch branch is fairly

or not. according but

is made of this simple accuracy.

prediction percent. of profiling

bit. This and

5.3

Other

Schemes results of Lee and smiths


Backward Taken, Buffer Taken and designs and Forthe profil-

The

average

scheme

is about at the

scheme

Figure 9 shows the simulation Branch Target Buffer designs,


ward ulated Only signs using percent urations, ulated. Not with taken The (BTFN), Branch ing scheme. Target

low

prediction

Always

5.4

Comparison

of

Schemes

were sim-

automata, of the figure, designs and A4 than in


A3

Al,

A2, A3, A4, and Last-Time.


using the A2 and Last-Time of the de3 because results

the results the using A2.

designs are similar using using AHRT, in those

are shown

Figure 10 illustrates the comparison between the schemes mentioned above. The 512-entry 4-way AHRT was chosen for all the uses of HRT, because it is simple enough to be implemented. Tmm-Level Adaptive

to those predict Three A2.

of the designs about buffer sets the 2 to configsimupper

The similar

Al

and similar

Static

training At scheme

schemes the whose As top can

are is the

chosen prediction from

on the

basis Adaptive

of is the

lower Using

those

costs.

Two-Level the

to IHRT, an IHRT

and HHRT, schemes

were

Training about

average

accuracy graph,

97 percent.

be seen

59

Static lower dicts Buffer scheme the

Training than almost design which the

scheme top as well with predicts of the

predicts The and

about profiling Smiths 92.5 with

1 to

5 percent preTarget The of result

References
[1] M. Butler, and T-Y M. Yeh, Shebanow, Than Y.N. Two, Patt, Instruction M. Alsup, Level H. Par-

curve. as Lee accuracy

scheme Branch percent. lsst

around

Scales, allelism

a branch branch

the about

is Greater

Proceedings

of the Archi-

execution

achieves

89 percent

accuracy. [2]

18th International Symposium on Computer tecture, (May. 1991), pp. 276286.


D. R. Kaeli ble Prediction and P. G. Emma, of Moving Returns Branch History

Tato

6
This

Concluding
paper proposes Training. the behavior of the Two-Level with three history the HHRT to history a new The

Remarks
branch scheme last predictor, predicts Two-Level a branch and by the [3]

Target

Branches

Due

Subroutine

, Proceedings

of the 18th InArchitecture,

ternational
(May Tse-Yu Branch of Michigan, [4] Motorola Arizona, [5] of 1991),

Symposium
pp. 34-42.

on Computer

Adaptive examining branch pattern The ulated

of the

n branches of that

for the lasts last

occurrences

unique

Yeh, Prediction,

Two-Level Technical

Adaptive Report,

Training University

n branches.
Adaptive HRT the register AHRT which obtain A scheme accuracy same the Training table schemes the large enough table. The for were to simwhich hold IHRT each usually scheme AHRT simseen [6] configurations: which is a hash upper using than size, HHRT. Training register prediction the history scheme lengths. accuracy register. Training predicTraining Taken, a simple [8] [7] branch Static Always and Adaptive or static Smiths designs, Not taken, was As IHRT

(1991). Inc., M881OO Users Manual,

Phoeniz,

is an ideal cache, data the has using Each ulated from ally In tion and was other higher an

(March

13, 1989).

all stat ic branches, included schemes. prediction HHRT miss

is a set-associative bounds

Hwu, T. M. Conte, and P. P. Chang, ComparW.W. ing Software and Hardware Schemes for Reducing the Cost of Branches, ternational
(May N.P. Level pipelined 1989). Jouppi and D. Wall, for Available Superscalar Instructionand Super-

an AHRT the because same the

Proceedings on Computer

of the 16th inArchitecture,

Symposium

of the rate than

has lower

Two-Level with various

Adaptive history results, the

Parallelism Machines.,

the

simulation to such

is usu-

Proceedings

of the Third

In-

improved addition several schemes

by lengthening other Target and were dynamic and Buffer Forward simulated.

Two-Level

scheme, schemes, Backward profiling The shown cent suite. ter than diction reduction Since lative on

ternational Conference on Architectural Support Languages and Operating Sysfor Programming 1989), pp. 272-282. tems, (April
D. J. Lilja, Reducing Processors pp.47-55. Hwu and Y.N. Execution Patt, Checkpoint 1987), Repair for , the Branch Penalty in (July

as Lee

Branch Taken scheme Twc-Level

Pipelined 1988), W.W.

IEEE

Computer,

Adaptive an average

Training prediction from static the

scheme accuracy SPEC

has

been

to have nine most The

of 97 perbenchmark betpre[9] branch

Out-of-order

Machines, (December

IEEE

Transpp. 1496-

benchmarks of the other which number miss already on the

actions
1514.

on Computers,

prediction

accuracy means of in

is about or dynamic more pipeline flushing the than

4 percent

schemes, in the

a 100 percent required. the specucan Training are effecbe

P. G. Emma of Branch Evaluating

and E. S. Davidson, and Data Dependencies Performance (July M. , Pipeline

Characterization in Programs , IEEE for

flushes of

a prediction execution

causes

Trans-

progrem,

performance

actions

on Computers,
and H.

1987), An

pp.859-876. Evaluation of

improvement considerable scheme. Deep-pipelining tive ness, good Branch methods however, branch

a high-performance Two-Level

processor Adaptive execution level This

[10] J. A. DeRosa
Branch (June [11] D-R. in the Delay

Levy,

by using and for single

Architectures

ternational
superscalar instruction performance. critically Two-Level by minimizing branches. 60 on the as a way 1987), Ditzel exploiting processor depends predictor. processors mispredicted parallelism effectiveof a high as-

Symposium
PP.1O-16. and Zero, H.R.

Proceedings of the ldth inon Computer Architecture,

to improve

McLellan,

Branch Reducing

Folding Branch

accuracy to support the penalty

CRISP to

Microprocessor:

Adaptive

Training

Prediction

is proposed

tional
1987),

Proceedings of the Idth Interna(June Symposium on Computer Architecture,


pp.2-9.

performance sociated with

[12]

S. Cost

McFarling of

and

J.

Hennessy,

Reducing

the

Branches,

ternational (1986),

Proceedings of the 13th InSymposium on Computer Architecture,

ings of the 8th International Symposium on Computer Architecture, (May. 1981), pp.443-458. [16] J.E. Smith, gies, A Study Computer
of Branch Prediction

pp.396-403. Branch Buffer pp.6-22. Prediction Design, Strat& IEEE

Strate&m1981),

Proceedings on

of the

$ih

International (May.

[13] J. Lee and A. J. Smith, gies and Branch (January Computer, [14] T.R.

posium

Architecture,

Target 1984),

pp.443-458. [17] L.E. Shar and E.S. Davidson, A MdtiminiPr~

Gross and J. Henneasy,

Optimizing

Delayed

Branches, Proceedings of the l$th shop on Microprogramming, (Oct. 120. [15] D.A. duced Patterson Instruction and C.H. Sequin,

Annual Work1982), pp.114

cessor System Implemented Through Pipelining., IEEE Computer, (Feb. 1974), pp.42-51. [18] T. C. Chen, parallelism, Pipelining Design, Vol. and Computer 10, No. 1, (Jan.

RISC-I:

A ReProceed-

Efficiency, Computer 1971), pp.69-74.

Set VLSI

Computer,

61

Vous aimerez peut-être aussi