The Moron: CS 152 Final Project

The Moron
CS 152 Final Project

Professor Kubiatowicz
Superscalar, Branch Prediction
John Gibson (cs152-jgibson)

John Truong (cs152-jstruong)
Albert Wang (cs152-albrtaco)
Timothy Wong (cs152-timwong)
1 of 18
John Gibson, Albert Wang, CS 152 – Section 101
Tim Wong, John Truong Page 2
Final Project
I. Abstract
The goal of this project is to construct a working superscalar processor, with
branch prediction. The memory module from lab 6 needed to be reworked to properly
functioning. This phase we decided to emphasize robustness and functionality (ie, a
working processor) rather than speed, so the memory was scaled back to a direct
mapped, write through cache.
Although the superscalar architecture itself was straightforward, the primary
complication lay in increasing the number of ports in the register file and cache to
support 2 pipelines. We introduced “striping” in the cache to handle this situation
with relatively few stalls.
II. Division of Labor

Datapath Enhancement – This part involved updating the datapath to include 2
pipelines, adding additional forwarding logic, and updating the memory / register
modules to support dual input/output, when necessary.
Initial revision: Albert, John G.
Cache, Striping, Dual Issue – This part involved writing a direct mapped, write-
through cache with bursting, striping instructions within the cache, and adapting
the cache for reading 2 instructions at once.
Initial revision: John G., Tim

Testing: John G.
Branch Predictor – This part involved writing and
Initial revision: John T.

Testing: John T., Tim
Distributor – This part consisted of a VHDL component that distributes the

instructions between the two pipelines based on dependencies and other
constraints.
Initial revision: Tim

Testing: Albert
Forwarding/Hazards – This involved updating the forwarding and hazard units

to support the two pipelines.
2 of 18
Initial revision: Tim
Testing: Tim, Albert
Integration – Integration primarily involved updating the toplevel modules to

support the new modules introduced by superscalar.
Integration: Everybody
Overall Testing – Testing was done on each element that we implemented

followed by thorough testing of the datapath after integration of each component,
as well as ensuring that it worked on the board correctly.
Testing: Everybody
III. Detailed Strategy

Sections:
0: Superscalar
1: Stall Arbiter
2: Dual Issue
3: Memory Subsystem
4: Instruction Distribution
5: Forwarding
6: Hazards
7: Branch Prediction
Section 0: Superscalar
superscalar.sch
Because our 5-stage pipelined processor was already working reasonably well,
extending it to a superscalar architecture was relatively straightforward. The two
pipelines are referred to as the EVEN and ODD pipeline. Alternatively, the control
signals distinguish between the two pipelines as Pipeline 1 (EVEN) and 2 (ODD) (this is
slightly confusing, however we were able to distinguish the names between ourselves,
and decided that going back to change all the names would be tedious and could cause
annoying bugs if we were not careful).
Each pipeline maintains their own copies of the instructions, PCs, and control signals
they process. The goal of this is to isolate each pipeline as much as possible in order to
simplify the debugging process and minimize complexity.
With this project, we had the opportunity to use many of the lessons we learned from
Lab 6’s non-functioning cache. Most notably, we kept the “Keep It Simple, Stupid” motto
in mind throughout the design process. Because we wanted to reduce the complexity of
our design, we decided to limit the functionality of the pipelines. For instance, all branch
3 of 18
and jump instructions must be processed in the EVEN pipeline, whereas all memory
instructions must be process in the ODD pipeline. We also kept an invariant that the
earlier instruction must always be in the EVEN pipeline. The rationale behind this
decision was that we wanted to keep the pipelines “synched” so that forwarding, hazards,
and prediction mechanisms would be easier to design and test. Although this invariant
inevitably increases our CPI, our goal was to have a working processor first and then
include additional “features.” Keeping this in mind, we tried to design our processor so
that it would be easy to integrate optimizations later.
Restricting the pipeline reduced the number of corner cases we had to worry about.
The “branch pipeline” was intentionally set as the EVEN pipeline (the earlier one), so
that branch delay slots could be handled more cleanly. Since branch and jump
instructions will always be sent to the EVEN pipeline with their delay slots in the ODD
pipeline, our distributor doesn’t have to keep states and remember that a delay slot
instructions has to be fetched.
Restricting the pipelines also reduced the complexity of forwarding between the
pipelines because data does not need to be forwarded to the memory stage of the EVEN
pipeline and nor does data have to forwarded to the decode stage of the ODD pipeline.
Section 1: Stall Arbiter

stallarbiter.v
When multiple components request a stall or a bubble the stall arbiter decides which
stall has precedence. Until the final project stalls had been handled in an ad hoc manner
with several simple logic gates and latch signals. While it was easy to use the ad hoc
system for lab 5 (only the hazard unit could stall, so no arbitration was necessary), we
began to see stalling issues in lab 6 when we created two additional components (the data
and instruction caches) that needed to stall the processor. However, with a few more
gates we were able to retain our old stalling system. Unfortunately this system became
inadequate during the development of lab 7 when we created three new stalling signals
that needed to be handled. The first is bubble, which is asserted when the instruction in
the decode stage of the ODD pipeline is dependent upon the instruction in the decode
stage of the EVEN pipeline. The second and third signals are jumpflush and branchflush,
which are asserted when a jump is detected or bad guess is made by the branch predictor.
This proved to be far too many signals to handle with simple logic gates so a new module
was created to give preference to the various signals.
1. Data Cache Stall – Freezes the entire pipeline

2. Hazard Stall – Freezes the fetch and decode stages, inserts bubbles into the
execute stage.
3. Instruction Cache Stall – Freezes the fetch and decode stages, inserts bubbles into
the execute stage.
4. Bubble – Inserts a bubble into the execute stage of the ODD pipeline. It also fills
the decode stage of the EVEN pipeline with the decode instruction from the ODD
4 of 18
pipeline. Finally it fills the decode instruction of the ODD pipeline with the
instruction from the even fetched instruction unless the even fetched instruction is
a jump or a branch.
5. Jump Flush / Branch Flush (Only one of these should ever be asserted at once, so
they have equal priority) – The flush signals reset the instructions entering the
decode stage.
The stall arbiter selects the stall signal with the highest priority and propagates it to
the rest of the processor. The actual reset and write enable signals to instruction and PC
registers are determined by a set of OR and NOR gates which have the appropriate stall
signals being fed into them. This probably should have been done in another behavioral
verilog module, but because the signals could be decided with a single level of gates we
decided that it was easiest to just place the gates on the pipeline.
Section 2: Instruction Issue
A superscalar processor requires two new instructions every cycle, therefore we had
to modify our 32 bit-wide write through cache to create a wider cache that could provide
64 bits of data to the pipeline. Fortunately the instruction cache never has to accept
stores from the pipeline, this made developing the new instruction cache substantially
easier.
We widened the instruction cache to fetch two instructions in one cycle. The original
design was a BRAM with a single 64-bit port. We used Coregen to create a BRAM with
64 bit entries, and we halved the number of entries to keep the size of the instruction
cache constant. This had the side effect of making our loads from RAM 4 cycles faster
because now a cache line could be filled in just four writes to cache instead of eight. The
downside to this design was that we could only load even/odd pairs of words, not
odd/even pairs. Unfortunately, the distributor is designed to fetch words in odd/even
pairs as well as even/odd pairs (because of a jump/branch or a stall because of a memory
instruction in the EVEN pipeline). Without modification, we would have incurred a 1
cycle penalty every time we wanted an odd/even pair. We thought of several solutions to
this problem.
First, Prof. Kubi suggested that we build a stream buffer that would always be a few
cycles ahead and then we could select both odd/even and even/odd pairs as long as the
buffer was full. To keep the buffer full, we would have had to keep the fetch stage a few
cycles ahead of the rest of the processor. This would have made branches and jumps very
costly, however our branch prediction unit would have offset this penalty. Unfortunately,
managing a stream buffer sounded complicated, because we wanted to keep our design as
simple as possible, we decided not to use this approach.
An alternative would have been to use a dual-ported BRAM with port widths of 32
which would allow us to select any two words we wanted simultaneously. However
given Prof. Kubi’s disdain for dual-ported BRAMs and the fact that using dual porting
5 of 18
here would rob us of the opportunity to use dual-porting to enhance the loading speed of
DRAM requests, we decided not to go this route either.
The final option was to stripe the instructions across two separate BRAMs with one
containing odd words and the other containing even words. This way we could always
select both an odd and an even word in a single cycle regardless of the word’s position in
cache. This modification was straightforward and we were able to make the change
without touching the cache controller.
Unfortunately, we encountered another significant problem: loading words across

cache lines. If the processor requested word 7 of line x and word 0 of line x + 1 then we
would be unable to fulfill the request in a single cycle for a number of reasons. The first
was that we still could only lookup a tag for a single line in one cycle. To fix this
problem we either would have to dual-port the tag file or keep two copies of it so that we
could query different entries simultaneously. A more serious issue would arise in the
event of a double-cache miss. Handling a double-cache miss would have altered our
cache controller greatly and given our earlier trouble with the cache controller we felt that
it was safer not to attempt this solution. Instead we decided to make a separate controller
that would handle a load across cache lines over multiple cycles. This controller would
act as a supercontroller for the instruction cache; it would detect a load across cache lines
and then freeze the rest of the processor while it hijacked the cache and simulated two
separate load requests. This way we were able to quickly resolve the load across cache
lines issue without having to substantially alter our cache controller. We did pay a
performance penalty however, a load across cache lines would always take at least two
cycles. We decided that this was an acceptable tradeoff because functionality is more
important than performance. To reduce this penalty we could increase the length of our
cache lines (thus making requests across cache lines less frequent). This improvement
would require changes to the DRAM controller.
Section 3: Memory Subsystem

memorysubsystem.sch
The memory subsystem is just a schematic that wires together several major
components. It contains the Memory Mapped I/O, the Instruction Memory Wrapper, the
Instruction Cache, the Data Cache, the Cache Arbiter, and the DRAM Interface as well as
a few shift registers and logic gates. The MM I/O unit sits above the Data Cache and is
in charge of intercepting requests to the Memory-Mapped I/O space before they can reach
the Data Cache. The Instruction Memory Wrapper performs a similar task. It prevents
the Instruction Cache from attempting to load instructions while the Boot Rom code is
executing. The Cache Arbiter is connected to both the Instruction Cache and the Data
Cache and routes their requests to the DRAM Interface. Data transfers to and from
DRAM are done through the shift registers. We decided to implement the Memory
Subsystem as a schematic because it makes it easier to visualize the connections.
Note that we re-worked Lab 6’s cache architecture to correctly function. In doing so,
we simplified it into a direct-mapped, write-through policy.
6 of 18
Section 4: Instruction Distribution
superdistributor.v
With dual issue and the restrictions we put on our pipeline concerning memory and
branching instructions, the processor needs a way to distribute instructions to avoid
possible structural hazards. The purpose of the distributor module is to distribute
instructions in such a way to avoid these structural hazards. It does a simple decode of the
opcode and functcode of the instructions coming from instruction cache. If the earlier
instruction coming from cache is a memory instruction it sends it down the ODD
pipeline, sends a NOP down the EVEN pipeline and requests that the cache load the 2
instructions following the memory instruction. For a branch or jump detected as the later
instruction, the earlier instruction is sent down the EVEN pipeline, a NOP down the ODD
pipeline, and the distributor requests that the branch or jump be fetched again with its
delay slot instruction (Figure 4.1). The effect of refetching the branch or jump instruction
with its delay slot instruction helps particularly when dealing with branch prediction.
Normally, when a branch is predicted after the instruction is fetched, the new predicted
PC is immediately sent into the instruction cache. If a branch happens to be in the ODD
pipeline there would be an issue of having to fetch the delay slot instruction before
predicting a branch.
We also extended the distributor’s responsibility by allowing it to detect jumps (j and

jal) in the early instruction and send the new jump PC to the PC unit (our branch
predictor). In this way we saved a cycle whenever there was a j or jal instruction because
we did not have to wait until ID stage to decode a jump.
More functionality was added to the distributor to handle dependant instructions in

the same cycle as well. This is discussed in Section 6: Hazards (RAW).
We tested this unit with directed vectors. Instructions were placed in a file for our
testbench to read from. The testbench imitated the cache, sending the instructions to the
distributor. Output from the distributor was displayed and verified.
Figure 4.1: Distributor
7 of 18
Section 5: Forwarding
superforward.v
The forwarding unit for our superscalar processor was built primarily from the
forwarding unit from our 5 stage pipeline. Our design decision to have branches and
jumps serviced only in the EVEN pipeline meant that we did not have to forward to the
ID stage of the ODD pipeline. Likewise, since memory instructions can only be serviced
in the ODD pipeline, we did not have to forward to the MEM stage of the EVEN
pipeline. Other than this, for each conditional clause that determined the selection of a
forwarding mux, all that needed to be done was to add 2 additional else statements that
took into account the extra forwarding sources and the order in which forwarding should
be considered. For example, among the instructions in the same stage, the forwarded data
from the instruction in the ODD pipeline should take precedence over the data from the
EVEN pipeline, because the ODD pipeline always has the later instruction and thus,
should always have the most recent data. (Figure 5.1)
Figure 5.1: Forwarding Paths
Section 6: Hazards
superhazard.v (handles stall signals to the pipeline when a hazard is detected)
superhazardbrains.v (detects hazards within and between pipelines)
superdepend.v (detects dependencies within the same cycle)
The superscalar processor must deal with many of the same hazards that occur in the
5-stage pipelined processor of labs 5 and 6, but again like the forwarding unit also must
consider the other pipeline.
8 of 18
Read After Write (RAW) hazards:
Just like the regular 5-stage pipeline forwarding cannot solve all RAW hazards.
Fortunately, these hazards are easily quantified because they only occur when branch,
jump, or memory instructions occur in the pipeline(s). All that needed to be done then
was to adapt existing code to identify which pipelines these instructions occur in and to
duplicate code to consider instructions in the opposite pipeline. The fact that our design
stipulated that only branches and jumps could occur in the EVEN pipeline and that only
memory instructions could occur in the ODD pipeline, made this task easier.
A special case that occurs is when two instructions that are dependant appear in the
decode stage. In this case the instruction in the ODD pipeline is dependent on the
instruction in the EVEN pipeline because the EVEN instruction always contains the
earlier instruction. In our original design, the hazard unit stalls the fetch stage, sends the
earlier instruction down the EVEN pipeline, sends a NOP down the ODD pipeline, and
asserts a NOP in the EVEN pipeline in the decode stage. In this way, the problem is
reduced to a forwarding issue and processor is unstalled. We initially chose this design
because we thought that handling this dependency in the distributor unit would force the
distributor to have to do extra decoding and logic in the fetch stage which we wanted to
avoid in order to keep from extending the cycle time for that stage.
However, we found a way around this. In our new design, the processor still detects
the dependency in the decode stage, but instead of stalling the processor and NOPing the
EVEN pipeline, the hazard unit sends a request to the distributor to send the dependant
instruction to the EVEN pipeline’s decode stage, and send the earlier of the two
instructions that the distributor already fetched from the cache to the ODD pipeline’s
decode stage (See Figure 6.1). This only works however, when the dependant instruction
is not a memory instruction and the earlier instruction is not a branch or jump instruction.
superdepend.v
Since our optimized distributor was added late in the project, we only tested this
distribution with integrated testing. Furthermore, since we were simultaneously running
versions of the processor on the board and in simulation, this optimization was not
running on the board, but was running in simulation.
9 of 18
Figure 6.1: Optimized Distributor
Write After Read (WAR) hazards:

The way that the 5-stage pipeline handles write after read hazards with equal length
stages, last stage write back with asynchronous reading and negative edge writes to
register file (in our implementation), and a single memory stage also applies to our
superscalar processor, so these hazards never had to be dealt with explicitly.
Write After Write (WAW) hazards:

Like the regular 5-stage pipeline our superscalar processor only has one true memory
stage so that this type of hazard can’t occur with memory. However, unlike the regular 5-
stage pipeline, our superscalar processor allows 2 instructions to write to the register file
during the same clock cycle. This brings up the problem of two instructions requesting to
write to the same register at the same time. The way that the processor deals with this
case is by always choosing to write from the ODD pipeline since our design ensures that
the instruction in the ODD pipeline is always contains the later instruction (of the parallel
instructions) and thus contains the most recent data.
Section 7: Branch Prediction

branchpredictor.v
Its responsibilities are three-fold:

1. Make branch prediction using the current PC.
2. Update its table when a branch has been resolved in the ID stage.
3. Send the new PC (even when there is no branch) to the instruction cache.
The branch predictor was implemented as described in lecture. It combines a Branch

Target Buffer (BTB) and Branch History Table (BHT). The BTB stores the branch PC
and the branch target address to be branched to, should the branch be taken. Meanwhile
the BHT is the component that actually makes the prediction.
The branch predictor is a table with the BTB and BHT sitting side-by-side. It is a
fully associative table with 8 entries. We chose this configuration over a large direct
mapped table because we figured that for the types of programs we would be running,
any loops that we came across would not be very deep (nested loops) that would warrant
such a large history table. See Figure 7.1 for a representation of the branch predictor.
The BHT uses 2-bit saturating counters as its predictors. A value of 0 or 1 means the
branch is predicted to be taken, whereas a value of 2 or 3 predicts that the branch will not
be taken. Figure 7.2 shows how the 2-bit predictors are updated.
The replacement policy for the branch predictor table is one that merely replaces
entries in sequence. It fills in new branch entries in order from 1 through 8 using a
10 of 18
counter, and then jumps back to 1 again. This design was chosen over a policy such as
LRU for simplicity.
Here are some things to notice about our branch predictor. First, our branch predictor
has ultimate say on what the next PC will be. Even the distributor has to go through the
branch predictor, in case we have just fetched a branch or must recover from a
mispredicted branch. The branch predictor is also responsible for flushing IF/ID stage
when a branch has been predicted incorrectly. The first time the current PC of a branch is
used to fetch the instruction, the branch predictor will not have the instruction in its table
and the branch is predicted to be not taken. If the branch is resolved in the ID stage as a
branch that should be taken, then the predictor flushes the IF/ID stage and sends the new
branch PC to the instruction cache. In this way, the branch predictor should always have
the final say in what the next PC should be.
We wrote a test bench to test our branch predictor, using directed vectors.
branchpredictor_tf.tf
The test cases include:

1. Empty branch table
a. add entry
2. Branch table with elements, not full
a. Access/prediction
b. Increase predictor (branch Not Taken), from 0 to 1, 2, 3
c. Decrease predictor (branch Taken), from 3 to 2, 1, 0
d. Flush, when predicted incorrectly.
e. Add entry
3. Branch table with elements, full
a. Access/prediction
b. Increase predictor
c. Decrease predictor
d. Flush, when predicted incorrectly
e. Add entry
Valid Branch PC Branch Target Predictor

1
2
3
4
5
6
7
8
Figure 7.1: Branch TargetTBuffer and History Table
11 of 18
Figure 7.2: Branch Predictor State Diagram
IV. Testing Methodology

Because the superscalar processor was built on top of the functional 5-stage pipelined
processor, we could rely on the individual pipelines being functional for testing. Thus, we
knew that many of the bugs that would appear would be a result of integrating the
pipelines. Even so, we tried to do as much individual module testing as we could. This
included testing the distributor for simple functionality as well as the branch predictor.
Learning from the Lab 6, we knew that our cache’s main problem would be timing
correlation with the processor. To test the bare cache controller, we first integrated the
cache and DRAM controller into our normal 5-stage pipelined processor to ensure single
issue functionality. It was then just a matter of implementing and testing our cache with
dual-issue, striping, and loading across cache lines integrated into the superscalar
processor. Modules such as the forwarding unit, the hazard unit, were built upon previous
units which we knew could stand up on single pipeline testing, which meant that any
problems we would find would occur when the processor was integrated. The stall
arbiter, likewise, depended on other components to see true functionality.
Knowing that the cache would be the origin of many of our problems, we decided that
time could be better spent if we could split our efforts and test the processor with and
without the cache simultaneously. To do this, we made an SRAM version of the processor
that had SRAM blocks as in lab 5 in for cache. This way we could test distribution,
forwarding, and hazard units separately from the cache
12 of 18
.
Like in other labs, we automated the tests as much as possible. After testing for some
basic functionality (add, beq, ori) we run one or several instructions designed to test
specific functionality. After each “micro-test,” the result is compared to the desired result.
If the result is undesired, a flag is set in a designated register. After completion of the
tests, the flag register is then output to I/O space and/or a break signal set, indicating a
problem.
While performing integrated testing, after each set of large changes, we re-ran our
suite of regression tests to verify that no new bugs were introduced. The regression tests
consisted of all the tests from the prior labs, the new tests provided by the TAs, and a new
superscalar test that tested forwarding between the different pipelines and ensured the
correct distribution of instructions was being sent down the pipelines.
Initially, we re-ran the original multicycle pipelined processor test (mipstest.s) from
Lab 5 to test the processor. We then created 2 new tests specifically for the superscalar
architecture (ss_test1.s, ss_test2.s), which were designed to test inter- and intra- pipeline
forwarding, the distributor, and branch predictor modules. For instance, it ensures that
only memory operations are sent down the odd pipeline, and the branch / jump
instructions were only sent down the even pipeline.
Noteworthy Bugs
As expected, our integrated testing revealed timing and stall priority problems with
across cache line loads. However, the bugs were painstakingly solved by John G., who
wrote the supercontroller and stall arbiter for those responsibilities.
Our integrated tests were also able to turn up a forwarding bug. The quick_sort test
was able to reveal a bug where a forwarding path between the end of the MEM stage and
EX stage for the store word register was missing. This path was not a problem in lab 5,
but became a problem in lab 6 when we reworked the timing on memory instructions for
the data cache.
V. Results
About half an hour before the final project demo, we were finally able to get a version
of the processor running corner.mem running on the board. This version has a DRAM
clock running at 27MHz and a processor clock running at 4MHz. Unfortunately, the
debouncers are implemented strangely, so some “pre-processor initialization”
(hammering the pushbuttons – our buttons toggle instead of the normal behavior) needs
to be done. The main problem we discovered was the clock boundary between the
DRAM and main processor interfaces. Out primary concerns are that the arbiter
sometimes will fail to read the DRAMdone signal output by the DRAMinterface.
13 of 18
Device utilization summary:
Number of External GCLKIOBs 1 out of 4 25%
Number of External IOBs 173 out of 512 33%
Number of LOCed External IOBs 153 out of 173 88%
Number of BLOCKRAMs 52 out of 160 32%
Number of SLICEs 7134 out of 19200 37%
Number of GCLKs 3 out of 4 75%
Test results (CPI)

Without memory stalls (perfect memory)
Base (no optimizations)
mipstest 295/322 = .916
corner 195/235 = 0.83
mystery_test 690/464 = 0.672
With memory stalls (direct-mapped write-through cache)

Base (no optimizations) With Branch Predictor
Quick_sort 14115 cycles / 6582 13857/6582 = 2.106

instructions = 2.144
CPI
Base 2348/1221 = 1.923 2348/1221 = 1.923
Corner 2478/1488 = 1.665 2478/1488 = 1.663
Cachetest2 3275/1879 1.743 3263/1877 = 1.738
With Branch Predition and With branch prediction, jump

Jump Prediction prediction, bubble optimization
Quick_sort 13843/6582 = 2.103 13825 cycles /6582
instructions = 2.100
Base 2348/1221 = 1.923 2348/1221 = 1.923
Corner 2473/1488 = 1.662 2473/1488 = 1.662
cachetest2 3276/1879 = 1.743 3262/1877 = 1.738
14 of 18
Unfortunately, we were unable to re-run all the tests for the perfect memory
processor. A large amount of clock cycles were devoted to memory stalls. As a result, the
branch prediction, jump prediction (tested for performance, but not included because we
had insufficient time to verify correctness) and bubble optimization had very little effect
on performance.
The branch prediction had very little effect on these tests, most likely either
because of the lack of branches or because the branch history table size needs to be
increased. Also, some last-minute tests revealed that bubble optimization decreased the
total cycle count of Jason Ding’s quick_sort from 5000 cycles (base) to 4500. Most likely,
however, the largest performance gains would have been achieved by increasing the set
associativity of the cache, or moving to a write-back cache.
VI. Conclusion
Taking the extra time to properly and thoroughly design the entire superscalar
processor in the beginning definitely paid off for this project. The primary problem we
needed to resolve was the cache line boundary issue, which we solved with a
supercontroller. Very few bugs occurred that had to do with the parallel pipelines, rather,
the majority of time was spent implementing a write-through cache that supported dual
issue and loading across cache lines and attempting to get the processor working on the
Calinx boards. However, by sticking to a simple design, we were able to have a
functioning processor in simulation and somewhat functioning processor in the board as
well as adding a few optimizations.
VII. Time Spent
15 of 18
John Gibson: 55.5 hours
John Truong: 46 hours
Albert Wang: 75.5 hours
Tim Wong: 60 hours
VIII. Appendix
Online notebook
online_notebook_final.txt
Schematics
Superscalar Datapath
superscalar.sch
Memory Subsystem
memorysubsystemWT.sch
Data Cache
datacachewt.sch
Instruction Cache
16 of 18
instcacheWT.sch
Verilog Modules
Branch Predictor
branchpredictor.v
Cache Controller
cachecontrolWT.v
Distributor
superdistributor.v
17 of 18
Forwarding Logic
superforward.v
Hazard (Stall) Logic
superhazardbrains.v
Stall Arbiter
stallarbiter.v
18 of 18

The Moron: CS 152 Final Project

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

The Moron: CS 152 Final Project

Transféré par

Droits d'auteur :

Formats disponibles

The Moron

CS 152 Final Project

John Gibson (cs152-jgibson)

II. Division of Labor

Initial revision: Albert, John G.

Initial revision: John G., Tim

Branch Predictor – This part involved writing and

Initial revision: John T.

Distributor – This part consisted of a VHDL component that distributes the

Initial revision: Tim

Forwarding/Hazards – This involved updating the forwarding and hazard units

Integration – Integration primarily involved updating the toplevel modules to

Overall Testing – Testing was done on each element that we implemented

III. Detailed Strategy

Section 1: Stall Arbiter

1. Data Cache Stall – Freezes the entire pipeline

Section 2: Instruction Issue

Unfortunately, we encountered another significant problem: loading words across

Section 3: Memory Subsystem

We also extended the distributor’s responsibility by allowing it to detect jumps (j and

More functionality was added to the distributor to handle dependant instructions in

Figure 4.1: Distributor

Figure 5.1: Forwarding Paths

Write After Read (WAR) hazards:

Write After Write (WAW) hazards:

Section 7: Branch Prediction

Its responsibilities are three-fold:

The branch predictor was implemented as described in lecture. It combines a Branch

The test cases include:

Valid Branch PC Branch Target Predictor

IV. Testing Methodology

Test results (CPI)

With memory stalls (direct-mapped write-through cache)

Quick_sort 14115 cycles / 6582 13857/6582 = 2.106

With Branch Predition and With branch prediction, jump

VII. Time Spent

Hazard (Stall) Logic

Vous aimerez peut-être aussi