Académique Documents
Professionnel Documents
Culture Documents
1 of 18
John Gibson, Albert Wang, CS 152 – Section 101
Tim Wong, John Truong Page 2
Final Project
I. Abstract
The goal of this project is to construct a working superscalar processor, with
branch prediction. The memory module from lab 6 needed to be reworked to properly
functioning. This phase we decided to emphasize robustness and functionality (ie, a
working processor) rather than speed, so the memory was scaled back to a direct
mapped, write through cache.
Although the superscalar architecture itself was straightforward, the primary
complication lay in increasing the number of ports in the register file and cache to
support 2 pipelines. We introduced “striping” in the cache to handle this situation
with relatively few stalls.
Cache, Striping, Dual Issue – This part involved writing a direct mapped, write-
through cache with bursting, striping instructions within the cache, and adapting
the cache for reading 2 instructions at once.
2 of 18
Initial revision: Tim
Testing: Tim, Albert
Integration: Everybody
Testing: Everybody
Section 0: Superscalar
superscalar.sch
Because our 5-stage pipelined processor was already working reasonably well,
extending it to a superscalar architecture was relatively straightforward. The two
pipelines are referred to as the EVEN and ODD pipeline. Alternatively, the control
signals distinguish between the two pipelines as Pipeline 1 (EVEN) and 2 (ODD) (this is
slightly confusing, however we were able to distinguish the names between ourselves,
and decided that going back to change all the names would be tedious and could cause
annoying bugs if we were not careful).
Each pipeline maintains their own copies of the instructions, PCs, and control signals
they process. The goal of this is to isolate each pipeline as much as possible in order to
simplify the debugging process and minimize complexity.
With this project, we had the opportunity to use many of the lessons we learned from
Lab 6’s non-functioning cache. Most notably, we kept the “Keep It Simple, Stupid” motto
in mind throughout the design process. Because we wanted to reduce the complexity of
our design, we decided to limit the functionality of the pipelines. For instance, all branch
3 of 18
and jump instructions must be processed in the EVEN pipeline, whereas all memory
instructions must be process in the ODD pipeline. We also kept an invariant that the
earlier instruction must always be in the EVEN pipeline. The rationale behind this
decision was that we wanted to keep the pipelines “synched” so that forwarding, hazards,
and prediction mechanisms would be easier to design and test. Although this invariant
inevitably increases our CPI, our goal was to have a working processor first and then
include additional “features.” Keeping this in mind, we tried to design our processor so
that it would be easy to integrate optimizations later.
Restricting the pipeline reduced the number of corner cases we had to worry about.
The “branch pipeline” was intentionally set as the EVEN pipeline (the earlier one), so
that branch delay slots could be handled more cleanly. Since branch and jump
instructions will always be sent to the EVEN pipeline with their delay slots in the ODD
pipeline, our distributor doesn’t have to keep states and remember that a delay slot
instructions has to be fetched.
Restricting the pipelines also reduced the complexity of forwarding between the
pipelines because data does not need to be forwarded to the memory stage of the EVEN
pipeline and nor does data have to forwarded to the decode stage of the ODD pipeline.
When multiple components request a stall or a bubble the stall arbiter decides which
stall has precedence. Until the final project stalls had been handled in an ad hoc manner
with several simple logic gates and latch signals. While it was easy to use the ad hoc
system for lab 5 (only the hazard unit could stall, so no arbitration was necessary), we
began to see stalling issues in lab 6 when we created two additional components (the data
and instruction caches) that needed to stall the processor. However, with a few more
gates we were able to retain our old stalling system. Unfortunately this system became
inadequate during the development of lab 7 when we created three new stalling signals
that needed to be handled. The first is bubble, which is asserted when the instruction in
the decode stage of the ODD pipeline is dependent upon the instruction in the decode
stage of the EVEN pipeline. The second and third signals are jumpflush and branchflush,
which are asserted when a jump is detected or bad guess is made by the branch predictor.
This proved to be far too many signals to handle with simple logic gates so a new module
was created to give preference to the various signals.
4 of 18
pipeline. Finally it fills the decode instruction of the ODD pipeline with the
instruction from the even fetched instruction unless the even fetched instruction is
a jump or a branch.
5. Jump Flush / Branch Flush (Only one of these should ever be asserted at once, so
they have equal priority) – The flush signals reset the instructions entering the
decode stage.
The stall arbiter selects the stall signal with the highest priority and propagates it to
the rest of the processor. The actual reset and write enable signals to instruction and PC
registers are determined by a set of OR and NOR gates which have the appropriate stall
signals being fed into them. This probably should have been done in another behavioral
verilog module, but because the signals could be decided with a single level of gates we
decided that it was easiest to just place the gates on the pipeline.
A superscalar processor requires two new instructions every cycle, therefore we had
to modify our 32 bit-wide write through cache to create a wider cache that could provide
64 bits of data to the pipeline. Fortunately the instruction cache never has to accept
stores from the pipeline, this made developing the new instruction cache substantially
easier.
We widened the instruction cache to fetch two instructions in one cycle. The original
design was a BRAM with a single 64-bit port. We used Coregen to create a BRAM with
64 bit entries, and we halved the number of entries to keep the size of the instruction
cache constant. This had the side effect of making our loads from RAM 4 cycles faster
because now a cache line could be filled in just four writes to cache instead of eight. The
downside to this design was that we could only load even/odd pairs of words, not
odd/even pairs. Unfortunately, the distributor is designed to fetch words in odd/even
pairs as well as even/odd pairs (because of a jump/branch or a stall because of a memory
instruction in the EVEN pipeline). Without modification, we would have incurred a 1
cycle penalty every time we wanted an odd/even pair. We thought of several solutions to
this problem.
First, Prof. Kubi suggested that we build a stream buffer that would always be a few
cycles ahead and then we could select both odd/even and even/odd pairs as long as the
buffer was full. To keep the buffer full, we would have had to keep the fetch stage a few
cycles ahead of the rest of the processor. This would have made branches and jumps very
costly, however our branch prediction unit would have offset this penalty. Unfortunately,
managing a stream buffer sounded complicated, because we wanted to keep our design as
simple as possible, we decided not to use this approach.
An alternative would have been to use a dual-ported BRAM with port widths of 32
which would allow us to select any two words we wanted simultaneously. However
given Prof. Kubi’s disdain for dual-ported BRAMs and the fact that using dual porting
5 of 18
here would rob us of the opportunity to use dual-porting to enhance the loading speed of
DRAM requests, we decided not to go this route either.
The final option was to stripe the instructions across two separate BRAMs with one
containing odd words and the other containing even words. This way we could always
select both an odd and an even word in a single cycle regardless of the word’s position in
cache. This modification was straightforward and we were able to make the change
without touching the cache controller.
The memory subsystem is just a schematic that wires together several major
components. It contains the Memory Mapped I/O, the Instruction Memory Wrapper, the
Instruction Cache, the Data Cache, the Cache Arbiter, and the DRAM Interface as well as
a few shift registers and logic gates. The MM I/O unit sits above the Data Cache and is
in charge of intercepting requests to the Memory-Mapped I/O space before they can reach
the Data Cache. The Instruction Memory Wrapper performs a similar task. It prevents
the Instruction Cache from attempting to load instructions while the Boot Rom code is
executing. The Cache Arbiter is connected to both the Instruction Cache and the Data
Cache and routes their requests to the DRAM Interface. Data transfers to and from
DRAM are done through the shift registers. We decided to implement the Memory
Subsystem as a schematic because it makes it easier to visualize the connections.
Note that we re-worked Lab 6’s cache architecture to correctly function. In doing so,
we simplified it into a direct-mapped, write-through policy.
6 of 18
Section 4: Instruction Distribution
superdistributor.v
With dual issue and the restrictions we put on our pipeline concerning memory and
branching instructions, the processor needs a way to distribute instructions to avoid
possible structural hazards. The purpose of the distributor module is to distribute
instructions in such a way to avoid these structural hazards. It does a simple decode of the
opcode and functcode of the instructions coming from instruction cache. If the earlier
instruction coming from cache is a memory instruction it sends it down the ODD
pipeline, sends a NOP down the EVEN pipeline and requests that the cache load the 2
instructions following the memory instruction. For a branch or jump detected as the later
instruction, the earlier instruction is sent down the EVEN pipeline, a NOP down the ODD
pipeline, and the distributor requests that the branch or jump be fetched again with its
delay slot instruction (Figure 4.1). The effect of refetching the branch or jump instruction
with its delay slot instruction helps particularly when dealing with branch prediction.
Normally, when a branch is predicted after the instruction is fetched, the new predicted
PC is immediately sent into the instruction cache. If a branch happens to be in the ODD
pipeline there would be an issue of having to fetch the delay slot instruction before
predicting a branch.
We tested this unit with directed vectors. Instructions were placed in a file for our
testbench to read from. The testbench imitated the cache, sending the instructions to the
distributor. Output from the distributor was displayed and verified.
7 of 18
Section 5: Forwarding
superforward.v
The forwarding unit for our superscalar processor was built primarily from the
forwarding unit from our 5 stage pipeline. Our design decision to have branches and
jumps serviced only in the EVEN pipeline meant that we did not have to forward to the
ID stage of the ODD pipeline. Likewise, since memory instructions can only be serviced
in the ODD pipeline, we did not have to forward to the MEM stage of the EVEN
pipeline. Other than this, for each conditional clause that determined the selection of a
forwarding mux, all that needed to be done was to add 2 additional else statements that
took into account the extra forwarding sources and the order in which forwarding should
be considered. For example, among the instructions in the same stage, the forwarded data
from the instruction in the ODD pipeline should take precedence over the data from the
EVEN pipeline, because the ODD pipeline always has the later instruction and thus,
should always have the most recent data. (Figure 5.1)
Section 6: Hazards
superhazard.v (handles stall signals to the pipeline when a hazard is detected)
superhazardbrains.v (detects hazards within and between pipelines)
superdepend.v (detects dependencies within the same cycle)
The superscalar processor must deal with many of the same hazards that occur in the
5-stage pipelined processor of labs 5 and 6, but again like the forwarding unit also must
consider the other pipeline.
8 of 18
Read After Write (RAW) hazards:
Just like the regular 5-stage pipeline forwarding cannot solve all RAW hazards.
Fortunately, these hazards are easily quantified because they only occur when branch,
jump, or memory instructions occur in the pipeline(s). All that needed to be done then
was to adapt existing code to identify which pipelines these instructions occur in and to
duplicate code to consider instructions in the opposite pipeline. The fact that our design
stipulated that only branches and jumps could occur in the EVEN pipeline and that only
memory instructions could occur in the ODD pipeline, made this task easier.
A special case that occurs is when two instructions that are dependant appear in the
decode stage. In this case the instruction in the ODD pipeline is dependent on the
instruction in the EVEN pipeline because the EVEN instruction always contains the
earlier instruction. In our original design, the hazard unit stalls the fetch stage, sends the
earlier instruction down the EVEN pipeline, sends a NOP down the ODD pipeline, and
asserts a NOP in the EVEN pipeline in the decode stage. In this way, the problem is
reduced to a forwarding issue and processor is unstalled. We initially chose this design
because we thought that handling this dependency in the distributor unit would force the
distributor to have to do extra decoding and logic in the fetch stage which we wanted to
avoid in order to keep from extending the cycle time for that stage.
However, we found a way around this. In our new design, the processor still detects
the dependency in the decode stage, but instead of stalling the processor and NOPing the
EVEN pipeline, the hazard unit sends a request to the distributor to send the dependant
instruction to the EVEN pipeline’s decode stage, and send the earlier of the two
instructions that the distributor already fetched from the cache to the ODD pipeline’s
decode stage (See Figure 6.1). This only works however, when the dependant instruction
is not a memory instruction and the earlier instruction is not a branch or jump instruction.
superdepend.v
Since our optimized distributor was added late in the project, we only tested this
distribution with integrated testing. Furthermore, since we were simultaneously running
versions of the processor on the board and in simulation, this optimization was not
running on the board, but was running in simulation.
9 of 18
Figure 6.1: Optimized Distributor
The branch predictor is a table with the BTB and BHT sitting side-by-side. It is a
fully associative table with 8 entries. We chose this configuration over a large direct
mapped table because we figured that for the types of programs we would be running,
any loops that we came across would not be very deep (nested loops) that would warrant
such a large history table. See Figure 7.1 for a representation of the branch predictor.
The BHT uses 2-bit saturating counters as its predictors. A value of 0 or 1 means the
branch is predicted to be taken, whereas a value of 2 or 3 predicts that the branch will not
be taken. Figure 7.2 shows how the 2-bit predictors are updated.
The replacement policy for the branch predictor table is one that merely replaces
entries in sequence. It fills in new branch entries in order from 1 through 8 using a
10 of 18
counter, and then jumps back to 1 again. This design was chosen over a policy such as
LRU for simplicity.
Here are some things to notice about our branch predictor. First, our branch predictor
has ultimate say on what the next PC will be. Even the distributor has to go through the
branch predictor, in case we have just fetched a branch or must recover from a
mispredicted branch. The branch predictor is also responsible for flushing IF/ID stage
when a branch has been predicted incorrectly. The first time the current PC of a branch is
used to fetch the instruction, the branch predictor will not have the instruction in its table
and the branch is predicted to be not taken. If the branch is resolved in the ID stage as a
branch that should be taken, then the predictor flushes the IF/ID stage and sends the new
branch PC to the instruction cache. In this way, the branch predictor should always have
the final say in what the next PC should be.
We wrote a test bench to test our branch predictor, using directed vectors.
branchpredictor_tf.tf
11 of 18
Figure 7.2: Branch Predictor State Diagram
Knowing that the cache would be the origin of many of our problems, we decided that
time could be better spent if we could split our efforts and test the processor with and
without the cache simultaneously. To do this, we made an SRAM version of the processor
that had SRAM blocks as in lab 5 in for cache. This way we could test distribution,
forwarding, and hazard units separately from the cache
12 of 18
.
Like in other labs, we automated the tests as much as possible. After testing for some
basic functionality (add, beq, ori) we run one or several instructions designed to test
specific functionality. After each “micro-test,” the result is compared to the desired result.
If the result is undesired, a flag is set in a designated register. After completion of the
tests, the flag register is then output to I/O space and/or a break signal set, indicating a
problem.
While performing integrated testing, after each set of large changes, we re-ran our
suite of regression tests to verify that no new bugs were introduced. The regression tests
consisted of all the tests from the prior labs, the new tests provided by the TAs, and a new
superscalar test that tested forwarding between the different pipelines and ensured the
correct distribution of instructions was being sent down the pipelines.
Initially, we re-ran the original multicycle pipelined processor test (mipstest.s) from
Lab 5 to test the processor. We then created 2 new tests specifically for the superscalar
architecture (ss_test1.s, ss_test2.s), which were designed to test inter- and intra- pipeline
forwarding, the distributor, and branch predictor modules. For instance, it ensures that
only memory operations are sent down the odd pipeline, and the branch / jump
instructions were only sent down the even pipeline.
Noteworthy Bugs
As expected, our integrated testing revealed timing and stall priority problems with
across cache line loads. However, the bugs were painstakingly solved by John G., who
wrote the supercontroller and stall arbiter for those responsibilities.
Our integrated tests were also able to turn up a forwarding bug. The quick_sort test
was able to reveal a bug where a forwarding path between the end of the MEM stage and
EX stage for the store word register was missing. This path was not a problem in lab 5,
but became a problem in lab 6 when we reworked the timing on memory instructions for
the data cache.
V. Results
About half an hour before the final project demo, we were finally able to get a version
of the processor running corner.mem running on the board. This version has a DRAM
clock running at 27MHz and a processor clock running at 4MHz. Unfortunately, the
debouncers are implemented strangely, so some “pre-processor initialization”
(hammering the pushbuttons – our buttons toggle instead of the normal behavior) needs
to be done. The main problem we discovered was the clock boundary between the
DRAM and main processor interfaces. Out primary concerns are that the arbiter
sometimes will fail to read the DRAMdone signal output by the DRAMinterface.
13 of 18
Device utilization summary:
Number of External GCLKIOBs 1 out of 4 25%
Number of External IOBs 173 out of 512 33%
Number of LOCed External IOBs 153 out of 173 88%
Number of BLOCKRAMs 52 out of 160 32%
Number of SLICEs 7134 out of 19200 37%
Number of GCLKs 3 out of 4 75%
14 of 18
Unfortunately, we were unable to re-run all the tests for the perfect memory
processor. A large amount of clock cycles were devoted to memory stalls. As a result, the
branch prediction, jump prediction (tested for performance, but not included because we
had insufficient time to verify correctness) and bubble optimization had very little effect
on performance.
The branch prediction had very little effect on these tests, most likely either
because of the lack of branches or because the branch history table size needs to be
increased. Also, some last-minute tests revealed that bubble optimization decreased the
total cycle count of Jason Ding’s quick_sort from 5000 cycles (base) to 4500. Most likely,
however, the largest performance gains would have been achieved by increasing the set
associativity of the cache, or moving to a write-back cache.
VI. Conclusion
Taking the extra time to properly and thoroughly design the entire superscalar
processor in the beginning definitely paid off for this project. The primary problem we
needed to resolve was the cache line boundary issue, which we solved with a
supercontroller. Very few bugs occurred that had to do with the parallel pipelines, rather,
the majority of time was spent implementing a write-through cache that supported dual
issue and loading across cache lines and attempting to get the processor working on the
Calinx boards. However, by sticking to a simple design, we were able to have a
functioning processor in simulation and somewhat functioning processor in the board as
well as adding a few optimizations.
15 of 18
John Gibson: 55.5 hours
John Truong: 46 hours
Albert Wang: 75.5 hours
Tim Wong: 60 hours
VIII. Appendix
Online notebook
online_notebook_final.txt
Schematics
Superscalar Datapath
superscalar.sch
Memory Subsystem
memorysubsystemWT.sch
Data Cache
datacachewt.sch
Instruction Cache
16 of 18
instcacheWT.sch
Verilog Modules
Branch Predictor
branchpredictor.v
Cache Controller
cachecontrolWT.v
Distributor
superdistributor.v
17 of 18
Forwarding Logic
superforward.v
superhazardbrains.v
Stall Arbiter
stallarbiter.v
18 of 18