CMT Overview r2

EE560 Project CMT Core
It is a 4-thread single core MIPS-ISA-based processor. We are enhancing our CPU design to
accommodate for Multiple Threads and include Mult and Div execution units
Instructions Supported:
Add rt,rs,rt ---------------- 000(alu op code)
SUB rd, rs, rt ------------- 001
AND rd, rs, rt ------------- 010
OR rd, rd, rt --------------- 011
SLT rd, rs,rt --------------- 100
LW rt, offset(rs)
SW rt, offset(rs)
BEQ rs, rt, offset
JMP target
Mult
Div
The design will be a six-stage pipeline. The following are the stages. In our simplified design, there is
no virtual memory. Hence there is no TLB/PT. We are not supporting exceptions and interrupts.
Fetch(IF)
Thread(TS)
Decode(ID)
Execute(EX)
Memory(MEM)
Write Back(WB)
Fetch
There is a unified I-cache and 4 PC's to point the location next instruction of each thread. The PC can
be updated by PC+4 or Branch/Jump target PC or Rolled back PC(when we roll back an instruction).
Unlike the single-thread 5-stage CPU project, here we need to walk along with thread ID and PC
through the pipeline.. The size of the previous IF and ID stages is increased by 4 times as we have
independent resources for each thread. There is a valid bit in this array, which says whether the thread
is currently active or not.
Thread Select
Thread scheduling is a fair round robin league. A thread is kept out of scheduling if it is unavailable or
not ready to be executed or if there is going to be latency in its operation. The switched out thread is
brought into the scheduling sequence once the instruction that it is waiting upon has finished its
necessary operation. A thread that becomes ready after long latency is given priority to execute and
adds into the sequence of threads to be scheduled at the top of the scheduling stack.
Each thread is switched out at every clock to provide a fair resource sharing for all of them. The
following is the state diagram per thread. Each thread wakes up in WAIT state on RESET. When a valid
bit of the specific thread is high, then it moves to READY state. All threads in READY are scheduled
one after other in round robin. So after every clock each thread will move out from RUN to READY
state. When only one thread is present, it loops around in RUN state until it finished or if there is a
second thread.
The thread that is currently scheduled is written into TS/ID stage register and the Program Counter,
Thread Status Register (Register holding the state of the thread as Wait/Ready or Run) and the LRU
logic is updated accordingly.
Decode
In this stage we decode what instruction it is. This stage has register file. In order to avoid unnecessary
stalling, instead of implementing an early branch (in ID stage) like in EE457, we are implementing a
medium branch in EX stage. The threads have their own dedicated set of registers. Each thread has
32x32 registers. This can be implemented by four 32x32 register files or one big 128x32 register file. A
big register file is cheaper than four small register files. So we are choosing to implement one big file.
It has 2 read ports. We are supporting one write port. We can have two write ports one for the regular
instruction stream and one for long latency instruction. We might have conflicts in that case meaning,
multiply and divide streams might come to be written into the register file at the same clock. In that
case we will be having a conflict. In this case, we need to stall execution of one of the long latency
instructions (divide or multiply). Since our thread scheduler puts a thread into wait state after
encountering a long latency instruction, we will never hit a WAW hazard or a resource conflict while
accessing the register file for a single thread. Even with two write ports on the register file, it is
possible that a mult instruction from the pipelined multiply unit from one thread, a divide instruction
from the divide unit from another thread, and yet another integer instruction from a 3rd thread can all
attempt to write to the register file at the same time. So in our implementation we wish to manage with
one write port on the 128x32 composite register file.
Since we are performing only in-order execution, we will not provide ROB, etc. So once a long latency
instruction is issued by the decode unit, we need to suspend issuing further instructions from that
thread. However, the load-word instruction will be “speculated” as an instruction that is going to have a
hit, and instructions after the load word are issued. If the load incurs a cache miss, then we need to roll-
back the instructions of the thread currently occupying stages in the pipeline.
Execute
There is an integer execution unit, which takes 1 clock. There is a multiply execution unit which is a 3
stage pipeline. When a multiply is issued, it is sent to execution unit and that thread is kept out of
scheduling until that instruction is executed. Divider is a 7-clock multi cycle design. Hence it can only
handle one divide operation at a time. If we get another divide instruction too soon, that thread is rolled
back. Once the divide finishes executing both the threads will be added in the execution sequence.
As we have made our register file single ported (for writing), we need to make sure that there are no
conflicts for writing by different threads. This can be done by either using a scheduling sequence or
bringing back these results from long latency execution units into the pipeline and write them back
from WB stage. The branch is executed in EX stage. So if the branch is taken, then, if there are any
instructions of that thread in the pipeline after the execute stage, we need to flush them and we need to
rollback to the branch target address.
Long latency instructions and rolling back: Another thing that we can do is to have a shadow register
and save the contents of the long latency thread. In that case w e will have 3 shadow locations for 3
threads to sit aside so that other remaining thread or threads can go ahead in the normal pipeline. When
they are done with the execution then these threads will be added into the normal sequence and the
thread will continue to execute.
The problem with this method is this. If there is a junior branch that is taken, then we need to flush the
instruction in mult or div execution unit if it is present. Also the mult is a pipeline so we need to do a
comparison with all these stages. This method can be implemented if we are going to inject into the
long latency instruction into the pipeline and selectively rollback and also should take care of hazards.
But we are not supporting it. When a long latency instruction is encountered, the thread out or loop for
a sequence of time and rescheduled from the IF stage.
When you have a shadow register, we keep aside our long latency instruction and other threads will
walk over it. This instruction sitting aside has to run in the execution unit. Our Multiplier is a pipeline,
so if there is a flush in some other thread, then we need to selectively flush the instructions in the
pipeline. So we need comparison units. Also we need to take care of hazards. Instead of shadow
register, we are rolling back the long latency instruction. The corresponding thread is set aside out of
scheduling sequence and reschedule from IF stage.
There is a selection logic to send mult, div and int into the stream so that there will never be any
conflict when they are being written back into the register contents.
Memory
Here we do the memory operation. This stage is similar to the block ram design that we did previously.
Like in previous years, the data cache is emulated. This emulation is improvised to emulate a non-
blocking cache which can allow up to 3 pending transactions of read and one pending transaction of
write, while still serving the remaining fourth thread as long as the fourth thread is causing all hits.
Memory stores go into a store buffer and are not considered as long latency transactions. The store
buffer is one common store buffer for all four threads as the cache is a PIPT cache. We are not
supporting Virtual Memory, but if Virtual Memory is supported, then I am not sure if TLB access for
stores is performed in MEM stage or at the head of the Store buffer? We need to consult others on this.
Let us assume that TLB access for stores is performed at the head of the store buffer and the stores are
not considered as long-latency instructions. WAW across threads is not the responsibility of the
hardware anyways. WAW with-in a thread is taken care of because of in-order retirement of the stores.
WAR problem is not encountered as the load-word instruction incurring cache miss, causes rollback.
RAW can occur. Hence, when a load-word instruction comes to MEM stage, it first checks to see if
there is a pending store in the store buffer with matching address. If so, it (the load-word) gets data
through forwarding from the youngest store instruction with matching address.
Write Back
There are no changes in this stage. We took care of scheduling in ex stage so that there are no clashes
when we write the data back

CMT Overview r2

Transféré par

Informations du document

Description originale:

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

CMT Overview r2

Transféré par

Droits d'auteur :

Formats disponibles

EE560 Project CMT Core

Vous aimerez peut-être aussi