Vous êtes sur la page 1sur 5

# Solution Manual for Modern Processor Design by John Paul Shen and Mikko H.

Lipasti
This book emerged from the course Superscalar Processor Design, which has been taught at Carnegie Mellon University since 1995. Superscalar Processor Design is a mezzanine course targeting seniors and first-year graduate students. Quite a few of the more aggressive juniors have taken the course in the spring semester of their junior year. The prerequisite to this course is the Introduction to Computer Architecture course. The objectives for the Superscalar Processor Design course include: (1) to teach modem processor design skills at the microarchitecture level of abstraction; (2) to cover current microarchitecture techniques for achieving high performance via the exploitation of instruction-level parallelism (ILP); and (3) to impart insights and hands-on experience for the effective design of contemporary high-performance microprocessors for mobile, desktop, and server markets. In addition to covering the contents of this book, the course contains a project component that involves the microarchitectural design of a future-generation superscalar microprocessor.

Here, in next successive posts, I am going to post solutions for the same Text-book (Modern Processor Design by John Paul Shen and Mikko H. Lipasti). If you find any difficulty or wants to suggest anything, feel free to comment...:)

processor.html

http://targetiesnow.blogspot.in/p/solution-manual-for-modern-

Modern Processor Design by John Paul Shen and Mikko H. Lipasti : Exercise 1.6 and 1.7 Solution

Q.1.6: A program's run time is determined by the product of instructions per program, cycles
per instruction, and clock frequency. Assume the following instruction mix for a MlPS-like RISC instruction set: 15% stores, 25% loads, 15% branches, and 35% integer arithmetic, 5% integer shift, and 5% integer multiply. Given that load instructions require two cycles, branches require four cycles, integer ALU instructions require one cycle, and integer multiplies require ten cycles, compute the overall CPI.

## Q.1.7: Given the parameters of Problem 6, consider a strength-reducing optimization that

converts multiplies by a compile-time constant into a sequence of shifts and adds. For this instruction mix, 50% of the multiplies can be converted to shift-add sequences with an average length of three instructions. Assuming a fixed frequency, compute the change in instructions per program, cycles per instruction, and overall program speedup.

Ex 1.8, 1.9 and 1.10 Solution: Modern Processor Design by John Paul Shen and Mikko H. Lipasti :

Q.1.8: Recent processors like the Pentium 4 processors do not implement single-cycle shifts.
Given the scenario of Problem 7, assume that s = 50% of the additional integer and shift instructions introduced by strength reduction are shifts, and shifts now take four cycles to execute. Recompute the cycles per instruction and overall program speedup. Is strength reduction still a good optimization?

Q.1.9: Given the assumptions of Problem 8, solve for the break-even ratio s (percentage of
additional instructions that are shifts). That is, find the value of s (if any) for which program performance is identical to the baseline case without strength reduction (Problem 6).

Q.1.10: Given the assumptions of Problem 8, assume you are designing the shift unit on the
Pentium 4 processor. You have concluded there are two possible implementation options for the shift unit: 4-cycle shift latency at a frequency of 2 GHz, or 2-cycle shift latency at 1.9 GHz. Assume the rest of the pipeline could run at 2 GHz, and hence the 2-cycle shifter would set the entire processors frequency to 1.9 GHz. Which option will provide better overall performance?

Solution:
john-paul_13.html

http://targetiesnow.blogspot.in/2013/11/modern-processor-design-by-

Q.2.4: Consider that you would like to add a load-immediate instruction to the TYP instruction
set and pipeline. This instruction extracts a 16-bit immediate value from the instruction word, signextends the immediate value to 32 bits, and stores the result in the destination register specified in the instruction word. Since the extraction and sign-extension can be accomplished without the ALU, your colleague suggests that such instructions be able to write their results into the register in the decode (ID) stage. Using the hazard detection algorithm described in Figure 2-15, identify what additional hazards such a change might introduce.

Q.2.5: Ignoring

## pipeline interlock hardware (discussed in Problem 6), what additional

pipeline resources does the change outline in Problem 4 require? Discuss these resources and their cost.

Q.2.6: Considering

## hardware shown in Figure 2-18 to correctly handle the load-immediate instructions.

Solution:
processor.html

http://targetiesnow.blogspot.in/2013/11/ex-24-25-26-solution-modern-

Ex 2.8, 2.9 & 2.15 Solution : Modern Processor Design by John Paul Shen and Mikko H. Lipasti :

Q.2.8: Consider

the TYP pipeline. This store differs from the existing store with register+immediate addressing mode by computing its effective address as the sum of two source registers, that is, stx r3, r4, r5 performs r3<-MEM[r4+r5]. Describe the additional pipeline resources needed to support such an instruction in the TYP pipeline. Discuss the advantages and disadvantages of such an instruction.

Q.2.9:

addressing mode. In this addressing mode, the effective address for the load is computed as register+immediate, and the resulting address is written back into the base register. That is, lwu r3,8(r4) performs r3<-MEM[r4+8]; r4<r4+8. Describe the additional pipeline resources needed to support such an instruction in the TYP pipeline.

Q.2.15:

The IBM study of pipelined processor performance assumed an instruction mix based

on popular C programs in use in the 1980s. Since then, object oriented languages like C++ and Java have become much more common. One of the effects of these languages is that object inheritance and polymorphism can be used to replace conditional branches with virtual function calls. Given the IBM instruction mix and CPI shown in the following table, perform the following transformations to reflect the use of C++/Java, and recompute the overall CPI and speedup or slowdown due to this change: Replace 50% of taken conditional branches with a load instruction followed by a jump register instruction (the load and jump register implement a virtual function call). Replace 25% of not-taken branches with a load instruction followed by a jump register instruction.

Solution: http://targetiesnow.blogspot.in/2013/11/ex-28-29-215-solution-modernprocessor.html

Q.2.16:

In a TYP-based pipeline design with a data cache, load instructions check the tag array for a cache hit in parallel with accessing the data array to read the corresponding memory location. Pipelining stores to such a cache is more difficult, since the processor must check the tag first, before it overwrites the data array. Otherwise, in the case of a cache miss, the wrong memory location may be overwritten by the store. Design a solution to this problem that does not require sending the store down the pipe twice, or stalling the pipe for every store instruction. Referring to Figure 2-15, are there any new RAW, WAR, and/or WAW memory hazards?

## Q.2.17: The MIPS pipeline shown in Table 2-7 employs a two-phase

clocking scheme that makes efficient use of a shared TLB, since instruction fetch accesses the TLB in phase one and data fetch accesses in phase two. However, when resolving a conditional branch, both the branch target address and the branch fall-through address need to be translated during phase one in parallel with the branch condition check in phase one of the ALU stage to enable instruction fetch from either the target or the fallthrough during phase two. This seems to imply a dual-ported TLB. Suggest an architected solution to this problem that avoids dual-porting the TLB.

Solution:

Q.3.1:

Given the following benchmark code, and assuming a virtually-addressed fully-associative cache with infinite capacity and 64 byte blocks, compute the overall miss rate (number of misses divided by number of references). Assume that all variables except array locations reside in registers, and that arrays A, B, and C are placed consecutively in memory. double A[1024], B[1024], C[1024]; for(int i=0;i<1000;i += 2) { A[i] = 35.0 * B[i] + C[i+1]; } Q.3.3: Given the example code in Problem 1, and assuming a virtually-addressed two-way set associative cache of capacity 8KB and 64 byte blocks, compute the overall miss rate (number of misses divided by number of references). Assume that all variables except array locations reside in registers, and that arrays A, B, and C are placed consecutively in memory.

Solution: