ECE 452: Computer Organization and Design

ECE 452: Computer Organization and Design Spring 2010 Sample Final
Colorado State University May 2010
Name: Student ID number:
___________________________________ ___________________________________
Instructions:
Write all answers on the white space provided below the question. In case you require more space, you can use the back of the sheet or extra sheets of paper. Only 1 sheet of hand written notes (both sides) on regular sized paper is allowed, along with a calculator. There are 16 regular questions for a total score of 120. You have 2 hours. Write legibly and give clear answers showing all your steps. Try to attempt all questions as partial points will be given for a correct approach. Make reasonable assumptions if there is ambiguity for any question. Not all questions are of equal difficulty. Please review the entire set of questions first and then budget your time carefully.
Q1. [5 points] Suppose that we can improve the floating point instruction performance of machine by a factor of 15 (the same floating point instructions run 15 times faster on this new machine). What percent of the instructions must be floating point to achieve a Speedup of at least 4? We will use Amdahls Law for this question. Let x be percentage of floating point instructions. Since the speedup is 4, if the original program executed in 100 cycles, the new program runs in 100/4 = 25 cycles. (100)/4 = (x)/15 + (100 x) Solving for x, we get: x = 80.36 The percent of floating point instructions need to be 80.36.
Q2. [10 points] Assume that a design team is considering enhancing a machine by adding MMX (multimedia extension instruction) hardware to a processor. When a computation is run in MMX mode on the MMX hardware, it is 10 times faster than the normal mode of execution. Call the percentage of time that could be spent using the MMX mode the percentage of media enhancement. (a) What percentage of media enhancement is needed to achieve an overall speedup of 2? We will use Amdahls Law for this question. Execution time with Media Enhancement = (Execution time improved by Media enhancement)/(Amount of Improvement) +Execution time unaffected Let x be the percent of media enhancement needed for achieving an overall speedup of 2. Then, (100)/2 = (x)/10 + (100-x) Solving for x, we have x = 55.55
(b) What percentage of the run-time is spent in MMX mode if a speedup of 2 is achieved? (Hint: You will need to calculate the new overall time.) The new overall time is 100/2 = 50. Now 55.55 of the original program used media enhancement. Let x be the percentage of the new run-time that is spent in MMX mode (for a speedup of 2). x = (55.55 *50)/100 = 27.78
(c) What percentage of the media enhancement is needed to achieve one-half the maximum speedup attainable from using the MMX mode? The maximum speedup using MMX mode occurs when the whole program can run in media enhancement mode. The maximum speedup in this case is 10. One-half of this is 15. Plugging in 15 instead of 2 in (a): (100)/15 = (x)/10 + (100-x) Solving for x, we get x = 103.7
Q3. [5 points] A designer wants to improve the overall performance of a given machine with respect to a target benchmark suite and is considering an enhancement X that applies to 50% of the original dynamicallyexecuted instructions, and speeds each of them up by a factor of 3. The designers manager has some concerns about the complexity and the cost-effectiveness of X and suggests that the designer should consider an alternative enhancement Y. Enhancement Y, if applied only to some (as yet unknown) fraction of the original dynamically-executed instructions, would make them only 75% faster. Determine what percentage of all dynamically-executed instructions should be optimized using enhancement Y in order to achieve the same overall speedup as obtained using enhancement X. We will use Amdahls Law for this problem. Execution time after improvement = (Execution time affected by improvement)/(Amount of Improvement) + Execution time unaffected Execution Time using X = (50)/3 + (100-50) = 66.67 The speedup is given by = (100)/66.67 = 1.5 Let the percentage of dynamically executed instructions to which Y is to be applied be x. Execution Time using Y = (x)/1.75 + (100-x) SpeedUp = (100)/(Execution Time using Y) = 1.5 Solving for x, we get x = 77.78
Q4. [10 points] In MIPS assembly, write an assembly language version of the following C code segment:
for (i = 0; i < 98; i ++) { C[i] = A[i + 1] - A[i] * B[i + 2] }
Arrays A, B and C start at memory location A000hex, B000hex and C000hex respectively. Try to reduce the total number of instructions and the number of expensive instructions such as multiplies. The MIPS assembly sequence is as follows:
li $s0, 0xA000 # Load Address of A li $s1, 0xB000 # Load Address of B li $s2, 0xC000 # Load Address of C li $t0, 0 # Starting index of i li $t5, 98 # Loop bound loop: lw $t1, 0($s1) # Load A[i] lw $t2, 8($s2) # Load B[i+2] mul $t3, $t1, $t2 # A[i] * B[i+2] lw $t1, 4($s1) # Load A[i+1] add $t2, $t1, $t3 # A[i+1] + A[i]*B[i+2] sw $t2, 4($s3) # C[i] = A[i+1] + A[i]*B[i+2] addi $s1, 4 # Go to A[i+1] addi $s2, 4 # Go to B[i+1] addi $s3, 4 # Go to C[i+1] addi $t0, 1 # Increment index variable bne $t0, $t5, loop # Compare with Loop Bound halt: nop
Q5. [5 points] Representations (a) In the 32-bit IEEE format, what is the encoding for negative zero? The representation of negative zero in 32-bit IEEE format is
1 00000000 00000000000000000000000
(b) In the 32-bit IEEE format, what is the encoding for positive infinity? The representation for positive infinity in 32-bit IEEE format is
0 11111111 00000000000000000000000
(c) In one sentence for each, state the purpose of guard, rounding, and sticky bits for floating point arithmetic. The guard bit is an extra bit that is added at the least significant bit position during an arithmetic operation to prevent loss of significance The round bit is the second bit that is used during a floating point arithmetic operation on the rightmost bit position to prevent loss of precision during intermediate additions. The sticky bit keeps record of any 1s that have been shifted on to the right beyond the guard and round bits
Q6. [5 points] Consider the following sequence of actual outcomes for a single static branch. T means the branch is taken. N means the branch is not taken. For this question, assume that this is the only branch in the program.
TTTNTNTTTNTNTTTNTN Assume that we try to predict this sequence with a BHT using one-bit counters. The counters in the BHT are initialized to the N state. Which of the branches in this sequence would be mis-predicted? You may use this table for your answer:

Q7. [10 points] This question covers your understanding of dependences between instructions. Using the code below, list all of the dependence types (RAW, WAR, WAW). List the dependences in the respective table (example INST-X to INST-Y) by writing in the instruction numbers involved with the dependence.
I0: I1: I2: I3: I4: I5: I6: A C D A C F G = = = = = = = B A A B F A F + + * / + C; B; C; C * D; D; G; D;
(b) Given four instructions, how many unique comparisons (between register sources and destinations) are necessary to find all of the RAW, WAR, and WAW dependences. Answer for the case of four instructions, and then derive a general equation for N instructions. Assume that all instructions have one register destination and two register sources.
For four instructions, the number of unique comparisons: (2(3) + 2(2) + 2(1)) + (2(3) + 2(2) + 2(1)) + (3 + 2 + 1) = 30 The first summand is for RAW comparisons, the second summand is for WAR comparisons and the last summand is for WAW comparisons. The general equation for N instructions = (5*(n-1)*n)/2
Q8. [5 points] Pipelining is used because it improves instruction throughput. Increasing the level of pipelining cuts the amount of work performed at each pipeline stage, allowing more instructions to exist in the processor at the same time and individual instructions to complete at a more rapid rate. However, throughput will not improve as pipelining is increased indefinitely. Give two reasons for this. Pipelining has a fixed (or relatively fixed) absolute overhead per stage which results from latch overhead and clock/data skew. This means that the latency of a pipeline stage cannot be driven to zero. Second, increasing the pipeline depth lengthens hazard penalties, increasing the CPI. For instance, increasing the depth of the pipeline between the fetch and execute stage increases the branch misprediction penalty.
Q9. [5 points] Consider an architecture that uses virtual memory, a two-level page table for address translation, as well as a TLB to speed up address translations. Further assume that this machine uses caches to speed up memory accesses. Recall that all addresses used by a program are virtual addresses. Further recall that main memory in the microarchitecture is indexed using physical addresses. The virtual memory subsystem and cache memories could interact in several ways. In particular, the cache memories could be accessed using virtual addresses. We will refer to this scheme as a virtually indexed, virtually tagged cache. The cache could be indexed using virtual addresses, but the tag compare could happen with physical addresses (virtually indexed, physically tagged). Finally, the cache could be accessed using only the physical address. Describe the virtues and drawbacks for each of these systems. Be sure to consider the case where two virtual addresses map to the same physical address.
Q10. [10 points] The memory architecture of a machine X is summarized in the following table.
(a) Assume that there are 8 bits reserved for the operating system functions (protection, replacement, valid, modified, and Hit/Miss- All overhead bits) other than required by the hardware translation algorithm. Derive the largest physical memory size (in bytes) allowed by this PTE format. Make sure you consider all the fields required by the translation algorithm.
Since each Page Table element has 4 bytes (32 bits) and the page size is 16K bytes: (1) we need log2(16*210*23) , ie, 17 bits to hold the page offset (2) we need 32 8 (used for protection) = 24 bits for page number The largest physical memory size is 2(17 + 24)/2(3) bytes = 238 bytes = 256 GB (b) How large (in bytes) is the page table? The page table is indexed by the virtual page number which uses 54 14 = 40 bits. The number of entries are therefore 240. Each PTE has 4 bytes. So the total size of the page table is 242 bytes which is 4 terabytes.
Q11. [5 points] Assume an instruction cache miss rate for gcc of 2% and a data cache miss rate of 4%. If a machine has a CPI of 2 without any memory stalls and the miss penalty is 40 cycles for all misses, determine how much faster a machine would run with a perfect cache that never missed. Assume 36% of instructions are loads/stores. Let the instruction count be I Instruction cache Stalls = I * 0.02 * 40 = 0.8I Data cache stalls = I * 0.36 * 0.04 * 40 = 0.56I Total memory stalls = 1.36 * I CPI with perfect memory = 2.0 CPI with memory stalls = 3.36 Perfect memory performance is better by 3.36/2 or 1.66 Q12. [10 points] (a) What is the average time to read or write a 512-byte sector for a typical disk rotating at 7200 RPM? The advertised average seek time is 8ms, the transfer rate is 20MB/sec, and the controller overhead is 2ms. Assume that the disk is idle so that there is no waiting time. Disk Access Time = seek time + rotational delay + transfer time + controller overhead = 8 + (0.5*60*1000/7200) + (512/20*220)*1000 + 2 = 14.17 ms (b) A program repeatedly performs a three-step process: It reads in a 4-KB block of data from disk, does some processing on that data, and then writes out the result as another 4-KB block elsewhere on the disk. Each block is contiguous and randomly located on a single track on the disk. The disk drive rotates at 7200RPM, has an average seek time of 8ms, and has a transfer rate of 20MB/sec. The controller overhead is 2ms. No other program is using the disk or processor, and there is no overlapping of disk operation with processing. The processing step takes 20 million clock cycles, and the clock rate is 400MHz. What is the overall speed of the system in blocks processed per second assuming no other overhead?
Disk Read Time for a 4KB block = seek time + rotational delay + transfer time + controller overhead = 8 + (0.5*60*1000/7200) + (4*1024/20*220)*1000 + 2 = 14.17 ms Processing Time = 20 * 106 * (1/(400*106)) = 1/20 = 0.05 s = 50 ms Disk Write Time for 4 KB block = 14.17 ms Total time to completely process a 4 KB block = 2*14.17 + 50 = 78.34 ms Number of blocks processed per second = 1000/78.34 = 12.76
Q13. [5 points] What are the advantages and disadvantages of fine-grained multithreading, coarse-grained multithreading, and simultaneous multithreading? Fine-grained multithreading can hide the throughput losses arises from both short and long stalls. But it will slow down the execution of individual threads, especially those without stalls. Coarse-grained multithreading only switches when there is a long stall, so it is less likely to slow down the execution of individual threads. However, it has limited ability to overcome throughput losses due to short stalls and relatively higher startup overhead. Simultaneous multithreading dynamically issue operations from multiple threads simultaneously. This covers the throughput losses from both short and long stalls, and does not suffer from high switching overhead. But SMT may still slow down the execution of individual threads if that thread does not have any stall.
Q14. [10 points] Consider a system with two multiprocessors with the following configurations: (a) Machine 1, a NUMA machine with two processors, each with local memory of 512 MB with local memory access latency of 20 cycles per word and remote memory access latency of 60 cycles per word. (b) Machine 2, a UMA machine with two processors, with a shared memory of 1GB with access latency of 40 cycles per word. Suppose an application has two threads running on the two processors, each of them need to access an entire array of 4096 words, is it possible to partition this array on the local memories of the NUMA machine so that the application runs faster on it rather than the UMA machine? If so, specify the partitioning. If not, by how many more cycles should the UMA memory latency be worsened for a partitioning on the NUMA machine to enable a faster run than the UMA machine? Assume that the memory operations dominate the execution time. Suppose we have x words on one processor and (T-x) words on the other processor, where T = 4096. Execution Time on the NUMA machine = max(20x + 60(T-x), 60x + 20(T-x)) = max(60T-40x, 20T+40x) The max is 40T (unit time), where x = T/2 Execution Time on the UMA machine = 40T So, we can't make the NUMA machine faster than the UMA machine. However, if the UMA access is one more cycle slower (that is, 41 cycles access latency), the NUMA machine could be faster.
Q15. [5 points] Describe any 2 flow control schemes used in NoCs See lecture slides for the answer
Q16. [15 points] What is the difference between the following and give an example of a routing scheme for each: (a) minimal and non minimal routing schemes? See lecture slides for the answer
(b) algorithmic and non-algorithmic routing schemes? See lecture slides for the answer
(c) ordered and unordered routing schemes?
See lecture slides for the answer

ECE 452: Computer Organization and Design

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

ECE 452: Computer Organization and Design

Transféré par

Droits d'auteur :

Formats disponibles

ECE 452: Computer Organization and Design Spring 2010 Sample Final

Colorado State University May 2010

Name: Student ID number:

(c) ordered and unordered routing schemes?

See lecture slides for the answer

Vous aimerez peut-être aussi