Vous êtes sur la page 1sur 5

Solutions to Chapter 1 Exercises

1.1: To do the calculation right we should fit all the points. But the main point is conveyed by a
simple example, taking the i4004 and the Alpha 21164. Over 24 years, Die Size increases by a
factor of 33.2, Transistor count by 4043-fold, and Clock rate by 600-fold. Using the formula
AnnualRate ( n ) = n X , where X is the ratio and n is the number of years, we get annual growth rates
of 15.7%, 41.3%, and 30.5%, respectively.

1.2: The performance curve is of the form S i = S 0 × X i , so each year is X times faster than the year
before. A crude estimate is obtained by fitting the curve through the first and last points (a least
squares approximation would be more accurate) and gives the following.

Table 1: Performance Growth Rates

Machine Year SpecInt SpecFP Linpac n=1000 Peak FP

Sun 4/260 1987 9 6 1.1 1.1 3.3


AlphaSer- 1994 200 291 43 129 190
ver 2100
Annual 1.56 1.74 1.69 1.98 1.80
Factor 56% 74% 69% 98% 80%

The floating point performance has been increasing at a faster rate than the integer performance. If
you plot out the actual data points against this curve you will see that the AlphaServer 2100 was
actually falling off the curve for the floating point performance. Newer data will show a return to
a steady growth rate.

T p ≥ T 1  s + -----------
1–s
 p 
1.3: 1
S p ≤ --------------------
1–s
s + -----------
p

The sequential work takes as along as on one processor. If the remainder is perfectly parallelized,
it will be accellerated by a factor of p . If there is redundant work performed, communication, or
synchronization, this factor may be reduced.

1.4. To get an lower bound the time on a k-way issue machine, where the maximum instruction
level parallelism is p, we view all issue slots with no more than k instructions as taking a single
cycle and take all the work in slots with more than k-way parallelism and doing it k instructions at
a time.
k p
 
T k ≥ T 1  ∑ f i + --- ∑ f i
1
i = 0 k
i = k+1

A better estimate, but not necessarily an underestimate, is obtained by assuming each highly par-
allel slot is performed k instructions at a time.

k p
 f 
T k ≈ T 1  ∑ f i + ∑ ----i
i = 0 i = k+1 k

1.5. This will change from year to year

1.6. Every processor performs a fetch&inc on the shared variable. They can wait for all the rest to
get ranked by waiting for the counter to exceed the number of processes.

extern int proc_counter = 0;

my_proc = fetch&inc( proc_counter );


while (proc_counter < number_procs ) {};

You cannot determine the number of processors this way because one may simply be very slow
and not reach the counter.

1.7. R=0.25 us, W=40 MB/s, so 1/W = 38.15 Bytes per microsecond, since MB is 2^20 and us is
10^-6. In units of us we have:

Even with relatively small messages, the channel time dominates the routing delay with cut-
through routing.

1.8. Ignoring the boundaries, with a per-element decomposition there is 4 × 1024 × 1024 × 8 = 32MB
communicated. The most efficient decomposition is into 8x8 blocks of 128x128 elements each.
n
Then, except for the boundaries, there is 4 ------- words communicated, 4KB per step.
p
1000m
Ave Msg Rate ( m ) MM/s = -----------------------------
T ime ( m ) ns
1.9. T ime No Pipe ( m ) = 100m
T ime Pipe ( m ) = 10 ( m – 1 ) + 90
T ime ( T , m ) = 10 ( m – 1 ) + max(90,T)

1.10.
n1 ⁄ 2
T ( n 1 ⁄ 2 ) = T 0 + ---------- - = 2T 0
B
n1 ⁄ 2
----------- = T 0
B

1.11. X = –n1 ⁄ 2

 Arbitration + Memory + Transfer


1.12. L = 
 2 × 25ns + 100ns + 32
------ × 25ns = 250ns
 8

Bandwidth obtained on this transfer is 128 MB/s

1.13. Latency = 3600ns, BW = 8.89 MB/s

1.14. 8 KB.

1.15.Copy n bytes takes time n*12.5 ns. Message time is 1000ns + n*12.5 ns.
Combined is 1000 ns + n*25 ns. The effect of the copy is to cut the transfer bandwidth in half. If
it took 10 us for a fast system call, this would be the time to copy 400 bytes.

1.16. Instruction misses, loads, and stores each contribute 0.32 bytes per instruction of traffic.
Thus, at 100 MIPS, the traffic is 96 MB/s per processor. With a 250 MB/s bus, we can only get
one processor below the 125 MB/s ‘half power point’.

1.17. Instruction misses contribute 0.01 x 20 = 0.2 cycles per instruction, load misses contribute
0.05 x 0.2 x 20 = 0.2 cycles per instruction, for a CPI = 1.4. With one processor, the memory is
utilized for 0.4 cycles out of 1.4, or 29%.

1.18. As more processors are added, each must wait longer to get access to the bus, so each slows
down. It is a closed system, where each processor must wait for the previous operation to finish
before it issues another. We think of each processor as alternating through a cycle of Z units of
work, on average, followed by a miss, which includes waiting for the bus and then S units on the
bus resource. For the data in 1.17, S = 20 ticks and so Z = 50 ticks. Thus, with only one proces-
sor, the length of this cycle is 70 ticks. The bus utilization is 28.6%. This means that with a sec-
ond processor we would see an average queue length 0.286S and need to wait for 25.7 ticks, on
average, for each miss. Let R ( n ) be the time to service a memory request with n processors. The
R ( 1 ) = S . The time for each processor to make it around the cycle is then T ( n ) = z + R ( n ) . With n
n
processors doing this, the service rate is X ( n ) = ------------ . The trick then is to see how the waiting
T (n)
increases with the number of processors. With n processors, the expected queue length is
Q ( n ) = X ( n )R ( n ) . And finally, the expected time to service a request is S plus the time to wait for the
others, so we have the following.

T (n) = z + R(n)
n
X ( n ) = ------------
T (n)
Q ( n ) = X ( n )R ( n )
R(n) = S(1 + Q(n – 1))

Calculating this recurrence, we can see how each processor slows down as the bus saturates and
the waiting time increases. Shown in the following graphs.

Vous aimerez peut-être aussi