Vous êtes sur la page 1sur 27

Chapter 5

System Buses

Chapter 5

51 A bus transaction is a sequence of actions to complete a well-defined activity. Some examples of


such activities are memory read, memory write, I/O read, burst read, and so on.

Chapter 5

52 A bus transaction may perform one or more bus operations. For example, a Pentium burst read
transfers four words. Thus this bus transaction consists of four memory read operations.

Chapter 5

53 Address bus width determines memory-addressing capacity of the system. Typically, 32-bit processors such as the Pentium use 32-bit addresses, and 64-bit processors use 64-bit addresses.

Chapter 5

54 System performance improves with a wider data bus as we can move more bytes in parallel. Thus,
higher data bus width increases the data transfer rate. For example, the Pentium uses a 64-bit data
bus, whereas the Itanium uses a 128-bit data bus. Therefore, if all other parameters are the same,
we can double the bandwidth in the Itanium relative to the Pentium processor.

Chapter 5

55 As the name implies, dedicated buses have separate buses dedicated to carry data and address
information. For example, a 64-bit processor with 64 data and address lines requires 128 pins
just for these two buses. If we want to move 128 bits of data like the Itanium, we need 192 pins!
The obvious advantage of these designs is the performance we can get out of them. To reduce
the cost of such systems we might use multiplexed bus designs in which buses are not dedicated
to a function. Instead, both data and address information is time multiplexed on a shared bus.
Multiplexed bus designs reduce the cost but they also reduce the system performance.

Chapter 5

56 In synchronous buses, a bus clock signal provides the timing information for all actions on the bus.
Change in other signals is relative to the falling or rising edge of the clock. In asynchronous buses,
there is no clock signal. Instead, they use four-way handshaking to perform a bus transaction.
Asynchronous buses allow more flexibility in timing.
The main advantage of asynchronous buses is that they eliminate this dependence on the bus clock.
However, synchronous buses are easier to implement, as they do not use handshaking. Almost
all system buses are synchronous, partly for historical reasons. In the early days, the difference
between the speeds of various devices was not so great as it is now. Since synchronous buses are
simpler to implement, designers chose them.

Chapter 5

57 In synchronous buses, all timing must be a multiple of the bus clock. For example, if memory requires slightly more time than the default amount, we have to add a complete bus cycle. Of course,
we can increase the bus frequency to counter this problem. But that introduces problems such as
bus skew, increased power consumption, the need for faster circuits, and so on. Thus, choosing an
appropriate bus clock frequency is very important for synchronous buses. In determining the clock
frequency, we have to consider all devices that will be attached to the bus. When these devices are
heterogeneous in the sense that their operating speeds are different, we have to operate the bus at
the speed of the slowest device in the system.

Chapter 5

58 The READY signal is needed in the synchronous buses because the default timing allowed by
the processor is sometimes insufficient for a slow device to respond. The processor samples the
READY line when it expects data on the data bus. It reads the data on the data bus only if the
READY signal is active; otherwise, it waits. Thus, slower devices can use this line to indicate that
they need more time to respond to the request.
For example, in a memory read cycle, if we have a slow memory it may not be able to supply data
when the processor expects. In this case, the processor should not presume that whatever is present
on the data bus is the actual data supplied by memory. That is why the processor always reads the
value of the READY line to see if the memory has actually placed the data on the data bus. If this
line is inactive (indicating not ready status), the processor waits one more cycle and samples the
READY line again. The processor inserts wait states as long as the READY line is inactive. Once
this line is active, it reads the data and terminates the read cycle.
The asynchronous buses do not require the READY signal as they use handshaking to perform a
bus transaction (i.e., there is no default timing as in the synchronous buses).

10

Chapter 5

59 Processors provide block transfer operations that read or write several contiguous locations of a
memory block. Such block transfers are more efficient than transferring each individual word. The
cache line fill is an example that requires reading several bytes of contiguous memory locations.
Data movement between cache and main memory is in units of cache line size. If the cache line size
is 32 bytes, each cache line fill requires 32 bytes of data from memory. The Pentium uses 32-byte
cache lines. It provides a block transfer operation that transfers four 64-bit data from memory.
Thus, by using this block transfer, we can fill a 32-byte cache line. Without block transfer, we
need four memory read cycles to fill a cache line, which takes more time than the block transfer
operation.

Chapter 5

11

510 The main disadvantage of the static mechanism is that its bus allocation follows a predetermined
pattern rather than the actual need. In this mechanism, a master may be given the bus even if it
does not need it. This kind of allocation leads to inefficient use of the bus. This inefficiency is
avoided by the dynamic bus arbitration, which uses a demand-driven allocation scheme.

12

Chapter 5

511 A fair allocation policy does not allow starvation. Fairness can be defined in several ways. For
example, fairness can be defined to handle bus requests within a priority class, or requests from
several priority classes. Some examples of fairness are: (i) all bus requests in a predefined window
must be satisfied before granting requests from the next window; (ii) a bus request should not be
pending for more than
milliseconds. For example, in the PCI bus, we can specify fairness by
indicating the maximum delay to grant a request.

Chapter 5

13

512 A potential disadvantage of the nonpreemptive policies is that a bus master may hold the bus for
a long time, depending on the transaction type. For example, long block transfers can hold the
bus for extended periods of time. This may cause problems for some types of services where the
bus is needed immediately. Preemptive policies force the current master to release the bus without
completing its current bus transaction.

14

Chapter 5

513 A drawback of the transaction-based release policy is that if there is only one master that is requesting the bus most of the time, we have to unnecessarily incur arbitration overhead for each
bus transaction. This is typically the case in single-processor systems. In these systems, the CPU
uses the bus most of the time; DMA requests are relatively infrequent. In demand-based release,
the current master releases the bus only if there is a request from another bus master; otherwise, it
continues to use the bus. Typically, this check is done at the completion of each transaction. This
policy leads to more efficient use of the bus.

Chapter 5

15

514 Centralized implementations suffer from single-point failures due to the presence of the central
arbiter. This causes two main problems:
1. If the central arbiter fails, there will be no arbitration.
2. The central arbiter can become a bottleneck limiting the performance of the whole system.
The distributed implementation avoids these problems. However, the arbitration logic has to be
distributed among the masters. In contrast, in the centralized organization, the bus masters dont
have the arbitration logic.

16

Chapter 5

515 The daisy-chaining scheme has three potential problems:


It implements a fixed-priority policy, which can lead to starvation problems. From our discussion, it should be clear that the master closer to the arbiter (in the chain) has a higher
priority.
The bus arbitration time varies and is proportional to the number of masters. The reason is
that the grant signal has to propagate from master to master. If each master takes time
units to propagate the bus grant signal from its input to output, a master that is in the th
 time units before receiving the
position in the chain would experience a delay of
grant signal.

(i 1) 

This scheme is not fault tolerant. If a master fails, it may fail to pass the bus grant signal to
the master down the chain.

Chapter 5

17

516 The hybrid scheme limits the following three potential problems associated with the daisy-chaining
scheme as explained below:
The hybrid scheme limits the fixed-priority policy of the daisy-chaining scheme to a class.
In the daisy-chaining scheme, the bus arbitration time varies and is proportional to the number of masters. The hybrid scheme limits this delay as the class size is small.
The daisy-chaining scheme is not fault tolerant. If a master fails, it may fail to pass the bus
grant signal to the master down the chain. However, in the hybrid scheme, this problem is
limited to the class in which the node failure occurs.
The independent request/grant lines scheme rectifies the problems associated with the daisy-chaining
scheme. However, it is expensive to implement. The hybrid scheme reduces the cost by applying
this scheme at the class level.

18

Chapter 5

517 The ISA bus was closely associated with the system bus used by the IBM PC. It operates at
8.33 MHz clock with a maximum bandwidth of about 8 MB/s. This bandwidth was sufficient
at that time as memories were slower and there were no multimedia, windows GUI, and the like to
worry about. However, in current systems, the ISA bus can only support slow I/O devices. Even
this limited use of the ISA bus is disappearing due to the presence of the USB. Current systems use
the PCI bus, which is processor independent. It can provide a peak bandwidth of (up to) 528 MB/s.

Chapter 5

19

518 The main reason is to save the number of connector pins. Even with multiplexing, 32-bit PCI
uses 120-pin connectors while the 64-bit version needs an additional 64 pins. A drawback with
multiplexed address/data bus is that it needs additional time to turn around the bus.

20

Chapter 5

519 The four Byte Enable lines identify the byte of data that is to be transferred. Each BE# line
identifies one byte of the 32-bit data: BE0# identifies byte 0, BE1# identifies byte 1, and so on.
Thus, we can specify any combination of the bytes to be transferred. Two extreme cases are:
C/BE# = 0000 indicates transfer of all four bytes, and C/BE# = 1111 indicates a null data phase
(no byte transfer). In a multiple data phase bus transaction, the byte enable value can be specified
for each data phase. Thus, we can transfer the bytes of interest in each data phase. The extreme
case of null data phase is useful, for example, if you want to skip one or more 32-bit values in the
middle of a burst data transfer. If null data transfer is not allowed, we have to terminate the current
bus transaction, request the bus again via the arbiter, and restart the transfer with a new address.

Chapter 5

21

520 The current bus master drives the FRAME# signal to indicate the start of a bus transaction (when
the FRAME# signal goes low). This signal is also used to indicate the length of the bus transaction
cycle. This signal is held low until the final data phase of the bus transaction.

22

Chapter 5

521 PCI uses a centralized bus arbitration with independent grant and request lines. Each device has
separate grant (GNT#) and request (REQ#) lines connected to the central arbiter. The PCI specification does not mandate a particular arbitration policy. However, it mandates that the policy
should be a fair one to avoid starvation.
A device that is not the current master can request the bus by asserting the REQ# line. The
arbitration takes place while the current bus master is using the bus. When the arbiter notifies a
master that it can use the bus for the next transaction, the master must wait until the current bus
master has released the bus (i.e., the bus is idle). The bus idle condition is indicated when both
FRAME# and IRDY# are high.
PCI uses hidden bus arbitration in the sense that the arbiter works while another bus master is
running its transaction on the PCI bus. This overlapped bus arbitration increases the PCI bus
utilization by not keeping the bus idle during arbitration.
PCI devices should request a bus for each transaction. However, a transaction may consist of an
address phase and one or more data phases. For efficiency, data should be transferred in burst
mode. PCI specification has safeguards to avoid a single master from monopolizing the bus and to
force a master to release the bus.

Chapter 5

23

522 PCI allows hierarchical PCI bus systems, which are typically built using PCI-to-PCI bridges (e.g.,
using the Intel 21152 PCI-to-PCI bridge chip). This chip connects two independent PCI buses: a
primary and a secondary. The bridge improves performance due to the following reasons:
1. It allows concurrent operation of the two PCI buses. For example, a master and target on the
same PCI bus can communicate while the other PCI bus is busy.
2. The bridge also provides traffic filtering which minimizes the traffic crossing over to the other
side.
Obviously, this traffic separation along with concurrent operation improves overall system performance for bandwidth-hungry applications such as multimedia.

24

Chapter 5

523 Most PCI buses tend to operate at 33 MHz clock speed due to serious challenges in implementing
the 66 MHz design. To understand this problem, look at the timing of the two buses. The 33 MHz
bus cycle of 30 ns leaves about 7 ns of setup time for the target. When we double the clock
frequency, all values are cut in half. The reduction in setup time is important as we have only 3 ns
for the target to respond. As a result of this difficulty, most PCI buses tend to operate at 33 MHz
clock speed.
The PCI-X solves this problem by using a register-to-register protocol, as opposed to the immediate protocol implemented by PCI. In the PCI-X register-to-register protocol, the signal sent by the
master device is stored in a register until the next clock. Thus the receiver has one full clock cycle
to respond to the masters request. This makes it possible to increase the frequency to 133 MHz.
At this frequency, one clock period corresponds to about 7.5 ns, about the same period allowed for
the decode phase in the 33 MHz PCI implementation. We get this increase in frequency by adding
one additional cycle to each bus transaction. This increased overhead is more than compensated
for by the increase in the frequency.

Chapter 5

25

524 With the increasing demand for high-performance video due to applications such as 3D graphics
and full-motion video, the PCI bus is reaching its performance limit. In response to these demands,
Intel introduced the AGP to exclusively support high-performance 3D graphics and full-motion
video applications. The AGP is not a bus in the sense that it does not connect multiple devices.
The AGP is a port that precisely connects only two devices: the CPU and video card.
To see the bandwidth demand of a full-motion video, let us look at a 640  480 resolution screen.
For true color, we need three bytes per pixel. Thus, each frame requires 640 * 480 * 3 = 920 KB.
Full-motion video should use a frame rate of 30 frames/second. Therefore, the required bandwidth
is 920 * 30/1000 = 27.6 MB/s. If we consider a higher resolution of 1024  768, it goes up
to 70.7 MB/s. We actually need twice this bandwidth when displaying video from hard disks or
DVDs. This is due to the fact that the data have to traverse the bus twice: once from the disk to the
system memory and again from the memory to the graphics adaptor. The 32-bit, 33 MHz PCI with
133 MB/s bandwidth can barely support this data transfer rate. The 64-bit PCI can comfortably
handle the full-motion video but the video data transfer uses half the bandwidth. Since the video
unit is a specialized subsystem, there is no reason for it to be attached to a general-purpose bus like
the PCI. We can solve many of the bandwidth problems by designing a special interconnection to
supply the video data. By taking the video load off the PCI bus, we can also improve performance
of the overall system. Intel proposed the AGP precisely for these reasons.

26

Chapter 5

525 As we have seen in the text, AGP is targeted at 3D graphical display applications, which have
high memory bandwidth requirements. One of the performance enhancements AGP uses is the
pipelining. AGP pipelined transmission can be interrupted by PCI transactions. This ability to
intervene in a pipelined AGP transfer allows the bus master to maintain high pipeline depth for
improved performance.

Chapter 5

27

526 The STSCHG# signal gives I/O status change information for multifunction PC cards. In a pure I/O
PC card, we do not normally require this signal. However, in multifunction PC cards containing
memory and I/O functions, this signal is needed to report the status-signals removed from the
memory interface (READY, WP, BVD1, and BVD2). A configuration register (called the pin
replacement register) in the attribute memory maintains the status of the signals removed from
the memory interface. For example, since BVD signals are removed from the memory interface,
this register keeps the BVD information to report the status of the battery. When a status change
occurs, this signal is asserted. The host can read the pin replacement register to get the status.

Vous aimerez peut-être aussi