Vous êtes sur la page 1sur 26

18

CHAPTER

Network Protocols
P. M. Melliar-Smith Louise E. Moser

raditional network protocols supported the movement of bits from source to destination. With the inadequate communication media of the past, that was achievement enough. But communication media are improving quite rapidly, though perhaps not rapidly enough, and are becoming cheaper, faster, more exible, more dependable, and ubiquitous. The network protocols of the future will be designed specically to support applications, and application requirements will determine the communication services that are provided. Ingenious protocol mechanisms will no longer be required to overcome the inadequacies of the network infrastructure. Communication services are what is important to the users; protocol mechanisms should remain invisible to them. Many network protocols are built hierarchically as a stack of protocols, as shown in Figure 18.1. The implementation of each protocol in the stack exploits the services provided by the protocols below it. The service provided by each protocol is freestanding and does not depend on whether the implementation of that protocol exploits other protocols. One approach to assembling a protocol stack is to allow the user to construct the stack by choosing protocols from a toolkit of microprotocols [491, 548]. In this chapter, we address the topic of network protocolstheir characteristics and the services they provide. First, we consider the grid applications discussed in Chapters 3 through 6 and the generic types of communication services they need (Section 18.1). In Section 18.2, we identify four different classes of network protocols that provide these types of services. In Section 18.3, we discuss different types of message delivery services needed by

454
Application Application

18

Network Protocols

Protocol service
logical protocol

physical connection

18.1 FIGURE

Protocol stack.

the applications. In Section 18.4, we address the issue of maintaining consistency of replicated information, which leads into a discussion in Section 18.5 of group communication and group membership protocols. In Section 18.6, we consider the issues of resource management, including ow control, congestion control, latency, and jitter. With these services and issues in mind, we then present in Section 18.7 an overview of several of the more interesting recently developed protocols. Finally, we discuss in Section 18.8 the future of network protocols.

18.1

APPLICATION REQUIREMENTS
The classical network protocol provides a point-to-point, sender-initiated data transfer service, delivering messages in the order in which they were sent, with good reliability and with throughput as the measure of performance. However, each application has its own reasons for communicating and its own types of information to be communicated, which may impose very different requirements on the network protocols. Teleimmersion and collaborative virtual environments (see Chapter 6) are some of the most challenging applications for network protocols because of their varied, demanding requirements. These applications require the following protocols:
!

Data streaming protocols for audio and video, often multicast at high rates

18.1

Application Requirements

455

! ! !

Reliable data transport protocols for collaboration, again often multicast Coordinated control protocols and membership protocols for collaboration Protocols for the integration of complex, independently developed modules

Realtime instrumentation applications (see Chapter 4) also have demanding requirements and are currently limited by a reluctance to commit such critical applications to the unreliability of the Internet. These applications require the following protocols:
!

Data transport protocols that provide reliable delivery of messages and also unreliable timestamped delivery of messages containing the most recent available data Coordinated control protocols with fault tolerance Protocols for the integration of complex, independently developed modules

! !

Data-intensive applications (see Chapter 5) are still in their infancy. These applications will undoubtedly require the following protocols:
!

Data transport protocols for the efcient, rapid, and reliable transport of huge datasets Protocols for the integration of complex, independently developed modules Protocols for the encapsulation and integration of existing modules and movement of program code to remote sites where the data are located

Distributed supercomputing applications (see Chapter 3), both on interconnected supercomputers and on clusters of small computers, are perhaps the best developed of the four types of applications. These applications require the following protocols:
! !

Low-latency, reliable data transport protocols, sometimes multicast Low-latency control and synchronization protocols that are scalable to quite large numbers of nodes Protocols for the encapsulation and integration of existing modules and movement of program code to remote sites at which the computation is to be performed

456

18

Network Protocols

18.2

CLASSES OF NETWORK PROTOCOLS


Despite the multitude of network protocols that have been proposed, a few major classes of protocols are starting to emerge. We consider four of these below. There is a common tendency to describe the mechanisms of the protocols, even though it is the protocol services that are important to the users. We endeavor to emphasize the services.

18.2.1

Data Transport Protocols


Data transport protocols are the workhorses of the Internet and are exemplied by the venerable but highly effective Transmission Control Protocol (TCP). They typically provide reliable, source-ordered delivery of messages with excellent ow control and buffer management. Most data transport protocols are point to point, but multicast protocols are becoming more common. The Xpress Transfer Protocol (XTP) [34, 526] is an example of a data transport protocol that has been designed specically to achieve high performance, particularly for parallel computations. Quite different is the Scalable Reliable Multicast (SRM) protocol [194], a reliable data transfer protocol intended for multicasting to large groups over the Internet, as required, for example, in collaborative virtual environment applications. XTP and SRM are described in more detail in Sections 18.7.1 and 18.7.2, respectively.

18.2.2

Streaming Protocols
Streaming protocols are typically used for audio, video, and multimedia and also for certain instrumentation applications. These applications do not require reliable delivery and can accept message loss, particularly if selectiveloss algorithms are used. Multicasting is common in these applications, but unreliable source-ordered delivery sufces. Even when the trafc is bursty, ow control is ineffective, and bandwidth reservation is preferable. Low jitter is important. An example of a streaming protocol is the Real-Time Transport Protocol (RTP) [498], described in Section 18.7.4, which might be used in conjunction with the Resource reSerVation Protocol (RSVP) [76, 584], described in Section 18.7.5. Unfortunately, the datagram orientation of the Internet does not yet support streaming protocols well; the connection-oriented, negotiated quality-of-service approach of ATM generally provides a better foundation for streaming protocols.

18.2

Classes of Network Protocols

457

18.2.3

Group Communication Protocols


What distinguishes group communication protocols from the data transport and streaming protocols described above is that group communication protocols are concerned with more than just movement of data. In a distributed system, to permit effective cooperation and to provide fault tolerance, a group of several processes must all keep copies of the same data, and the copies of the data must be kept consistent as the application executes. Maintaining consistency is difcult for the application programmer, particularly in the presence of faults. Group communication protocols assist application programmers in maintaining the consistency of replicated data by maintaining the memberships of process groups and by multicasting messages to those process groups. The InterGroup Protocols (IGPs) [56], described in Section 18.7.3, are group communication protocols being developed for distributed collaborative virtual environments and realtime instrumentation applications, to allow scientists and engineers to collaborate and conduct experiments over the Internet. The IGP protocols might also be suitable for other applications. They pay special attention to scaling to large sizes and to minimizing the effect on latency of long propagation times across a large network.

18.2.4

Distributed Object Protocols


A new area of protocol development is distributed object protocols for heterogeneous distributed systems, exemplied by the Internet Inter-Orb Protocol (IIOP) [388] developed by the OMG for CORBA (see also Chapter 9). CORBA supports the use of existing legacy code, provides interoperability across diverse platforms, and hides the distributed nature of the computation and the location of objects from the application. However, the overheads of CORBAs remote method invocation are currently still high. IIOP, described in Section 18.7.6, allows ORBs from different vendors to work together. CORBA and IIOP are well suited for distributed collaborative virtual environments and realtime instrumentation applications, which involve the integration of complex distributed systems from existing software. Also important is Javas Remote Method Invocation (RMI) [290], described in Section 18.7.7, which supports the movement of code to remote machines (see also Chapter 9). While similar in some respects to CORBA, Java RMI provides less support for the integration of complex systems and no support for existing code. Currently, Java is less efcient than CORBA, but improved performance should come soon. Java and RMI are particularly well suited for data-intensive and distributed supercomputing applications, as these can

458

18

Network Protocols

require movement of programs. The integration of Java with CORBA achieves the best of both technologies.

18.3

TYPES OF MESSAGE DELIVERY SERVICES


The most basic form of message delivery service is unreliable message delivery, which provides only a best-effort service with no guarantees of delivery. Unreliable message delivery is used for audio and video streaming protocols, such as are needed in teleimmersion. Most applications require a more stringent service, reliable message delivery, which ensures that each message sent is eventually received, possibly after multiple retransmissions, despite loss of messages by the communication medium. Many recent network protocols extend the point-to-point service of the traditional network protocol to a multicast service [151]. Multicasting allows efcient delivery of the same information to several, possibly many, destinationsa service needed particularly by collaborative virtual environments but also by other applications. Multicasting is conceptually simple but involves many complications in the protocol mechanisms to achieve efciency. Typical multicast protocols provide only one-to-many multicasting, where a single sender sends to multiple destinations, but collaborative virtual environments require many-to-many multicasting. Another important criterion for a message delivery service is the order in which messages are delivered, including causally ordered delivery, sourceordered delivery, group-ordered delivery, and totally ordered delivery, as described below. Causally ordered delivery [328] requires that (1) if process P sends message M before it sends message M and process Q delivers both messages, then Q delivers M before it delivers M , and (2) if process P receives message M before it sends message M and process Q delivers both messages, then Q delivers M before it delivers M . Delivery of messages in causal order prevents anomalies in the processing of data contained in the messages, but it does not maintain the consistency of replicated data. Source-ordered delivery, or FIFO delivery, requires that messages from a particular source are delivered in the order in which they were sent. For multimedia streaming protocols, such as are required for teleimmersion, sourceordered delivery sufces. Data distribution, such as in data-intensive applications, also requires only source-ordered delivery. Group-ordered delivery requires that, if processes P and Q are members of a process group G, then P and Q deliver the messages originated by the processes in G in the same total order. A reliable group-ordered message

18.4

Maintaining Consistency

459
Messages delivered A1 B1 A1 C1 B2 B2 A2 C1 B3 B2 A2 A3 C3 C2 C3 B3

Messages multicast A1 A2 A3 B1 B2 B3 C1 C2 C3

C1 B1 C2

Unreliable source-ordered delivery A1 A2 A3 B1 B2 B3 C1 C2 C3 A1 B1 A2 C1 B2 B3 C2 A3 C3 A1 B1 B2 A2 C1 B3 A3 C2 C3 C1 B1 C2 A1 B2 C3 A2 A3 B3 Reliable source-ordered delivery A1 A2 A3 B1 B2 B3 C1 C2 C3 A1 B1 A2 C1 B2 B3 C2 A3 C3 A1 B1 A2 C1 B2 B3 C2 A3 C3 A1 B1 A2 C1 B2 B3 C2 A3 C3 Reliable group-ordered delivery

18.2 FIGURE

Types of message delivery services, where A, B, and C are processors and A1, A2, and A3 are messages sent by processor A, and so forth.

delivery service helps to maintain the consistency of replicated data when processes that constitute a group have copies of the same data and must update that data. Totally ordered delivery requires that, if process P delivers message M before it delivers message M and process Q delivers both messages, then Q delivers M before it delivers M . Totally ordered delivery is important where systemwide consistency across many groups is required, as in realtime instrumentation applications. Various types of message delivery services are illustrated in Figure 18.2.

18.4

MAINTAINING CONSISTENCY
In a distributed system, processing and data may be replicated for increased reliability or availability, or faster access. In particular, several processes may perform different tasks of the application and may each have a copy of the

460

18

Network Protocols

Client A1 A2 A3 Replicated data at server A1 B1 A2 C1 B2 B3 C2 A3 C3 Client B1 B2 B3

Replicated data at server A1 B1 A2 C1 B2 B3 C2 A3 C3

Client C1 C2 C3

18.3 FIGURE

Total ordering on messages helps to maintain the consistency of replicated data.

same data so that they can cooperate. Alternatively, several processes may be replicated for fault tolerance, with each replica performing the same task of the application and holding a copy of the same data. In both cases, the copies of the replicated data must be kept consistent as the application executes. Group communication protocols [56, 62, 162, 372, 401, 548] help to maintain the consistency of replicated data. Such protocols distribute update messages to all processes holding copies of the replicated data using a multicast service. They provide a reliable, totally ordered message delivery service, which ensures that each process performs the same sequence of updates in the same order, as shown in Figure 18.3. This simplies the application programming required to maintain the consistency of the replicated data. Because they are designed to support fault tolerance in addition to coordination of distributed computation, these protocols are very robust and handle a wide range of faults, including processor faults and network partitioning. They include excellent fault detection, group membership, and recovery protocols that provide a consistent view of faults and membership changes systemwide [62, 400]. Early group communication protocols were inefcient, but performance has improved. Current group communication protocols operate very ef-

18.5

Group Membership

461

ciently in the benign private world of a LAN, as efciently as other protocols despite the more elaborate services they provide. Effective strategies are, however, still being developed for the hostile world of the Internet [56], where resources must be shared with unknown and uncooperative users. Applications for these protocols can be purely computational (e.g., distributed supercomputing applications), where tasks might be allocated to processors by a distributed scheduler, which is coordinated by a group communication protocol. Alternatively, these protocols can be used to mediate human interaction, as in a collaborative virtual environment, where a group communication protocol can ensure that all users see the same updates to a shared workspace. The protocols can also be used in the control or monitoring of a physical device, as in a realtime instrumentation application, where different processes control different parts of the instrumentation, coordinated by a group communication protocol.

18.5

GROUP MEMBERSHIP
A process group [62] is a set of processes that collaborate on an application task. Messages related to the application task are multicast to the members of the process group. A distributed application may consist of hundreds of process groups, and a process group may contain hundreds of processes. The processes may join and leave the groups voluntarily, as in a collaborative virtual environment, or because a process or processor has failed, as in a realtime instrumentation application. Naming mechanisms that work well for simple applications on small, stable networks are inappropriate for large complex dynamic applications. For example, if a processor fails, its processes may be replaced by other processes on different processors. Thus, messages are not addressed to a specic process on a specic processor. Rather, each message is addressed to the process group and is delivered to all of the processes that have joined that group. This strategy must be supported by mechanisms that allocate unique names to process groups and by a name server. Some applications and many protocol mechanisms need to know the current group membership for each process group so that, for example, a distributed scheduling algorithm can assign tasks on a distributed supercomputer. It is also important to establish a precise relationship between message ordering and membership changes. For example, if state is being transferred to a new process on another processor, it is essential to establish whether the messages were processed before the membership change and thus are reected

462

18

Network Protocols

Q R S Membership is Q, R, and S

S fails

Q R Membership is Q and R State transfer P joins group Membership is P, Q, and R P Q R

Continued operation before fault is detected

Messages processed before state transfer

Messages processed after state transfer

18.4 FIGURE

When state is transferred from one process to another, it is important that both processes agree on which messages should be processed before the state transfer and which should be processed after.

in the state being transferred, or whether the messages were processed after the membership change and thus should be processed by the new process, as shown in Figure 18.4. The concept of ordering membership changes relative to data messages is called virtual synchrony [62]. The maintenance of group membership is quite difcultindeed theoretically impossible without the use of a fault detectorbut several effective practical membership protocols have been developed. As shown in Figure 18.5, the typical approach involves the following phase:
!

A discovery phase of broadcast solicitations and responses that establishes which processes are alive and can communicate

18.5

Group Membership

463
Q R S Membership is Q, R, and S Normal operation S fails

Fault detected Discovery phase Membership algorithm operating Commitment phase Recovery phase Membership algorithm done P Last few messages of old membership are delivered P joins group

Membership is P, Q, and R

Normal operation resumes with new messages

18.5 FIGURE
!

The phases of a membership protocol.

A commitment phase using a form of two-phase commit that determines the membership A recovery phase in which messages from the prior membership are collected and delivered

Liveness is ensured by repeating these steps as often as is required to complete the algorithm while at each step excluding at least one additional process thought to be faulty in the previous membership attempt. This guarantees termination in nite time but at the expense of possibly allowing, in unfavorable circumstances, each nonfaulty process to form a membership containing only itself. In practice, with appropriate values of the time-outs, the approach is robust and effective. The problem with such a membership protocol, and many other protocols that maintain network topology, connectivity, or distance information, is that the cost is proportional to N 2, where N is the number of processes in the group. Furthermore, the interval between membership changes is inversely proportional to N. As N becomes large, the system may spend too much of

464

18

Network Protocols

its time in the membership protocol. Worse still, to ensure virtual synchrony, many group communication systems suspend message delivery during membership changes. As N becomes large, they may deliver no messages at all. Research is under way into weaker forms of membership that scale better while remaining useful to the applications, such as the collaborative virtual environments and distributed supercomputing applications, that may involve many processes.

18.6

RESOURCE MANAGEMENT
The great advantage that the Internet has demonstrated over traditional telecommunication networks is increased sharing of network resources, which leads to reduced costs. Many network connections have intermittent and bursty trafc. Where a network resource can be shared between enough connections, typically more than 20, quite high utilization of the resource can be obtained without serious degradation of the service provided to each connection. Multiple bursty trafc sources will, however, always be able to overwhelm the network; thus, ow control and congestion control will remain important.

18.6.1

Flow Control and Congestion Control


With modern communication equipment, packet or cell loss is almost always caused by congestion, either in the switches and routers of the network or in the input buffers at the destinations, rather than by corruption of packets or cells by the medium. Consequently, the TCP protocol uses packet loss as an indication that it should reduce its transmission rate. The TCP backoff algorithm, which employs rapid rate reduction on packet loss and slow rate increase in the absence of packet loss, is very effective for data transmission, as for example in data-intensive and distributed supercomputing applications. It allows many TCP connections to share the network fairly and efciently without explicit coordination. The strategy suffers, however, from several problems. First, it is not appropriate for data streaming, such as the multimedia connections of a teleimmersion application, where the transmission rate is relatively high but predictable and constant. Such connections are best served by an explicit reservation strategy, such as ATM provides and RSVP has proposed for the Internet. A second problem is that connections that observe the TCP backoff strategy can be squeezed out by rogue connections that do not reduce their transmission rate in the presence of congestion and packet loss. This problem is

18.6

Resource Management

465

Messages originated

Messages delivered Latency

18.6 FIGURE

Latency is the time from message origination at the source to message delivery at the destination(s). Jitter is the variance in the latency.

addressed by ATMs trafc-policing mechanisms and, for the Internet, by random early drop (RED) algorithms in Internet switches [193]. RED algorithms explicitly discriminate, by dropping packets, against connections that do not reduce their transmission rate in the presence of congestion. TCPs backoff strategy also becomes less effective for connections that communicate at high rates over long distances. The time to detect packet loss on a 1 Gb/s transcontinental link corresponds to about 5 MB of transmitted data, given the 40 ms round-trip time. At, say, 40 Gb/s, this becomes over 200 MB, and switch buffering requirements become unreasonable.

18.6.2

Throughput, Latency, and Jitter


Critical performance metrics for network protocols and for the applications are throughput, latency, and jitter. Throughput is the number of user data bits communicated per second. Latency is the time from message origination at the source to message delivery at the destination(s), and jitter is the variance in the latency, as shown in Figure 18.6. Because of the inadequate performance of the network infrastructure, protocols have focused on maximizing the throughput of the network, which has been the bottleneck. As network throughput has improved, attention has shifted to latency and jitter. Protocol mechanisms that increase the throughput of the network may also increase the latency. Long latencies are undesirable for some applications, such as distributed supercomputing and realtime instrumentation and control. Protocols have been developed, particularly for distributed supercomputing, with mechanisms to reduce the latency caused by the protocol stack and the operating

466
Three sources A, B, and C multicasting A1 A2 A3 B1 B2 B3 C1 C2 C3 Input buffer of a receiver A1 B1 A2 C1 B2 A3 C2

18

Network Protocols

C3
B
3

Buffer overflows; messages are lost

18.7 FIGURE

Multicast protocols incur a high risk of packet loss at the input buffers of the destinations.

system (see Chapter 20). Such mechanisms are effective, however, only for lightly loaded systems in which queuing delays are not signicant. For heavily loaded complex systems, high throughput reduces queuing delays and is the most effective approach for reducing overall system latency. Jitter is highly undesirable for audio, video, and multimedia applications, such as teleimmersion. Special protocols that timestamp every packet, such as RTP [498], allow accurate measurement of latency, which is effective at helping to reduce or mask jitter. Also important for multimedia applications, particularly those that use MPEG encoding of streams, are protocols and switches that select intelligently the packets to be dropped when congestion occurs [193].

18.6.3

Multicast/Acknowledgment Mechanisms
Good performance of group communication protocols, and even the feasibility of Internet video broadcasting, requires that multicasting be able to reach many destinations with a single transmission, sharing the bandwidth and avoiding wasteful multiple transmissions of the same information. With a modern high-performance network and multiple sources multicasting simultaneously, as in a collaborative virtual environment, it is easy to transmit information faster than the receivers can process it, resulting in a bottleneck at the input buffers of the receivers, as shown in Figure 18.7. When using highperformance LANs, rather than the slower Internet, the best group communication protocols focus their ow control mechanisms on preventing input buffer overow [401]. Many recent protocols exploit negative acknowledgments rather than positive acknowledgments to achieve higher performance and higher reliability [194]. Positive acknowledgment protocols, also called sender-based protocols,

18.6

Resource Management

467

Acks Multicast message

Nack

18.8 FIGURE

Positive and negative acknowledgments. With many receivers in a multicast group, there is an implosion of acknowledgments at the sender.

require the receiver to send an acknowledgment to the sender if it has received the message. If the sender does not receive an acknowledgment within a certain period of time, it retransmits the message. In contrast, negative acknowledgment protocols, or receiver-based protocols, require the receiver to send a negative acknowledgment to the sender if it has not received the message within a certain period of time. With a large number of receivers in a multicast group, positive acknowledgment protocols can result in the sender being overwhelmed by an implosion of positive acknowledgments, as shown in Figure 18.8, making a negative acknowledgment protocol more appropriate. To achieve reliable delivery, positive acknowledgments are typically combined with negative acknowledgments. This combination allows messages to be removed from the senders buffers when those messages are no longer needed for possible future retransmission. As larger amounts of data are transmitted over longer distances at higher speeds, the buffers must become larger, and buffer management becomes more important.

18.6.4

Translucency
The limited performance of current communication networks, coupled with rapidly growing and changing demands on network protocols, has encouraged

468

18

Network Protocols

translucency, or access, by the application programmer to the mechanisms of the network protocols. Examples are the following:
!

Network-aware applications adapt their own behavior and demands on the network to reect the current network load, latency, or rate of message loss. This strategy may allow applications to operate more effectively under adverse network conditions. The disadvantage is that the strategy increases application programming costs. Building a better network may be cheaper. Application-level framing allows the application to be aware of, and even manipulate directly, the actual packets or frames transmitted over the communication medium. The advantage can be lower overheads for applications such as teleimmersion; the disadvantage is that the application program loses portability and adaptability to future networks. Microprotocol toolkits allow the user to assemble the protocol stack by choosing the protocols from a toolkit of microprotocols [491, 548]. The advantage is that a skilled user can construct a protocol suited to the needs of the application. In general, however, network protocols constructed by the user from a microprotocol toolkit are less efcient than more highly integrated protocol implementations because the interfaces between the microprotocols incur overhead and inhibit code optimization. Additional inefciency is caused by microprotocols that must be designed to be freestanding so that they can be used with or without other microprotocols.

As high-performance networks become readily available, as the networking environment stabilizes, and as the needs of applications become better understood, the advantages of translucency will diminish, and the costs of custom network programming in the application will be harder to justify, even for performance-sensitive applications such as teleimmersion and distributed supercomputing. The predominant consideration will be the convenience and efciency of a standard, fully integrated protocol providing a simple, welldened, and stable service to the application.

18.6.5

Achieving Adequate Network Performance


The primary requirement of adequate network performance is adequate bandwidth, which is currently a problem but will become less of a problem in the future. Efcient and cost-effective use of the network will, however, depend on sharing the cost of the bandwidth with many users, and on a pricing mechanism that restrains the growth in demand and also funds the increase

18.7 Example Network Protocols

469

in available bandwidth. Other resources, such as buffer space, must also be shared. Users who require exclusive reservation of resources, such as ATM AAL1 users, are likely to pay much higher costs than users who agree to share resources. The rapid growth and effectiveness of the Internet depend in part on the low costs made possible by packet-switching protocols that share the available bandwidth among multiple packets simultaneously. The congestion of the Internet results from the lack of an effective pricing mechanism. As networks continue to develop, the greatest contributions of network protocols will result from improved services rather than improved mechanisms. Substantial performance improvements result from transmitting information once only instead of many times in a collaborative virtual environment, or of transmitting the right information in a data-intensive application, or of not needing to transmit the information at all in a distributed supercomputing application. Caching, now used quite extensively to speed access to popular sites on the Web, is an example of a mechanism that improves network performance by reducing the need to transmit information. As yet, other network protocols have made little use of caching beyond buffering messages for possible retransmission after message loss. Improved protocols and improved network performance, resulting from transmitting the right information at the right time, will result from understanding better the services that the applications need.

18.7

EXAMPLE NETWORK PROTOCOLS


We describe here a few of the more interesting, recently developed network protocols. The different characteristics and mechanisms of these protocols make them appropriate for different aspects of grid applications such as those considered in this book.

18.7.1

Xpress Transfer Protocol


The Xpress Transfer Protocol (XTP) is a transport-level protocol designed for distributed parallel applications operating over clusters of computers. For exibility and efciency, it supports several different communication paradigms, as discussed below. XTP supports both unicasting from a single sender to a single receiver and one-to-many multicasting from a single sender to a group of receivers. Data ows in one direction from the sender to the receiver(s), with control trafc

470

18

Network Protocols

owing in the opposite direction. Multiple instances, or contexts, of XTP can be active at a node, providing partial support for many-to-many multicasting. Normally, the receivers do not inform the sender of missing packets until the sender asks for such information; however, an option is available that allows the receivers to notify the sender immediately after detecting a packet loss. A control packet indicating lost packets contains the highest consecutive sequence number of any of the packets received by a receiver and the range of sequence numbers of missing packets. Retransmissions are multicast to the group and are either selective or go-back-N. The sender may disable error control but may still ask the receivers for ow control and rate control information. XTP supports both ow control and rate control algorithms, which provide the requested quality of service. Flow control, in which further data is sent when acknowledgments are received for prior transmissions, is appropriate for both data and control. Rate control, in which data is transmitted at a constant rate, is appropriate for data streaming. Flow control considers endto-end buffer space; rate control considers processor speed and congestion. Multicast groups are managed by the user, based on information that the sender maintains about the receivers. The user species how the initial group of receivers is formed, and the criterion for admission and removal once the multicast group is established. The sender learns which receivers are active by periodically soliciting control packets. The scalability of XTP is limited by its method of error recovery. Like other sender-initiated error recovery approaches, the sender may be unable to maintain state for large receiver groups, and the throughput of the sender may be slowed by control packet implosion and processing and by retransmissions. Multicasting retransmissions to the entire multicast group uses unnecessary bandwidth on links to receivers that have already received the packet.

18.7.2

Scalable Reliable Multicast Protocol


The Scalable Reliable Multicast (SRM) protocol is a lightweight protocol that supports the notion of application-level framing and is intended for applications such as collaborative virtual environments. SRM aims to provide reliable delivery of packets over the Internet, in that all packets should eventually be delivered to all members of the multicast group; no delivery order, either source ordered or group ordered, is enforced [194]. To receive packets sent to the group, a receiver sends a join message on the local subnet announcing that it is interested in receiving the packets. Each receiver voluntary joins and

18.7 Example Network Protocols

471

leaves the group without affecting the transmission of packets to other group members. Each group member is individually responsible for its own reception of packets by detecting packet loss and requesting retransmission. Packet loss is detected by nding a gap in the sequence numbers of the packets from a particular source. To ensure that the last packet in a session is received, each member periodically multicasts a session message that reports the highest sequence number of packets that it has received from the current senders. The session messages are also used to determine the current participants in the session and to estimate the distance between nodes. When a node detects a missing packet, it schedules a repair request (retransmission request) for a random time in the future. This random delay is determined by the distance between the node and the sender of the packet. When its repair request timer for the missing packet goes off, the node multicasts a repair request for the packet, doubles its request timer delay, and waits for the repair. If the node receives a request for the packet from another node before its request timer goes off, it does an exponential backoff and resets its request timer. When a node receives a repair request and it has the requested packet, the node schedules a repair for a random time in the future. This random delay is determined by the distance between the node and the node requesting the repair. When its repair timer for the packet goes off, the node multicasts the repair. If the node receives a repair for the packet from another node before its repair timer goes off, it cancels its repair timer. The request/repair timer algorithm introduces some additional delay before retransmission to reduce implosion and to reduce the number of duplicates. It combines the systematic setting of the timer as a function of distance (between the given node and the source of the lost packet or of the repair request) and randomization (for the nodes that are at an equal distance from the source of the missing packet or repair request). Because the repair timer delay depends on the distance to the node requesting the repair, a nearby node is likely to time out rst and retransmit the packet.

18.7.3

The InterGroup Protocols


The InterGroup Protocols (IGPs) are intended for large-scale networks, like the Internet, with highly variable delays in packet reception and with relatively high packet loss rates, and for large-scale applications with relatively few senders and many receivers [56]. The IGPs are particularly appropriate for collaborative virtual environments and realtime instrumentation applications.

472

18

Network Protocols

Each Data packet has in its header a unique source identier, a sequence number, and a timestamp obtained from the senders local clock. The sequence numbers are used to detect missing packets from the sender and to provide reliable source-ordered delivery of packets. The timestamps provide reliable group-ordered delivery of packets, while maintaining the causal order of messages sent within the group. Each multicast group consists of two subgroups: the sender group and the receiver group. Superimposed on the underlying multicast backbone (MBone of the Internet) is a forest of trees that includes all of the receivers in the multicast group. This forest of trees is used to collect acknowledgments, negative acknowledgments, and fault detection packets to reduce congestion and to achieve scalability. Data packets and retransmissions are multicast by using the MBone. Only the senders multicast Data packets, but the senders are also receivers in that they receive Data packets. Both senders and receivers may send retransmissions of Data packets and are required to send Alive packets. An Alive packet serves as a heartbeat and also as a positive acknowledgment, since it contains the highest timestamp up to which the node multicasting the Alive packet has received all packets. Each node sends an Alive packet on an infrequent periodic basis, but only to its parent in the tree, and the information in the packet is then propagated up the tree. Upon receiving the information, a root node (i.e., sender) multicasts its Alive packet to the group. If a node detects that it has missed a packet, it sends a Nack (negative acknowledgment) packet. Under high packet loss rates, Nack packets are sent more frequently than Alive packets, but again a node sends a Nack packet only to its parent in the tree. The InterGroup Protocols use approximately synchronized clocks at the senders and receivers, by running a clock synchronization algorithm occasionally or using a Global Positioning System. At minimum, these clocks are Lamport clocks that respect causality [328]. If every member of the sender group is sending packets, the maximum delivery time of a packet is D + , where D is the diameter of the group and is the interpacket generation time of the slowest sender. The maximum delivery time is kept low by using Alive packets from the senders. To maintain consistency of packet delivery, both senders and receivers must know the membership of the sender group. For scalability, only the senders are responsible for detecting faults of the senders in the sender group and for repairs of the sender group. Before a node can deliver a packet from a particular sender, it must have received a packet, from each sender in the sender group, with a timestamp at least as great as that of the packet to be

18.7 Example Network Protocols

473

delivered. Otherwise, the node may deliver a packet from one sender with a higher timestamp before it receives a packet from another sender with a lower timestamp. Both senders and receivers must know the membership of the receiver group, either implicitly or explicitly. This information is used for garbage collection so that a node can remove a Data packet from its buffers when it knows that all nonfaulty members of the receiver group have received the packet and thus that it will never need to retransmit the packet subsequently. Again, for scalability, to maintain the membership of the receiver group, each node is responsible only for fault detection of its children in the tree.

18.7.4

Real-Time Transport Protocol


The Real-Time Transport Protocol (RTP) [498] provides end-to-end delivery services for data with realtime characteristics, such as interactive audio and video (see also Chapter 19). RTP is intended primarily for multimedia conferencing with multiple participants, as in teleimmersion applications, but it is also applicable to interactive distributed simulation, storage of continuous data, and so on. RTP typically runs on top of the User Datagram Protocol (UDP) to utilize its multiplexing and checksum services, and supports data transfer to multiple destinations using multicasting provided by the underlying network. RTP works in conjunction with the Real-Time Transport Control Protocol (RTCP). The Real-Time Transport Protocol provides realtime transmission of data packets, while the Real-Time Transport Control Protocol monitors quality of service and conveys session control information to the participants in a session. Among the services provided by RTP are payload type identication, marker identication, sequence numbering, timestamping, and delivery monitoring. The payload type identies the format of RTP payload (e.g., H.261 for video), and the marker identies signicant events for the payload (e.g., beginning of a talk spurt). The sequence numbers allow the receivers to detect packet loss and to restore the senders packet sequence, but may also be used to determine the position of a packet in a sequence of packets as, for example, in video transmission. The timestamp, obtained from the senders local clock at the instant at which a data packet is generated, is used for synchronization and jitter calculations. RTP itself does not provide any mechanisms to ensure timely delivery or other quality-of-service guarantees. It does not guarantee delivery or prevent out-of-order delivery, nor does it assume that the underlying network provides such services. For jitter-sensitive applications, a receiver can buffer packets

474

18

Network Protocols

and then exploit the sequence numbers and timestamps to deliver a delayed but almost jitter-free message sequence. RTCP is based on periodic transmission of control packets to all of the participants in a session. The trafc is monitored and statistics are gathered on the number of packets lost, highest sequence number received, jitter, and so forth. These statistics, transmitted in control packets, are used as feedback for diagnosing problems in the network, controlling congestion, handling packet errors, and improving timely delivery. In addition, RTCP conveys minimal session control information, without membership control or parameter negotiation, as the participants in a session enter or leave the session. Although RTCP collects and distributes control and quality-of-service information to the participants in the session, enforcement is left to the application.

18.7.5

Resource reSerVation Protocol


The Resource reSerVation Protocol (RSVP) [76, 584] is designed for an integrated services Internet and is suitable for applications such as teleimmersion. More details on RSVP are provided in Chapter 19. RSVP makes resource reservations for both unicast and multicast applications, adapting dynamically to changing group memberships as well as to changing routes. RSVP operates on top of the Internet Protocol IPv4 or IPv6 and occupies the place of a transport protocol in the protocol stack. However, it does not transport application data nor does it route application data; rather, it is a control protocol. RSVP is based on the concept of a session, which is composed of at least one data stream or ow and is dened in relation to a destination. A ow is any subset of the packets in a session that are sent by a particular source to a particular destination or group of destinations. RSVP is used to request resource reservations in each node along the path of a ow and to request specic qualities of service from the network for that ow. RSVP is receiver oriented in that the receiver of a ow initiates and maintains the resource reservation used for that ow. Thus, it aims to accommodate large groups, dynamic group memberships, and heterogeneous receiver requirements. It carries a resource reservation request to all of the switches, routers, and hosts along the reverse data path to the source. Since the membership of a large multicast group and the topology of the corresponding multicast tree are likely to change with time, RSVP sends periodic refresh messages to maintain the state in the routers and hosts along the reserved paths. This is referred to as soft state because it is built and destroyed incrementally.

18.7 Example Network Protocols

475

Quality of service is implemented for a particular ow by a mechanism called trafc control that includes a packet classier and a packet scheduler. The packet classier determines the QoS for each packet, while the packet scheduler achieves the promised QoS for each outgoing interface. During reservation setup, a QoS request is passed to two local decision modules, admission control and policy control. Admission control determines whether the node has sufcient available resources to supply the requested QoS; policy control determines whether the user had administrative permission to make the reservation.

18.7.6

CORBA GIOP/IIOP
CORBA is based on the client-server model and supports interoperability at the user level (language transparency) via the OMGs Interface Denition Language (IDL) and at the communication level via inter-ORB protocols [388]. CORBA is also discussed in Chapters 9 and 10. When a client object invokes a method of a server object, the call is bound to a static stub generated from the IDL specication of the server object, or the operation is invoked dynamically. The stub passes the call and its parameters to the Object Request Broker (ORB), which determines the location of the server object, marshals (encodes and packages) the call and its parameters into a message, and sends the message across the network to the processor hosting the server object. The ORB at the server unmarshals the message and passes the call and its parameters to a skeleton, which invokes the operation and returns the results to the client object, again via the ORB. The CORBA 2.0 specication provides the technology that enables ORBs from different vendors to communicate with each other, known as the ORB Interoperability Architecture or, more specically, the General InterORB Protocol (GIOP). GIOP is designed to map ORB requests and responses to any connection-oriented medium. The protocol consists of the Common Data Representation, the GIOP message formats, and the GIOP message transport assumptions. The Common Data Representation (CDR) is a transfer syntax that maps every IDL-dened data type into a low-level representation that allows ORBs on diverse platforms to communicate with each other. CDR handles variations in byte ordering (little-endian vs. big-endian) across the different machines hosting the ORBs. It also handles memory alignments to natural boundaries within GIOP messages.

476

18

Network Protocols

The GIOP message formats include Request, Reply, LocateRequest,


Locate-Reply, CancelRequest, CloseConnection, Message Error, and Fragment. Each of these message formats is specied in IDL and consists of a standard GIOP header, followed by a message format-specic header and, when necessary, a body of the message. The Request header identies the target object of the invocation and the operation to be performed on that object, and the message body contains parameters of the operation. The Reply header indicates a successful (or unsuccessful) operation, and the message body contains the results (or the exception). The GIOP message transport assumptions mandate that the underlying transport is a connection-oriented, reliable byte stream. Moreover, the transport must provide notication of connection loss. Clearly, TCP/IP is a candidate for such an underlying transport, and the mapping of GIOP to TCP/IP is standardized by the OMG as the Internet Inter-ORB Protocol (IIOP). All CORBA 2.0compliant ORBs must be able to communicate over IIOP, which effectively allows the ORBs to use the Internet as a communication bus. In using IIOP, objects publish their object references, or names, as Interoperable Object References (IORs) and register them with CORBAs Naming Service. These object references allow object addressing and discovery even in a network of heterogeneous ORBs.

18.7.7

Java RMI
While CORBA provides interoperability between different languages, Java Remote Method Invocation (RMI) [290] assumes the homogeneous environment of the Java Virtual Machine (JVM) (see Chapter 9). Like CORBA, RMI is based on the client-server model. For RMI, marshaling and unmarshaling of objects are handled by the Java Object Serialization system, which converts the data into a sequence of bytes that are sent over the network as a at stream. The RMI model consists of a suite of classes to support distributed object computing by allowing a client object to invoke methods dened in a server object running on a remote machine. Each remote object has (1) a remote interface that declares the methods that can be invoked remotely on the remote object and extends the class Remote and (2) a remote object (server) that implements the interface and extends the class UnicastRemoteObject. Remote Method Invocation is the action of invoking a method dened in a remote interface on a remote object that implements this interface. When a remote object is created, it is registered in a registry on the same machine. The registry stores the remote objects name along with a reference to a stub for the remote object. In order for the registry lookup to be

18.8 The Future of Network Protocols

477

effective, the remote object must have been bound previously to the registry. An application using RMI rst makes contact with a remote object by nding the name of the object in the registry. The client then downloads the stub, which acts as a proxy for the remote object. The client can then invoke a method on the remote object exactly as it would invoke a method locally. The arguments passed to a remote method can include both remote and nonremote objects. Remote objects are passed by reference. The remote reference passed is a reference to the stub for the remote object. Objects that are not remote are passed by value by copying over the network, and therefore must implement the serializable interface. Thus, Java RMI allows distributed applications written in Java to pass data, references to remote objects, and complete objects (including the code associated with those objects) from one machine to another. The ability to load code across the network distinguishes the Java RMI system from other distributed computing frameworks, such as CORBA. This is made possible because both client and server objects are implemented within the Java Virtual Machine and because Java bytecodes form a portable binary format.

18.8

THE FUTURE OF NETWORK PROTOCOLS


In the past, the primary limitations on computer applications were processing speed and memory size. Modern processors have removed these limitations for most applications. Currently, communication bandwidth is a major limitation for distributed applications, but the end of this limitation is in sight. The primary limitation on our ability to build distributed application systems in the future will be the complexity of the application programs. The network protocols of the future must, therefore, reect this new primary limitation. They must be designed to simplify application programming, even at the expense of less efcient processing and use of communication bandwidth. The application programmer of the future will focus on the application, rather than on communication. The invocation of a remote operation across the network will be no different from a local invocation. If an application must be fault tolerant, then replication of data will be automatic and invisible, as will be recovery from a fault. If an application involves multimedia, then the streams will be transmitted and delivered in a synchronized and jitter-free manner without involvement of the application programmer. Network protocols are critical to the future of computing. Almost all computing in the future will be distributed computing; network protocols provide the glue that holds such computing together. Much of the computing of the

478

18

Network Protocols

future will involve the integration of components developed independently with little coordination, and the use of those components in ways that could not have been foreseen by their developers. Such integration is possible only with simple, well-dened, stable interfaces; such characteristics have been achieved by many network protocols but by little else in computing. If we treasure that simplicity, precise denition, and stability, our current and future network protocols will provide the interfaces from which the distributed application systems of the future will be built.

FURTHER READING
For more information on the topics covered in this chapter, see www.mkp.com/ grids and also the following references:
! ! ! !

Goudas text [241] on network protocol design is very readable. Halsalls text [259] is a comprehensive reference book. Holzmanns book [278] discusses how to design protocols, by an expert. Stallings IEEE tutorial [522] is a bit dated but still excellent.

Vous aimerez peut-être aussi