Vous êtes sur la page 1sur 11

.

THE VIRTUAL INTERFACE


ARCHITECTURE
Dave Dunning
Greg Regnier
Gary McAlpine
Don Cameron
Bill Shubert
N etwork bandwidths are increasing,
and latencies through these networks
are decreasing. Unfortunately, appli-
cations have not been able to take full advan-
tage of these performance improvements due
ly to a network.) IPC performance depends
on the software overhead to send and
receive messages and/or data and the time
required to move a message or data across
the network. The number of software layers
Frank Berry to the many layers of user-level and kernel- that are traversed and the number of inter-
Anne Marie Merritt level software that is required to get to and rupts, context switches, and data copies
Ed Gronke from the network. when crossing those boundaries contribute
Chris Dodd To address this problem, Intel Corpora- to the software overhead.
tion, Compaq Computer Corporation, and Note that the time to traverse the software
Intel Corporation Microsoft Corporation jointly authored the stack layers and the time to field interrupts
VI Architecture specification. VIA signifi- and complete context switches do not
cantly reduces the software overhead be- depend on the number of bytes being sent
tween a high-performance CPU/memory or received. Only the time to complete the
subsystem and a high-performance network. data copies depends on the number of bytes
Access http://www.viarch.org/ for a copy of being moved. Clearly, the time to move a
the specification. message or data across the network depends
This protected, zero- VIA defines a set of functions, data struc- on the number of bytes moved.
tures, and associated semantics for moving Primarily, faster processors that execute
copy, user-level data into and out of a process’ memory. It protocol layers in less time reduce the soft-
achieves low-latency, high-bandwidth com- ware overhead. To facilitate the increase in
network interface munication and data exchange between clock frequency, designers have used deeper
processes running on two nodes within a pipelines, intelligent branch prediction algo-
architecture reduces computing cluster, with minimal CPU usage. rithms in hardware, more on-board registers,
VIA gives a user process direct access to the and larger, faster caches. These innovations
the system overhead network interface, avoiding intermediate lead to executing code paths faster. However,
copies of data and bypassing the operating they also lead to a larger penalty (usually mea-
for sending and system in a fully protected fashion. Avoid- sured in the number of clock cycles) paid
ing interrupts and context switches when- when interrupting the code sequence, switch-
receiving messages ever possible minimizes CPU usage. ing to another process with its associated con-
This article presents the mechanisms that text, and flushing a portion of the instruction
between high- support protected, zero-copy user-level cache. Therefore, the increase in processor
access and the performance data of two clock frequencies does not necessarily lead to
performance related implementations. a proportional reduction in the software over-
head of each message.
CPU/memory Interprocess communication With the introduction of technologies
VIA attacks the problem of the relatively such as fast Ethernet and OC-3 ATM, net-
subsystems and low achievable performance of interprocess work bandwidths have increased from 1
communication (IPC) within a cluster. (Clus- Mbps to 10 Mbps. We see 100-Mbps to 150-
networks to less than ter computing consists of short-distance, Mbps bandwidths and roadmaps to 1 Gbps
low-latency, high-bandwidth IPCs between to 2 Gbps. These impressive increases in
10 microseconds when multiple building blocks. Cluster building bandwidth have reduced the time to com-
blocks include servers, workstations, and plete bulk data transfers. However, a com-
integrated in silicon. I/O subsystems, all of which connect direct- mon misconception is that the data rate of

66 IEEE Micro 0272-1732/98/$10.00 © 1998 IEEE


.

Key design issues and resolutions


The Virtual Interface Architecture is connection oriented; was sent. When data has been transferred to the network
each VI instance (VI) is specifically connected to another supporting reliable delivery service, the descriptor can be
VI and thus can only send to and receive from its connected marked complete. All errors in the network must be detect-
VI. Bailey et al.1 describes the need to quickly dispatch ed and reported. Note that most transport errors will be
packets and start processing them just as quickly. The high reported in a receive descriptor, not the send descriptor
weighting placed on message-passing (data movement) of a given transaction. When an error is detected, the con-
performance led to the connection architecture. Requiring nection on which the error occurred is disconnected.
connections between VIs allows for lower latency between Reliable reception service guarantees that data sent is
source and destination and simpler VIA-compliant received uncorrupted, only once, and in the order that it
implementations. was sent. Reliable reception service differs from reliable
Connecting two VIs removes one level of protection delivery service in that a descriptor cannot be marked com-
checking from the critical path of sending and receiving plete until the data has been transferred into the memory
packets. The sender has permission to send to the receiver at the remote endpoint. Like the reliable delivery service,
because the kernel agent already has established the con- all errors in the network must be detected and reported;
nection. The receiver is not required to verify that the sender when an error occurs, the connection on which the error
is a valid source of data. Upon arrival, the data is funneled occurred is disconnected. Unlike reliable delivery, when
directly to the VI receive queue, reducing the amount of using the reliable reception service, a transport error is
processing required, and therefore reducing the latency to reported in the send descriptor associated with the trans-
receive a packet. action, and no other descriptors on that queue of that VI
VIA’s connection-oriented nature facilitates a simpler can be completed.
NIC design, which reduces the head-of-line blocking. We Remote-DMA/read. This is an optional feature in VIA.
define head-of-line blocking as a small message that gets The only communication operations required to achieve
stuck behind a large message. If VI pairs were not con- the goals of the architecture are send and receive.2 As the
nected, a small message could be queued behind a large architecture progressed, it became apparent that remote-
message with a different endpoint. Then to prevent a large DMA/write was a very useful operation allowing a node
message from blocking the small message, the scheduler to “put” data into remote memory without consuming a
would need to check ahead on the VI and begin process- descriptor. The added complexity of implementations was
ing the small message. The small message would likely fairly small, being very similar to a send operation. The
complete first. However, the done status would be blocked trade-offs between application performance and design
until either the large message completed or the NIC complexity with associated costs could not be quantita-
reordered the descriptors, neither of which are appealing tively evaluated. Therefore, the semantics of remote-
design options. DMA/read are defined but not required in VIA.
Avoiding this blocking is one service that is desirable in Pinned memory regions. We based the decision to
NICs. The next level of service is a minimum bandwidth require that memory regions used for descriptors and data
guarantee, a maximum bandwidth not to be exceeded, and buffers be pinned on performance. We considered meth-
a maximum latency not to exceed. Although not required ods to allow caching of virtual-to-physical translations, but
in the current revision of VIA, connection-oriented VIs sim- they would not yield the performance associated with pin-
plify the design of a scheduler that can support these qual- ning and registering the memory. Nonvariable time for
ities of service. virtual-to-physical address translation is especially critical
Reliability guarantees. The reliability attributes of the at the receiving endpoint. We did not consider backing
physical network below VIA are separated into three dif- data up onto the network or discarding data at a receiving
ferent categories: an unreliable datagram service, a reli- endpoint to be acceptable.
able delivery service, and a reliable reception service.
Unreliable datagram service guarantees that all data that
is delivered has not been corrupted during transmission References
and will only be delivered once. There is no guarantee 1. M.L. Bailey et al., “PathFinder: A Pattern-Based Packet
that all data sent was delivered. There is no guarantee that Classifier,” Proc. First Symp. Operating Systems Design and
the delivered packets will be delivered in the order in Implementation, Usenix Assoc., Sunset Beach, Calif., Nov.
which they were sent. If an error occurs in the network, 1994, pp. 115-123.
the data is discarded and the connection between the two 2. P. Pierce and G. Regnier, “Fast Messages: Efficient, Portable
VIs remains intact. Communication for Workstation Clusters and MPPs,” IEEE
Reliable delivery service guarantees that data sent is Concurrency, Vol. 5, No. 2, Apr.-June 1997, pp. 60-72.
received uncorrupted, only once, and in the order that it

the network’s physical layer is a good metric to assess com- overhead associated with the communication traffic as well
munication performance. It must be coupled to the software as the traffic pattern. See Martin et al.1 for a more compre-

March/April 1998 67
.

VIA

al.,4 it is not simply one or two of the software layers that


must be traversed.
To further quantify the problem, we conducted a series of
simple tests between two 133-MHz Pentium Pro processor
100 servers. We inserted instrumentation hooks in a version of
80 the Windows NT operating system and collected cycle counts
associated with the various software layers and the TCP/IP
Percentage

60 protocol stack. We recorded measurements for the send and


receive paths through the Windows NT protocol stack in a set
40 Driver
TCP to driver
of trials large enough to assure reasonable statistical accura-
20 AFD to TCP cy. Figures 1a and 1b summarize the send and receive results
User to AFD of these experiments. Note that no single layer can be tuned
0
to significantly reduce the latency.
Making the common case (a 200-byte message) fast is dif-
(a) ficult. The following application of Amdahl’s law,5 which
points out the law of diminishing returns, displays the mag-
nitude of the problem. For a latency budget, we conserva-
tively assumed a 200-MHz processor averaging one clock per
instruction, a network with a 1.0-Gbps physical bandwidth,
and a 200-byte message. In a balanced system design, the
overhead portion of sending a 200-byte message must bal-
100
ance the bandwidth-dependent portion of the message laten-
80 cy (1.6 µs). The overhead comes from executed software
Percentage

60 AFD to user instructions and hardware processing. The budget allocated


Sockets AFD between the instructions executed and hardware processing
40 TCP time will vary with the implementation, but the upper bound
20
NDIS on software overhead is less than 1.6 µs or less than 320
Driver
ISR to handler
instructions in this simple example.
0
VIA description
(b)
We use the following two terms when describing VIA: user
and kernel agent. A user is the software layer using the archi-
tecture; it could be an application or a communication ser-
Figure 1. Sending (a) and receiving (b) overhead in two vices layer. The kernel agent is a driver running in protected
Pentium Pro servers. (kernel) mode. It must set up the necessary tables and struc-
tures that allow communication between cooperating
processes. It is not in the critical path for data transfers unless
hensive discussion on the relationships between latency, specifically requested by a user. We describe its functions
overhead, and bandwidth in a cluster environment. throughout this article.
Major tenets. VIA defines and specifies a simple set of
Communication traffic operations that moves data between network-connected end-
Without knowing the traffic pattern within a system, we points with latencies closer to memory operations than the
cannot assess the relative importance of decreasing software longer latencies of network operations. VIA accomplishes
overhead versus increasing bandwidth. If the majority of low latency in a message-passing environment by following
messages are large, higher bandwidth is of greater impor- these rules:
tance because the software overhead can be amortized over
the entire message, making the average per-byte overhead • Eliminate any intermediate copies of the data.
cost relatively low. Thus for large message sizes, the network • Avoid traps into the operating system whenever possi-
bandwidth is the dominant contributor to message latency. ble to avoid context switches in the CPU as well as
If the majority of the messages are small, the average per-byte cache thrashing.
overhead cost is relatively high, and the dominant contribu- • Eliminate the need for a driver running in protected ker-
tor to message latency becomes the software overhead. nel mode to multiplex a hardware resource (the net-
Studies of LAN traffic patterns2,3 show that the typical pat- work interface) between multiple concurrent processes.
tern is bimodal with over 80% of messages being 200 bytes • Minimize the number of instructions a process must exe-
or smaller and approximately 8% of the message over 8,192 cute to initiate data movement.
bytes. The most significant problem confronting communi- • Remove the constraint of requiring an interrupt when
cation performance in clusters is the magnitude of software initiating and/or completing an I/O operation.
overhead in virtual memory operating environments when • Define a simple set of operations that send and receive
sending and receiving messages. As confirmed by Clark et data.

68 IEEE Micro
.

• Keep the architecture simple Per VI control


enough to be emulated in soft- and synchronization User User User
ware as well as integrated in sil- process process process
icon.

VI instances. VIA presents the VI


VI VI VIVI
illusion to each process that it owns

CNTL
CNTL
CNTL

CNTL

CNTL
the interface to the network. The
construct used to create this illusion Kernel
is a VI instance. Each VI consists of

Queue

Queue
agent

Send

Send
Queue

Queue
Receive

Receive
Receive

Send

Send
queue
queue

queue

queue

queue
queue
Send
Send

Send
one send queue and one receive
queue, and is owned and maintained
by a single process.
A process can own many VIs, and
many processes may own many VIs
that all contain active work to be
processed. The kernel can also own
VIs. See Figure 2. Network interface
Each VI queue is formed by a Control
and interrupts
linked list of variable-length descrip-
tors. To add a descriptor to a queue,
the user builds the descriptor and Figure 2. VI queues.
posts it onto the tail of the appropri-
ate work queue. That same user
pulls the completed descriptors off the head of the same Addresses of
completed descriptors
work queue they were posted on.
submitted by NIC
Posting a descriptor includes 1) linking the descriptor
being posted to the descriptor currently on the tail of the Descriptors
posted
desired work queue and 2) notifying the network interface
controller (NIC) that work (an additional descriptor) has been
added. Notification consists of writing to a memory-mapped

1 completion queue
register called a doorbell. Posting a descriptor exchanges its

Example showing
ownership from the owning process to the NIC.
Example showing
3 work queues

The process that owns a VI can post four types of descrip-


tors. Send, remote-DMA/write, and remote-DMA/read
descriptors are placed on the send queue of a VI. Receive
descriptors are placed on the receive queue of a VI.
Synchronization. VIA provides both polling and block-
ing mechanisms to synchronize between the user process
and completed operations. When descriptor processing com-
pletes, the NIC writes a done bit and includes any error bits Poll/wait on
associated with that descriptor in its specified fields. This act completion queue,
transfers ownership of the descriptor from the NIC back to then dequeue
descriptor
the process that originally posted it. Descriptors
The head of each queue of each VI can be polled (as long dequeued
as the VI is not associated with a completion queue, as dis-
cussed later). Polling checks the descriptor at the head of
the work queue to see if it has been marked complete. If it Figure 3. Example of one completion queue.
is complete, the function call removes the descriptor from
the head of the queue and returns the descriptor’s address
to the calling process. Otherwise it returns a unique status, Then an interrupt will be generated. VIA supports both an
and the head of the queue does not change. interrupt to awaken the process as well as a callback with
The user may also use a blocking call to check the head associated handler through two different blocking calls.
of the queue for a completed descriptor. If it is complete, Completion queues. These queues (see Figure 3) are an
the function call removes the descriptor from the head of additional construct that allows the coalescing of comple-
the queue and returns the address of the descriptor to the tion notifications from multiple work queues into a single
calling process. Otherwise the function call requests the oper- queue. The two work queues of one VI can be associated
ating system to remove the process from the active run list with completion queues independently of one another. The
until the descriptor on the head of the queue completes. work queues of the same VI can be associated with differ-

March/April 1998 69
.

VIA

Control Memory handle Next descriptor virtual address but not possible to gather data in one remote-DMA/read
descriptor. Figure 4 illustrates the format of the descriptors.
Status Length Immediate data VI# Immediate data. VIA permits 32 bits of immediate data
to be specified in each descriptor. When immediate data is
present in a send descriptor, the NIC moves the field direct-
Reserved Memory handle Remote virtual address
ly into the receive descriptor. The presence of immediate
data in a remote-DMA/write descriptor causes a receive
Length Memory handle Buffer virtual address descriptor to be consumed at the remote VI with the imme-
diate data field transferred into that consumed receive
Length Memory handle Buffer virtual address descriptor. The presence of immediate data in a remote-
DMA/read descriptor is benign; the 32-bit immediate data
field can be written into and will remain unchanged when
the operation is complete with no descriptor being con-
Figure 4. Descriptor formats. sumed at the remote VI. Potential uses for immediate data
include synchronization of remote-DMA/writes at the remote
VI as well as sequence numbers used by a software layer
ent completion queues; it is possible to associate only one using that VI.
work queue of a VI with a completion queue. Work queue ordering. VIA maintains ordering and data
When a descriptor on a VI that is associated with a com- consistency rules within one VI but not between different
pletion queue completes, the NIC marks the descriptor done VIs. Descriptors posted on a VI are processed in FIFO order.
and places a pointer to that descriptor on the tail of the asso- VIA doesn’t specify the order of completion for descriptors
ciated completion queue. If a VI work queue is associated posted on different VIs; the order depends on the schedul-
with a completion queue, synchronization of completed ing algorithm implemented.
descriptors must occur either by polling or by waiting on the This rule is easily maintained with sends and remote-
completion queue, not the work queue itself. DMA/writes because sends behave like remote-DMA/writes
Descriptors. These constructs describe work to be done without a remote address segment. The receive descriptor
by the network interface. Send and receive descriptors con- on the head of the queue determines the remote address(es).
tain one control segment and a variable number of data seg- Remote-DMA/reads make it more difficult to balance data
ments. Remote-DMA/write and remote-DMA/read descriptors consistency with keeping the data pipelines full. Remote-
contain one additional address segment following the control DMA/reads are round-trip transactions and are not complete
segment and preceding the data segment(s). until the requested data is returned from the remote
The control segment contains the descriptor type, the node/endpoint to the initiating node/endpoint. The VI-NIC
immediate data if present, a queue fence if present, the num- designer must decide on an implementation that trades off
ber of segments that follow the control segment, the virtual performance with complexity. Two possible implementa-
address of the next descriptor on the queue and its associ- tions are described.
ated memory handle, the status of the descriptor (initially One implementation would stop processing descriptors
null when posted and filled in by the NIC upon completion), on a send queue when a remote-DMA/read descriptor is
and the total length (in bytes) of the data to be moved by the encountered until the read is complete. With many active
descriptor. VIs and a well-designed scheduler, the NIC can continue to
The address segment contains the remote memory region’s make progress by multiplexing between VIs. With only a
virtual address and associated memory handle. few active VIs, the performance to or from the network(s)
Each data segment describes a memory region and con- may drop due to inability to pipeline the execution of mul-
tains the local memory region’s virtual address, its associat- tiple descriptors from one VI.
ed memory handle, and the length (in bytes) of data An alternative implementation is to process the request
associated with that region of memory. portion of a remote-DMA/read descriptor and continue pro-
For remote-DMA/write descriptors, the address/memory cessing descriptors on that VI’s send queue. The benefit of
handle pair in the address segment is the starting location of this approach is the ability to continue processing addition-
the region where the data is placed. Though only one remote al descriptors posted on that VI. However, either the NIC or
memory region address is supported per descriptor, a user a software layer (usually the application) must ensure that
can specify many local memory regions in each remote- memory consistency is maintained. Specifically, a write to a
DMA/write descriptor. Therefore it is possible to gather data remote address that follows a read from that same address
but not possible to scatter data in one remote-DMA/write can not “pass” the read, or the read may return incorrect
descriptor. data. To allow and encourage pipelining within a VI, VIA
For remote-DMA/read descriptors, the address/memory defines a queue fence bit. The queue fence bit forces all
handle pair in the address segment is the starting location of descriptors on the queue to be completed before further
the region for the data. Again, although only one remote descriptor processing can continue. This queue fence bit
memory region address is supported per descriptor, a user ensures that memory consistency is maintained even when
can specify many local memory regions in each remote- the NIC does not guarantee it.
DMA/read descriptor. Therefore it is possible to scatter data As described previously, if a NIC executes more than one

70 IEEE Micro
.

descriptor from a VI, it is possible for a send or remote- physical page addresses into the translation and protection
DMA/write descriptor to complete before a remote- table and returns a memory handle. The memory handle is
DMA/read descriptor completes. The completion of the send an opaque, 32-bit data structure. The user can now refer to
and/or remote-DMA/write descriptor(s) is not visible until that memory region using the virtual address and memory
the descriptor on the head of the queue is also complete. handle pair without having to worry about crossing page
Work queue scheduling. There is no implicit ordering boundaries.
relationship between the execution of descriptors placed on The use of a memory handle/virtual address pair is a
different VIs. The scheduling of service for each active VI unique feature of VIA. In contrast to the Hamlyn6 and U-Net7
depends on the algorithm used in the NIC, the message sizes architectures in which addresses are specified as a tag and
associated with the active descriptors, and the underlying offset, VIA allows the direct use of virtual addresses. This
transport used. means that the application does not need to keep track of the
Memory protection. VIA provides memory protection mappings of virtual addresses to tags.
for all VI operations to ensure that a user process cannot A VI-NIC is responsible for taking a memory handle/vir-
send out of, or receive into, memory that it does not own. A tual address pair, determining the physical address, and ver-
mechanism called protection tags that is programmed by the ifying that the user has the appropriate rights for the
kernel agent in a translation and protection table (TPT) pro- requested operation. The designer of a VI-NIC is free to
vides the memory protection. decide the format of a memory handle and the way the trans-
Protection tags are unique identifiers that are associated lation and protection information is stored and retrieved.
with VIs and memory regions. A user, prior to creating a new In our implementation, when a memory region of n pages
VI or registering a memory region, must obtain a protection is registered, the kernel agent locates n consecutive entries
tag(s). When a user requests the creation of a VI or requests in the translation and protection table, and assigns a mem-
that a region of memory be registered, a protection tag is an ory handle such that the calculation [(Virtual Address >> page
input parameter to the requesting call. The kernel agent offset bits ) − Memory Handle] results in the index into the
checks to ensure that the user owns the protection tag spec- table of the first entry in the series.
ified in the calling request. The page offset bits refer to the number of bits in the vir-
When a user creates a new VI, the user must specify a tual address that are used to store the page offset—12 for a
memory tag as an input parameter. The kernel agent checks 4,096-byte page. The >> symbol represents a logical shift-
to ensure that the requesting user is the owner of that pro- right operation. We use unsigned binary arithmetic. After the
tection tag. If that check fails, a new VI is not created. page offset bits in the virtual address have been shifted right
When a user registers a memory region, the user must (truncated), the calculation ignores the remaining upper bits
specify a memory tag associated with the region, if remote- that extend beyond the number of bits in the memory han-
DMA/write is enabled and if a remote-DMA/read is enabled. dle. The n entries in the table contain the physical address-
The kernel agent checks to ensure that the process register- es of the n pages. The page offset bits are not entered in the
ing the memory regions owns the region and the specified table because they do not differ from the virtual address page
protection tag. The kernel agent also checks to see if the offset bits.
memory region being registered is marked as read only. If the There are two additional consequences of registering
requesting user does not own the specified protection tag, memory regions as opposed to the more common use of
the kernel agent rejects the entire registration request. If the registering pages. The user is not concerned about crossing
memory region is marked as read only by the operating sys- page boundaries. Therefore, the network interface must be
tem, the agent rejects a request to enable remote-DMA/write. aware of the page size so that when the virtual address cross-
Otherwise, the memory region will be registered, and the es over to another page, the controller receives a new trans-
kernel agent will program the appropriate entry(s) and asso- lation. (The physical pages need not be contiguous.)
ciated bits into the translation and protection table, and Although the user may request that only a partial page be
return the memory handle for that region. registered, that entire page is actually registered, and there-
A process can own many VIs, each with the same or dif- fore the entire page is subject to the protection attributes for
ferent protection tags. Each protection tag is unique; differ- the portion that was registered.
ent processes are not allowed to share protection tags. VIs
can only access memory regions registered with the same Prototype projects
protection tag. Therefore, not all VIs owned by a process We split the VI prototype program into two separate pro-
can necessarily access all memory regions owned by that jects. The first project was a proof of concept for the overall
process. architectural concepts. The second project aimed to validate
Virtual address translation. An equally important task a proposed hardware implementation.
performed by the kernel agent occurs when registering a Architecture validation prototype. We built a proto-
memory region, giving the NIC a method of translating vir- type that used a standard commodity Ethernet NIC. This con-
tual addresses to physical addresses. troller has no VI-specific hardware support or intelligent
When a user requests registeration of a memory region, the subsystem, so we emulated the VI-NIC functionality entire-
kernel agent performs ownership checks, pins the pages into ly in software running on the host.
physical memory, and probes the region for the virtual-to- Goals. We established three goals prior to starting the
physical address translations. The kernel agent enters the project:

March/April 1998 71
.

VIA

350 120
Pro100B - UDP/IP Pro100B - UDP/IP
300 Pro100B - VI emulation 100 Pro100B - VI emulation

Bandwidth (Mbps)
250
Latency (µs)

80
200
60
150
40
100
50 20

0 0
0 32 64 128 256 512 1,024 1,472 0 32 64 128 256 512 1,024 1,472
Data payload size (bytes) Data payload size (bytes)

Figure 5. VI versus UDP latency. Figure 6. VI versus UDP bandwidth.

• Implement all full features in VIA Version 0.9 of the The test program calculates application send latency by
specification, the version current at that time. sending a packet to a remote node that then sends the pack-
• Demonstrate at least a two-fold reduction in overhead, et back to the sender. Multiple round-trip operations provide
as compared to UDP on a similar platform. An impor- an average time per round trip. One half of the average
tant component of the architecture validation is to round-trip time is the single packet application send latency.
demonstrate that VI provides performance gains over Figure 5 compares VI to UDP latency.
legacy protocol stacks. We selected UDP because it is Application send bandwidth. These tests measure the rate
most similar to the VI unreliable service. at which large amounts of data can be sent from one appli-
• Provide a stable VI development platform. We aimed to cation to another application across the network. The test
support at least four nodes interconnected through a net- program calculates application send bandwidth by initiating
work, with each node supporting 1,000 (emulated) VIs. multiple concurrent sends, then continuing to issue sends as
quickly as possible for the duration of the test time. The
NIC description. We selected Intel Ethernet Pro 100B NICs receiving node must have multiple receives outstanding
and Ethernet switches because the controllers and switches throughout the test. It calculates bandwidth by simply divid-
are commodity items with hardware and tools, such as net- ing the total amount of data sent by the total test time. Fig-
work analyzers that are readily available, and the availabili- ure 6 illustrates the comparison of VI and UDP bandwidth for
ty of an NT-NDIS driver for UDP for performance various message sizes.
comparisons. Hardware validation prototype. To further validate VIA
Prototype implementation. The architecture validation pro- concepts and verify performance goals, we built a prototype
totype implemented the VI-NIC functionality in software. system using a NIC with an on-board RISC CPU. We moved
This approach provided the quickest route to validate the most of the VIA emulation from the driver running on the
architectural concepts and an easily duplicated hardware host CPU to code running on the RISC CPU on the NIC.
platform for VI application experiments. Note that throughout the following description, we refer
Performance measurements and results. The host nodes to a NIC coupled with the VIA emulation in software run-
used for the performance tests were IA-32 server systems ning on the host node as a VI-emulated NIC. We refer to a
with dual 200-MHz Pentium Pro processors, the Intel NIC coupled with VIA that is implemented in hardware
82450GX PCIset, 64 Mbytes of memory, and the Intel Ether- and/or running as code on the NIC board as a VI-NIC.
Express Pro/100B network adapter. We used Microsoft Win- Goals. We had three goals. First, we aimed to validate VIA
dows NT 3.51 with the EtherExpress Pro/100B network concepts and provide a basis for experimentation with appli-
adapter NDIS driver for the software environment for the cations using VIA in place of traditional network interfaces.
UDP testing. The VIA performance tests were run on We required that prototypes support at least four nodes inter-
Microsoft Windows NT 4.0 with an Intel-supplied VI driver. connected with a network, each node supporting at least
We measured the UDP/IP performance using a ttcp test 200 VIs and the mapping of 64 Mbytes of memory per NIC.
that we modified only to give timing and round-trip infor- We also wanted to demonstrate performance gains avail-
mation. We wrote a VI test program to measure the equiva- able with a VI-NIC and compare them with a VI-emulated
lent functions as ttcp. NIC. The performance advantages of a VI-NIC should be
Application send latency. This test measures the time to clearly visible as compared to emulating the VI interface
copy the contents of an application’s data buffer to another using a low-cost NIC.
application’s data buffer across an interconnect using the VI Finally, we hoped to expose potential problems and issues
interface. Since VIA semantics require a preposted receive with the VI-NIC designs. Creation of a functional VI-NIC and
buffer, the test program includes the time to post a receive the software to drive it allows discovery of any issues that
buffer. were undiscovered in the original VIA proof of concept.

72 IEEE Micro
.

250 1,200

Myrinet-VI emulated in kernel agent Myrinet-VI emulated in kernel agent


1,000
200 Myrinet-VI emulated in NIC

Bnadwidth (Mbps)
Myrinet-VI emulated in NIC
Latency (µs)

800
150
600
100
400
50
200

0 0
0 32 64 128 256 512 1,024 2,048 4,096 8,192 0 32 64 128 256 512 1,024 2,048 4,096 8,192
Data payload size (bytes) Data payload size (bytes)

Figure 7. Application-to-application latency test results. Figure 8. Application-to-application bandwidth test results.

Prototype nodes. The host nodes used for the performance The emulation that generated the results we present did
tests were Micron Millenia PRO2 systems with 64 Mbytes of not emulate directly accessible doorbell registers, and thus
DRAM, dual 200-MHz Pentium Pro processors each with 256 did require a kernel transition to post a descriptor.
Kbytes of second-level cache, a 440FX chipset, and Windows
NT 4.0 with Service Pack 3. Performance measurements
Hardware environment. We selected Myricom’s Myrinet We measured both application-to-application latency and
NICs and switches for several reasons. They provide a data bandwidth.
rate of at least 1.28 Gbps. The programmable RISC CPU on Latency. The application-to-application latency test is the
the NIC allows for rapid development and modification of the same test we used to gather the VI Ethernet latency numbers
prototype. Enough memory resources (1 Mbyte) are present described earlier. Figure 7 summarizes the results of this test.
to allow the control program, memory protection and trans- Bandwidth. The application-to-application bandwidth
lation table, and per-VI context information to reside on the test measures the rate at which large amounts of data can be
NIC. Tools and documentation were available for developing copied from one application to another application across
the network interface’s control program. Example source code an interconnect using VIA. The test program calculates
was available as a starting point. The NICs were available as application-to-application bandwidth by initiating multiple
off-the-shelf products and compatible with the host’s hard- round-trip operations (with an average of 50 that are out-
ware (based on PCI-32 I/O bus and Pentium Pro processors). standing) and obtaining an average time per round trip.
Prototype implementations. We created two prototype envi- Using one half of the average round-trip time and the data
ronments: a VI-emulated NIC and an emulation of a VI-NIC. packet payload size, we calculated the application-to-
For the VI-emulated NIC environment, we programmed the application bandwidth for streamed data. Figure 8 summa-
Myrinet NIC to emulate a low-cost gigabit NIC; the driver soft- rizes the results.
ware supported the VI emulation. This only provided a means CPU utilization. The application CPU load test measures
to place packets onto a network and to extract packets from the amount of CPU time that is unavailable to the application
the network. These types of controllers only understand phys- when passing messages. This test program measures the total
ical addresses and deal with network data on a packet basis. cost of system activity related to message passing that affects
This leaves the work of memory address translation and per- the application. It includes the execution of message-passing
connection multiplexing and demultiplexing of the packet code (library calls), kernel transitions (if any), interrupt han-
stream to the host CPU. We refer to this environment in the dling, and CPU instruction cache flushing (due to execution
performance graphs as “VI emulated in kernel agent.” of instructions outside of the main application). It also
For the emulated VI-NIC environment, we programmed includes cache snooping and memory bandwidth contention
the Myrinet NIC to emulate an example VI-NIC hardware due to movement of data to and from NIC.
design. We modified the driver software to support this emu- This program calculates the overhead to pass a message
lation. We refer to this environment in the performance by constructing a situation in which the test application can
graphs as “VI emulated in NIC.” consume all of the CPU resources while performing its work.
VI-NIC characteristics. A VI-NIC works directly with the The program uses a vector sum for the workload. We ran
send and receive descriptors, using virtual addresses. The the application twice, once without passing messages and
network adapter handles all the memory translation and pro- once while passing messages. Each run continues for a fixed
tection functions, and multiplexing and demultiplexing of amount of time, and we recorded the number of vector sums
packets to connections. The hardware and software map the completed.
doorbell registers directly into the application’s memory The first run yields an average time to perform a vector
space, allowing the posting of descriptors to avoid a kernel sum. The second run yields overhead of the message-passing
transition. operations performed during the test run. The second run

March/April 1998 73
.

VIA

100 the design focused on emulating the reference hardware


Myrinet-VI emulated in kernel agent
Myrinet-VI emulated in NIC design over optimizing performance for the Myrinet NIC. If,
Application time lost

80 for example, the goal was to design and implement the


per message (µs)

fastest VI-emulated system for the Myrinet NICs, different


60
design choices may have yielded better performance.
Emulation overheads. The scheme used to emulate the
40
doorbell functionality required an application-to-kernel tran-
20 sition. Doorbell support in hardware would reduce mea-
sured single-packet latencies by an estimated 7.5 µs and the
0 CPU load by an estimated 9 µs.
4 32 1,024 2,048 4,096 8,192 Prototype conclusions. Emulating a VI-NIC in host soft-
Data payload size (bytes) ware using a low-cost adapter is quite feasible, but a signif-
icant amount of performance is lost in both the
Figure 9. CPU utilization test results. communications space and in the application space. Emu-
lating a VI-NIC using an intelligent NIC provides the mini-
mum level of performance that VIA can provide.
completes some number of vector sums (always smaller in Two additions to an intelligent NIC would provide signif-
quantity than the first test run). The time consumed by pass- icant performance improvements over the prototype results
ing messages is determined by subtracting the amount of achieved in this project. A faster programmable controller
time spent running the application (the number of vector would result in a more desirable bandwidth performance
sums performed in the second test run multiplied by the curve and would improve small-packet latency. A control
average time to perform a vector sum from the first test run) CPU in the 200-MIPS range would have been a good match
from the total test time. The overhead per message is the for today’s host CPUs.
total message-passing time divided by the number of mes- A hardware assist for implementing the doorbell registers
sages passed. See Figure 9 for the results. would reduce small packet latency and significantly reduce
Performance results. Both of the application-to- host CPU overhead per message sent.
application tests showed that the VI-NIC performance Performance conclusions. The best network perfor-
improvements resulted from the elimination of host data mance and the best price/performance will be available on
copies at the receiving node and host system interrupts. The network adapters that implement the core VI functionality
quantity of code executed by the NIC’s processor limited the in special-purpose silicon and eliminate all possible pro-
performance gains. cessing overheads in the send and receive paths.
The test of the application CPU cost per message clearly Conservative estimates show that the elimination of ker-
revealed the entire cost of handling host system interrupts nel transitions and no intermediate copies of data at either
and performing host system data copies. In the VI-NIC the send or receive nodes will yield a software overhead of
results, the slight increase in CPU time associated with well under 300 user-level instructions, many of which are
increasing the packet size is attributable to the PCI data trans- construction of descriptors. The use of completion queues
fers competing with the application for host system memo- and associated handler(s) should avoid most interrupts on
ry bandwidth. receive for a busy system. With today’s processors, a gigabit
network and modest hardware design, small message laten-
Performance issues and projections cies well below 10 µs are achievable.
Our prototype environment was successful in increasing
performance and reducing CPU cycles from the host, but it
did limit performance compared to a fully integrated VI-NIC. GIVING A USER PROCESS DIRECT ACCESS to a net-
Hardware performance limitations. Although the work resource is not a new concept. What is new about VIA
selected hardware’s capabilities were well matched to the is that the concepts, data structures, and semantics are spec-
requirements, certain limitations constrained the performance ified and defined to the computing industry in an open fash-
available when emulating the VI-NIC hardware. The host ion, independent of the operating system and the CPU. The
system’s 200-MHz Pentium Pro processors overshadow the goal of this approach is to encourage a vibrant and compet-
33-MHz instruction rate of the on-board controller. Also, due itive environment that accelerates the acceptance of clus-
to the Myrinet NIC architecture, the transfer of a block of tered computers. VIA removes the barrier of application code
data takes four data copy operations. These operations are being written based on one system with a specific network
the local host to local adapter, local adapter memory to net- interface controller, operating system, and CPU, each of
work (mostly overlapped with the next item), network to which have a limited lifetime. Establishing and adhering to
remote adapter memory, and adapter memory to host. Final- VIA results in application code that achieves excellent inter-
ly, emulation of the VI doorbell scheme is difficult for more process communication using today’s technologies. More
than a trivial number of VIs without consuming a significant importantly, that application code will benefit from future
amount of memory, controller CPU cycles, or PCI bus technology improvements and lead to higher application
resources. performance without rewriting code.
Design limitations. To support the stated project goals, Reuse of application code is a key to architecture accep-

74 IEEE Micro
.

Related work
Academic and industrial research have described and tion node. This information is set up in the page tables of
validated many of VIA’s concepts. The works previously network interfaces on both communicating nodes. After
referenced and/or mentioned here influenced VIA. mapping the memory, the data is sent by performing writes
The message-driven processor (MDP) project8 showed to the send buffer.
that fast context switches and offloading the host CPU from Cornell University’s U-Net project demonstrated user-
handling and buffering messages would achieve good level access to a network using a simple queuing method.
message-passing performance. The progress of micro- This user-level network architecture provides every
processors has led to impressive increases in clock speed process with an illusion of its own protected interface to
but not a proportional decrease in context switch times. the network. It associates a communication segment and
Therefore VIA avoids context switches while off-loading a set of send, receive, and free message queues with each
the processor. U-Net endpoint. Endpoints and communication channels
The Active Messages9 project at University of California, allow U-Net to provide protection boundaries between
Berkeley, and the Fast Messages10 project at the Universi- multiple processes.
ty of Illinois, Urbana-Champaign, point out the benefit of The Hamlyn architecture from Hewlett-Packard Labs
an asynchronous communication mechanism. The archi- provides applications with direct access to network hard-
tectures are similar in that each uses a handler, which is ware. The sender-based memory management used in
executed on message arrival. In VIA, all message-passing Hamlyn avoids software-induced packet losses. An appli-
operations are asynchronous. It does not specify the con- cation’s virtually contiguous address space is used to send
cept of a handler per message in lieu of the use of pro- and receive data in a protected fashion without operating
tection tags, translation and protection tables, and system intervention.
registered memory. The Memory Channel interconnect developed at Digi-
Four examples of user-level networking architectures tal Equipment Corporation uses memory mapping to pro-
are Shrimp,11 U-Net,7 Hamlyn,6 and Memory Channel.12 In vide low communication latency. Applications map a
Princeton University’s Shrimp project, a virtual memory- portion of a clusterwide address space into their virtual
mapped network interface maps the send buffer to main address space and then use load and store instructions to
memory by a kernel call. This call checks for protection read or write data. Direct manipulation of shared-memory
and stores memory-mapping information on the network pages achieves message-passing operations. In the Mem-
interface. The network interface maps out each physical ory Channel network, no calls to operating systems are
page of the send buffer to a physical page of the destina- made once a map is established.

tance. However, the performance must be there to warrant the Processing Overheads in TCP/IP,” Computer Communication
intitial investment. Prototyping efforts show that VIA does Review, Vol. 23, No. 4, Oct. 1993, pp. 259-268.
meet the three performance goals of reducing the latency and 4. D.D. Clark et al., “An Analysis of TCP Processing Overhead,”
increasing the bandwidth between communicating process- IEEE Communications, Vol. 27, No. 6, June 1989, pp. 23-29.
es, while significantly reducing the number of instructions 5. J.L. Hennessy and D.A. Patterson, Computer Architecture: A
the CPU needs to execute to accomplish the communication. Quantitative Approach, 2nd ed., Morgan Kaufmann Publishers,
Performance investigations show that VI-NICs integrated in Inc., San Francisco, Calif., 1995.
silicon should easily achieve below 10-µs end-to-end laten- 6. G. Buzzard et al., “An Implementation of the Hamlyn Sender-
cies. The sustainable bandwidth between two communicat- Managed Interface Architecture,” Operating Systems Review,
ing processes is limited by the hardware NIC and fabrication 1996, pp. 245-259.
implementation, not the code used to send and receive the 7. T. von Eicken et al., “U-Net: A User-level Network Interface for
data. The reduction in latency and increase in bandwidth for Parallel and Distributed Computing,” Operating Systems
interprocess communication are achieved with a great reduc- Review, Vol. 29, No. 5, Dec. 1995, pp. 40-53.
tion of instruction executed by the CPU. 8. W.J. Dally et al., “Architecture of a Message-Driven Processor,”
Proc. 14th Int’l Symp. Computer Architecture, IEEE Computer
Society Press, Los Alamitos, Calif., 1987, pp. 189-196.
9. T. von Eicken et al., “Active Messages: A Mechanism for
Integrated Communications and Computation,” Computer
References Architecture News, Vol. 20, No. 2, May 1992, pp. 256-266.
1. R.P. Martin et al., “Effects of Communication Latency, Overhead 10. S. Pakin, V. Karamcheti, and A. Chien, “Fast Messages (FM):
and Bandwidth in a Cluster Architecture,” Computer Efficient, Portable Communication for Workstations and
Architecture News, May 1997, pp. 85-97. Massively-Parallel Processors,” IEEE Concurrency, Vol. 5, No. 2,
2. R. Gusella, “A Measurement Study of Diskless Workstation Apr.-June 1997, pp. 60-72.
Traffic on an Ethernet,” IEEE Trans. Communications, Vol. 38, 11. M.A. Blumrich et al., “Virtual Memory Mapped Network
No. 9, Sept. 1990, pp. 1557-1568. Interface for the SHRIMP Multicomputer,” Proc. 21st Int’l Symp.
3. J. Kay and J. Pasquale, “The Importance of Non-Data Touching Computer Architecture, Apr. 1994, pp. 142-153.

March/April 1998 75
.

VIA

12. R. Gillett and R. Kaufmann, “Using the Memory Channel


Network,” IEEE Micro, Vol. 17, No. 1, Jan.-Feb. 1997, pp. 19-25. Frank Berry is a staff software engineer
in Intel’s Enterprise Server Group. His
technical interests include operating sys-
tems, device drivers, and networking
Dave Dunning is a principal engineer technology. Berry was the principal
within the Intel Server Architecture Lab. developer of very high performance
His technical interests include high- Windows NT network drivers on Intel-
speed networks and network interfaces. based systems. He is currently the senior designer for the VI
Dunning received a BA in physics from Proof of Concept work.
Grinnell College, a BS in electrical engi-
neering from Washington University (St.
Louis), and an MS in computer science from Portland State
University.
Anne Marie Merritt is a software engi-
neer in Intel’s Enterprise Server Group.
She has worked on the VI Architecture
library and device driver under Windows
Greg Regnier is a senior staff engineer NT. Her technical interests include Win-
within the Intel Server Architecture Lab. dows NT device drivers and network
His technical interests include operating management. Merritt is finishing her
systems, distributed computing, message master’s degree at California State University, Sacramento.
passing, and networking technology. Reg- Her final project details an SNMP MIB-II to manage VIA 1.0
nier has a BS in computer science from clusters.
Saint Cloud State University in Minnesota.

Ed Gronke is a staff software sngineer


Gary McAlpine is a system/component in Intel’s Server Architecture Lab. As part
architect in Intel’s Enterprise Server of the VI Architecture team, he focuses
Group. His technical interests include on how database systems and database-
distributed computer systems, high-end like applications can make the best use
computer system architectures, net- of the capabilities of VI-enabled clusters.
working systems, networking, graphics, Gronke has a BA in physics from Reed
signal and image processing, real-time College and did graduate work in stellar astrophysics at Wes-
computing, and I/O and neural networks. leyan University.

Don Cameron is a senior staff architect Chris Dodd is a principal engineer in


in the Intel Enterpise Server Group’s Intel’s Server Architecture Laboratory.
Server Architecture Lab. His current His technical interests focus on scalable
interests are user-level IO and network server architectures, especially high-per-
attached storage. He is a member of the formance message passing formulti-
Storage Network Industry Association. computers. Dodd received his BS and
MS degrees in computer science from
the University of Wisconsin, Madison.

Bill Shubert is a software engineer in


Intel Corporation’s Server Architecture
Lab. He has done extensive work on net-
working and communications software Direct questions concerning this article to Dave Dunning,
for supercomputers and servers. Shubert Server Architecture Laboratory, Intel Corporation, 5200 N.E.
graduated with a BS from Carnegie Mel- Elam Young Parkway, Hillsboro, OR 97124; ddunning@
lon University in Pittsburgh. co.intel.com.

76 IEEE Micro

Vous aimerez peut-être aussi