Académique Documents
Professionnel Documents
Culture Documents
the network’s physical layer is a good metric to assess com- overhead associated with the communication traffic as well
munication performance. It must be coupled to the software as the traffic pattern. See Martin et al.1 for a more compre-
March/April 1998 67
.
VIA
68 IEEE Micro
.
CNTL
CNTL
CNTL
CNTL
CNTL
the interface to the network. The
construct used to create this illusion Kernel
is a VI instance. Each VI consists of
Queue
Queue
agent
Send
Send
Queue
Queue
Receive
Receive
Receive
Send
Send
queue
queue
queue
queue
queue
queue
Send
Send
Send
one send queue and one receive
queue, and is owned and maintained
by a single process.
A process can own many VIs, and
many processes may own many VIs
that all contain active work to be
processed. The kernel can also own
VIs. See Figure 2. Network interface
Each VI queue is formed by a Control
and interrupts
linked list of variable-length descrip-
tors. To add a descriptor to a queue,
the user builds the descriptor and Figure 2. VI queues.
posts it onto the tail of the appropri-
ate work queue. That same user
pulls the completed descriptors off the head of the same Addresses of
completed descriptors
work queue they were posted on.
submitted by NIC
Posting a descriptor includes 1) linking the descriptor
being posted to the descriptor currently on the tail of the Descriptors
posted
desired work queue and 2) notifying the network interface
controller (NIC) that work (an additional descriptor) has been
added. Notification consists of writing to a memory-mapped
1 completion queue
register called a doorbell. Posting a descriptor exchanges its
Example showing
ownership from the owning process to the NIC.
Example showing
3 work queues
March/April 1998 69
.
VIA
Control Memory handle Next descriptor virtual address but not possible to gather data in one remote-DMA/read
descriptor. Figure 4 illustrates the format of the descriptors.
Status Length Immediate data VI# Immediate data. VIA permits 32 bits of immediate data
to be specified in each descriptor. When immediate data is
present in a send descriptor, the NIC moves the field direct-
Reserved Memory handle Remote virtual address
ly into the receive descriptor. The presence of immediate
data in a remote-DMA/write descriptor causes a receive
Length Memory handle Buffer virtual address descriptor to be consumed at the remote VI with the imme-
diate data field transferred into that consumed receive
Length Memory handle Buffer virtual address descriptor. The presence of immediate data in a remote-
DMA/read descriptor is benign; the 32-bit immediate data
field can be written into and will remain unchanged when
the operation is complete with no descriptor being con-
Figure 4. Descriptor formats. sumed at the remote VI. Potential uses for immediate data
include synchronization of remote-DMA/writes at the remote
VI as well as sequence numbers used by a software layer
ent completion queues; it is possible to associate only one using that VI.
work queue of a VI with a completion queue. Work queue ordering. VIA maintains ordering and data
When a descriptor on a VI that is associated with a com- consistency rules within one VI but not between different
pletion queue completes, the NIC marks the descriptor done VIs. Descriptors posted on a VI are processed in FIFO order.
and places a pointer to that descriptor on the tail of the asso- VIA doesn’t specify the order of completion for descriptors
ciated completion queue. If a VI work queue is associated posted on different VIs; the order depends on the schedul-
with a completion queue, synchronization of completed ing algorithm implemented.
descriptors must occur either by polling or by waiting on the This rule is easily maintained with sends and remote-
completion queue, not the work queue itself. DMA/writes because sends behave like remote-DMA/writes
Descriptors. These constructs describe work to be done without a remote address segment. The receive descriptor
by the network interface. Send and receive descriptors con- on the head of the queue determines the remote address(es).
tain one control segment and a variable number of data seg- Remote-DMA/reads make it more difficult to balance data
ments. Remote-DMA/write and remote-DMA/read descriptors consistency with keeping the data pipelines full. Remote-
contain one additional address segment following the control DMA/reads are round-trip transactions and are not complete
segment and preceding the data segment(s). until the requested data is returned from the remote
The control segment contains the descriptor type, the node/endpoint to the initiating node/endpoint. The VI-NIC
immediate data if present, a queue fence if present, the num- designer must decide on an implementation that trades off
ber of segments that follow the control segment, the virtual performance with complexity. Two possible implementa-
address of the next descriptor on the queue and its associ- tions are described.
ated memory handle, the status of the descriptor (initially One implementation would stop processing descriptors
null when posted and filled in by the NIC upon completion), on a send queue when a remote-DMA/read descriptor is
and the total length (in bytes) of the data to be moved by the encountered until the read is complete. With many active
descriptor. VIs and a well-designed scheduler, the NIC can continue to
The address segment contains the remote memory region’s make progress by multiplexing between VIs. With only a
virtual address and associated memory handle. few active VIs, the performance to or from the network(s)
Each data segment describes a memory region and con- may drop due to inability to pipeline the execution of mul-
tains the local memory region’s virtual address, its associat- tiple descriptors from one VI.
ed memory handle, and the length (in bytes) of data An alternative implementation is to process the request
associated with that region of memory. portion of a remote-DMA/read descriptor and continue pro-
For remote-DMA/write descriptors, the address/memory cessing descriptors on that VI’s send queue. The benefit of
handle pair in the address segment is the starting location of this approach is the ability to continue processing addition-
the region where the data is placed. Though only one remote al descriptors posted on that VI. However, either the NIC or
memory region address is supported per descriptor, a user a software layer (usually the application) must ensure that
can specify many local memory regions in each remote- memory consistency is maintained. Specifically, a write to a
DMA/write descriptor. Therefore it is possible to gather data remote address that follows a read from that same address
but not possible to scatter data in one remote-DMA/write can not “pass” the read, or the read may return incorrect
descriptor. data. To allow and encourage pipelining within a VI, VIA
For remote-DMA/read descriptors, the address/memory defines a queue fence bit. The queue fence bit forces all
handle pair in the address segment is the starting location of descriptors on the queue to be completed before further
the region for the data. Again, although only one remote descriptor processing can continue. This queue fence bit
memory region address is supported per descriptor, a user ensures that memory consistency is maintained even when
can specify many local memory regions in each remote- the NIC does not guarantee it.
DMA/read descriptor. Therefore it is possible to scatter data As described previously, if a NIC executes more than one
70 IEEE Micro
.
descriptor from a VI, it is possible for a send or remote- physical page addresses into the translation and protection
DMA/write descriptor to complete before a remote- table and returns a memory handle. The memory handle is
DMA/read descriptor completes. The completion of the send an opaque, 32-bit data structure. The user can now refer to
and/or remote-DMA/write descriptor(s) is not visible until that memory region using the virtual address and memory
the descriptor on the head of the queue is also complete. handle pair without having to worry about crossing page
Work queue scheduling. There is no implicit ordering boundaries.
relationship between the execution of descriptors placed on The use of a memory handle/virtual address pair is a
different VIs. The scheduling of service for each active VI unique feature of VIA. In contrast to the Hamlyn6 and U-Net7
depends on the algorithm used in the NIC, the message sizes architectures in which addresses are specified as a tag and
associated with the active descriptors, and the underlying offset, VIA allows the direct use of virtual addresses. This
transport used. means that the application does not need to keep track of the
Memory protection. VIA provides memory protection mappings of virtual addresses to tags.
for all VI operations to ensure that a user process cannot A VI-NIC is responsible for taking a memory handle/vir-
send out of, or receive into, memory that it does not own. A tual address pair, determining the physical address, and ver-
mechanism called protection tags that is programmed by the ifying that the user has the appropriate rights for the
kernel agent in a translation and protection table (TPT) pro- requested operation. The designer of a VI-NIC is free to
vides the memory protection. decide the format of a memory handle and the way the trans-
Protection tags are unique identifiers that are associated lation and protection information is stored and retrieved.
with VIs and memory regions. A user, prior to creating a new In our implementation, when a memory region of n pages
VI or registering a memory region, must obtain a protection is registered, the kernel agent locates n consecutive entries
tag(s). When a user requests the creation of a VI or requests in the translation and protection table, and assigns a mem-
that a region of memory be registered, a protection tag is an ory handle such that the calculation [(Virtual Address >> page
input parameter to the requesting call. The kernel agent offset bits ) − Memory Handle] results in the index into the
checks to ensure that the user owns the protection tag spec- table of the first entry in the series.
ified in the calling request. The page offset bits refer to the number of bits in the vir-
When a user creates a new VI, the user must specify a tual address that are used to store the page offset—12 for a
memory tag as an input parameter. The kernel agent checks 4,096-byte page. The >> symbol represents a logical shift-
to ensure that the requesting user is the owner of that pro- right operation. We use unsigned binary arithmetic. After the
tection tag. If that check fails, a new VI is not created. page offset bits in the virtual address have been shifted right
When a user registers a memory region, the user must (truncated), the calculation ignores the remaining upper bits
specify a memory tag associated with the region, if remote- that extend beyond the number of bits in the memory han-
DMA/write is enabled and if a remote-DMA/read is enabled. dle. The n entries in the table contain the physical address-
The kernel agent checks to ensure that the process register- es of the n pages. The page offset bits are not entered in the
ing the memory regions owns the region and the specified table because they do not differ from the virtual address page
protection tag. The kernel agent also checks to see if the offset bits.
memory region being registered is marked as read only. If the There are two additional consequences of registering
requesting user does not own the specified protection tag, memory regions as opposed to the more common use of
the kernel agent rejects the entire registration request. If the registering pages. The user is not concerned about crossing
memory region is marked as read only by the operating sys- page boundaries. Therefore, the network interface must be
tem, the agent rejects a request to enable remote-DMA/write. aware of the page size so that when the virtual address cross-
Otherwise, the memory region will be registered, and the es over to another page, the controller receives a new trans-
kernel agent will program the appropriate entry(s) and asso- lation. (The physical pages need not be contiguous.)
ciated bits into the translation and protection table, and Although the user may request that only a partial page be
return the memory handle for that region. registered, that entire page is actually registered, and there-
A process can own many VIs, each with the same or dif- fore the entire page is subject to the protection attributes for
ferent protection tags. Each protection tag is unique; differ- the portion that was registered.
ent processes are not allowed to share protection tags. VIs
can only access memory regions registered with the same Prototype projects
protection tag. Therefore, not all VIs owned by a process We split the VI prototype program into two separate pro-
can necessarily access all memory regions owned by that jects. The first project was a proof of concept for the overall
process. architectural concepts. The second project aimed to validate
Virtual address translation. An equally important task a proposed hardware implementation.
performed by the kernel agent occurs when registering a Architecture validation prototype. We built a proto-
memory region, giving the NIC a method of translating vir- type that used a standard commodity Ethernet NIC. This con-
tual addresses to physical addresses. troller has no VI-specific hardware support or intelligent
When a user requests registeration of a memory region, the subsystem, so we emulated the VI-NIC functionality entire-
kernel agent performs ownership checks, pins the pages into ly in software running on the host.
physical memory, and probes the region for the virtual-to- Goals. We established three goals prior to starting the
physical address translations. The kernel agent enters the project:
March/April 1998 71
.
VIA
350 120
Pro100B - UDP/IP Pro100B - UDP/IP
300 Pro100B - VI emulation 100 Pro100B - VI emulation
Bandwidth (Mbps)
250
Latency (µs)
80
200
60
150
40
100
50 20
0 0
0 32 64 128 256 512 1,024 1,472 0 32 64 128 256 512 1,024 1,472
Data payload size (bytes) Data payload size (bytes)
• Implement all full features in VIA Version 0.9 of the The test program calculates application send latency by
specification, the version current at that time. sending a packet to a remote node that then sends the pack-
• Demonstrate at least a two-fold reduction in overhead, et back to the sender. Multiple round-trip operations provide
as compared to UDP on a similar platform. An impor- an average time per round trip. One half of the average
tant component of the architecture validation is to round-trip time is the single packet application send latency.
demonstrate that VI provides performance gains over Figure 5 compares VI to UDP latency.
legacy protocol stacks. We selected UDP because it is Application send bandwidth. These tests measure the rate
most similar to the VI unreliable service. at which large amounts of data can be sent from one appli-
• Provide a stable VI development platform. We aimed to cation to another application across the network. The test
support at least four nodes interconnected through a net- program calculates application send bandwidth by initiating
work, with each node supporting 1,000 (emulated) VIs. multiple concurrent sends, then continuing to issue sends as
quickly as possible for the duration of the test time. The
NIC description. We selected Intel Ethernet Pro 100B NICs receiving node must have multiple receives outstanding
and Ethernet switches because the controllers and switches throughout the test. It calculates bandwidth by simply divid-
are commodity items with hardware and tools, such as net- ing the total amount of data sent by the total test time. Fig-
work analyzers that are readily available, and the availabili- ure 6 illustrates the comparison of VI and UDP bandwidth for
ty of an NT-NDIS driver for UDP for performance various message sizes.
comparisons. Hardware validation prototype. To further validate VIA
Prototype implementation. The architecture validation pro- concepts and verify performance goals, we built a prototype
totype implemented the VI-NIC functionality in software. system using a NIC with an on-board RISC CPU. We moved
This approach provided the quickest route to validate the most of the VIA emulation from the driver running on the
architectural concepts and an easily duplicated hardware host CPU to code running on the RISC CPU on the NIC.
platform for VI application experiments. Note that throughout the following description, we refer
Performance measurements and results. The host nodes to a NIC coupled with the VIA emulation in software run-
used for the performance tests were IA-32 server systems ning on the host node as a VI-emulated NIC. We refer to a
with dual 200-MHz Pentium Pro processors, the Intel NIC coupled with VIA that is implemented in hardware
82450GX PCIset, 64 Mbytes of memory, and the Intel Ether- and/or running as code on the NIC board as a VI-NIC.
Express Pro/100B network adapter. We used Microsoft Win- Goals. We had three goals. First, we aimed to validate VIA
dows NT 3.51 with the EtherExpress Pro/100B network concepts and provide a basis for experimentation with appli-
adapter NDIS driver for the software environment for the cations using VIA in place of traditional network interfaces.
UDP testing. The VIA performance tests were run on We required that prototypes support at least four nodes inter-
Microsoft Windows NT 4.0 with an Intel-supplied VI driver. connected with a network, each node supporting at least
We measured the UDP/IP performance using a ttcp test 200 VIs and the mapping of 64 Mbytes of memory per NIC.
that we modified only to give timing and round-trip infor- We also wanted to demonstrate performance gains avail-
mation. We wrote a VI test program to measure the equiva- able with a VI-NIC and compare them with a VI-emulated
lent functions as ttcp. NIC. The performance advantages of a VI-NIC should be
Application send latency. This test measures the time to clearly visible as compared to emulating the VI interface
copy the contents of an application’s data buffer to another using a low-cost NIC.
application’s data buffer across an interconnect using the VI Finally, we hoped to expose potential problems and issues
interface. Since VIA semantics require a preposted receive with the VI-NIC designs. Creation of a functional VI-NIC and
buffer, the test program includes the time to post a receive the software to drive it allows discovery of any issues that
buffer. were undiscovered in the original VIA proof of concept.
72 IEEE Micro
.
250 1,200
Bnadwidth (Mbps)
Myrinet-VI emulated in NIC
Latency (µs)
800
150
600
100
400
50
200
0 0
0 32 64 128 256 512 1,024 2,048 4,096 8,192 0 32 64 128 256 512 1,024 2,048 4,096 8,192
Data payload size (bytes) Data payload size (bytes)
Figure 7. Application-to-application latency test results. Figure 8. Application-to-application bandwidth test results.
Prototype nodes. The host nodes used for the performance The emulation that generated the results we present did
tests were Micron Millenia PRO2 systems with 64 Mbytes of not emulate directly accessible doorbell registers, and thus
DRAM, dual 200-MHz Pentium Pro processors each with 256 did require a kernel transition to post a descriptor.
Kbytes of second-level cache, a 440FX chipset, and Windows
NT 4.0 with Service Pack 3. Performance measurements
Hardware environment. We selected Myricom’s Myrinet We measured both application-to-application latency and
NICs and switches for several reasons. They provide a data bandwidth.
rate of at least 1.28 Gbps. The programmable RISC CPU on Latency. The application-to-application latency test is the
the NIC allows for rapid development and modification of the same test we used to gather the VI Ethernet latency numbers
prototype. Enough memory resources (1 Mbyte) are present described earlier. Figure 7 summarizes the results of this test.
to allow the control program, memory protection and trans- Bandwidth. The application-to-application bandwidth
lation table, and per-VI context information to reside on the test measures the rate at which large amounts of data can be
NIC. Tools and documentation were available for developing copied from one application to another application across
the network interface’s control program. Example source code an interconnect using VIA. The test program calculates
was available as a starting point. The NICs were available as application-to-application bandwidth by initiating multiple
off-the-shelf products and compatible with the host’s hard- round-trip operations (with an average of 50 that are out-
ware (based on PCI-32 I/O bus and Pentium Pro processors). standing) and obtaining an average time per round trip.
Prototype implementations. We created two prototype envi- Using one half of the average round-trip time and the data
ronments: a VI-emulated NIC and an emulation of a VI-NIC. packet payload size, we calculated the application-to-
For the VI-emulated NIC environment, we programmed the application bandwidth for streamed data. Figure 8 summa-
Myrinet NIC to emulate a low-cost gigabit NIC; the driver soft- rizes the results.
ware supported the VI emulation. This only provided a means CPU utilization. The application CPU load test measures
to place packets onto a network and to extract packets from the amount of CPU time that is unavailable to the application
the network. These types of controllers only understand phys- when passing messages. This test program measures the total
ical addresses and deal with network data on a packet basis. cost of system activity related to message passing that affects
This leaves the work of memory address translation and per- the application. It includes the execution of message-passing
connection multiplexing and demultiplexing of the packet code (library calls), kernel transitions (if any), interrupt han-
stream to the host CPU. We refer to this environment in the dling, and CPU instruction cache flushing (due to execution
performance graphs as “VI emulated in kernel agent.” of instructions outside of the main application). It also
For the emulated VI-NIC environment, we programmed includes cache snooping and memory bandwidth contention
the Myrinet NIC to emulate an example VI-NIC hardware due to movement of data to and from NIC.
design. We modified the driver software to support this emu- This program calculates the overhead to pass a message
lation. We refer to this environment in the performance by constructing a situation in which the test application can
graphs as “VI emulated in NIC.” consume all of the CPU resources while performing its work.
VI-NIC characteristics. A VI-NIC works directly with the The program uses a vector sum for the workload. We ran
send and receive descriptors, using virtual addresses. The the application twice, once without passing messages and
network adapter handles all the memory translation and pro- once while passing messages. Each run continues for a fixed
tection functions, and multiplexing and demultiplexing of amount of time, and we recorded the number of vector sums
packets to connections. The hardware and software map the completed.
doorbell registers directly into the application’s memory The first run yields an average time to perform a vector
space, allowing the posting of descriptors to avoid a kernel sum. The second run yields overhead of the message-passing
transition. operations performed during the test run. The second run
March/April 1998 73
.
VIA
74 IEEE Micro
.
Related work
Academic and industrial research have described and tion node. This information is set up in the page tables of
validated many of VIA’s concepts. The works previously network interfaces on both communicating nodes. After
referenced and/or mentioned here influenced VIA. mapping the memory, the data is sent by performing writes
The message-driven processor (MDP) project8 showed to the send buffer.
that fast context switches and offloading the host CPU from Cornell University’s U-Net project demonstrated user-
handling and buffering messages would achieve good level access to a network using a simple queuing method.
message-passing performance. The progress of micro- This user-level network architecture provides every
processors has led to impressive increases in clock speed process with an illusion of its own protected interface to
but not a proportional decrease in context switch times. the network. It associates a communication segment and
Therefore VIA avoids context switches while off-loading a set of send, receive, and free message queues with each
the processor. U-Net endpoint. Endpoints and communication channels
The Active Messages9 project at University of California, allow U-Net to provide protection boundaries between
Berkeley, and the Fast Messages10 project at the Universi- multiple processes.
ty of Illinois, Urbana-Champaign, point out the benefit of The Hamlyn architecture from Hewlett-Packard Labs
an asynchronous communication mechanism. The archi- provides applications with direct access to network hard-
tectures are similar in that each uses a handler, which is ware. The sender-based memory management used in
executed on message arrival. In VIA, all message-passing Hamlyn avoids software-induced packet losses. An appli-
operations are asynchronous. It does not specify the con- cation’s virtually contiguous address space is used to send
cept of a handler per message in lieu of the use of pro- and receive data in a protected fashion without operating
tection tags, translation and protection tables, and system intervention.
registered memory. The Memory Channel interconnect developed at Digi-
Four examples of user-level networking architectures tal Equipment Corporation uses memory mapping to pro-
are Shrimp,11 U-Net,7 Hamlyn,6 and Memory Channel.12 In vide low communication latency. Applications map a
Princeton University’s Shrimp project, a virtual memory- portion of a clusterwide address space into their virtual
mapped network interface maps the send buffer to main address space and then use load and store instructions to
memory by a kernel call. This call checks for protection read or write data. Direct manipulation of shared-memory
and stores memory-mapping information on the network pages achieves message-passing operations. In the Mem-
interface. The network interface maps out each physical ory Channel network, no calls to operating systems are
page of the send buffer to a physical page of the destina- made once a map is established.
tance. However, the performance must be there to warrant the Processing Overheads in TCP/IP,” Computer Communication
intitial investment. Prototyping efforts show that VIA does Review, Vol. 23, No. 4, Oct. 1993, pp. 259-268.
meet the three performance goals of reducing the latency and 4. D.D. Clark et al., “An Analysis of TCP Processing Overhead,”
increasing the bandwidth between communicating process- IEEE Communications, Vol. 27, No. 6, June 1989, pp. 23-29.
es, while significantly reducing the number of instructions 5. J.L. Hennessy and D.A. Patterson, Computer Architecture: A
the CPU needs to execute to accomplish the communication. Quantitative Approach, 2nd ed., Morgan Kaufmann Publishers,
Performance investigations show that VI-NICs integrated in Inc., San Francisco, Calif., 1995.
silicon should easily achieve below 10-µs end-to-end laten- 6. G. Buzzard et al., “An Implementation of the Hamlyn Sender-
cies. The sustainable bandwidth between two communicat- Managed Interface Architecture,” Operating Systems Review,
ing processes is limited by the hardware NIC and fabrication 1996, pp. 245-259.
implementation, not the code used to send and receive the 7. T. von Eicken et al., “U-Net: A User-level Network Interface for
data. The reduction in latency and increase in bandwidth for Parallel and Distributed Computing,” Operating Systems
interprocess communication are achieved with a great reduc- Review, Vol. 29, No. 5, Dec. 1995, pp. 40-53.
tion of instruction executed by the CPU. 8. W.J. Dally et al., “Architecture of a Message-Driven Processor,”
Proc. 14th Int’l Symp. Computer Architecture, IEEE Computer
Society Press, Los Alamitos, Calif., 1987, pp. 189-196.
9. T. von Eicken et al., “Active Messages: A Mechanism for
Integrated Communications and Computation,” Computer
References Architecture News, Vol. 20, No. 2, May 1992, pp. 256-266.
1. R.P. Martin et al., “Effects of Communication Latency, Overhead 10. S. Pakin, V. Karamcheti, and A. Chien, “Fast Messages (FM):
and Bandwidth in a Cluster Architecture,” Computer Efficient, Portable Communication for Workstations and
Architecture News, May 1997, pp. 85-97. Massively-Parallel Processors,” IEEE Concurrency, Vol. 5, No. 2,
2. R. Gusella, “A Measurement Study of Diskless Workstation Apr.-June 1997, pp. 60-72.
Traffic on an Ethernet,” IEEE Trans. Communications, Vol. 38, 11. M.A. Blumrich et al., “Virtual Memory Mapped Network
No. 9, Sept. 1990, pp. 1557-1568. Interface for the SHRIMP Multicomputer,” Proc. 21st Int’l Symp.
3. J. Kay and J. Pasquale, “The Importance of Non-Data Touching Computer Architecture, Apr. 1994, pp. 142-153.
March/April 1998 75
.
VIA
76 IEEE Micro