Thesis

Multi-Threaded End-to-End Applications on Network Processors A Thesis Presented to the Faculty of the California Polytechnic State University San
Luis Obispo In Partial Full lment of the Requirements for the Degree Master of Science in Comp uter Science by Michael S. Watts June 2005
Multi-Threaded End-to-End Applications on Network Processors Copyright c 2005 by Michael S. Watts ii
APPROVAL PAGE TITLE: Multi-Threaded End-to-End Applications on Network Processors AUTHOR: Michael S. Watts DATE SUBMITTED: January 26, 2006 Professor Diana Franklin Advisor or Committee Chair Signature Professor Hugh Smith Committee M ember Signature Professor Phil Nico Committee M ember Signature iii
Abstract Multi-Threaded End-to-End Applications on Network Processors by Michael S. Watts High speed networks put a heavy load on network processors, therefore optimizati on of applications for these devices is an important area of research. Many netw ork processors provide multiple processing chips, and it is up to the applicatio n developer to utilize the available parallelism. To fully exploit this power, o ne must be able to parallelize full end-to-end applications that may be composed of several less complex application kernels. This thesis presents a multi-threa ded end-to-end application benchmark suite and a generic network processor simul ator modeled after the Intel IXP1200. Using our benchmark suite we evaluate the e ectiveness of network processors to support end-to-end applications as well as t he e ectiveness of various parallelization techniques to take advantage of the net work processor architecture. We show that kernel performance is an inaccurate in dicator of end-to-end application performance and that relying on such data can lead to sub-optimal parallelization. iv
Contents Contents 1 Introduction 2 Related Work 2.1 Network Processors . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 2.2 Intel IXP1200 . . . . . . . . . . . . . . . . . . . . . . . . . v 1 4 4 6 7 7 8 8 9 9 10 10 11 11 12 12 13 13 14 Network Processor Simulators . . . . . . . . . . . . . . . . . . . . 2.2.1 2.2.2 SimpleScalar . . . . . . . . . . . . . . . . . . . . . . . . . . PacketBench . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 2.3 .2 2.3.3 MiBench . . . . . . . . . . . . . . . . . . . . . . . . . . . . CommBen ch . . . . . . . . . . . . . . . . . . . . . . . . . . NetBench . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Application Frameworks . . . . . . . . . . . . . . . . . . . . . . . 2.4.1 2.4.2 2.4.3 2.4.4 Click . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . N P-Click . . . . . . . . . . . . . . . . . . . . . . . . . . . . NEPAL . . . . . . . . . . . . . . . . . . . . . . . . . . . . NetBind . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 The Simulator 3.1 3.2 Processing Units . . . . . . . . . . . . . . . . . . . . . . . . . . . Memory Structure . . . . . . . . . . . . . . . . . . . . . . . . . . v
3.3 3.4 Methods of Use . . . . . . . . . . . . . . . . . . . . . . . . . . . . Applicati on Development . . . . . . . . . . . . . . . . . . . . . . . 15 16 18 19 20 23 25 25 26 27 29 30 31 31 33 33 34 34 37 40 42 44 46 50 4 Benchmark Applications 4.1 4.2 4.3 Message Digest . . . . . . . . . . . . . . . . . . . . . . . . . . . . URL-Based Switch . . . . . . . . . . . . . . . . . . . . . . . . . . Advanced Encryption Standard . . . . . . . . . . . . . . . . . . . 5 Results 5.1 Isolation Tests . . . . . . . . . . . . . . . . 5.1.1 5.1.2 5.1.3 5.1.4 5.2 MD5 . . . . . . . . . . . . . . . . . URL . . . . . . . . . . . . . . . . . AES . . . . . . . . . . . . . . . . . . . . . . . alysis . . . . . . . . . . . . . . . . . . . . . . . Shared Tests . . .2.2 5.2.3 5.2.4 URL . . . . . . . . . . . . . . . . . . . . . . 5.3 5.4 5.5 Static Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dynamic Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Conclusion 7 Future Work Bibliography A Acronyms vi . . MD5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . AES . . . . . . . . . . . . . . . . . Isolation . . . . . . . . 5.2.1 . . . . . . . . . . . . An 5 . . .
. . . . . . . . . . . . . . . . . . . . . . . . Shared Analysis
Chapter 1 Introduction As available processing power has increased, devices that traditionally used App lication Speci c Integrated Circuit (ASIC) chips are beginning to use programmable processors in order to take advantage of their exibility. This increase in exibil ity has traditionally been gained at the sacri ce of speed. Network processors aim to bridge the gap between speed and exibility by taking advantage of the bene ts o f both ASICs and general purpose processors. There is no single unifying charact eristic that allows all network processors to accomplish this goal. However, the re are several major strategies employed to bridge the gap: parallel processing, special-purpose hardware, memory structure, communication mechanisms, and perip herals [24]. Network processors have made it possible for the deployment of comp lex applications into the network at nodes that previously acted only as routers and switches. High speed networks put a heavy load on network processors, there fore optimization of applications for these devices is an important area of rese arch. It is up to the application developer to utilize the parallelism available in network processors. Parallelization of kernels is often a trivial task compa red to parallelization 1
of end-to-end applications. In the context of this thesis, kernels are programs that carry out a single task. This task is of limited use in and of itself, howe ver multiple kernels can often be combined to provide a more useful solution. In the area of networking, kernels are programs such as MD5, URL-based switching, and AES discussed in Chapter 4. These kernels can also be applicable outside the area networking, however since the context of this thesis is networking, these kernels focus on packet processing. An end-to-end application refers to a useful combination of kernels. The endto-end application discussed in Chapter 5 makes use of the kernels in Chapter 4 by rst calculating the MD5 signature of each pack et, then determining its destination using URL-based switching, and nally encrypt ing it using AES. In the proposed scenario, the integrity of the packet could be veri ed and its payload decrypted at the destination node. Our rst contribution is the creation of a simulator that emulates a generic network processor modeled o n the Intel IXP1200. Our simulator lls a gap in existing academic research by sup porting multiple processing units. In this way, interaction between the six micr oengines of the Intel IXP1200 can be simulated. We chose to emulate the IXP1200 because it is a member of the commonly used Intel IXP line of Network Processing Unit (NPU)s. Our second contribution is the construction of multi-threaded, end -to-end application benchmarks based on the NetBench [18] and MiBench [10] singl ethreaded kernels. Since network processors are capable of supporting complex ap plications, it is important to have benchmarks that fully utilize them. Existing benchmark suites make it di cult to research the properties of parallelized end-t o-end applications since they are made up of single-threaded kernels. Our benchm arks have been designed to provide insight into the characteristics of end2
to-end applications. Our third contribution is an analysis of our multi-threaded , end-to-end application benchmarks on our network processor simulator. This ana lysis reveals characteristics of the kernels making up the end-to-end applicatio ns and the endto-end applications themselves, as well as insight into the streng ths and weaknesses of network processors. This paper is organized as follows. In the next chapter we provide background and related work. In Chapter 3 we presen t our simulator. The kernels that make up our end-to-end application benchmark a re presented in Chapter 4. Chapter 5 describes our testing methodology and our e valuation of the e ectiveness of network processors to support end-to-end applicat ions as well as the e ectiveness of various parallelization techniques to take adv antage of the network processor architecture. Finally, our conclusion is present ed in Chapter 6. 3
Chapter 2 Related Work As the size and capacity of the Internet continues to grow, devices within the n etwork and at the network edge are increasing in complexity in order to provide more services. Traditionally, these devices have made use of ASICs which provide high performance and low exibility. NPUs bridge the gap between speed and exibili ty by taking advantage of the bene ts of both ASICs and general purpose processors . There is no single unifying characteristic that allows all network processors to accomplish this goal. However, there are several major strategies employed to bridge the gap: parallel processing, special-purpose hardware, memory structure , communication mechanisms, and peripherals [24]. Network processors have made i t possible for the deployment of complex applications into the network at nodes that previously acted only as routers and switches. 2.1 Network Processors NPU is a general term used to describe any processor designed to process packets for network communication. Another characteristic of NPUs is that their 4
programmability allows applications deployed to them to access higher layers of the network stack than traditional routers and switches. The OSI reference model de nes seven layers of network communication from the physical layer (layer 1) to the application layer (layer 7) [15]. NPUs are capable of supporting layer 7 ap plications which have traditionally been reserved for desktop and server compute rs. There are over 30 di erent self-identi ed NPUs available today [24]. These NPUs can be classi ed into two categories based on their processing element con guration: pipelined and symmetric. A processing element (PE) is a processor able to decod e an instruction stream [24]. Pipelined con gurations dedicate each PE to a partic ular packet processing task, while in symmetric con gurations each PE is capable o f performing any task [24]. Both of these con gurations are capable of taking adva ntage of the inherent parallelism in packet processing. Pipelined architectures include: Cisco PXF [25], EZChip NP1 [8], and Xelerator Network Processors [31]. Symmetric architectures include: Intel IXP [6] and IBM PowerNP [1]. High-speed n etworks place high demands on the performance of NPUs. In order to prevent netwo rk communication delays, NPUs must quickly and e ciently process packets. Parallel processing through the use of multiple PEs is only one strategy used in NPUs to improve performance. Another strategy is to use special-purpose hardware to o oad tasks from the PEs. Special-purpose hardware includes co-processors and special functional units. Co-processors are more complex then functional units. They ma y be attached to several PEs, memories, and buses, and they may store state. A c o-processor can be advantageous to the programmer when implementing an applicati on, but 5
can also dictate that the programmer use a speci c algorithm in order to take adva ntage of a particular co-processor. Special functional units are used to impleme nt common networking operations that are hard to implement e ciently in software y et easy to implement in hardware [24]. Since memory access can potentially waste processing cycles, NPUs often use multi-threading to e ciently utilize processing power. Hardware is dedicated to multi-threading such as separate register banks for di erent threads and hardware units to schedule and swap threads with no over head. Special units also handle memory management and the copying of packets fro m network interfaces into shared memory [24]. 2.1.1 Intel IXP1200 The IXP1200 was designed to support applications requiring fast memory access, l ow latency access to network interfaces, and strong processing of bit, byte, wor d, and longword operations. For processors, the IXP1200 provides a single Strong ARM processor and six independent 32-bit RISC PEs called microengines. This boil s down to a single powerful processor coupled with 6 very simple, weaker engines for highly parallel computation. In addition, each microengine provides four ha rdware supported threads with zero-overhead context switching. The StrongARM was designed to manage complex tasks and to o oad speci c tasks to individual microengi nes [6]. The StrongARM and microengines share 8 MBytes of SRAM for relatively fa st accesses and 256 MBytes of SDRAM for larger memory space requirements (but sl ow accesses). There is also a scratch memory unit available to all processors co nsisting of 1 MByte SRAM. The StrongARM has a 16 KByte instruction 6
cache and 8 KByte data cache, providing it with fast accesses on a small amount of data. Each microengine has a 1 KByte data cache and a large number of transfe r registers. The IXP1200 platform does not provide any built-in memory managemen t, therefore the application developers are responsible for maintaining memory a ddress space [6]. 2.2 Network Processor Simulators Simulators are often used to execute programs written to run on hardware platfor ms that are inconvenient or inaccessible to developers [28]. Simulators are also able to provide performance statistics such as cycle count, memory usage, bus b andwidth, and cache misses. These statistics enable developers to identify bottl enecks and tune applications to speci c hardware con gurations. Simulators are an im portant aspect of research in network processors due to the high-cost and the wi de variety of architecture found in current NPUs. Highcost often makes cutting-e dge NPUs inaccessible in academic research although outdated NPUs are becoming m ore accessible. The wide variety of NPU architectures makes developing applicati ons to run across multiple platforms di cult. Since simulators can potentially be con gured to simulate multiple platforms, analysis of architectural di erences can b e performed. 2.2.1 SimpleScalar SimpleScalar provides tools for developing cycle-accurate hardware simulation so ftware that models real-world architecture [3]. We chose to use SimpleScalar bec ause of its prevalence in architectural research. SimpleScalar takes as input 7
binaries compiled for the SimpleScalar architecture and simulates their executio n [3]. The SimpleScalar architecture is similar to MIPS, which is commonly found in NPU platforms such as the Intel IXP. A modi ed version of GNU GCC allows binar ies to be compiled from FORTRAN or C into SimpleScalar binaries [3]. 2.2.2 PacketBench PacketBench is a simulator developed at the University of Massachusetts to provi de exploration and understanding of NPU workloads [22]. PacketBench makes use of SimpleScalar ARM for cycle-accurate simulation [22]. PacketBench also emulates some of the functionality of a NPU by providing a simple API for sending and rec eiving packets and for memory management [22]. In this way, the underlying detai ls of speci c NPU architectures are hidden from the application developer. Althoug h PacketBench is useful in characterizing workload, it does not provide simulati on support for multiprocessor environments. Since NPUs make extensive use of par allelization, we chose not to use this tool. 2.3 Benchmarks Benchmarks are applications designed to assess the performance characteristics o f computer hardware architectures [27]. One approach is to use a single benchmar k suite to compare the performance of several di erent architectures. Another appr oach is to compare the performance of di erent applications on a speci c architectur e. Benchmarks designed to mimic a particular type of workload are called Synthet ic, while Application benchmarks are real-world applications [27]. For the purpo ses of this paper, our interest is in application bench8
marks, and more speci cally, representative benchmarks for the domain of NPUs. 2.3.1 MiBench MiBench is a benchmark suite providing representative applications for embedded microprocessors [10]. Due to the diversity of the embedded microprocessor domain , MiBench is composed of 35 applications divided into six categories: Automotive and Industry Control, Network, Security, Consumer Devices, O ce Automation, and T elecommunications. The Network and Security categories include Rijndael encrypti on, Dijkstra, Patricia, Cyclic Redundancy Check (CRC), Secure Hash Algorithm (SH A), Blow sh, and Pretty Good Privacy (PGP) algorithms. The Telecommunications cate gory consists of mostly signal processing algorithms, while the other categories are not relevant to this discussion. All MiBench applications are available in standard C source code allowing them to be ported to any platform with compiler support. 2.3.2 CommBench CommBench was designed to evaluate the performance of network devices based on e ight typical network applications. The applications included in CommBench are ca tegorized into header-processing and payload-processing. Headerprocessing applic ations include Radix-Tree Routing table lookup, FRAG packet fragmentation, De cit Round Robin scheduling, and tcpdump tra c monitoring. Payload-processing applicati ons include CAST block cipher encryption, ZIP data compression, Reed-Solomon For ward Error Correction (REED) redundancy checking, and JPEG lossy image compressi ng [30]. 9
2.3.3 NetBench NetBench is a benchmarking suite consisting of a representative set of network a pplications likely to be found in the network processor domain. These applicatio ns are split into three categories: micro, IP, and application. The micro-level includes the CRC-32 checksum calculation and the table lookup routing scheme. IP -level programs include IPv4 routing, De cit-Round Robin (DRR) scheduling, Network Address Translation (NAT), and the IPCHAINS rewall application. Finally, applica tion-level includes URL-based switching, Di e-Hellmen (DH) encryption for VPN conn ections, and Message-Digest 5 (MD5) packet signing [18]. Although CommBench and NetBench o er good representations of typical network applications, they are both limited to single-threaded environments. Our work builds on the NetBench suite b y parallelizing several NetBench applications. 2.4 Application Frameworks Application framework is a widely used term referring to a set of libraries and a standard structure for implementing applications for a particular platform [26 ]. Application frameworks often promote code-reuse and good design principles. S everal frameworks for NPUs are available in academia, each o ering various bene ts t o application developers. NPU vendors also provide frameworks speci c to their arc hitectures, such as the Intel IXA Software Development Kit [5]. One key advantag e of academic frameworks is the possibility that they will be able to support mu ltiple architectures, thus enabling developers to design and implement applicati ons independent of a speci c architecture. Unfortunately, of the NPU-speci c framewo rks surveyed in this paper only NEPAL currently 10
realizes cross-platform support. The others are currently striving to meet this goal. 2.4.1 Click Click is an application development environment designed to describe networking applications [13]. Applications implemented using Click are assembled by combini ng packet processing elements. Each element implements a simple autonomous funct ion. The application is described by building a directed graph with processing e lements at the nodes and packet ow described using edges. Click supports multi-th reading but has not been extended to multiprocessor architectures. The modularit y of Click applications gives insight into their inherent concurrency and allows alterations in parallelization to be made without changing functionality. 2.4.2 NP-Click NP-Click is based upon Click and designed to enable application development on N PUs without requiring in-depth understanding of the details of the target archit ecture [19]. NP-Click o ers a layer of abstraction between the developer and the h ardware through the use of a programming model. The code produced using the NP-C lick programming model has been shown to run within 10% of the performance of ha nd-coded solutions while signi cantly reducing development time [19]. The current implementation of NP-Click targets only the Intel IXP1200 network processor alth ough a goal of this project is to support other architectures. 11
2.4.3 NEPAL The Network Processor Application Language (NEPAL) is a design environment for d eveloping and executing module-based applications for network processors [17]. I n a similar fashion to Click, application development takes place by de ning a set of modules and a module tree that de nes the ow of execution and communication bet ween modules. The platform independence of NEPAL was veri ed using their own custo mized version of SimpleScalar ARM simulator for multiprocessor architectures. Th ey provide performance results for two simulated NPUs modeled after the IXP1200 [6] and Cisco Toaster [25]. 2.4.4 NetBind NetBind is a binding tool for dynamically constructing data paths in NPUs [4]. D ata paths are made up of components performing simple operations on packet strea ms. NetBind modi es the machine code of executable components in order to combine them into a single executable at run-time. The current implementation of NetBind speci cally targets the IXP1200 network processor, although it could be ported to other architectures in the future. 12
Chapter 3 The Simulator The simulator developed for this work was built on the SimpleScalar tool set [3] . SimpleScalar provides tools for developing cycle-accurate hardware simulation software that models real-world architecture. This simulation tool set was chose n because of its prevalence in architectural research. For this work, we modi ed a n existing simulator with support for multiple processors in order to create a g eneric network processor simulator modeled after the Intel IXP1200. We chose to model the IXP1200 because it is a member of the commonly used Intel IXP line of NPUs. 3.1 Processing Units The simulator includes a single main processor and six auxiliary processors each supporting up to four concurrent threads. This con guration corresponds to the St rongARM core processor and accompanying microengines on the IXP1200. The StrongA RM core is represented by an out-of-order processor. The microengines are repres ented by single-issue in-order processors. Since each micro13
Parameter Scheduling Width L1 I Cache Size L1 D Cache Size StrongARM Microengines Out-of-order In-order 1 (single-issue) 1 (single-issue) 1 6 KByte SRAM (no miss penalty) 8 KByte 1 KByte Table 3.1. Processor Parameters engine must support four threads with zero overh ead context switching [6], the simulator creates one single-issue in-order proce ssor for each microengine thread. When a single-issue in-order processor is crea ted, it is given the number of threads allocated to its physical microengine so it knows to execute every n cycles, where n is the number of threads on the micr oengine. The total number of required threads is speci ed on the command line when the simulator is run, therefore unused threads are not created. 3.2 Memory Structure The StrongARM and microengines share 8 MBytes of SRAM and 256 MBytes of SDRAM [6 ]. There is also a scratch memory unit consisting of 1 MByte SRAM. These memory units are represented in the simulator using a single DRAM unit. Separate DRAM c aches back these memory units for the StrongARM and microengines. The StrongARM has a 16 KByte instruction cache and 8 KByte data cache that are backed by SRAM [6]. Each microengine has a 1 KByte data cache and unlimited instruction cache. The microengines are given unlimited instruction cache in order to mimic the beh avior of the large number of transfer registers associated with each microengine on the IXP1200. Since the number of simulated registers cannot exceed the numbe r of physical registers on the host architecture, 14
we determined this to be the best option available. Since the IXP1200 is capable of connecting with any number of network devices through its high speed 64 bit IX bus interface, the amount of delay incurred to fetch a packet could very grea tly. For the purposes of this simulator, network delay is not important and it i s assumed that the next packet is available as soon as the application is ready to receive it. In order to imitate this behavior, a large chunk of the DRAM unit is allocated as network memory and is backed by a no-penalty cache object availab le to all processors. The simulator does not provide any built in memory managem ent, therefore the application developers are responsible for maintaining memory address space. The simulator assigns address ranges to each of the memory units . SRAM is dedicated for the call stack and DRAM is broken up into a range for te xt, global variables, and the heap. 3.3 Methods of Use The simulator compiles to Linux using GCC 3.2.3 to an executable called sim3ixp1 200. The simulator takes a list of arguments that modify architectural defaults and indicate the location of a SimpleScalar native executable and any arguments that should be passed to it. Its use can be expressed as: sim3ixp1200 [-h] [simargs] program [program-args] The -h option lists available simulator arguments o f the form -Parameter:value. These arguments can modify aspects of the simulatio n architecture including the number of PEs, threads, cache speci cations, and memo ry unit speci cations. Default values for each available parameter are based on th e IXP1200 architec15
ture. The most important parameter for this work was Threads that controls the n umber of microengine threads made available to the SimpleScalar application. Thr eads can be any value between 0 and 24 inclusive. Zero threads indicates the mic roengines will not be used and therefore the application will execute only on th e StrongARM processor. When the number of threads are greater than zero, they ar e allotted to the 6 possible microengines using a round-robin scheme so that the threads are distributed as evenly as possible. For instance, if 8 threads are r equested, then 4 microengines will be run 1 thread and 2 microengines will run 2 threads. 3.4 Application Development The applications developed for this work were written in C and compiled using a GCC 2.7.2.3 cross-compiler. A cross-compiler translates code written on one comp uter architecture into code that is executable on another architecture. For this work, the host architecture was Linux/x86 and the target architecture was Simpl eScalar PISA (a MIPS-like instruction set). Since the simulator does not support POSIX threads, developing multi-threaded applications follows a completely di ere nt path. Instead of the main process spawning child threads, the same applicatio n code is automatically executed in each simulator thread. In order for the appl ication code to distinguish which thread it is running in, a function called get cpu() that returns an integer is made available by the simulator. This function, although mis-named, returns the thread identi er, not the CPU identi er. Code that is meant to run in a particular thread must be isolated in an if block that test s the return value from 16
getcpu(). This function requires a penalty of one cycle, but it is typically cal led only once and its value stored in a local variable during the initialization of the application. A global variable called ncpus is automatically made availa ble by the simulator and populated with the number of threads. It is often neces sary in application development to require all threads to reach a particular poi nt before any thread is allowed to proceed. This is accomplished using another f unction made available by the simulator called barrier(). A call to barrier() re quires one cycle for the function call, but induces no penalty while a thread wa its. The simulator reports statistics on the utilization of each hardware unit a t the end of each execution. For each PE this includes cycle count, instruction count, and fetch stalls. For each memory unit this includes hits, misses, reads, and writes. In addition, the simulator provides a function called PrintCycleCou nt() that can be used at any time to print the cycle count of the current thread to standard error and standard output. This function is useful when an applicat ion has an initialization process that should not count towards the total cycle count. By making a call to PrintCycleCount() at the beginning and end of a block of code, the total cycle count for that block can be determined by analyzing th e output. When the developer requires that the application make some calculation based on cycle count, the function GetCycles() returning and integer can be use d. Both of these functions induce a penalty of one cycle for the call, but no pe nalty for their execution. 17
Chapter 4 Benchmark Applications Previous research in the area of network processors has focused on exploring the ir performance characteristics by running individual applications in isolation a nd in a single threaded environment. Network processors are capable of supportin g more complex applications that guide packets through a series of applications running in parallel. For the rst stage of this work we ported three typical netwo rk applications to our simulator: MD5, URL-switching, and Advanced Encryption St andard (AES). This process involved modifying memory allocations to use appropri ate simulator address space and reorganizing each application to take advantage of multiple threads. For the second stage of this work we combined these three a pplications into three types of end-to-end applications: shared, static, and dyn amic. These distinctions refer to three di erent ways of utilizing the available t hreads. 18
4.1 Message Digest The MD5 algorithm [23] creates a 128-bit signature of an arbitrary length input message. Until recently, it was believed to be infeasible to produce two message s with the same signature or to produce the original message given a signature. However, in March 2005, Arjen Lenstra, Xiaoyun Wang, and Benne de Weger demonstr ated [14] that two valid X.509 certi cates [11] could be created with identical MD 5 signatures. Although more robust algorithms exist, MD5 is still extensively us ed in public-key cryptography and in verifying data integrity. Our implementatio n of MD5 was adopted from the NetBench suite of applications for network process ors [18, 16]. The NetBench implementation was designed to process packets in a s erial fashion utilizing a single thread. The multi-threaded, multiprocessor natu re of NPUs is better utilized by processing packets in parallel. In order to ana lyze the performance characteristics in this environment, our implementation of MD5 o oads the processing of packets to available microengine threads. In this way , the number of packets processed in parallel is equal to the number of microeng ine threads. As shown in Figure 4.1, the StrongARM processor is responsible for accepting incoming packets and distributing them to idle microengine threads. Co mmunication between the StrongARM and microengines is done through the use of se maphores. When the StrongARM nds an idle thread, it copies a pointer to the curre nt packet and the length of the current packet to shared memory locations known by both the StrongARM and the thread. The StrongARM then sets a semaphore that t riggers the thread to begin executing. When all packets have been processed, the StrongARM waits for each thread to become idle, then noti es them to exit before exiting itself. 19
Figure 4.1. MD5 - StrongARM Algorithm Each microengine thread proceeds as shown in Figure 4.2. It waits until its semaphore has changed, then either exits or co pies the current packet to its stack before processing it to generate a 128-bit signature. It then resets its semaphore and returns to waiting. 4.2 URL-Based Switch URL-based switching directs network tra c based on the Uniform Resource Locator (U RL) found in a packet. Other terms for URL-based switch include Layer 7 switch, content-switch, and web-switch. The purpose of switching based on Layer 7 conten t is to realize improved performance and reliability of web-based 20
Figure 4.2. MD5 - Microengine Algorithm services. A Layer 4 switch located in fr ont of the cluster of servers can control how each Transmission Control Protocol (TCP) connection is established on a per connection basis. How requests are dir ected within a connection is out of reach to a Layer 4 switch. Tra c can be manage d per request, rather than per connection, by a URL-based switch [12]. In order to manager requests, a URL-switch acts as the end point of each TCP connection a nd establishes its own connections to the servers containing the content request ed by the client. It then relays content to the client. In this way, the switch can perform load-balancing and fault detection and recovery. For instance, if on e server is overloaded or unreachable, the switch can send its request to a di ere nt server. Our URL-based switching algorithm is based on the implementation foun d in NetBench [18, 16]. The algorithm searches the contents of a packet for a li st of matching patterns. Each pattern has an associated destination that can be used to switch the packet or begin another process. The focus of our URL-based 2 1
switch is the pattern matching algorithm. Unlike our implementation of MD5, our URL-based switch does not utilize parallelism by processing multiple packets at once, instead it uses multiple threads to process each packet. The data structur e used to store patterns is a list of lists. Each element of the primary list is made up of a secondary list and the largest common substring of the patterns in the secondary list. The algorithm proceeds as shown in Figures 4.3 and 4.4. Figure 4.3. URL - StrongARM Algorithm Each packet received by the StrongARM is c opied to the stack and run through the Internet checksum algorithm to verify its integrity. For each element in the list of largest common substrings, the Stron gARM copies the elements secondary list pointer to a shared memory location known by an idle microengine thread. A pointer to the current packet is also copied t o shared memory and then the idle microengines semaphore is set to notify it to b egin 22
Figure 4.4. URL - Microengine Algorithm executing. The microengine thread rst cop ies the packet to its stack and then uses a Boyer Moore search function to deter mine whether the packet contains the largest common pattern. If this test is pos itive, then the thread proceeds to search for a matching pattern in the secondar y list. Otherwise, the microengine resets its semaphore and returns to an idle s tate. If the thread nds a matching pattern, it sets its semaphore to re ect this be fore returning to an idle state. The StrongARM continues until it reaches the en d of the primary list or until a thread nds a matching pattern, it then processes the next packet. 4.3 Advanced Encryption Standard The AES is an encryption standard adopted by the US government in 2001 [20]. The standard was proposed by Vincent Rijmen and Joan Daemen under the name Rijndael [7]. AES is a block cipher encryption algorithm based on 128, 192, or 256 bit k eys. The algorithm is known to perform e ciently in both hardware and software. Ou r implementation of AES is based on the Rijndael algorithm found in 23
the MiBench embedded benchmark suite [10, 21]. In algorithm processes packets in parallel, our AES of packets to microengine threads. The encryption ey that is loaded into each threads stack during on the simulator in the same manner as MD5 above 24
much the same way that our MD5 algorithm o oads the encryption is performed using a 256 bit k startup. This algorithm executes (Figures 4.1 and 4.2).
Chapter 5 Results In order to evaluate the e ectiveness of NPUs to support multi-threaded endto-end applications and the e ectiveness of various parallelization techniques to take ad vantage of the NPU architecture, we performed four types of tests: Isolation, Sh ared, Static, and Dynamic. The Isolation tests establish a baseline and explore application behavior on the multi-threading NPU architecture. The Shared tests e xplore how each application is a ected by the concurrent execution of other applic ations. The Static tests reveal characteristics of an end-to-end application and how to best distribute threads. Finally, the Dynamic tests serve to compare an on-demand thread allocation algorithm to statically allocated threads. 5.1 Isolation Tests The purpose of the Isolation tests is twofold: to establish a baseline for subse quent tests and to explore the e ects of multi-threading on the NPU. The Isolation tests consisted of independent tests for each application. For each independent test, the number of microengine threads available to the application 25
was varied between 1 and 24, since the simulator supports up to 24 threads. A da ta point was also gathered for the serial version of each application in which n o microengine threads were used. 5.1.1 MD5 Figure 5.1. MD5 Isolated Speedup on 1000 Packets Test results in Figure 5.1 show that parallelization of the MD5 algorithm o ers signi cant speedup compared to its serial counterpart. The data point at zero threads represents the serial version of MD5 executed on the StrongARM processor. The data point at 1 thread represen ts the multi-threaded version making use of the StrongARM and a single microengi ne. This case is slower than the serial version because of the overhead involved in communication between the StrongARM and the microengine and because the micr oengine does not o er as strong processing power as the StrongARM. As the number o f threads increases, the combined processing power of the microengines outweighs the communication overhead. 26
The slope of the speedup graph in Figure 5.1 decreases suddenly at 7, 13, and 19 threads. These changes can be attributed to the fact that there are 6 microengi nes, therefore, up until 7 threads each microengine is responsible for a single thread. From 7-12 threads, each microengine is burdened with up to 2 threads. Si milarly, as the number of threads increases to 24, each microengine is burdened with 3 and then 4 threads causing the speedup to approach a at line. 5.1.2 URL Figure 5.2. URL Isolated Speedup on 100 Packets (non-polling) Although test results for the parallelization of URL show improvements over the serial version, characteristics of the algorithm limited speedup. As stated in t he previous chapter, the URL algorithm is parallelized in such a way that multip le threads work together to process each packet. Each thread is responsible for searching the packet for a particular set of patterns, and the rst match preempts further execution. The drawback of this algorithm is that since only one thread will nd a match, the other threads do work that in hindsight is unnecessary. 27
Figure 5.3. URL Isolated Speedup on 100 Packets (polling) This in itself would n ot be detrimental to the applications performance except that all threads are vyi ng for a limited number of shared resources. We developed two variations of the URL algorithm in an attempt to minimize the cycles spent searching false leads. The rst version allows each thread to run to completion after a matching pattern is found. Once a thread reports to the StrongARM that a match has been found, th e StrongARM stops spawning new threads and simply waits for the active threads t o nish, although their processing is immaterial. In the alternative approach, whe n a match is found, the StrongARM sets a global ag that is constantly polled by e ach thread. When a thread detects that the ag has changed, it stops executing. Al though it was expected that the polling version of URL would perform better, it actually performed slightly worse than the non-polling version. As shown in Figu res 5.2 and 5.3, the highest speedup attained by non-polling was 1.75 and for po lling 1.64. Analysis of the applications output shows that a matching pattern is found in only about 40% of the trace packets, thus polling 28
is unable to preempt execution 60% of time. The di erence in speedup is due to the fact that the polling version is doing unnecessary work 60% of the time and tha t polling itself wasts too many cycles. In both versions of URL, the speedup dro ps o after reaching a maximum between 4 and 6 threads. This indicates that conten tion to shared resources becomes a problem after this point. 5.1.3 AES Figure 5.4. AES Isolated Speedup on 100 Packets Speedup tests on AES show that t his algorithm performs poorly when of oaded to the microengines. The AES encryptio n algorithm requires each packet be read and processed 16 bytes at a time. State is maintained for the lifetime of each packet in an accumulator that made up of the encryption key and state variables. In addition, a static lookup table of 8 Kbytes is required. The L1 data cache for the StrongARM is 8 Kbytes compared to 1 Kbytes for the microengines. Due to the limited size of the microengine cache s, AES su ers from substantial 29
cache misses. Processing of each packet consumes roughly 1.36 million simulator cycles when encryption is performed on the StrongARM. The same process consumes roughly 11.4 million simulator cycles on a microengine thread when it is the onl y microengine thread running. This is an increase by a factor of 8.4. In contras t, MD5 consumes roughly 0.518 million cycles on the StrongARM and 0.922 million cycles on a single microengine thread. This results in an increase by a factor o f 1.6. Thus, AES requires a substantially higher increase in cycles when moving from the StrongARM to a microengine thread. Figure 5.4 shows that although perfo rmance on the microengine threads is far worse than the serial version, it remai ns relatively constant as the number of threads increases. Therefore, the poor p erformance of AES on the microengine threads is primarily a result of processing power and cache size, not memory contention between threads which would be the case if speedup tailed o . 5.1.4 Isolation Analysis These tests reveal general characteristics of each kernel on both the StrongARM and microengines. MD5 has been shown to o er strong speedup on microengine threads using conventional parallelization. URL, using an alternative approach to multi -threading, has been shown to provide maximum speedup between 4 and 6 threads em ploying either a polling or non-polling scheme. Finally, AES reveals an algorith m with poor performance on the microengines that cannot be overcome by multi-thr eading. 30
5.2 Shared Tests The purpose of the Shared tests is to determine how sensitive each kernel is to the concurrent execution of the other kernels. For these tests we ran all three kernels on the simulator at the same time. The StrongARM served as the controlle r, passing incoming packets to available microengine threads. We ran one test fo r each kernel, in which the number of threads available to the kernel under test was varied, while the threads available to the other kernels remained constant. Our baseline for each of these tests was 1 thread for MD5, 4 threads for URL, a nd 1 thread for AES. This baseline was chosen because running URL with few than 4 threads was found to cause a signi cant bottleneck. The number of threads availa ble to the kernel under examination was increased for each subsequent run. Each kernel processed a separate packet stream until the kernel under test completed the desired number of packets, in this case 50. Figure 5.5 shows the speedup res ults from all three tests on the same graph revealing the relative speedup of ea ch kernel. Clearly, MD5 and AES have much greater speedup than URL, indicating t hey are less sensitive to the concurrent processing of other kernels. However, i t is more interesting to compare the Shared speedup of each kernel with its Isol ated speedup. This comparison is covered in the following subsections. 5.2.1 MD5 The speedup results of MD5 in the Isolation and Shared tests, shown in Figures 5 .1 and 5.5 respectively, show few di erences. The slope of each graph is approxima tely the same and both peak near a speedup of six. This indicates that MD5 is no t substantially a ected by the concurrent execution of URL and 31
Figure 5.5. Shared Speedup on 50 Packets AES. The lightweight nature of MD5 with regards to memory is the most likely explanation for this behavior. Figure 5.6 compares the MD5 Isolation and Shared tests with regard to the number of cycles consumed by the StrongARM while 50 packets are processed, revealing that more cy cles are required to process the same packet stream when MD5 is sharing the reso urces of the NPU. The horizontal-axis corresponds to the number of MD5 threads e mployed to process the packets while the vertial-axis corresponds to the number of cycles spent processing the packet stream. Since with the Shared tests 4 thre ads are allocated to URL and 1 to AES, these threads cause contention for access to shared resources and therefore higher cycle counts than the Isolation tests. 32
Figure 5.6. MD5 Isolated vs. Shared Cycles on 50 Packets 5.2.2 URL Although the Shared speedup of URL shown in Figure 5.5 steadily increases, its m aximum of 1.17 with 22 threads does not match the Isolation speedup shown in Fig ure 5.2 that peaks at 1.75 and degrades to 1.41 with 22 threads. This indicates that URL is a ected by the concurrent execution of other applications due to its m emory access requirements. 5.2.3 AES The Shared speedup of AES shown in Figure 5.5 is an order of magnitude greater t han the Isolation speedup shown in Figure 5.4. This high speedup is due to the f act that the baseline for this test performed extremely poorly. This can be attr ibuted to two characteristics of the AES kernel. Firstly, as shown in the Isolat ion tests, AES performs poorly on the microengines due to their lack of processi ng power and limited size of their cache. Secondly, since the 33
StrongARM is the controller for all three kernels it continuously monitors all o f the microengine threads and distributes incoming packets as necessary. In the baseline, the StrongARM has to monitor one thread for each kernel, thus only one -third of its time is spent monitoring the AES thread. Therefore, the AES thread occasionally nishes processing a packet and wastes idle cycles waiting for the S trongARM to send it another packet. As more threads are allocated to AES, the St rongARM spends a larger percentage of time monitoring AES threads therefore incr easing throughput. 5.2.4 Shared Analysis The Shared tests reveal that MD5 and AES are relatively insensitive to the concu rrent execution of the other kernels on a single NPU. URL, however, is sensitive , and its speedup su ers when it is run alongside the other kernels. 5.3 Static Tests The Static tests were designed to reveal characteristics of the end-to-end appli cation, such as the location of bottlenecks and the ideal thread con guration. The testing process was similar to that of the Shared tests. The di erence being that instead of processing independent packet streams, the applications worked toget her to process a single packet stream. Each incoming packet was processed rst by MD5, then by URL, and nally by AES. This scenario represents a possible end-to-en d application running on a NPU as shown in Figure 5.7. The purpose of this appli cation is to distribute sensitive information from a trusted internal network th rough the Internet to a variety of hosts. Each packet is re34
ceived by the application from the internal network, the application calculates its MD5 signature, determines its destination based on a deep inspection of the packet, and then encrypts it. Finally, the the encrypted packet along with its s ignature is sent to a host machine; although this step is not included in the si mulated application. To complete this scenario, the host machine would decrypt t he packet and verify that the contents were not modi ed in transit by comparing th e included signature to a newly generated one. This is also not included in the simulation. Figure 5.7. End-to-End Application Scenario Figure 5.8. Optimization with Static Allocation of Threads For these tests, the number of threads allocated to each stage of the end-to35
end application is static for each run. Once again, the baseline test is 1 threa d for MD5, 4 threads for URL, and 1 thread for AES. Each subsequent test increas es the number of threads by one and attempts to determine the optimal con guration . The optimal con guration is determined by giving the additional thread to each o f the applications in turn, and observing which con guration yields the best speed up. This con guration is then used as a starting point for the subsequent test. Fi gure 5.8 shows the resulting optimal con gurations for each number of available th reads between 6 and 24. These con gurations were found through test runs of 50 pac kets. MD5 never became a bottleneck point and 1 thread remained suf cient througho ut the tests. URL and AES almost evenly split the remaining threads, with the a n al con guration of 12 threads for AES, 11 for URL, and 1 for MD5. These results sh ow that the demands of AES and URL are similar and parallelization o ers increased performance for these applications, while the simplicity of MD5 makes paralleli zation of it in the context of this end-to-end application unnecessary. The abov e discovery reveals an interesting characteristic of this end-to-end application . Although MD5 provided the best speedup in the Isolation tests, parallelizing i t in the Static tests resulted in less performance improvement than further para llelization of the other applications. This can be explained by Amdahls Law [2], which states that the overall speedup achievable from the improvement of a propo rtion of the required computation is a ected by the size of that proportion. If P is the proportion and S is the speedup, Amdahls Law states that the overall speed up will be: 36
1 (1 P ) + P S Therefore, the computation required to perform MD5 in this end to end applicatio n is a small proportion of the overall computation. Subsequently, speedup bene ts more through increased parallelization of URL and AES. It is also interesting to note that although AES did not bene t from additional microengines during the Iso lation tests (Figure 5.4), in the high load context of this end to end applicati on additional AES threads bene t overall performance. Figure 5.8 also shows that i nitially more threads were allocated to URL and after 14 threads more threads we re allocated to AES. Since URL is required to nish processing each packet before it can be sent to AES, URL caused more of a bottleneck when it had less than 10 threads. After that point, AES required 4 threads to ever 1 for URL in order to keep pace. 5.4 Dynamic Tests The Dynamic tests present an alternative approach to the Static tests. Where the Static tests represent ideal con gurations, the Dynamic tests represent realistic con gurations. Static allocation of microengine threads is also much less feasibl e since all possible con gurations must be run in order to determine the best one for the given end to end application. This could become an extremely complex and lengthy process. The trade o with a dynamic heuristic is increased complexity in the logic of the application. The purpose of these tests was to determine how a n on demand allocation of threads performs against a static approach. The Dynami c tests consist of 37
all three kernels processing the same packet stream in serial, as in the Static tests, but with threads dynamically allocated based on demand. Once again, the S trongARM serves as the controller and is responsible for allocating threads. All ocation is implemented through the use of queues for each stage of the endto end application. Each queue stores pointers to packets that are waiting to be proce ssed by the next stage. The StrongARM detects when a queue has packets and creat es threads to process them. Figure 5.9. Dynamic Speedup on 50 Packets Figure 5.9 shows the speedup of the Dy namic application using as a baseline the Static con guration consisting of 1 MD5, 4 URL, and 1 AES thread. The speedup increases from 4.29 with 6 threads to 4.39 with 24 threads, a substantial increase over the Static baseline. Figure 5.10 s hows the di erence between the number of cycles requires for each of the applicati ons to process the same number of packets. While the Static version spent in the neighborhood of 1.3 billion cycles per 50 packets, the Dynamic version spent cl oser to 300 million, a ratio of 4.3:1. 38
Figure 5.10. Static vs. Dynamic Cycles on 50 Packets This discrepancy can be att ributed to cycles wasted on idle threads. With the Static version, each thread i s statically assigned to perform either MD5, URL, or AES. Since the URL kernel r equires much longer to run than MD5, the queue of packets waiting for URL proces sing is quickly lled forcing the MD5 thread to stop processing new packets until URL can reduced the queue. At the same time, when the URL threads were unable to process packets as quickly as the AES threads, some AES threads wasted idle cyc les. The Dynamic version did not su er from these bottleneck issues because idle t hreads were put to use by whichever kernel required them. Another bene t of the Dy namic version is that it is able to adjust to changes in load caused by varying packet sizes and payloads. Speci cally, since URL performs a thorough string match ing on the payload of each packet, the size of the packet has a large a ect on the number of cycles required to process it. The Dynamic version is able to minimiz e bottlenecks in URL due to large packets by putting more threads to work on the bottleneck. 39
Figure 5.10 also shows that neither the Static nor the Dynamic versions of the e nd to end application bene t much from additional threads. The number of cycles re mains relatively constants from 6 to 24 threads. The Isolation tests show that b etween 6 and 24 threads MD5 is the only kernel to experience significant perform ance improvement. The speedup of URL declines slightly and AES remains relativel y constant. Therefore, with the exception of the MD5 kernel, the end to end appl ications experience performance characteristics similar to the Isolation tests. Once again, this can be explained by Amdahls Law [2], because MD5 constitutes onl y a small percentage of the overall computation. Thus, the performance of the en d to end application is driven by the performance of the URL and AES kernels. 5.5 Analysis We performed four types of tests for our analysis: Isolation, Shared, Static, an d Dynamic. The Isolation tests established a baseline and explored kernel behavi or on the multi threading NPU architecture. The Shared tests explored how each k ernel was a ected by the concurrent execution of other kernels. The Static tests r evealed characteristics of an end to end application and how to best distribute threads. Finally, the Dynamic tests served to compare an on demand thread alloca tion algorithm to statically allocated threads. The Isolation tests revealed gen eral characteristics of each kernel on both the StrongARM and microengines. MD5 o ered strong speedup on microengine threads using conventional parallelization. U RL, using an alternative approach to multi threading, provided maximum speedup b etween 4 and 6 threads employing either a polling or non polling scheme. Finally , AES revealed an algorithm with 40
poor performance on the microengines that could not be overcome by multithreadin g. The Shared tests revealed that MD5 and AES are relatively insensitive to the concurrent execution of the other applications on a single NPU. URL, however, wa s shown to be sensitive because its speedup su ered when it was run alongside the other kernels. The Static tests provided a baseline for the Dynamic tests and re vealed the optimal thread con gurations for running the end to end application. Re sults showed that the demands of AES and URL are similar and parallelization o ere d increased performance for these applications, while the simplicity of MD5 made parallelization of it in the context of this end to end application unnecessary . As an alternative to statically allocating threads, the Dynamic tests explored the bene ts of dynamically allocating threads. Overall the Dynamic tests required less than 25% as many cycles to process each 50 packet test as the Static conte rpart. 41
Chapter 6 Conclusion We have presented a network processor simulator, multi threaded end to end bench mark applications, and an analysis of the characteristics of these applications on NPUs. Our rst contribution was the creation of a simulator that emulates a gen eric network processor modeled on the Intel IXP1200. Our simulator lls a gap in e xisting academic research by supporting multiple processing units. Our second co ntribution was the construction of multi threaded, end toend application benchma rks. These benchmarks extend the functionality of existing benchmarks based on s ingle threaded kernels. Our nal contribution was an analysis of the characteristi cs of our benchmarks on our network processor simulator. Our analysis in Chapter 5 found several interesting results. Firstly, although the MD5 kernel scaled we ll in the Isolation and Shared tests, parallelization of it in an end to end app lication had little e ect due to Amdahls Law. Secondly, the Static and Dynamic test s found that the end to end application did not have much performance gain from the addition of more than 6 threads. Finally, the Dynamic version of the end to end application required less than 25% as many 42
cycles to process the same packet stream compared to the Static version. In an a ttempt to bridge the gap between the speed of ASIC chips and the exibility of gen eral purpose processors, NPUs utilize parallel processing and special purpose ha rdware and memory structure, as well as other techniques. While NPUs make it pos sible to deploy complex end to end applications into the network, high speed net works put heavy load on these devices making application optimization an importa nt area of research. The simulator presented in this paper made development and analysis of two end to end application benchmarks as well as the kernels making up these applications. Through the development of these kernels and applications we explored several parallelization techniques. Using our simulator and testing methodology, we unveiled the performance characteristics of these kernels and a pplication benchmarks on a typical NPU. 43
Chapter 7 Future Work The simulator developed in this work provides a tool that can be used in a varie ty of future projects. Thus far, the simulator has been used by Gridley in his M asters thesis on active network algorithm performance [9] and Tsudama to test his denial of service detection algorithm as part of his Masters thesis [29]. As fut ure work, several improvements could be made to the existing simulator including support for dedicated processing chips, larger cycle count capability, and upda tes necessary to model the current generation of NPUs. Other future work could i nclude testing the existing end to end applications on an updated simulator to d etermine whether or not the performance problems found in this work have been ov ercome by the current generation of NPUs. If the same performance problems remai n, further investigation into methods of designing parallel applications to avoi d bottlenecks on NPUs will be required. Additionally, the parameters of the NPU architecture could be adjusted to determine which changes lead to performance im provements. However, if performance bottlenecks are not found on current NPUs, t hen larger scale end to end applications should be developed to push the perform ance limits of the architecture 44
and reveal new bottlenecks. The benchmark suite could be extended by including a dditional kernels. The end to end applications could be extended to include thes e kernels or new endto end applications could be developed to model other real w orld scenarios. Optimization of the current and future kernels and end to end ap plications will continue to be an open area of research. 45
Bibliography [1] J. Allen, B. Bass, C. Basso, R. Boivie, J. Calvignac, G. Davis, L. Frelechou x, M. Heddes, A. Herkersdorf, A. Kind, J. Logan, M. Peyravian, M. Rinaldi, R. Sa bhikhi, M. Siegel, and M. Waldvogel. IBM PowerNP network processor: Hardware, so ftware, and applications. IBM Journal of Research and Development, 2003. [2] Gen e Amdahl. Validity of the single processor approach to achieving largescale comp uting capabilities. In AFIPS Conference Proceedings, pages 483 485, Atlantic City , N.J., 1967. [3] Douglas C. Burger and Todd M. Austin. The simplescalar tool se t, version 2.0. Technical Report CS TR 1997 1342, Computer Sciences Department, University of Wisconsin, June 1997. [4] A. Campbell, S. Chou, M. Kounavis, V. St achtos, and J. Vicente. Netbind: A binding tool for constructing data paths in n etwork processor based routers. In Proceedings of IEEE OPENARCH 2002, New York C ity, NY, June 2002. [5] Intel Corporation. development kit. Intel internet excha nge architecture (IXA) software http://www.intel.com/design/network/products/ npfamily/sdk download.htm. Accessed June 3, 2005. 46
[6] Intel Corporation. Ixp1200 network processor datasheet, September 2003. [7] J. Daemen and V. Rijmen. AES proposal: Rijndael. First Advanced Encryption Stand ard (AES) Conference, August 1998. [8] EZchip technologies. Network processor de signs for next generation networking equipment. White Paper, December 1999. http ://www.ezchip. com/html/tech nsppaper.html. [9] Dave Gridley. Active network alg orithm performance on a network processor: Adaptive metric based routing and mul ticast. Masters thesis, California Polytechnic State University, San Luis Obispo, June 2004. [10] M. Guthaus, J. Ringenberg, T. Austin, T. Mudge, and R. Brown. M ibench: A free, commercially representative embedded benchmark suite. In Proceed ings of the IEEE 4th Annual Workshop on Workload Characterization, Austin, TX, D ecember 2001. [11] R. Housley. Internet X.509 public key infrastructure certi cate and certi cate revocation list (CRL) pro le. RFC 3280, Internet Engineering Task Fo rce, April 2002. [12] PMC Sierra Inc. URL based switching. WHITE PAPER PMC 20022 32, February 2001. [13] Eddie Kohler, Robert Morris, Benjie Chen, John Jannotti, and M. Frans Kaashoek. The click modular router. ACM Transactions on Computer S ystems, 18(3):263297, August 2000. [14] Arjen Lenstra, Xiaoyun Wang, and Benne de Weger. Colliding X.509 certi cates. Cryptology ePrint Archive, Report 2005/067, 2005. http: //eprint.iacr .org/. 47
[15] Alberto Leon Garcia and Indra Widjaja. Communication Networks: Fundamental Concepts and Key Architectures. McGraw Hill School Education Group, 2000. [16] G okhan Memik. NetBench/, 2002. [17] Gokhan Memik and William H. Mangione Smith. N EPAL: A framework for e ciently structuring applications for network processors. S econd Workshop on Network Processors (NP 2), February 2003. [18] Mangione Smith Memik and Hu. Netbench: A benchmarking suite for network processors. In Proceedi ngs of IEEE International Conference on Computer Aided Design, November 2001. [1 9] K. Keutzer N. Shah, W. Plishker. NP click: A programming model for the Intel IXP1200. HPCA 92nd Workshop on Network Processors (NP 2), February 2003. [20] Na tional Institute of Standards and Technology, National Bureau of Standards, U.S. Department of Commerce. Advanced encryption standard. Federal Information Proce ssing Standard (FIPS) 197, November 2001. http: //csrc.nist.gov/publications/fip s. [21] University of Michigan at Ann Arbor. Mibench version 1. http://www. eecs .umich.edu/mibench/, 2002. [22] Ramaswamy R and T. Wolf. Packetbench: A tool for workload characterization of network processing. In Proceedings of IEEE 6th Ann ual Workshop on Workload Characterization (WWC 6), pages 4250, Austin, TX, Octobe r 2003. 48 Netbench web site. http://cares.icsl.ucla.edu/
[23] Ronald L. Rivest. The MD5 message digest algorithm. RFC 1321, Internet Engi neering Task Force, April 1992. [24] N. Shah and K. Keutzer. Network processors: Origin of species. In Proceedings of ISCIS XVII, The Seventeenth International Symposium on Computer and Information Sciences, 2002. [25] Cisco 2002. Systems. Parallel express forwarding. White Paper, http://www.cisco.com/en/US/products/hw/routers/ps133/ products white paper09186a008008902a.shtml. [26] Wikipedia, the free encyclopedi a. Application framework. http://en. wikipedia.org/wiki/Application framework, May 2005. [27] Wikipedia, the free enc yclopedia. Benchmark (computing). http://en. wikipedia.org/wiki/Benchmark %28com puting%29, June 2005. [28] Wikipedia, the free encyclopedia. Simulator. http://e n.wikipedia.org/ wiki/Simulator#Simulation in computer science, May 2005. [29] B rett Tsudama. A novel distributed denial of service detection algorithm. Masters thesis, California Polytechnic State University, San Luis Obispo, June 2004. [30 ] T. Wolf and M. A. Franklin. Commbench a telecommunications benchmark for net work processors. In Proceedings of IEEE International Symposium on Performance A nalysis of Systems and Software, pages 154162, Austin, TX, April 2000. [31] Xeler ated. Xelerator X10q network processor. Product Brief, 2004. http: //www.xelerat ed.com/file.aspx?file id=62. 49
Appendix A Acronyms AES Advanced Encryption Standard ASIC Application Speci c Integrated Circuit CRC C yclic Redundancy Check DH Di e Hellmen DMM Dynamic Module Manager DRR De cit Round R obin HTTP HyperText Transport Protocol MD5 Message Digest 5 NAT Network Address Translation NEPAL Network Processor Application Language NPU Network Processing Unit PE processing element 50
PISA Portable Instruction Set Architecture PGP Pretty Good Privacy REED Reed Sol omon Forward Error Correction SHA Secure Hash Algorithm TCP Transmission Control Protocol URL Uniform Resource Locator 51

Thesis

Transféré par

Informations du document

Description originale:

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Thesis

Transféré par

Droits d'auteur :

Formats disponibles

Multi-Threaded End-to-End Applications on Network Processors A Thesis Presented to the Faculty of the California Polytechnic State University San

Multi-Threaded End-to-End Applications on Network Processors Copyright c 2005 by Michael S. Watts ii

Vous aimerez peut-être aussi