Vous êtes sur la page 1sur 14

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 18, NO.

11, NOVEMBER 2010

1505

Enhancing the Performance of Symmetric-Key Cryptography via Instruction Set Extensions


Sean OMelia, Member, IEEE, and Adam J. Elbirt, Senior Member, IEEE
AbstractIn this paper, instruction set extensions for a reduced instruction set computer processor are presented to improve the software performance of the data encryption standard (DES), the triple DES, the international data encryption algorithm (IDEA), and the advanced encryption standard (AES) algorithms. The most computationally intensive operations of each algorithm are off-loaded to a set of newly dened instructions. The additional hardware required to support these instructions is integrated into the processors data path. For each of the targeted algorithms, comparisons are presented between traditional software implementations and new implementations that take advantage of the extended instruction set architecture. Results show that the utilization of the proposed instructions signicantly reduces program code size, and improves encryption and decryption throughput. Moreover, the additional hardware resources required to support the instruction set extensions increase the total area of the processor by less than 65%. Finally, it will be shown that the throughputs for triple DES, IDEA, and AES are approximately the same when accelerated via instruction set extensions. This allows for seamless and transparent algorithm agility as one algorithm may be easily replaced by another algorithm with minimal performance degradation. Index TermsCryptography, software, symmetric-key.

I. INTRODUCTION

ITH MORE than 188 million Americans connected to the Internet [1], information security has become a top priority. Many applicationselectronic mail, electronic banking, medical databases, and electronic commercerequire the exchange of private information. When engaged in electronic commerce, customers provide credit card numbers when purchasing products. If the connection is not secure, an attacker can obtain these sensitive data. In order to implement a comprehensive security plan for a given network to guarantee the security of a connection, the following services must be provided [2]. 1) Condentiality: Information cannot be observed by an unauthorized party. This is accomplished via public-key and private-key encryptions.

Manuscript received May 12, 2008; revised November 04, 2008. First published September 15, 2009; current version published October 27, 2010. S. OMelia is with Massachusetts Institute of Technology (MIT) Lincoln Laboratory, Lexington, MA 02421-6499 USA. A. J. Elbirt is with Charles Stark Draper Laboratory, Inc., Cambridge, MA 02139-3563 USA. Color versions of one or more of the gures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identier 10.1109/TVLSI.2009.2025171

2) Data integrity: Transmitted data within a given communication cannot be altered in transit due to error or an unauthorized party. This is accomplished via the use of hash functions and message authentication codes. 3) Authentication: Parties within a given communication session must provide certiable proof of their identity. This is accomplished via the use of digital signatures. 4) Nonrepudiation: Neither the sender nor the receiver of a message may deny transmission. This is accomplished via digital signatures and third party notary services. Cryptographic algorithms used to ensure condentiality fall within one of two categories: private key (also known as symmetric key) and public key. Symmetric-key algorithms use the same key for both encryption and decryption. Conversely, public-key algorithms use a public key for encryption and a private key for decryption. In a typical session, a public-key algorithm will be used for the exchange of a session key and to provide authenticity through digital signatures. The session key is then used in conjunction with a symmetric-key algorithm. Symmetric-key algorithms tend to be signicantly faster than public-key algorithms, and as a result, are typically used in bulk data encryption [2]. The two types of symmetric-key algorithms are block ciphers and stream ciphers. Block ciphers operate on a block of data while stream ciphers encrypt individual bits. Block ciphers are typically used when performing bulk data encryption, and the data transfer rate of the connection directly follows the throughput of the implemented algorithm. High-throughput encryption and decryption are becoming increasingly important in the area of high-speed networking. Many applications demand the creation of networks that are both private and secure while using public data transmission links. These systems, known as virtual private networks (VPNs), can demand encryption throughputs at speeds exceeding asynchronous transfer mode (ATM) rates of 622 million bits per second (Mb/s). Increasingly, security standards and applications are dened to be algorithm independent. Although context switching between algorithms can be easily realized via software implementations, the task is signicantly more difcult when using hardware implementations. The advantages of a software implementation include ease of use, ease of upgrade, ease of design, portability, and exibility. However, a software implementation offers only limited physical security, especially with respect to key storage [2], [3]. Conversely, cryptographic algorithms that are implemented in hardware are, by nature, more physically secure as they cannot be easily read or modied by an outside attacker when the key is stored in a special memory internal to the device [3]. As a result, the attacker does not have easy access to the key storage

1063-8210/$26.00 2009 IEEE

1506

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 18, NO. 11, NOVEMBER 2010

area, and cannot discover or alter its value in a straightforward manner [2]. When using a general-purpose processor, even the fastest software implementations of block ciphers cannot satisfy the required bulk data encryption data rates for high-end applications [4][7]. As a result, hardware implementations are necessary for block ciphers to achieve this required performance level. Although traditional hardware implementations lack exibility, congurable hardware devices offer a promising alternative for the implementation of processors via the use of IP cores in application-specic IC (ASIC) and eld-programmable gate array (FPGA) technologies. To illustrate, Altera Corporation offers IP core implementations of the Intel 8051 microcontroller and the Motorola 68000 processor in addition to their own Nios-II embedded processor [8]. Similarly, Xilinx, Inc., offers IP core implementations of the PowerPC processor in addition to their own MicroBlaze and PicoBlaze embedded processors [9]. The ASIC and FPGA technologies provide the opportunity to augment the existing data path of a processor implemented via an IP core to add acceleration modules supported through newly dened instruction set extensions, targeting performance-critical functions [10][12]. Many licensable and extendible processor cores are also available for the same purpose [13], [14]. One of the potential advantages of block ciphers implemented in congurable hardware is algorithm agility, the switching of cryptographic algorithms during operation. The majority of modern security protocols, such as secure sockets layer (SSL) or IPsec, allow for multiple encryption algorithms whose use is negotiated on a per-session basis. Whereas algorithm agility can be very costly with traditional hardware, but algorithm agility through recongurable hardware appears to be an attractive possibility [15], [16]. It is also conceivable that elded devices will be upgraded with a new encryption algorithm that did not exist (or was not standardized) at design time. Moreover, applications exist, which require modication of a standardized algorithm, e.g., by using proprietary substitution boxes (S-Boxes) or permutations. Such modications are easily made with recongurable hardware. Finally, while typically slower than ASIC implementations, recongurable implementations have the potential of running substantially faster than pure software implementations. The use of instruction set extensions follows the hardware/software codesign paradigm to achieve the performance and physical security associated with hardware implementations while providing the portability and exibility traditionally associated with software implementations [17]. Moreover, when considering alternative solutions, instruction set extensions result to signicant performance improvements versus traditional software implementations with considerably reduced logic resource requirements versus hardware-only solutions such as coprocessors [18][22]. It is the goal of this research to demonstrate a set of instruction set extensions for a reduced instruction set computing (RISC) processor that enhance the performance of symmetric-key algorithms in software implementations. What follows is an overview on the various methods of speeding up symmetric-key algorithms in a software. It will be shown that advances in technology have fueled trends toward

increased recongurability in embedded systems, resulting in instruction set extensions becoming an attractive option for performance-critical applications. A discussion of the target processor, the LEON2 RISC processor, will be followed by an analysis of the performance bottlenecks commonly encountered in software implementations of the targeted cryptographic algorithms. The proposed instructions will be then presented, followed by a description of the modications to the LEON2 processor and its associated development tools. Finally, data on the logic utilization of the additional hardware as well as throughput data for the target algorithms achieved by using the instruction set extensions will be presented to demonstrate the effectiveness of this acceleration method. II. PREVIOUS WORK Most traditional methods for improving the throughput of pure software implementations of symmetric-key algorithms fall into one of two categories. One option is to construct memory-based look-up tables, where results of some of the basic operations of the algorithm have been pre-computed and stored. The S-Boxes of the data encryption standard (DES) and advanced encryption standard (AES) algorithms are commonly stored in look-up tables in software implementations. Look-up tables may be also used to combine operations used in the DES and AES algorithms. An implementation of DES in [23] combines the S-Box look-up table with the subsequent 32-bit permutation, and uses tables as a part of the initial and nal permutations. AES requires several complicated mathematical operations that are time-consuming on general-purpose processors. Therefore, in some implementations, large look-up tables, called T-tables, are employed, which combine several of these complex operations into a single table access [17]. A look-up table-based implementation is a viable option for systems with a large memory space and low-memory access times. However, area-constrained systems suffer large performance penalties using this methodology, and are generally not implemented in this manner [17], [22]. Another method for speeding up the software implementations of cryptographic algorithms involves taking advantage of mathematical or structural properties of the particular algorithm. The initial and nal permutations of DES have regular structures that make it possible to execute a series of matrix transformations and XOR operations, as demonstrated in [24]. This translates into a sequence of instructions that is much smaller than the traditional sequence required to perform the initial and nal permutations. In a previous work on improving the performance of AES on 32-bit systems, it has been shown that the transfer of a block of plaintext from a column-oriented matrix to a row-oriented matrix reduces the number of instructions required to complete the cipher due to more efcient implementation of the Galois eld xed eld constant matrix multiplication operations [25]. In order to extend the cryptographic capabilities of an embedded system without modifying the main processor, a coprocessor solution can be adapted. When there are data that must be encrypted or decrypted through the chosen symmetric-key algorithm, the main processor sends the data and the key material to the coprocessor, and the coprocessor performs the algorithm,

OMELIA AND ELBIRT: ENHANCING THE PERFORMANCE OF SYMMETRIC-KEY CRYPTOGRAPHY

1507

sending the processed data back over the interface to the main processor. Most coprocessor solutions have tended to combine a number of different algorithms to provide a multifaceted security solution. The coprocessors have achieved high-throughput values compared to traditional software implementations, and are much more capable of meeting demands for speed-critical network communications. However, this type of solution is generally associated with considerable overhead in terms of hardware area, data transfer latency, and complex interfaces to the main processor [19], [26][29]. Previous work on instruction set extensions for generalized permutations are useful for improving the performance of permutations used in the DES algorithm. Two new instructions for general and dynamically specied permutations are presented in [30]. The input and a string of conguration bits are specied in the source operands, and the result is stored in the destination issues of the register. Permutations of bits required custom instructions as well as several loads of conguration bits into registers. Hybrid architectures comprising a processor core combined with recongurable function blocks are typically used to accelerate the performance of a general-purpose processor for specic applications. Recongurable function blocks may support on-the-y reconguration to provide more optimized implementations, and further improve the system performance. The mapping of complex functions to adaptable hardware reduces the instruction fetch and execute bottleneck common to a software implementation [31]. However, the delay associated with communication between the processor and the recongurable logic block often becomes the bottleneck within the system [32]. This overhead can be reduced by caching multiple congurations within the recongurable function blocks at the cost of more expensive and less exible hardware [33][35]. Hybrid architectures have been targeted at accelerating applications such as symmetric-key cryptography, digital signal processing, data compression, image processing, video processing, multimedia, block matching, automated target recognition, and wireless communications. Symmetric-key implementations targeting the ConCISe, Garp, and MorphoSys architectures have all demonstrated signicant performance improvements by off-loading inefcient operations from the processor to the recongurable logic blocks [32], [34], [36], [37]. The Garp architecture is of particular interest in which it combines a standard single-issue microprocessor without interlocked pipeline stages (MIPS) processor with a recongurable array used as a hardware accelerator that attaches to the MIPS processor as a coprocessor. The recongurable array is composed of a matrix of logic blocks whose conguration is controlled by the MIPS processor and accelerated by a conguration cache. The operation of the recongurable array is implemented via extensions to the MIPS instruction set. A theoretical implementation of DES in the Garp architecture operating at 133 MHz achieved a factor of 24 speedup versus an equivalent implementation on a 167-MHz Sun UltraSPARC 1/170. The Garp implementation was able to directly implement the S-Box look-up tables in parallel within the recongurable array. By avoiding referencing external memory, the cycle count was greatly reduced [32], [37].

Numerous other coprocessors have been developed to accelerate cryptographic algorithm implementations. The CryptoManiac very-long instruction word (VLIW) coprocessor [26] was developed as a result of instruction set extensions designed to accelerate the performance of a number of AES candidate algorithms [19]. CryptoManiac features the execution of up to four instructions per cycle and the use of instructions with up to three operands to allow for the combination of short-latency instructions for single-cycle execution. Similarly, the Cryptonite coprocessor is also VLIW based, with two 64-bit data paths and special instructions combined with dedicated memories to support AES implementations [27]. Both coprocessors improve the performance of AES implementations versus implementations targeting general-purpose processors. The implementations in [28] and [29] couple an FPGA coprocessor with a LEON2 processor core. The coprocessors connected to the LEON2 processor core via either a dedicated interface or as a memorymapped peripheral were able to signicantly improve the performance of AES implementations [28], [29]. Examples of instruction set extensions designed to improve the performance of cryptographic algorithms include those implemented to perform arithmetic over the Galois eld , usually targeting elliptic curve cryptography (ECC) GF systems. Word-level polynomial multiplication was shown to be the time-critical operation when targeting an acorn RISC machine (ARM) processor in [38], and a special Galois eld multiplication instruction resulted in signicant performance improvement. Instruction set extensions targeting a scalable processor architecture (SPARC) V8 processor core were used to accelerate the multiplication of binary polynomials for in [39], resulting in almost double the arithmetic in GF performance for the Galois eld GF and a xed reduction polynomial. Similar results were shown in [40] using the same instruction set extensions retargeted to a 16-bit RISC processor core. The implementation in [20] targets an MIPS32 architecture, and also attempts to accelerate word-level polynomial multiplication through the use of Combas method of handling the inner loops of the multiplication operation, resulting in a performance improvement by a factor of 6. Numerous generalized Galois eld multipliers have been also proposed for use in elliptic curve cryptosystems [41][44]. These implementations focus on accelerating exponentiation and inversion in Galois where . Since they do not employ elds GF a Galois eld xed eld constant matrix, these implementations are more generalized than is necessary when targeting block cipher implementations. Moreover, block ciphers employ Galois elds with small , resulting in a hardware that performs the multiplication at the bit level with no complex multiplication algorithms. Instruction set extensions designed to minimize the number of memory accesses and accelerate the performance of AES implementations have been proposed for a wide range of processors [17], [45][47]. The extensions in [45] target a general-purpose RISC architecture with multimedia instructions. Strategies are presented to implement AES using multimedia instructions while specically attempting to minimize the number of memory accesses. While the processor is data path scalable, the strategies in [45] do not map well to 32-bit

1508

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 18, NO. 11, NOVEMBER 2010

architectures. The extensions proposed in [46] are designed to combine the SubBytes and MixColumns AES functions into one T-table look-up operation to speed up algorithm execution. However, the functional unit in [46] requires a signicant amount of hardware to implement, and cannot be used for either the nal AES round (where the MixColumns function is not used) or the key expansion (where the SubBytes function is used without the MixColumns function). However, the T-table performance is heavily dependent upon the available cache size [17]. The extensions proposed in [47] target the Xtensa 32-bit processor and improve the performance of AES encryption, but worsen the performance of decryption. Unfortunately, functionality and area overhead information for the extensions is not provided. The implementation in [17] targets a LEON2 processor core, and combines the SubBytes and ShiftRows functions through the use of an instruction set extension termed sbox. Special instructions are also provided to efciently compute the MixColumns function through the use of ECC instruction set extensions, as proposed in [21]. The combination of these extensions results in a performance improvement of up to 3.68 for encryption and 2.76 for decryption versus an AES implementation with no instruction set extensions. Multiple implementations of Rijndael and Twosh, both AES candidate algorithm nalists, have been presented, targeting a wide range of hardware technologies. These implementations use specic Galois eld xed eld constant multipliers based on the constant matrix of the associated algorithm, thus resulting in either logic equations or look-up tables being generated to perform the multiplication [48][54]. Implementations based on logic equations are optimized for area and require a moderate number of logic levels. Implementations based on look-up tables are optimized for speed at the cost of additional logic resources, though the performance of these implementations, like the software implementations employing T-tables, is highly dependent on the memory system, and cache organization and size. The look-up tables may be replaced with logic equation implementations. In case of Rijndael, the look-up table replacement for the SubBytes and MixColumns transformations signicantly reduce the hardware resource requirements. In case of the SubBytes transformation, a reduction in gate count by as much as a factor of 4.66 has been realized by the use of logic equations in place of a look-up table. When performing 16 SubBytes transformations in parallel in a single round of Rijndael, this equates to a savings of over 38 000 gate equivalences. For a pipelined implementation of 128-bit Rijndael, these savings increase to over 380 000 gate equivalences [22]. However, in each case, the implementation must be completely regenerated when changing to a new algorithm with a new Galois eld xed eld constant matrix. Depending on the implementation methodology, AES throughputs as high as 70 Gb/s [52] when operating in nonfeedback modes and 2.29 Gb/s [53] when operating in feedback modes have been reported. Several software implementations of the International Data Encryption Algorithm (IDEA) take advantage of advanced processor architectures that employ instruction parallelism or functional units for multimedia support. A four-way parallel implementation on a 166-MHz Pentium multimedia extensions (MMX) processor [55] achieved a throughput of approximately

72 Mb/s. The throughput values ranging from 421 to 550 Mb/s have been achieved on the Itanium platform running at 733 MHz [56]. The performance evaluations reported in [57] include a comparison of IDEA software implementations on processors with various word sizes, clock frequencies, and cache sizes. Execution times for IDEA encryption ranged from 2555 on the on the 64-bit 440-MHz Ul8-bit 4-MHz Atmega 103 to 9 traSparc with instruction and data cache sizes of 16 kB. Fast multiplication capability was shown to be a major factor in the performance of the IDEA algorithm. Implementations of IDEA on recongurable computing platforms and systems with coprocessors have shown improved performance. An implementation on an SRC-6E platform [58] achieved throughputs of approximately 590 Mb/s for end-to-end software time of bulk data processing. Comparisons have been made between the performance of IDEA on DSPs, cryptographic coprocessors, and hardware implementations on FPGAs in a hardwaresoftware codesign system that makes use of encryption. Reported performance gures ranged from 32 Mb/s on the DEC SA-110 and 53.1 Mb/s on the TI TMX320C6x DSP chips, to 180 Mb/s using the VINCI cryptographic coprocessor and to 528 Mb/s with an FPGA-based implementation [59]. III. TARGET ALGORITHMS A. DES and Triple DES DES is a 16-round Feistel network block cipher. DES takes as input a 64-bit data block and a 64-bit key, where 8 of the 64 bits are used for parity and the other 56 bits comprise the actual key material. Encryption and decryption are functionally the same, except the key schedule for decryption is the reverse of that used for encryption. The rst part of the DES encryption is an initial permutation on the input block. The initial permutation rearranges the input, and the output is divided into a left and a right half , which become the input to the rst half , and round. For each round iteration . The -function is the core operation of each DES round and begins with the expansion, which duplicates some of the bits of the 32-bit input to the -funcis tion and outputs a 48-bit value. The result of partitioned into eight 6-bit values. The S-Boxes output a 4-bit value based on their corresponding 6-bit input. The outputs of the S-Boxes are concatenated to form the input to the permutation. The permutation rearranges the 32 S-Box output bits, to obtain for the current and the result is XORed with and are swapped prior to round. After the nal round, the nal permutation (the inverse of the initial permutation). The key schedule for DES operates on the 64-bit master key to produce a series of 48-bit round keys, each used one at a time for the 16 rounds of the cipher. Initially, the bits of the master into two 28-bit key are arranged by permuted choice 1 PC vectors, and (note that every eighth bit is a parity bit and is discarded). For each round of the cipher, a bit rotation is performed separately on the and values. The rotation moves to the left for encryption, and to the right for decryption. The rotation amount depends on the round and are carried out in reverse order for the right rotations of the decryption key schedule.

OMELIA AND ELBIRT: ENHANCING THE PERFORMANCE OF SYMMETRIC-KEY CRYPTOGRAPHY

1509

The nal round key is computed as the result of passing the current state of the and vectors through the permuted choice . 2 PC The triple-DES algorithm has been suggested as a more secure alternative to DES [60]. As the name suggests, this cipher sequentially executes the DES algorithm three times with , , and , where two or all three of these keys keys may be equivalent. For encryption and decryption, where is is a DES encryption using the plaintext, is the ciphertext, , and is a DES decryption using key , CT key PT and PT CT . Since the output ciphertext from implementations 1 and 2 of DES is used as the input plaintext to implementations 2 and 3, respectively, and the DES initial and nal permutations are inverse operations, the inner initial and nal permutations may be removed. Software implementations of DES tend to be signicantly slower than hardware implementations. Bit-level manipulations such as those contained in the permutation, expansion, permuted choice, and cyclic left/right shift units do not map well to general-purpose processors. The general-purpose processor instruction sets operate on multiple bits at a time based on the processor word size. Moreover, the DES S-Boxes do not use memory in an efcient manner. Software look-up tables would appear to be the obvious implementation choice for the DES S-Boxes. However, the DES S-Boxes have 6-bit addresses and 4-bit output bits, while most memories associated with general-purpose processors use byte addressing with either 8-bit or 32-bit output data. As a result, many software implementations of DES exhibit throughputs that are at least a full order of magnitude slower than hardware implementations. Even the best software implementations are only capable of throughputs in the range of 100200 Mb/s. Most of these implementations recommend storing the 32-bit left and right halves of the data stream as a 48-bit padded word within a 64-bit processor word, and implement the permutations and S-Boxes as precomputed look-up tables. Additionally, the look-up table implementation for the S-Boxes is most effective when the size of the look-up tables is minimized, guaranteeing that the data will t entirely in a on-chip cache. The size minimization of the S-Box look-up tables is achieved by implementing each S-Box in its own look-up table. Finally, one key software optimization is the unrolling of software loops to increase performance. Even when software loops are too cumbersome to unroll, using loop counters that decrement to zero in place of loop counters that increment to a terminal count are shown to greatly increase the performance of software implementations of the DES algorithm. However, the unrolling of software loops must be done with great care such that the total data storage space does not exceed the size of the on-chip cache to avoid extreme performance degradation [61], [62]. B. IDEA The computations involved in IDEA are based on operations from three different mathematical groups16-bit bitwise exclusive OR (denoted by ), addition modulo (denoted by ), and multiplication modulo (denoted by ). For , an input of 0 0000 represents multiplication modulo

. This is because the operation is performed over the value , where zero is not a member of the multiplicative group is a member of the group. The value is the group, but denoted by 0 0000 so that only 16 bits are required to represent all possible input values to each operation. Like DES, IDEA also operates on 64-bit data blocks. However, while DES requires a 56-bit key, IDEA requires a 128-bit key, accounting for the increased security of the cipher as compared to DES. The algorithm consists of eight rounds followed by a nal transformation to obtain the output. Similar to DES, the procedure is the same for both encryption and decryption, but different key schedules are used. The input plaintext is rep, , , and . These resented by four 16-bit subblocks subblocks are combined with the six 16-bit subblocks of the using the round key for the current round , labeled mathematical operations noted before. The IDEA key schedule for encryption is based on a series of left rotations of the 128-bit master key. The master key is rst partitioned into eight 16-bit blocks; these are the rst eight , , , , , , , and key subblocks: . The next eight key blocks are obtained by rotating the key to the left by 25 bits, then performing the partition again. This process is repeated until all 52 key blocks are generated (six blocks for each of the eight rounds and four blocks for the nal transformation). The key schedule for decryption is based on the encryption key schedule and requires the computation of the multiplicative inverse modulo of , and the additive inverse modulo of . In terms of the core operations of IDEA, bitwise XOR and addition are easily implemented with one instruction each in a soft, a processor such as the LEON2 ware. For reduction modulo that only performs arithmetic on 32-bit register operands requires an additional logic instruction to mask out the bits that may overow into the 16 most signicant bits of the destination register. However, the major performance bottleneck for a software implementation of IDEA is multiplication modulo . The multiplication may require several clock cycles to complete (especially those without hardware multipliers), and the modular reduction, which is commonly implemented using the lowhigh lemma [63], requires additional execution time. C. AES One of the most signicant features of the AES algorithm is the extensive use of Galois eld arithmetic. The particular eld used in the AES algorithm is the Galois eld GF . Values are represented by polynomials of the form , or , where each is a in bit vector notation, coefcient in the Galois eld GF . Addition is done by computing the sum modulo 2 of coefcients in the same bit positions; this can be accomplished by applying the bitwise XOR operation to the coefcients. Multiplication works in much the same way as ordinary polynomial multiplication, but there is an additional step for modular reduction of the product by an irreducible polynomial so that the nal product results in a poly. For the AES algorithm, the nomial in the Galois eld GF [64]. irreducible polynomial is

1510

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 18, NO. 11, NOVEMBER 2010

AES operates on data block sizes of 128 bits and supports key sizes of 128, 192, or 256 bits. The number of rounds used in the cipher is dependent on the key sizeten rounds for a 128-bit key, 12 rounds for a 192-bit key, and 14 rounds for a 256-bit key. This research focuses on a 128-bit key implementation, but is easily extended for use in implementations with larger key sizes. The plaintext is arranged into a 4 4 matrix of 8-bit values called the state. After an initial key whitening stage, where each column of the state is combined by a bitwise XOR operation with a 32-bit subkey, encryption of one plaintext block in AES requires the following sequence so that the SubBytes, ShiftRows, MixColumns, and AddRoundKey operations be performed in each of the ten rounds (note that the MixColumns operation is omitted from the nal round). SubBytes substitutes each byte in the state with a new value by computing the multiplicative inverse of the input byte in the , denoted as (except for the value 0 00, Galois eld GF which is mapped to itself), and then performing an afne transformation over the Galois eld GF . The result is copied into the position of in the state. ShiftRows performs cyclic left shifts on each row in the state. The amount of bytes by which to shift depends on the row: zero for the top row, one for the second row, two for the third row, and three for the bottom row. MixColumns treats each column of the state as a vector of . Each of the four four polynomials in the Galois eld GF columns are multiplied by a 4 4 constant matrix with coefand then, reduced the modulo cients in the Galois eld GF . AddRoundKey operates on individual columns of the state. Each column is combined by a bitwise XOR operation with a 32-bit word obtained from the current round key. AES decryption incorporates the inverse operations of those used in encryption. Note that the AddRoundKey is its own inverse. For the 128-bit key size implementation of AES, the master key is expanded into a linear array of 11 4-B through the use functions [64]. Subof the SubWord, RotWord, and Rcon Word applies a substitution to each of the four bytes in the input word using the same S-Box that is used in the SubBytes operation from AES encryption. RotWord performs a cyclic left to produce rotation by one byte on the input word an output of . Rcon is the round constant array with a size of ten words. Rcon is populated as Rcon , where and the powers are computed in the Galois eld GF . AES software performance bottlenecks typically occur in the SubBytes and MixColumns transformations, one or both of which are usually implemented via 8-bit to 8-bit look-up tables. Often, most of the AES round transformationsSubBytes, ShiftRows, and MixColumnsare combined into large look-up tables termed T-tables. Such implementations require up to three T-tables whose size may be either 1 kB or 4 kB, where the smaller tables require performing an additional rotation operation. The goal of the T-tables is to avoid performing the MixColumns and InvMixColumns transformations as these operations perform Galois eld xed eld constant multiplication, an operation which maps poorly to general-purpose

processors. However, the use of T-tables has a number of disadvantages. T-tables signicantly increase the code size, their performance is dependent on the memory system architecture as well as cache size, and their use causes key expansion for AES decryption to become signicantly more complex. As an alternative to the use of T-tables, it is also feasible to have the processor perform all of the AES round transformations. Row-based implementations have been demonstrated to allow for greater efciency in the implementation of the MixColumns and InvMixColumns transformations versus column-based implementations. However, the SubBytes transformation still remains as a bottleneck, requiring separate 256-byte look-up tables for encryption and decryption [17], [22], [25], [65], [66]. IV. LEON2 PROCESSOR The LEON2 processor is a RISC CPU produced by Gaisler Research that is implemented in VHSIC (very high speed ICs) hardware description language (VHDL), and is fully synthesizable. The model is highly congurable, allowing for adjustments to many features of the processor using a graphical conguration utility. The entire source code is freely available under the GNU general public license, which enables modications and enhancements to the architecture. The LEON2 processor is based on the SPARC [67]. The LEON2 architecture provides support for on-chip peripherals such as a oating-point unit (FPU), peripheral component interconnect (PCI), and Ethernet; coprocessor support is also available in accordance with the SPARC model. The main focus of this research with regards to the proposed instruction set extensions is the pipelined integer unit (IU). The IU pipeline consists of ve stagesfetch, decode, execute, memory, and write back. All SPARC V8 instructions are implemented in the LEON2 processor architecture. Instructions are grouped according to the values of the various elds in the instruction opcode. Arithmetic, logic, and memory operations have the Format 3 structure. Most of the available features of the LEON2 processor can be enabled, disabled, or adjusted by using the graphical conguration utility. This research employed a basic conguration with no FPU, PCI, Ethernet, co-processor interface, or hardware multiplier or divider. To extend the LEON2 architecture beyond the scope of the standard model, additional VHDL code is required. The specic les that must be modied depend on the functionality to be added, but if the instruction set is to be extended, the module containing the SPARC V8 opcode constants must be updated, and these instructions must follow the SPARC V8 architecture specication [68]. The LEON2 implementation can be targeted to any type of FPGA or ASIC technology. Functional verication and performance evaluation of programs built for the LEON2 architecture can be performed with the provided generic test bench. Software code is in a format readable by the test bench VHDL code. The software can then be read and executed by the test bench for both functional verication and performance evaluation. V. PROPOSED INSTRUCTION SET EXTENSIONS All of the proposed instruction set extensions comply with the SPARC V8 instruction model [68] using the Format 3 structure. All instructions that write to a register execute in one clock cycle

OMELIA AND ELBIRT: ENHANCING THE PERFORMANCE OF SYMMETRIC-KEY CRYPTOGRAPHY

1511

TABLE I INSTRUCTION SET EXTENSION ENCODINGS

except for the mmul16 instruction, which requires two clock cycles. For those instructions that store data directly into registers contained in the hardware added to the data path, the data are available at the All added hardware modules are coded in VHDL, and all inputs and outputs are read from and written to the LEON2 pipelined IU register le. None of the added logic circuits rely on the external memory for their functionality. A new module was included with the VHDL source for the LEON2 processor architecture to provide an easy way to select specic extensions to be included in the architecture. For the AES S-Box extensions, the available options are no S-Boxes, one S-Box, and four S-Boxes. For all other types of extensions, setting the conguration variables to true includes extensions into the architecture, while a value of false excludes them from the architecture. All of the added functional units that support the proposed instruction set extensions have been included in the IU. Component declarations were added for each of the added hardware units and instantiated as part of the arithmetic logic unit. The decode stage of the IU pipeline sets ags for the instruction set extensions, and generates source register and immediate data. On the next clock cycle, the execute stage passes the input operands to the appropriate functional unit based on the instruction ag set. The result is then read from the functional unit when the instruction species a destination register to receive an output. The SPARC assembly instruction opcodes are dened in the sparcv8.vhd module. The code added to this module include the values for the op3 eld of the new instructions. Provided in the LEON2 base package is a generic test bench with disassembly support. During functional simulation, assembly instructions are printed out to the simulation softwares console window as they are executed. Table I details the encoding for the instruction set extensions. A. DES and Triple DES To support the proposed instruction set extensions for DES and triple DES, permutation, key generator, and -function units were integrated into the LEON2 data path. The permutation unit implements the initial permutation and the nal permutation. The inputs are loaded from the source registers specied in the permutation instruction. The inputs are passed through two stages of 2-to-1 multiplexers. The rst stage selects the rearranged bits of either the initial permutation or the nal permutation output. The output of the selected permutation is represented by the pair of 32-bit vectors. The second stage of multiplexers sets the nal output to either the left half or the right half of the output. The key generator unit was designed to work in conjunction with the -function unit. All registers are sen-

sitive to the rising edge of the clock. The 64 key bits, which are loaded into the key generator unit from the source registers specied in the deskey instruction, are rearranged according to mapping. A 64-bit master key is loaded into the DES PC the key generator unit by issuing the deskey instruction. When this instruction is in the execute stage, the key bits are loaded and registers. When performing encryption, the into the and registers are loaded with the values of and rotated left by 1 bit, since the rst values of and used in the and . When performing deencryption key schedule are cryption, the and registers are loaded with the exact values and because the nal and values used in the of encryption key schedule are the rst values used in the decryption key schedule. The -function units 32-bit input port for is loaded from the source register operand specied in the desf instruction. The 48-bit input port for the current round key is generated by the key generator unit. Expansion and permutation blocks are implemented by rerouting the inputs, and the S-Boxes are dened as logic-based mappings. The output of the -function unit is stored in the destination register specied in the eld of the desf instruction. The desipl and desipr instructions produce the left and right halves of the initial permutation, respectively. Similarly, the desfpl and desfpr instructions produce the left and right halves of the nal permutation, respectively. The left half of the input block must be located in the rs1 register and the right half must be located in the rs2 register. The instruction syntax for the DES permutation instruction is of the form desXXY rs1, rs2, designates the initial or nal permutation and rd, where designates the half. The specic instruction to be executed is determined by the of the asi eld. Bits of the asi value of bits eld are ignored by all of the DES permutation instructions. asi corresponds to desipl, asi corcorresponds to desfpl, and responds to desipr, asi asi corresponds to desfpr. The inclusion of these instructions allows the initial and nal permutations for DES and triple DES to be completed in two instructions each. Traditional software implementations require a series of bit mask setup, shift, logical AND, and logical OR operations for each bit for a total of 256 instructions [30]. The improved permutation algorithm used in [69] requires 44 instructions to complete on a SPARC V8 processor such as the LEON2, which is still signicantly larger than the instruction count for the proposed instruction set extensions. The desdir instruction sets up the key generator to output round keys in either encryption or decryption order. This instruction also resets the round counter of the key generator ac-

1512

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 18, NO. 11, NOVEMBER 2010

cording to the chosen direction to ensure that the output of the round keys may be immediately carried out in the proper order. It is not necessary to reload the master key after this instruction is executed. The instruction syntax for the DES direction instruction is of the form desdir imm. The desdir instruction is used in conjunction with the deskey and desf instructions. The deskey instruction loads the 64-bit master key. The left half of the master key must be contained in the rs1 register, and the right half in the rs2 register. The instruction syntax for the DES key load instruction is of the form deskey rs1, rs2. The desf instruction takes the right half of a round output block stored in the rs1 register and stores the output of the -function into the rd register. The round key is not specied here since the round key output of the key generator is hardwired to the -functions round key input. After completion of this instruction, the key generator is signaled to generate the key for the next round. Due to the logic of the key generator, the desf instruction may not be followed by another desf instruction. However, this is not expected to cause a performance bottleneck due to the instruction required to complete the round function with the output of the -function. The instrucby XORing tion syntax for the DES f-function instruction is of the form desf rs1, rd. Implementation of the desdir, deskey, and desf instructions removes the need for storage of the 16 round keys and S-Boxes in the memory. All round keys are generated on the y in the hardware added to the data path. An implementation of the DES algorithm using these instructions requires two instructions for key scheduling and four instructions for each of the 16 roundsone desf, one bitwise XOR, and two register data transfers for swapping the left and right halves of the round function output. B. IDEA To support the proposed instruction set extension for IDEA, a modulo multiplier unit was integrated into the LEON2 multiplier is designed based on data path. The modulo the adder-based modular multiplier in [70]. The modulo multiplier rst generates partial products reduced modulo as described in Zimmermans investigation of ef[71]. Each cient architectures for arithmetic modulo partial product is determined by the formula (1) where the vector and and the range of to handle cases where dened as follows: contains zeros and ones, extends only downwards. In order or , a correction term is if if if if and and and and

reduction modulo we have

. This result may be reduced such that . Using the lowhigh lemma for assuming that [63], if if

. (3) multiplier Additions are implemented in the Modulo with a carry-propagate adder tree. A generic model is specied in a VHDL source le that is separate from the multiplier source le. The generic component design allows for adjustable width of the inputs. The mmul16 instruction computes mod and stores the product in the register. Both source operands must be in the lower 16 bits of their respective registers. The 16-bit product is stored in the lower 16 bits of the rd register. The instruction syntax for the IDEA instruction is of the form mmul16 rs1, rs2, rd. mod C. AES To support the proposed instruction set extensions for AES, S-Box and Galois eld xed eld constant multiplier units were integrated into the LEON2 data path. The SubBytes and InvSubBytes S-Boxes are implemented in the S-Box unit as logic-based mappings in hardware. The S-Box unit output is selected based on whether encryption or decryption is being performed. The Galois eld xed eld constant multiplier unit performs the MixColumns or InvMixColumns operation required by AES. The architecture of this multiplier is described in [72]. The MixColumns and InvMixColumns operations are a matrix multiplion each column of the state cation over the Galois eld GF by a 4 4 xed eld constant matrix. This means that a total of must be performed 16 multiplications in the Galois eld GF to complete the entire operation. Each product must be then re, the irreducible duced modulo polynomial specied for AES. To accomplish the multiplication and modular reduction simultaneously, the operation can be represented as an 8 8 matrix multiplication over the Galois eld GF . The constants in the inner matrix are determined by the constant factor in the multiplication and the polynomial . The core operation in the xed eld multiplication is an 8-bit inner product that must be performed 16 times, four per row. The four inner products of each row are then combined via a bitwise XOR operation to form the nal output word. For a , (representing the 8-bit known primitive polynomial , the resultant polynomial constant), and a generic input mod where equation takes the form each coefcient of is a function of . This results in in terms of an matrix representing the coefcients of [73]. An matrix must be generated for each , resulting in a total of 16 matrices. Note that this analysis holds true for Galois with corresponding adjustments to the elds other than GF mod . mapping matrix used to compute The aessb, aessbs, aessb4, and aessb4s instructions perform the SubBytes and InvSubBytes operations on either one or four of the bytes in the rs1 register. The instruction operands determine whether the SubBytes operation or the InvSubBytes operation is to be performed. In the case of the single S-Box instructions, the operand also species which of the four bytes in the rs1 register

(2) .

An intermediate sum is then computed. . Dening The nal step is a reduction of s modulo to be the 16 least signicant bits and to be the remaining high-order bits of , the result of the multiplication is

OMELIA AND ELBIRT: ENHANCING THE PERFORMANCE OF SYMMETRIC-KEY CRYPTOGRAPHY

1513

is to be operated upon. The aessbs instruction provides the additional functionality of allowing the user to specify the destination byte in the rd register, while the aessb4s instruction allows the user to specify the number of bytes to left shift the 4-B result prior to storage in the destination register. The instruction syntax for the instructions that perform the SubBytes and InvSubBytes operations is of the form aesXXXX rs1, imm, rd, where XXXX designates between aessb, aessbs, aessb4, and aessb4s. The value specied in the simm13 eld determines the actual operation performed. The lease signicant bit is set to zero for indicate the byte SubBytes, or one for InvSubBytes. Bits to be substituted for the aessb and aessbs instructions. These bits are not used by the aessb4 and aessb4s instructions. Bits and of simm13 are ignored by all of the S-Box corresponds to rs1 instructions. simm13 , simm13 corresponds to rs1 , corresponds to rs1 , and simm13 corresponds to rs1 . simm13 The gfmkld instruction is used to load one of the 16 constants into the 4 4 constant matrix of the Galois eld xed eld constant matrix multiplier. The constants are loaded row by row, beginning with row zero and proceeding in order to row three. Each row is loaded beginning with the constant from column zero and proceeding in order to the constant in column three. Due to the logic that has been added to the multiplier for inclusion into the LEON2 processor data path, instances of the gfmkld instruction may not be issued consecutively. The gfmmul instruction performs the Galois eld xed eld constant matrix multiplication on the input in the rs1 register and stores the result in the rd register. The instruction syntax for the AES Galois eld matrix loading instruction is of the form gfmkld rs1, rs2. The instruction syntax for the AES Galois eld matrix multiplication instruction is of the form gfmmul rs1, imm, rd. VI. ANALYSIS OF RESULTS Functional verication was performed using 12 test vectors for each algorithm. Performance testing measured the execution cycles required to perform one iteration of the target algorithm. Each algorithm was tested in both non-feedback (electronic code book) and feedback (cipher block chaining) modes of operation for both encryption and decryption. The LEON2 processor implementation was synthesized targeting the Xilinx Virtex-4 XC4VLX25 FPGA. A. Code Size The following tables present executable code sizes for implementations of the target algorithms with different combinations of instruction set extensions. For the DES and triple-DES algorithms, the permutation instructions decrease the total code size by up to a factor of 1.3, but have no effect on the key schedule as permutations are not used in the computation of the round keys. Instructions supporting the round key generation and round function have a much more pronounced impact on the code size. When these instructions are used, all of the lengthy permutation routines and memory-based S-Boxes are no longer needed. Encryption and decryption code size is reduced by up to a factor of 4.0 in the case of DES and a factor of 3.5 in the case of triple DES with the use of these instructions alone. Key

TABLE II DES AND TRIPLE-DES CODE SIZE (in BYTES)

TABLE III IDEA CODE SIZE (in BYTES)

scheduling is handled via two instructions, making the respective code size a small percentage of the remaining program code for encryption and decryption. When all of the instruction set extensions are implemented, the DES code size is reduced by factors of up to 31.2 in nonfeedback mode and 21.1 in feedback mode, and the triple-DES code size is reduced by factors of up to 15.9 in nonfeedback mode and 13.0 in feedback mode. The decryption key schedule code size data for IDEA includes that of the encryption key schedule because the decryption keys are determined from the encryption keys. The key schedule for decryption ensures a slight decrease in code size with the use of the mmul16 instruction because of the need to compute multi. The addition of the mmul16 inplicative inverses modulo struction signicantly decreases the code size of encryption and decryption by factors of up to 2.8 in nonfeedback mode and 2.4 in feedback mode. Due to the absence of a hardware multiplier in the LEON2 integer unit, the multiplication is performed by a library function. The AES instruction set extensions aessbs and aessb4s yielded either equivalent or reduced code size for both column-oriented and row-oriented implementations versus the aessb and aessb4 instruction set extensions. The use of one S-Box led to the largest reduction in code size for column-oriented implementations by up to a factor of 1.8 for encryption and 1.6 for decryption. In the case of the row-oriented implementations, the use of four S-Boxes led to the largest reduction in code size by up to a factor of 2.6 for encryption and 2.2 for decryption. However, the use of one S-Box results in signicantly reduced key scheduling code size for row-oriented implementations. When only the gfmmul instruction is incorporated, the code size decreases by up to a factor of 1.3 for encryption and 1.8 for decryption in the column-oriented implementations. The original code used to implement the InvMixColumns operation for decryption requires many more operations than the MixColumns operation used by encryption. Only four instances of the gfmmul instruction are needed to perform both operationsone for each column of the AES state. The use of

1514

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 18, NO. 11, NOVEMBER 2010

TABLE IV AES CODE SIZE (in BYTES)

TABLE V DES AND TRIPLE-DES EXECUTION CYCLES

TABLE VI IDEA EXECUTION CYCLES

the gfmmul instruction results in up to a factor of 1.1 increase in code size for encryption and 1.1 decrease in code size for decryption in the row-oriented implementations. This occurs because the MixColumns and InvMixColumns operations operate on the columns of the AES state, requiring additional instructions to rearrange the bytes prior to being processed by the gfmmul instruction in the row-oriented implementations. The effect of the byte rearrangement is that column-oriented implementations require signicantly smaller code space than the row-oriented implementations do, when all of the instruction set extensions are implemented. For column-oriented implementations, the code size is reduced by up to a factor of 3.6 for encryption and 4.8 for decryption using one S-Box, while for row-oriented implementations, code size is reduced by up to a factor of 2.1 for encryption and 2.3 for decryption using four S-Boxes. B. Execution Cycles The following tables present the number of clock cycles required to complete a full iteration of each algorithm. For the DES and triple-DES algorithms, the permutation instructions alone have virtually no impact on the execution cycles for both DES and triple DES. However, the instructions supporting round key generation and the round function have a signicant impact on execution cycles, yielding speedups for DES by a factor of up to 17.3 in nonfeedback mode and 16.5 in feedback mode, and speedups for triple DES by a factor of up to 17.3 in

nonfeedback mode and 17.0 in feedback mode. Implementation of all of the instruction set extensions yields speedups for DES by a factor of up to 32.6 in nonfeedback mode and 30.2 in feedback mode, and speedups for triple DES by a factor of up to 32.8 in nonfeedback mode and 32.0 in feedback mode. The use of the mmul16 instruction signicantly decreases the IDEA execution cycle count. While the key scheduling for encryption is unaffected, the use of the mmul16 instruction results in a decryption key scheduling speedup by a factor of 7.8, and speedups of encryption and decryption by up to a factor of 7.7 in nonfeedback mode and 7.1 in feedback mode. The AES instructions aessbs and aessb4s yielded either equivalent or reduced execution cycles for both column-oriented and row-oriented implementations versus the aessb and aessb4 extensions. The use of one S-Box led to the largest speedup for column-oriented implementations by up to a factor of 1.4 for encryption and 1.2 for decryption. The use of four S-Boxes led to the largest speedup for row-oriented implementations by up to a factor of 1.9 for encryption and 1.6 for decryption. However, the use of one S-Box reduced the key scheduling execution cycles for row-oriented implementations. Incorporating only the gfmmul instruction yields speedups of up to a factor of 1.8 for encryption and 3.0 for decryption in the column-oriented implementations. The original code used to implement the InvMixColumns operation for decryption requires many more operations, and thus cycles, versus the MixColumns operation used by encryption. The use of the gfmmul instruction results in up to a factor of 1.1 decrease in performance for encryption and 1.04 speedup for decryption in the row-oriented implementations. This occurs because the MixColumns and InvMixColumns operations operate on the columns of the AES state, thus requiring additional cycles to rearrange the bytes prior to being processed by the gfmmul instruction in the row-oriented implementations. The effects of the byte rearrangement is such that the speedups of the column-oriented implementations are signicantly larger than the speedups of the row-oriented implementations when all of the instruction set extensions are implemented, and the row-oriented implementations perform better when the gfmmul instruction is combined with one S-Box instead of four S-Boxes. Column-oriented implementations yield speedups by a factor of 4.0 for encryption and 6.6 for decryption using one S-Box, while row-oriented implementations yield speedups by a factor of 1.4 for encryption and 1.7 for decryption. Table VIII details execution cycle counts for some combinations of extensions as compared with results published by the instruction set extensions for cryptography (ISEC) project [22].

OMELIA AND ELBIRT: ENHANCING THE PERFORMANCE OF SYMMETRIC-KEY CRYPTOGRAPHY

1515

TABLE VII AES EXECUTION CYCLES

TABLE VIII AES EXECUTION CYCLES COMPARISON

Both sets of implementations are written in C and use in-line assembly to make use of the instruction set extensions. Their work includes the use of logic-based mappings as a choice of hardware implementation for the S-Boxes. The sbox and sbox4 instructions from the ISEC project are equivalent to the aessbs and aessb4 instructions presented in this research. The mixcol4 instruction uses an entire 32-bit word as input into an AES-specic MixColumns/InvMixColumns functional unit. In order to address the ShiftRows operation, the instructions sbox4s and mixcol4s combine an implicit ShiftRows operation with SubBytes and MixColumns, respectively. This functionality is similar, but not identical to the combination of the aessb4s and gfmmul instructions. Note that the gfmmul instruction incurs only a small performance penalty versus the mixcol4 instruction while maintaining the exibility associated with a functional unit that supports generalized Galois eld xed eld constant matrix multiplication operations. C. Hardware Resource Requirements Table IX shows the component usage of the Xilinx Virtex-4 XC4VLX25 FPGA by each added functional unit and the total utilizations of the LEON2 processor with combinations of instruction set extensions for each targeted algorithm. Due to the number of storage bits needed for matrix conguration and combinational logic used to compute the matrix product, the Galois eld matrix multiplier is the largest of the added hardware units. The implementation with all of the proposed instruction set extensions leads to a total area increase over the baseline conguration by approximately 63%. The modulo multiplier for IDEA was the single largest contributor to decreases in the implementations maximum operating frequency. A purely combinational design of the multiplier had large path delays due to the carry-propagate adder structure, yielding a maximum operating frequency of only 72 MHz. The multiplier was modied to create a two cycle instruction implementation to decrease the

TABLE IX HARDWARE UTILIZATIONXILINX XC4VLX25 FPGA

effect on the processor clock frequency. Implementing all extensions resulted in a clock frequency of approximately 117 MHz, a 10% decrease compared to the baseline implementation. Algorithm Throughput Comparisons Table X presents throughput data for baseline implementations of the target algorithms versus implementations of the target algorithms using the proposed instruction set extensions when performing encryption in nonfeedback mode. The results are analyzed with respect to the hardware resources required for each implementation to determine the hardware cost associated with improving the execution time of the targeted algorithms using the proposed instruction set extensions. Throughput data are presented in terms of megabits per second, hardware usage is presented in terms of FPGA congurable logic block (CLB) slices, and throughput/area (T/A) ratios are presented in bits per second per slice. The data shows that the instruction set extensions yielded the best improvement for DES in terms of T/A ratio, with the ratio increasing by a factor of 25.56. This increase is a result of nearly all of the DES functionality being off-loaded to the added hardware. As expected, T/A ratios for triple DES were

1516

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 18, NO. 11, NOVEMBER 2010

TABLE X THROUGHPUT TO AREA RATIOS

was also demonstrated that column-oriented implementations of AES are superior to row-oriented implementations of AES for all evaluation metrics when using the proposed instruction set extensions. Finally, it was shown that the throughputs for triple DES, IDEA, and AES were approximately the same when accelerated via instruction set extensions, allowing for algorithm agility that is transparent to the user as one algorithm may be easily replaced by another algorithm with minimal performance degradation. REFERENCES
[1] P. Gil, How big is the internet?, 2005. [Online]. Available: http:// netforbeginners.about.com/cs/technoglossary/f/FAQ3.htm [2] B. Schneier, Applied Cryptography, 2nd ed. New York: Wiley, 1996. [3] R. Doud, Hardware crypto solutions boost VPN, Electron. Eng. Times, no. 1056, pp. 5764, Apr. 1999. [4] K. Aoki and H. Lipmaa, Fast implementations of AES candidates, in Proc. 3rd Adv. Encryption Stand. Candidate Conf., New York, Apr. 2000, pp. 106122. [5] L. Bassham, III, Efciency testing of ANSI C implementations of round 2 candidate algorithms for the advanced encryption standard, in Proc. 3rd Adv. Encryption Stand. Candidate Conf., New York, Apr. 2000, pp. 136148. [6] J. Dray, Nist performance analysis of the nal round java AES candidates, in Proc. 3rd Adv. Encryption Stand. Candidate Conf., New York, Apr. 2000, pp. 149160. [7] A. Sterbenz and P. Lipp, Performance of the AES candidate algorithms in java, in Proc. 3rd Adv. Encryption Stand. Candidate Conf., New York, Apr. 2000, pp. 161168. [8] Altera Corporation, Altera IP megastoreEmbedded processors, 2009. [Online]. Available: http://www.altera.com/products/ip/processors/ipm-index.jsp [9] Xilinx, San Jose, CA, Xilinx embedded processing technology solutions, 2009. [Online]. Available: http://www.xilinx.com/products/design_resources/proc_central/ [10] M. Gschwind, A. A. Jerraya, L. Lavagno, and F. Vahid, Eds., Instruction set selection for ASIP design, in Proc. 7th Int. Symp. Hardw./ Softw. Codes. (CODES), Rome, Italy, Mar. 1999, pp. 711. [11] K. Kkakar, A. A. Jerraya, L. Lavagno, and F. Vahid, Eds., An ASIP design methodology for embedded systems, in Proc. 7th Int. Symp. Hardw./Softw. Codes (CODES), Rome, Italy, Mar. 1999, pp. 1721. [12] A. Wang, E. Killian, D. E. Maydan, and C. Rowen, Hardware/software instruction set congurability for system-on-chip processors, in Proc. 38th Des. Autom. Conf. (DAC), Las Vegas, NV, Jun. 2001, pp. 184188. [13] P. Faraboschi, G. M. Brown, J. A. Fisher, G. Desoli, and M. O. Homewood, Lx: A technology platform for customizable VLIW embedded processing, in Proc. 27th Annu. Int. Symp. Comput. Arch. (ISCA), Vancouver, BC, Canada, Jun. 2000, pp. 203213. [14] R. E. Gonzalez, Xtensa: A congurable and extensible processor, IEEE Micro, vol. 20, no. 2, pp. 6070, Mar./Apr. 2000. [15] C. Patterson, . K. Ko and C. Paar, Eds., A dynamic implementation of the serpent block cipher, in Proc. Workshop Cryptographic Hardware Embedded Syst. (CHES 2000), Worcester, MA, Aug. 2000, vol. 1985, Lecture Notes in Computer Science, pp. 142155. [16] C. Patterson, K. L. Pocek and J. M. Arnold, Eds., High performance DES encryption in Virtex FPGAs using JBits, in Proc. 8th Annu. IEEE Symp. Field-Program. Custom Comput. Mach. (FCCM), Napa Valley, CA, Apr. 2000, pp. 113121. [17] S. Tillich, J. Grochdl, and A. Szekely, J. Dittmann, S. Katzenbeisser, and A. Uhl, Eds., An instruction set extension for fast and memory-efcient AES implementation, in Proc. 9th Int. Conf. Commun. Multimedia Security (CMS), Salzburg, Austria, Sep. 2005, vol. 3677, Lecture Notes in Computer Science, pp. 1121. [18] R. B. Lee, Z. Shi, and X. Yang, Efcient permutation instructions for fast software crytography, IEEE Micro, vol. 21, no. 6, pp. 5669, Nov./ Dec. 2001. [19] J. Burke, J. McDonald, and T. M. Austin, Architectural support for fast symmetric-key cryptography, in Proc. 9th Int. Conf. Arch. Support Programming Lang. Oper. Syst. (AS-PLOS), Cambridge, MA, Nov. 2000, pp. 178189.

approximately one-third of the corresponding values for DES because both algorithms require the same hardware in order to be implemented by using the instruction set extensions. The T/A ratio for triple DES increased by a factor of 25.81 when instruction set extensions are used, matching the increase evidenced by DES. In the case of IDEA, the instruction set extensions had a signicant impact upon the performance at a reasonable hardware cost, with the measured T/A ratio increasing by a factor of 5.86. For AES, the instruction set extensions used in conjunction with the column-oriented implementations yield larger increases in throughput versus the instruction set extensions used in conjunction with the row-oriented implementations. As a result, the column-oriented implementations yielded a greater increase in T/A ratio, which increased by a factor of 2.33, than the row-oriented implementations, which increased by a factor of 1.02. Clearly, the column-oriented implementations are superior to the row-oriented implementations when using the proposed instruction set extensions. It is interesting to note that the throughputs for triple DES, IDEA, and AES are approximately the same when accelerated via instruction set extensions20.07, 22.44, and 28.89 Mb/s, respectively. Therefore, when these algorithms, accelerated via the proposed instruction set extensions, are used as the underlying encryption algorithms for a given protocol, a true algorithm agility may be achieved, which is transparent to the user as one algorithm may be easily replaced by another algorithm with minimal performance degradation. VII. CONCLUSION Instruction set extensions for improving software implementations of symmetric-key algorithms have been proposed. Existing literature on the subject of enhancing the performance of symmetric-key algorithms was discussed, followed by detailed descriptions of the targeted processor and the targeted cryptographic algorithms. Descriptions of the custom instructions were given along with the functional units that implement the underlying logical and arithmetic operations. The results show that the proposed instructions have a signicant positive effect on the program code size for all of the targeted algorithms, shrinking the number of code bytes by up to a factor of 31.2. The execution time for all algorithms was also improved, with demonstrated speedups by factors of up to 32.8. The instruction set extensions required only a 63% increase in the logic utilization of the LEON2 processor on the chosen FPGA device while decreasing the maximum clock frequency by approximately 10%. All of the targeted algorithms evidenced an increase in T/A ratio, increasing by a factor of up to 25.81. It

OMELIA AND ELBIRT: ENHANCING THE PERFORMANCE OF SYMMETRIC-KEY CRYPTOGRAPHY

1517

[20] J. Grochdl and E. Savas, M. Joye and J. Quisquater, Eds., In( ) and struction set extensions for fast arithmetic in nite elds (2 ), in Proc. Workshop Cryptographic Hardw. Embedded Sys. (CHES), Cambridge, MA, Aug. 2004, vol. 3156, Lecture Notes in Computer Science, pp. 133147. [21] S. Tillich and J. Grochdl, O. Gervasi, M. L. Gavrilova, V. Kumar, A. Lagan, H. P. Lee, Y. Mun, D. Taniar, and C. J. K. Tan, Eds., Accelerating AES using instruction set extensions for elliptic curve cryptography, in Proc. Int. Conf. Comput. Sci. Its Appl. (ICCSA), Singapore, May 2005, vol. 3481, Lecture Notes in Computer Science, pp. 665675. [22] S. Tillich and J. Grochdl, L. Goubin and M. Matsui, Eds., Instruction set extensions for efcient AES implementation on 32-bit processors, in Proc. Workshop Cryptographic Hardw. Embedded Syst. (CHES), Yokohama, Japan, Oct. 2006, vol. 4249, Lecture Notes in Computer Science, pp. 270284. [23] D. C. Feldmeier, A high-speed software DES implementation, Comput. Commun. Res. Group of Bell Commun. Res., 1989. [24] D. A. Osvik, Efcient implementation of the data encryption standard, M.S. thesis, Dept. Informatics, Univ. Bergensis, Bergen, Norway, 2003. [25] G. Bertoni, L. Breveglieri, P. Fragneto, M. Macchetti, and S. Marchesin, B. S. Kaliski, . K. Ko, and C. Paar, Eds., Efcient software implementation of AES on 32-Bit platforms, in Proc. Workshop Cryptographic Hardware Embedded Syst. (CHES 2002), Redwood Shores, CA, Aug. 1315, 2002, vol. 2523, Lecture Notes in Computer Science, pp. 159171. [26] L. Wu, C. Weaver, and T. Austin, B. Werner, Ed., Cryptomaniac: A fast exible architecture for secure communication, in Proc. 28th Annu. Int. Symp. Comput. Arch. (ISCA), Goteborg, Sweden, Jun. 4, 2001, pp. 110119. [27] D. Oliva, R. Buchty, and N. Heintze, J. H. Moreno, P. K. Murthy, T. M. Conte, and P. Faraboschi, Eds., AES and the Cryptonite crypto processor, in Proc. 2003 Int. Conf. Compilers, Arch. Synth. Embedded Syst. (CASES), San Jose, CA, Oct. 1, 2003, pp. 198209. [28] A. Hodjat and I. Verbauwhede, Interfacing a high speed crypto accelerator to an embedded CPU, in Proc. 38th Asilomar Conf. Signals, Syst., Comput,, Los Angeles, CA, Nov. 2004, vol. 1, pp. 488492. [29] P. Schaumont, K. Sakiyama, A. Hodjat, and I. Verbauwhede, Embedded software integration for coarse-grain recongurable systems, in Proc. 18th Int. Parallel Distrib. Process. Symp. (IPDPS), Santa Fe, NM, Apr. 2004, pp. 137142. [30] Z. Shi and R. B. Lee, Bit permutation instructions for accelerating software cryptography, in Proc. 11th IEEE Int. Conf. Appl.-Specic Syst., Arch. Processors (ASAP), 2000, pp. 138148. [31] K. Bondalapati and V. K. Prasanna, Recongurable computing: Architectures, models and algorithms, Current Sci., vol. 78, no. 7, pp. 828837, 2000. [32] K. K. Bondalapati, Modeling and mapping for dynamically recongurable hybrid architectures, Ph.D. dissertation, Dept. Comput. Eng., Univ. Southern California, Los Angeles, CA, 2001. [33] S. Hauck, T. W. Fry, M. M. Hosler, and J. P. Kao, The chimaera recongurable function unit, in Proc. 5th Annual IEEE Symp. Field-Program. Custom Comput. Mach. (FCCM), Napa Valley, CA, Apr. 1997, pp. 8796. [34] B. Kastrup, A. Bink, and J. Hoggerbrugge, Concise: A compiler-driven CPLD-based instruction set accelerator, in Proc. 7th Annu. IEEE Symp. Field-Program. Custom Comput. Mach. (FCCM), Napa Valley, CA, Apr. 1999, pp. 92101. [35] M. J. Wirthlin and B. L. Hutchings, A dynamic instruction set computer, in Proc. 3rd Annu. IEEE Symp. Field-Program. Custom Comput. Mach. (FCCM), Napa Valley, CA, USA, Apr. 1995, pp. 99107. [36] H. Singh, M. Lee, G. Lu, F. J. Kurdahi, N. Bagherzadeh, and E. M. C. Filho, Morphosys: An integrated recongurable system for data-parallel and computation-intensive applications, IEEE Trans. Comput., vol. 49, no. 5, pp. 465481, May 2000. [37] T. J. Callahan, J. R. Hauser, and J. Wawrzynek, The Garp architecture and C compiler, Computer, vol. 33, no. 4, pp. 6269, Apr. 2000. [38] S. Bartolini, I. Branovic, R. Giorgi, and E. Martinelli, A performance evaluation of ARM ISA extension for elliptic curve cryptography over binary nite elds, in Proc. 16th Symp. Comput. Arch. High Performance Comput. (SBC-PAD), Foz do Iguau, Brazil, Oct. 2004, pp. 238245.

GF

GF p

[39] S. Tillich and J. Grochdl, A simple architectural enhancement for fast and exible elliptic curve cryptography over binary nite elds (2 ), in Proc. 9th Asia-Pacic Conf. Adv. Comput. Syst. Arch. (ACSAC), Beijing, China, Sep. 2004, vol. 3189, Lecture Notes in Computer Science, pp. 282295. [40] J. Grochdl and G.-A. Kamendje, Instruction set extension for fast elliptic curve cryptography over binary nite elds (2 ), in Proc. 14th IEEE Int. Conf. Appl.-Specic Syst., Arch. Processors (AsSAP), The Hague, The Netherlands, Jun. 2003, pp. 455468. [41] A. Daly, W. Marnane, T. Kerins, and E. Popovici, An FPGA implementation of a ( ) ALU for encryption processors, Microprocessors Microsyst., vol. 28, no. 5/6, pp. 253260, Aug. 2004. [42] P. Kitsos, G. Theodoridis, and O. Koufopavlou, An efcient recongurable multiplier architecture for Galois eld (2 ), Microelectron. J., vol. 34, no. 10, pp. 975980, Oct. 2003. [43] M. A. G. Martinez, G. M. Luna, and F. R. Henriquez, E. Chvez, J. Favela, M. Meja, and A. Oliart, Eds., Hardware implementation of the binary method for exponentiation in (2 ), in Proc. 4th Mexican Int. Conf. Comput. Sci., Tlaxcala, Mexico, Sep. 2003, pp. 131134. [44] E. M. Popovici and P. Fitzpatrick, Algorithm and architecture for a Galois eld multiplicative arithmetic processor, IEEE Trans. Inf. Theory, vol. 49, no. 12, pp. 33033307, Dec. 2003. [45] J. Irwin and D. Page, Using media processors for low-memory AES implementation, in Proc. 14th IEEE Int. Conf. Appl.-Specic Syst., Arch. Processors (ASAP), The Hague, The Netherlands, Jun. 2003, pp. 144154. [46] K. Nadehara, M. Ikekawa, and I. Kuroda, Extended instructions for the AES cryptography and their efcient implementation, in Proc. 18th IEEE Workshop Signal Process. Syst. (SIPS), Austin, TX, Oct. 2004, pp. 152157. [47] S. Ravi, A. Raghunathan, N. Potlapally, and M. Sankaradass, System design methodologies for a wireless security processing platform, in Proc. 2002 Des. Autom. Conf. (DAC), New Orleans, LA, Jun. 2002, pp. 777782. [48] A. J. Elbirt, W. Yip, B. Chetwynd, and C. Paar, An FPGA-based performance evaluation of the AES block cipher candidate algorithm nalists, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 9, no. 4, pp. 545557, Aug. 2001. [49] K. Jrvinen, M. Tommiska, and J. Skytt, A fully pipelined memoryless 17.8 Gbps AES-128 encryptor, in Proc. ACM/SIGDA Int. Symp. Field Programmable Gate Arrays (FPGA), Monterey, CA, Feb. 2003, pp. 207215. [50] F. X. Standaert, G. Rouvroy, J. J. Quisquater, and J. D. Legat, Efcient implementation of Rijndael encryption in recongurable hardware: Improvements and design tradeoffs, in Proc. Workshop Cryptogr. Hardw. Embedded Syst. (CHES), Cologne, Germany, Sep. 710, 2003, vol. 2778, Lecture Notes in Computer Science, pp. 334350. [51] A. Hodjat and I. Verbauwhede, A 21.54 Gbit/s fully pipelined AES processor on FPGA, in Proc. 12th Annu. IEEE Symp. FieldProgram. Custom Comput. Mach. (FCCM), Napa, CA, Apr. 2004, pp. 308309. [52] A. Hodjat and I. Verbauwhede, Minimum area cost for a 30 to 70 Gbits/s AES processor, in Proc. IEEE Comput. Soc. Annu. Symp. VLSI Emerging Trends VLSI Syst. Des. (ISVLSI), Lafayette, LA, Feb. 2004, pp. 8388. [53] H. Kuo, I. Verbauwhede, and P. Schaumont, A 2.29 Gbits/section LVI mW non-pipelned Rijndael AES encryption IC in a 1.8 V 0.18 m CMOS technology, in Proc. IEEE Custom Integr. Circuits Conf., Orlando, FL, May 2002, pp. 147150. [54] K. Stevens and O. A. Mohamed, Single-chip FPGA implementation of a pipelined, memory-based AES Rijndael encryption design, in Proc. 8th Annu. Can. Conf. Electr. Comput. Eng. (CCECE), Saskatoon, SK, Canada, May 2005, pp. 12961299. [55] H. Lipmaa, S. Tavares and H. Meijer, Eds., Idea: A cipher for multimedia architectures?, in Proc. 5th Annu. Workshop Sel. Areas Cryptogr., Kingston, ON, Canada, Aug. 1998, vol. 1556, Lecture Notes in Computer Science. [56] J.-O. Haenni, Architecture epic et jeux dinstructions multimdias pour applications cryptographiques, Ph.D. dissertation, Dpt. dInformatique, Swiss Federal Inst. Technol., Lausanne, Switzerland, 2002. [57] P. Ganesan, R. Venugopalan, P. Peddabachagari, A. Dean, F. Mueller, and M. Sichitiu, Analyzing and modeling encryption overhead for sensor network nodes, in Proc. 2nd ACM Int. Conf. Wirel. Sens. Netw. Appl. (WSNA), San Diego, CA, Sep. 2003, pp. 151159.

GF

GF

GF p

GF

GF

1518

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 18, NO. 11, NOVEMBER 2010

[58] A. Michalski, K. Gaj, and D. A. Buell, T. Rissa, S. J. E. Wilton, and P. H. W. Leong, Eds., High-throughput recongurable computing: A design study of an idea encryption cryptosystem on the SRC-6E recongurable computer, in Proc. Int. Conf. Field Program. Logic Appl. (FPL), Tampere, Finland, Aug. 2005, pp. 681686. [59] O. Mencer, M. Morf, and M. J. Flynn, Hardware software tri-design of encryption for mobile communication units, in Proc. Int. Conf. Acoust., Speech, Signal Process., Seattle, WA, May 1998, vol. 5, pp. 30453048. [60] Federal Inf. Process. Standards, Nat. Bureau Standards, U.S. Dept. Com., Springeld, VA, Data Encryption Standard NIST FIPS PUB 46-3, 1977. [61] E. Biham, A fast new DES implementation in software, in Proc. 4th Int. Workshop Fast Softw. Encryption, E. Biham, Ed., Haifa, Israel, Jan. 1997, vol. 1267, LNCS, pp. 260272. [62] A. Ptzmann and R. Assman, More efcient software implementations of (generalized) DES, Comput. Security, vol. 12, no. 5, pp. 477500, 1993. [63] X. Lai and J. Massey, A proposal for a new block encryption standard, in Advances in CryptologyEUROCRYPT, I. B. Damgrd, Ed. Berlin, Germany: Springer-Verlag, 1990, vol. 473, Lecture Notes in Computer Science, pp. 389404. [64] Federal Inf. Process. Standards, Nat. Bureau Standards, U.S. Dept. Com., Springeld, VA, Specication for the Advanced Encryption Standard (AES), NIST FIPS PUB 197, 2001. [65] J. Daemen and V. Rijmen, The Design of Rijndael. New York: Springer-Verlag, 2002. [66] V. Rijmen, Rijndael reference code in ANSI C v2.2, 2007. [Online]. Available: http://homes.esat.kuleuven.be/rijmen/rijndaelref.zip [67] Gaisler Research, LEON2 SPARC V8 Compliance Certication, 2003. [Online]. Available: http://www.gaisler.com/images/leoncert.gif [68] Sun Microsystems, Campbell, CA, The SPARC architecture manual, v8, 1992. [69] P. Karn, DES software implementation, 2002. [Online]. Available: http://http://www.citi.umich.edu/projects/apv/ [70] J. Beuchat, Modular multiplication for FPGA implementation of the idea block cipher, in Proc. 14th IEEE Int. Conf. Appl.-Specic Syst., Arch. Process. (ASAP), The Hague, The Netherlands, Jun. 2003, pp. 412422.

[71] R. Zimmermann. (1999), Efcient VLSI implementation of modulo (2 1) addition and multiplication, Swiss Federal Inst. Technol. (ETH), Zurich, Switzerland, 1999. [Online]. Available: http://www. stud.ee.ethz.ch/zimmi/publications/ modulo_arith.ps.gz [72] A. J. Elbirt, Efcient implementation of Galois eld xed eld constant multiplication, in Proc. Int. Conf. Inf. Technol., New Gen. (ITNG), Las Vegas, NV, Apr. 2006, pp. 172177. [73] A. J. Elbirt, Recongurable computing for symmetric-Kkey algorithms, Ph.D. dissertation, Dept. Electr. Comput. Eng., Worcester Polytech. Inst., Worcester, MA, 2002.

Sean OMelia (M93) received the M.S. degree in computer engineering from the University of Massachusetts at Lowell, Lowell, in 2007. He is currently an Associate Technical Staff Member at Massachusetts Institute of Technology (MIT) Lincoln Laboratory, Lexington, where he is engaged in the areas of dynamic-networked environments, applied cryptography, and computer network attack and defense.

Adam J. Elbirt (SM07) received the Ph.D. degree in electrical engineering from Worcester Polytechnic Institute, Worcester, MA, in 2002. He is currently a Senior Member of the Technical Staff at Charles Stark Draper Laboratory, Inc., Cambridge, MA.

Vous aimerez peut-être aussi