Vous êtes sur la page 1sur 5

2008 International Conference on Signals, Circuits and Systems

VLSI Hardware Evaluation of the Stream Ciphers Salsa20 and ChaCha, and the Compression Function Rumba
L. Henzen, F. Carbognani, N. Felber, and W. Fichtner
Integrated Systems Laboratory ETH Zurich, Switzerland E-mail: {henzen, carbo, felber, fw}@iis.ee.ethz.ch

AbstractSalsa20 is a stream cipher candidate in the softwareoriented prole of the eSTREAM project. ChaCha is a successor stream cipher with improved per round diffusion and, conjecturally, increased resistance to cryptanalysis. Based on the combination of four Salsa20 instances, Rumba is a compression function for hashing schemes. This paper presents the evaluation of ve VLSI circuits for Salsa20. Synthesis results for a 0.18 m CMOS technology point out that the fastest implementation achieves a throughput of 6.4 Gbps, while the smallest design requires only an area of 10 k gate equivalents (GE) at 16 Mbps. This work also presents the rst hardware implementations of ChaCha and Rumba. The fastest ChaCha design achieves 6.8 Gbps and the smallest design requires an area of 9.1 kGE at 16 Mbps. Furthermore, two Rumba implementations are able to achieve 17.9 Gbps or a compact area of 16.8 kGE at 12 Mbps. Index TermsStream cipher, hash function, VLSI implementation, encryption and authentication process.

I. I NTRODUCTION In 2004, the European Network of Excellence for Cryptography (ECRYPT) started the eSTREAM project with the intent of identifying a promising set of new stream ciphers. At the end, eight candidates have been selected for the nal eSTREAM portfolio. The primary decisional criterion was the security analysis. Nonetheless, a further evaluation criterion was requested in order to determine the exibility and the suitability for software and hardware implementations of the stream ciphers. To this end, the candidates have been classied as software (Prole 1) or hardware-oriented functions (Prole 2). Several works investigated the hardware performance of the candidates, where area, throughput, power consumption, and scalability were analyzed [1][3]. Although the Salsa20 stream cipher [4] has been selected for the software portfolio, it was previously submitted for both proles. Only in the last phase of the competition, it has been excluded from Prole 2, since its structure was unlikely to be adapted for resource-constrained hardware applications. Moreover, Salsa20 and ChaCha, a successor algorithm with an improvement in the core function of Salsa20 [5], aiming at bringing increased diffusion without altering the performances,

rely on the structure of hash functions and use similar internal operations. Hence, Salsa20 and ChaCha turn out to be two valuable cryptographic algorithms that can be integrated in the core of modern hashing schemes. In the context of an investigation on generalized cryptographic attacks applied to incremental hashing, the Rumba algorithm has been presented as compression function for iterated hashing algorithms [6], [7]. The core of Rumba is based on Salsa20. It is, therefore, conceivable to reuse the improvements applied on the implementations of Salsa20 to speed-up Rumba. Contribution: The analysis of hardware implementations for the stream cipher Salsa20 and ChaCha is presented. The area and throughput exploration of ve VLSI designs of the underlying algorithms has been evaluated. Besides, two architectures of the Rumba compression function are also described. To the best of the authors knowledge, this is the rst paper that investigates hardware implementations of ChaCha and Rumba. The choice has been motivated by the incoming NIST hash competition, which aims at the denition of a novel hashing standard [8]. The computational core of Salsa20 and ChaCha could indeed be applied to novel hash function candidates, as in the case of Rumba. Outline: The remainder of this paper is organized as follows. The denition of Salsa20, ChaCha, and Rumba is described in Sec. II. The methodologies used in the VLSI implementation of the aforementioned algorithms is the topic of Sec. III. Sec. IV gives the performances of the circuits and a comparison of the hardware efciency. Eventually, Sec. V draws the conclusions. Notation: The computed variables are processed by the algorithms divided in 32-bit words. The i-th word of the variable a is denoted by ai . A tuple corresponds to an array of four words. The + operator means the wordwise modulo 2 32 integer addition, which is depicted as in block diagrams. The symbol indicates the exclusive-OR (XOR) logical operation, while k is a k -bit rotation towards the more signicant bits.

978-1-4244-2628-7/08/$25.00 2008 IEEE

-1-

2008 International Conference on Signals, Circuits and Systems

II. A LGORITHM S PECIFICATION The composition of the stream ciphers Salsa20 and ChaCha rests upon the same structure. Rumba is a compression function based on Salsa20. For a comprehensive specication of the three algorithms, refer to [4][6]. A. Salsa20 Salsa20 is a stream cipher that works in counter mode. As depicted in Fig. 1, it generates a sequence of keystream blocks Z , which are then XORed with the input message (plaintext) to produce the encrypted message (ciphertext). The internal keystream generation function of Salsa20 takes as input a 256bit secret key k = (k0 , k1 , . . . , k7 ) and a 64-bit nonce v = (v0 , v1 ), i.e., a unique message number, to produce a sequence of 512-bit keystream blocks. The inputs are congured as a 4 4 matrix of 32-bit words x0 x4 X = x8 x12 x1 x5 x9 x13 x2 x6 x10 x14 0 x3 k3 x7 = x11 t0 x15 k5 k0 1 t1 k6 k1 v0 2 k7 k2 v1 , (1) k4 3

Keystream Generator Z Plaintext Ciphertext

Fig. 1. Encryption process of a stream cipher, working in counter mode. The inputs of the keystream generator are the secret key k , the nonce v, and the counter t.

B. ChaCha ChaCha is similar to Salsa20. It follows the same functional computations of the internal keystream generator described by (2). The matrix initialization differs as follows x0 x4 X= x8 x12 x1 x5 x9 x13 x2 x6 x10 x14 0 x3 k0 x7 = x11 k4 x15 t0 1 k1 k5 t1 2 k2 k6 v0 3 k3 . (5) k7 v1

where the 64-bit counter t = (t0 , t1 ) corresponds to the message block index. The i are predened constants. The keystream block Z is then dened as Z = X + DRr (X ). (2)

The second distinction concerns the double-round function. DR() is dened over the original column step and a successive step, which affects the elements of X in diagonal QR(x0 , x5 , x10 , x15 ) QR(x0 , x4 , x8 , x12 ) QR(x1 , x5 , x9 , x13 ) QR(x1 , x6 , x11 , x12 ) ; . (6) , x , x , x ) QR( x QR(x2 , x7 , x8 , x13 ) 2 6 10 14 QR(x3 , x7 , x11 , x15 ) QR(x3 , x4 , x9 , x14 ) The last modication refers to the structure of the quarterround function. QR(a, b, c, d) is dened over the sequence a = a + b, d = (d a) 16, c = c + d, b = (b c) 12, a = a + b, d = (d a) 8, c = c + d, b = (b c) 7.

The double-round function DR() consists of the double computation of four quarter-round functions QR() over the rotated columns and rows of X . DR() is divided into the column step, which applies four QR() functions on the columns of X , and the row step, for the rows of X : QR(x0 , x1 , x2 , x3 ) QR(x0 , x4 , x8 , x12 ) QR(x5 , x9 , x13 , x1 ) QR(x5 , x6 , x7 , x4 ) ; . (3) , x , x , x ) QR( x QR(x10 , x11 , x8 , x9 ) 10 14 2 6 QR(x15 , x3 , x7 , x11 ) QR(x15 , x12 , x13 , x14 ) The QR(a, b, c, d) transformation updates four 32-bit words of the matrix X . It sequentially computes per line b = b [(a + d) 7], c = c [(b + a) 9], d = d [(c + b) 13], a = a [(d + c) 18]

(7)

(4)

The number of operations computed in QR() remains the same, but the four 32-bit words are updated twice, giving each word a chance to affect the other words. This translates into an increase in the amount of diffusion per round. Although the author of [5] suggests a round-reduced version of ChaCha, this work uses a 10-round stream cipher as for Salsa20. C. Rumba Rumba is a compression function for hashing algorithms, built on the Salsa20 keystream generator of (2). It maps a 1536-bit message M to a 512-bit (intermediate) hash value.

over the tuple (a, b, c, d). Considering the internal function (2), r double-rounds are executed over the input matrix X . Finally, the updated matrix X is added to the original input matrix. Salsa20 has been presented as a r = 10 rounds stream cipher.

-2-

2008 International Conference on Signals, Circuits and Systems

The message M is divided in 384-bit blocks M0 , M1 , M2 , M3 , which are then compressed by the Rumba function as follows Rumba(M ) = f0 (M0 ) f1 (M1 ) f2 (M2 ) f3 (M3 ) = (X0 + DRr (X0 )) (X1 + DRr (X1 )) (X2 + DRr (X2 )) (X3 + DRr (X3 )). (8) The function fi () combines the input block with distinct diagonal constants to complete the initial 512-bit state matrix X (see [6]). Afterwards, 10 rounds of the function Salsa20 are computed. III. H ARDWARE I MPLEMENTATIONS Since Rumba applies the internal round transformation of Salsa20, its hardware implementation depends on the structure of the Salsa20 circuit that elaborates the 384-bit message blocks and combines the nal keystreams to generate the output. In this way, the performance of the compression function relies on the design of the stream cipher. This section focuses, therefore, on the implementation of ve architectures of the eSTREAM candidate Salsa20 and its successor ChaCha. In the encryption process, the underlying stream ciphers generate a sequence of secure bits called keystreams, which are then combined with the plaintext to generate the ciphertext. Inversely, in the decryption process, the plaintext is simply recovered by combining the ciphertext with the keystream sequence. This specic structure, dened as the counter mode of operation [9], does not require a dedicated inverse cipher. Although the structure of the counter mode supports the application of loop-unrolling strategies, the iterative nature of the keystream generator leads to a basic architecture, where a round is implemented in combinational logic with the addition of a 512-bit register to store the state matrix X . Assuming the initial state (key, nonce, and counter) as constant during the iterations, Fig. 2 represents the basic structure to implement the stream ciphers in hardware. The presence of the nal Word Adder unit inhibits the implementation of pipelined architectures. The application of pipeline stages requires the insertion of parallel registers also for the initial state path of X . The resulting structure leads to large hardware-inefcient circuits as evaluated in [10]. Consequently, this work focuses on the investigation of circuits based on the model of Fig. 2. In addition, the control operations for the computation of the rounds are executed by a control unit, implemented as a nite state machine. The design space has been explored through the implementation of several variants of the Round Module. The DR() and QR() functions, hosted in this unit, present two fundamental peculiarities: Parallelism: many operations executed in the round process can be computed at the same time, reducing the processing delay;

X [Initial State]

Round Module State Register

Word mod 232- Adder

Keystream Generator

Z Plaintext Ciphertext

Fig. 2. Overview of the basic iterative keystream architecture of the stream cipher Salsa20 and ChaCha. The Word Adder module performs the addition of the initial state with the updated state matrix X . The datapaths correspond to 512 bits.

Scalability: the round function can be broken into similar computational steps, allowing different performance trade-offs between area and speed.

These two dimensions of the design space have been, therefore, analyzed as follows. A. Double-Round Optimization The basic implementation of the Round Module implies the instantiation of a complete double-round function. It consists altogether of eight QR() modules; the rst four calls to QR() are processed in parallel, because each of them updates a distinct column of the matrix X . The last four calls update distinct rows in Salsa20 (row step) or distinct diagonals in ChaCha (diagonal step), respectively. This isomorphic architecture, referred to as 8xQR(), needs ten clock cycles to compute the r = 10 double-rounds. An iterative decomposition of DR() enables the utilization of only four QR() modules for both step transformations. In the 4xQR() design the number of QR() block has been thus reduced to four. The resulting circuit performs one round of DR() in two cycles. The keystream Z is then generated after 2r cycles. The routing of the element xi inside and outside the QR() units for the step transformations requires additional multiplexers and demultiplexers. Further decomposition leads to the 1xQR() architecture, where the Round Module corresponds to a single QR() unit and a multiplexer/demultiplexer logic to route the input and output words. Through the iteration of eight cycles, the complete DR() procedure is computed. The generation of Z

-3-

2008 International Conference on Signals, Circuits and Systems

takes 8r clock cycles. The area-reduction of the chip has been the main goal of this implementation. B. Quarter-Round optimization This investigation deals with the internal analysis of the QR() function. As described in (4) and (7), Salsa20 and ChaCha use four wordwise modulo 232 integer additions, four XORs logical operations, and four bit rotations to update the 4-word tuple (see Fig. 3). Although ChaCha applies the operations in a different order and updates each word twice during each call, the structure of QR() can be divided into four operational stages as for Salsa20. Every stage computes addition, XOR, and rotation in the same order. Applying an internal iterative decomposition of QR(), the size of the circuit can be further downscaled by reducing the number of implemented stages. With a single stage, the execution of the QR() transformation is performed in four cycles. Fig. 4 shows the block diagram of the circuit for a singlestage implementation of the QR() function. The correct routing of the xi elements inside and outside the stage is managed by multiplexers and demultiplexers, which control the inputs ini and the outputs outi of the S-QR() module. The four rotation operators are instantiated distinctly with a control logic that selects the required rotation. In hardware, these wordwise rotations correspond to a straightforward rerouting of the bits and, thus, does not affect the propagation delay of the circuits. The 4xS-QR() architecture includes four S-QR() units of Fig. 4, which update the elements of the matrix X in parallel. This solution takes 8r cycles to generate the keystream, since
(A) a
<<< 7 <<< 9

Salsa20 S-QR() in0 in1 in2 in0 in1 in2


<<< 7 <<< 9 <<< 13 <<< 18

out

ChaCha S-QR() out0


<<< 16 <<< 12 <<< 8 <<< 7

out1

Fig. 4. Block diagram of the Salsa20 and ChaCha single-stage modules S-QR() .

eight cycles are requested to perform the DR() function. The last evaluated design is a lightweight implementation of Salsa20 and ChaCha, where the Round Module uses a unique single-stage of QR(). Every parallel process is iteratively computed through the same hardware logic. Since the latency of the keystream generation increases drastically, this architecture is suitable for low-area and low-speed applications. The overall cycle count of the keystream generation is 32r. The 1xSQR() design concentrates the optimization efforts to achieve a compact circuit of the stream cipher. IV. R ESULTS AND C OMPARISON

a b
<<< 13

b c
<<< 18

c d

STAGE 1

STAGE 2

STAGE 3

STAGE 4

(B) a b c d
<<< 16 <<< 8 <<< 12 <<< 7

a b c d

Fig. 3. Operation overview of the QR() function of Salsa20 (A) and ChaCha (B). The tuple to the right contains the updated values of the input tuple (a, b, c, d). The datapaths are 32-bit words.

The architectures presented in the previous section have been synthesized with the Synopsys Design Compiler using a 0.18 m CMOS technology. In Tab. I, the throughput and the resulting area of the circuits designed for maximal speed, are given. In addition, the column hardware efciency gives the ratio between throughput and size, allowing better comparisons of the investigated designs. The comparable performances of Salsa20 and ChaCha are due to the similar structure of the double-round and the quarter-round functions. The fastest throughput is obtained with the 8xQR() and 4xQR() versions. Throughput values up to 6.0 Gbps demonstrate that Salsa20 and ChaCha can be compared with specic hardware-oriented eSTREAM candidates. In [2] and [11], the VLSI implementation of several stream ciphers has been investigated with different CMOS technologies. From the nal candidates of Prole 2, Trivium [12] is the only algorithm, that shows in hardware evident advantages in term of speed. Furthermore, the 4xQR() implementations of Salsa20 and ChaCha turn out to be the circuits with the highest hardware efciency.

-4-

2008 International Conference on Signals, Circuits and Systems

TABLE I VLSI SYNTHESIS RESULTS OF THE S ALSA 20, C HAC HA , AND R UMBA IMPLEMENTED ARCHITECTURES . Ref. 8xQR() 4xQR() 4xS-QR() 1xQR() 1xS-QR() 4xQR() Function Salsa20 ChaCha Salsa20 ChaCha Salsa20 ChaCha Salsa20 ChaCha Salsa20 ChaCha Rumba Area [kGE] 39.64 39.54 24.06 28.11 22.81 22.44 17.06 16.69 14.89 14.28 95.51 Freq. [MHz] 125 132 241 215 365 366 209 196 362 389 233 Throughput [Gbps] 6.400 6.782 6.169 5.505 2.336 2.344 1.339 1.252 0.580 0.623 17.903 HW-ff. [Kbps/GE] 161.45 171.51 256.39 195.84 102.40 104.45 78.47 75.03 38.93 43.59 187.44

TABLE II I MPLEMENTATION RESULTS OF THE LOW- AREA ARCHITECTURES OF S ALSA 20, C HAC HA , AND R UMBA . Ref. 1xQR() 1xS-QR() 41xS-QR() Function Salsa20 ChaCha Salsa20 ChaCha Rumba Area [kGE] 10.39 9.77 9.97 9.11 16.78 Frequency [MHz] 10 10 10 10 10 Throughput [Mbps] 64 64 16 16 12

Salsa20 and ChaCha for implementations in modern digital cryptographic applications. Furthermore, the investigation of two architectures, reusing large components of Salsa20, support the application of Rumba in novel hashing algorithms, and provide a rst touchstone for further VLSI implementations. R EFERENCES
[1] K. Gaj, G. Southern, and R. Bachimanchi, Comparison of hardware performance of selected phase ii eSTREAM candidates, eSTREAM, ECRYPT Stream Cipher Project, Report 2007/026 (SASC 2007), 2007, http://www.ecrypt.eu.org/stream. [2] F. K. Grkaynak, P. Luethi, N. Bernold, R. Blattmann, V. Goode, M. Marghitola, H. Kaeslin, N. Felber, and W. Fichtner, Hardware evaluation of eSTREAM candidates: Achterbahn, grain, mickey, mosquito, snks, trivium, vest, zk-crypt, eSTREAM, ECRYPT Stream Cipher Project, Report 2006/015, 2006, http://www.ecrypt.eu.org/stream. [3] M. Rogawski, Hardware evaluation of eSTREAM candidates: Grain, lex, mickey128, salsa20 and trivium, eSTREAM, ECRYPT Stream Cipher Project, Report 2007/025 (SASC 2007), 2007, http://www.ecrypt.eu.org/stream. [4] D. J. Bernstein, The salsa20 family of stream ciphers, eSTREAM, ECRYPT Stream Cipher Project, Report 2005/025, 2005, http://www.ecrypt.eu.org/stream. [5] , Chacha, a variant of salsa20, Jan. 2008, http://cr.yp.to/chacha.html. [6] , What output size resists collisions in a XOR of indipendent expansions? in Proceeding of the Workshop on Hash Functions ECRYPT, 2007, http://cr.yp.to/rumba20.html. [7] , Better price-performance ratios for generalized birthday attacks, in Proceeding of the Workshop on Special-purpose Hardware for Attacking Cryptographic Systems SHARCS, 2007, http://cr.yp.to/rumba20.html. [8] NIST, Call for a new cryptographic hash algorithm (SHA-3) family, Federal Register, Vol.72, No.212, 2007, http://www.nist.gov/hashcompetition. [9] M. Dworkin, Recommendation for block cipher mode of operation, Dec. 2001, NIST Special Publication 800-38A. [10] J. Yan and H. M. Heys, Hardware implementation of the salsa20 and phelix stream ciphers, in Proceedings of the Canadian Conf. on Electrical and Computer Engineering, Vancouver, BC, Apr. 2007, pp. 11251128. [11] T. Good and M. Benaissa, Hardware results for selected stream cipher candidates, eSTREAM, ECRYPT Stream Cipher Project, Report 2007/023 (SASC 2007), 2007, http://www.ecrypt.eu.org/stream. [12] C. D. Canniere and B. Preneel, Trivium - a stream cipher construction inspired by block cipher design principles, eSTREAM, ECRYPT Stream Cipher Project, Report 2005/030, 2005, http://www.ecrypt.eu.org/stream.

The 1xQR() and 1xS-QR() implementations of Salsa20 and ChaCha have been also synthesized with a relaxed timing to force minimal chip surfaces (see results in Tab. II). With the double-round optimization procedure, an area of about 10 kGE1 is achieved in 1xQR(). Adopting the single-stage solution, a reduction of size up to 5 % for Salsa20 and 9 % for ChaCha is reached. The resulting throughput is 16 Mbps. This rather modest area reduction is due to the increased control logic outside the S-QR() module. Nonetheless, these designs point out the adaptability of the Salsa20 and ChaCha stream ciphers for applications in radio frequency identication (RFID) systems, where a compact chip size is the main constraint. In Tab. I and Tab. II, two implementations of the Rumba compression function are presented. The faster design, reaching 17 Gbps, is obtained with the instantiation of four Salsa20 4xQR() modules, which process in parallel the 384-bit message blocks Mi . After 2r rounds, the nal keystreams of the fi () functions are combined to generate the 512-bit output value. The second architecture of Rumba relies on a single lightweight 1xS-QR() module of Salsa20, which processes the four message blocks sequentially. It has been only synthesized with area optimizations at the minimal frequency of 10 MHz. To store the intermediate keystreams, a 512-bit register-based memory has been further on included. V. C ONCLUSIONS In this paper, the VLSI characterization of the stream cipher Salsa20, its successor ChaCha, and the compression function Rumba is presented. In total, twelve implementations are investigated and evaluated in a 0.18 m CMOS standard cell library. High-speed hardware architectures, exceeding 6 Gbps throughput ratios, and compact implementations with a circuit complexity of less than 10 kGE demonstrate the exibility of
1 One gate equivalent (GE) corresponds in a 0.18 m CMOS technology to the area of a two-input drive-one NAND gate of size 9.7 m 2 .

-5-

Vous aimerez peut-être aussi