Vous êtes sur la page 1sur 11

The Computer Journal Advance Access published June 2, 2007

# The Author 2007. Published by Oxford University Press on behalf of The British Computer Society. All rights reserved.
For Permissions, please email: journals.permissions@oxfordjournals.org
doi:10.1093/comjnl/bxm023

Analysis and Detection Of Errors


In Implementation Of SHA-512
Algorithms On FPGAs
I MTIAZ AHMAD* AND A. SHOBA DAS
Department of Computer Engineering, Kuwait University, PO Box 5969, Safat 13060, Kuwait
*Corresponding author: imtiaz@eng.kuniv.edu.kw

The Secure Hash Algorithm SHA-512 is a dedicated cryptographic hash function widely considered
for use in data integrity assurance and data origin authentication security services. Reconfigurable
hardware devices such as Field Programmable Gate Arrays (FPGAs) offer a flexible and easily
upgradeable platform for implementation of cryptographic hash functions. Owing to the iterative
structure of SHA-512, even a single transient error at any stage of the hash value computation will
result in large number of errors in the final hash value. Hence, detection of errors becomes a key
design issue. In this paper, we present a detailed analysis of the propagation of errors to the
output in the hardware implementation of SHA-512. Included in this analysis are single, transient
as well as permanent faults that may appear at any stage of the hash value computation. We then
propose an error detection scheme based on parity codes and hardware redundancy. We report the
performance metrics such as area, memory, and throughput for the implementation of SHA-512
with error detection capability on an FPGA of ALTERA. We achieved 100% fault coverage in
the case of single faults with an area overhead of 21% and with a reduced throughput of 11.6%
with the error detection circuit.

Keywords: Hash algorithm; FPGSs; error analysis; error detection


Received 6 August 2006; revised 23 April 2007

1. INTRODUCTION
matching the security of advanced encryption standard
Cryptographic hash functions are recommended by several (AES) with a key size of 256 [4]. SHA-512 also has complex-
Internet engineering task force requests for comments ity of the best birthday attack of 2256.
(RFCs) in applications such as digital signature schemes in In a cryptographic hash function, a message of arbitrary
public-key cryptosystems, password storage and verification. length is first padded and broken into blocks and then con-
Hash functions are also the building blocks of secret-key verted into a fixed-length output (hash value). The hash
message authentication codes used in two popular security values of individual blocks are used iteratively by a com-
protocols, namely: secure sockets layer (SSL) recommended pression function to find the final hash value, referred to as
in RFC 2246 [1] and IPSecurity specified in RFC 2404 [2]. message digest. A common sequence of operations is called
The implementation of cryptographic hash functions using a digest round and the compression function produces a
reconfigurable hardware devices such as field programmable hash value by subjecting a block of message to many digest
gate arrays (FPGAs) has higher performance than software rounds.
implementation in terms of speed and FPGAs are also flexible There are different types of faults and methods of fault
and easily upgradeable. Hash functions, which have a ‘dedi- injection in public-key encryption algorithms. The faults can
cated’ design, are fast and have considerable advantage over be transient or permanent in nature. Transient faults, such as
other algorithms, which are based on block cipher [3]. Dedi- single-event upsets, which are typically a flipping of a
cated hash functions suitable for both software and hardware logical state or multiple-event faults caused by several
implementation have been proposed. One of the widely con- single-event faults can be induced. Some of the permanent
sidered for use dedicated hash functions in real applications faults such as total dose-rate faults due to exposure to
is the Secure Hash Algorithm SHA-512 with the security harmful environment can cause defects. Several transient

THE COMPUTER JOURNAL, 2007


Page 2 of 11 AHMAD AND DAS

and permanent faults and methods of fault injection such as scheme. Yen and Wu [11] proposed a scalable and symmetri-
varying supply voltage, external clocks, temperature or indu- cal error detection scheme based on cyclic redundancy check.
cing faults using white light, laser and X-rays methods of A technique described in Siewiorek and Swarz [12] known as
fault injection are discussed in detail in Bar-EL et al. [5]. If duplication with comparison (DWC) performs CED by dupli-
an attacker deliberately generates a glitch attack, causing a cating the circuit and comparing the two results. Johnson [13]
flip-flop state to change or corrupt data values when they are described triple modular redundancy, a hardware
transferred from one digest operation to another, even a redundancy-based error correction technique with at least
single fault can result in multiple errors in the hash value com- 200% hardware overhead. A method similar to the dupli-
puted. The severity of the problem necessitates detection of cation of the circuit, but with partial replication based on pre-
errors, a key design issue. Moreover, as a digest round consists diction logic functions was used by Drineas and Makris [14]
of several operations, errors can creep in at any of these oper- for concurrent fault detection in random combinational logic
ations and can affect one or several bits at any of the oper- thereby reducing the hardware overhead. Patel and Fung [15]
ations in a digest round. As SHA-512 is considered for use developed a time redundancy-based technique for permanent
in essential security services, concurrent error detection faults by re-computing with shifted operands. Time redun-
(CED) is very important. This necessitates an analysis on dancy techniques using alternating data are proposed by Rey-
the propagation of error from the point of origin to the nolds and Metze [16]. A register transfer level CED
output when implemented in hardware. technique based on time redundancy was proposed by Karri
Even though CED is very desirable, it has certain associated and Wu [17]. A combination of hardware and time redun-
penalties such as hardware cost and the performance degra- dancy techniques referred to as quadruple time redundancy
dation due to interaction between the circuit and the detection is proposed by Townsend et al. [18] for concurrent error cor-
logic, which need to be considered while designing the error recting adder design. For performing arithmetic operations,
detection circuit. The design goal of the CED is to achieve Parhami [19] suggested a parity code-based system using
100% error detection with minimal penalty. CED techniques even-parity redundant operands. Several fault-secure
involve redundancy (extra logic) in one form or other such designs for arithmetic operators based on parity prediction
as hardware, time or information. A CED circuit based on are reported by Nicolaidis et al. [20]. Several information
hardware redundancy duplicates the complete circuit. Dupli- redundancy-based CED techniques use codes such as
cation targets unrestricted error models, but this approach Bose-Lin codes [21], Berger codes [22]. A method that com-
will result in 100% hardware overhead. Time redundancy bines parity check codes and duplication is reported by
techniques re-compute the output at different times with the Almukhaizim and Makriset al. [23] for fault tolerant design
same circuit thereby increasing the performance overhead by of combinational and sequential logic. A non-intrusive CED
100%. Moreover, only transient faults can be detected by technique based on compaction of circuit outputs, prediction
these CED circuits as the original faulty circuit is used for of the compacted responses and comparison of restricted
re-computation. In information redundancy technique, data error models is suggested by Almukhaizim et al. [24]. The
are appended with additional bits and a coding scheme is idea of multiple parity bits was introduced by Sogomonyan
used to detect errors. Coding techniques marginally increase [25].
the hardware as well as performance overhead and are suitable A single-chip FPGA solution for two of the dedicated hash
for simple arithmetic operations with restricted error models. functions SHA-384 and SHA-512 was proposed by McLoone
Combinations of the above techniques are also employed to and McCanny [26]. A comparative analysis of the hardware
minimize the overhead for CED. implementation of SHA-1 and SHA-512 on Xilinix Virtex
Many techniques have been reported [6– 25] for CED in FPGA was done by Grembowski et al. [27]. An elegant appli-
different application domains. Karri et al. [6] investigated a cation specific integrated circuit (ASIC) implementation of
low-cost, low-latency CED for Rijndael symmetric encryp- SHA-512 by making use of delay balancing and pipelining
tion algorithm. A low-cost method based on a modification is recently reported by Dadda and Macchetti [28]. None of
of parity bit of input into the parity bit of the output for the above-mentioned hardware implementations of dedicated
AES encryption algorithm was implemented on a Xilinx hash functions considered the issue of fault tolerance. To the
Virtex 1000 FPGA by Wu et al. [7]. A complete scheme best of our knowledge, an analysis on the propagation of
for parity-based fault detection in a hardware implementation errors and proposing error detection schemes for hardware
of the AES was presented by Bertoni et al. [8]. Wu and Karri implementation of SHA-512 has not been done.
[9] proposed a CED technique using idle cycles in a data path In this paper, we analyze the propagation of single and mul-
to do the re-computation with RC6 encryption as a case tiple errors occurring at different operations of a digest round
study. Bertoni et al. [10] have done a detailed study on the in the hardware implementation of SHA-512. The presence of
propagation of errors on a hardware implementation of Rijn- both transient and permanent faults that may occur at different
dael AES and proposed two fault detection schemes, namely: stages of the hash value computation are considered in
a redundancy-based scheme and an error detection coding the analysis. As the hardware implementation on a FPGA is

THE COMPUTER JOURNAL, 2007


IMPLEMENTATION OF SHA-512 ALGORITHMS ON FPGAS Page 3 of 11

too complex for an exhaustive analysis, we implemented TABLE 1: Initial hash values of SHA-512
SHA-512 in software only for carrying out the analysis. We
propose an error detection scheme based on information as H00 ! a 6a09e667 f3bcc908 H04 ! e 510e527f ade682d1
well as hardware redundancy schemes. Errors in most of the H01 ! b bb67ae85 84caa73b H05 ! f 9b05688c 2b3e6c1f
operations in a digest round are detected by simple parity cir- H02 ! c 3c6ef372 fe94f82b H06 ! g 1f83d9ab fb41bd6b
cuits and errors in a few of the remaining operations are H03 !d a54ff53a 5f1d36f1 H07 ! h 5be0cd19 137e2179
detected by the simplest DWC technique wherein the hard-
ware pertaining to the operations are duplicated and a com-
parison is made between the original and the duplicated Step 1: The hash values, Hi21
0 to Hi21
7 , are assigned to vari-
circuits. As DWC technique results in minimum additional ables a, b, c, d, e, f, g and h. The eight initial hash values,
delay with 100% hardware overhead, we used DWC technique which are 64 bits wide, are shown in Table 1.
only for non-linear operations of SHA-512. Performance Steps 2 and 3 are performed 80 times on a 1024-bit block.
metrics such as area, memory and throughput are also com- Step 2:
puted for the implementation of SHA-512 with error detection Message schedule
capability on a FPGA of ALTERA.

The remaining paper is organized in the following manner. Bit ; 0  t  15
In Section 2, we discuss the SHA-512 algorithm. In Section 3, Wt ¼
s1 ðWt2 þ Wt7 þ s0 ðWt15 Þ þ Wt16 ; 16  t  80;
an analysis of error propagation for single errors is discussed.
Section 4 explains the error detection techniques proposed in where,
this paper. Results are discussed in Section 5, and Section 6
concludes the paper. s1 ¼ ROTR 19 ðxÞ  ROTR 61 ðxÞ  SHR 6 ðxÞ;
s0 ¼ ROTR 1 ðxÞ  ROTR 8 ðxÞ  SHR 7 ðxÞ

2. SHA-512 ALGORITHM where ROTR n(x) is a circular rotation of a variable x by n


In this section, the SHA-512 algorithm is discussed in detail. positions to the right and SHR n(x) is shifting of a variable x
When a message of any length ,2128 bits is input, the by n positions to the right.
SHA-512 computes a 512 bits long condensed representation Step 3:
of message, referred to as message digest. The procedure for † A sequence of 80 constant 64-bit words, Kt are used by
the message digest computation consists of two stages, the processing unit.
namely: preprocessing and hash computation. In the prepro- P
† The Pprocessing unit uses four functions, Ch and Maj, 0
cessing stage, the message is padded, parsed into m-bit and 1.
blocks and initialization values to be used in the hash compu-
Chðx; y; zÞ ¼ ðx ^ yÞ  ð: x ^ zÞ;
tation are set. A message scheduler divides each m-bit block
into 16 words and prepares a message schedule by passing Majðx; y; zÞ ¼ ðx ^ yÞ  ðx ^ zÞ  ðy ^ zÞ;
one word at a time. A series of hash values are generated itera- X
0
ðxÞ ¼ ROTR 28 ðxÞ  ROTR 34 ðxÞ  ROTR 39 ðxÞ;
tively from functions, constants and word operations and the X
final hash value is called the message digest. The operations ðxÞ ¼ ROTR 14 ðxÞ  ROTR 18 ðxÞ  ROTR 41 ðxÞ;
1
performed during the two stages are listed below: X
Preprocessing: T1 ¼ h þ 1
ðeÞ þ Chðe; f ; gÞ þ Kt þ Wt ;
X
(1) Padding the message into multiples of 1024 bits. T2 ¼ ðaÞ þ Maj ða; b; cÞ;
0
(2) Parsing the padded message into N message blocks
B 0, B 1,..,B N, where the block size is 1024 bits. h ¼ g;
Hash computation: g ¼ f;
f ¼ e;
(1) Each message block B i is processed in order. A word
(64 bits wide) of a message block B i is referred to as e ¼ d þ T1 ;
B it and in a block there are 16 such words. d ¼ c;
(2) For each message block i in the range 1– N, after assign- c ¼ b;
ing initial hash values (Step 1), following steps (1 – 4),
b ¼ a;
referred to as a digest round, are carried out to
compute hash values Hi0 to Hi7 for the ith block. a ¼ T1 þ T2 :

THE COMPUTER JOURNAL, 2007


Page 4 of 11 AHMAD AND DAS

The message digest computation in an SHA-512 involves a


Step 4: The ith intermediate hash values Hi0 to Hi7 are com- sequence of operations (digest round) performed on a word B it
puted by modulo-64 bit adders after the iterations. of a message block B i. The digest round consists of arithmetic
and logical operations listed in Steps 2 and 3 of the hash value
computation. These two steps are repeated 80 times to
H0i ¼ a þ H0i1 H1i ¼ b þ H1i1 H2i ¼ c þ H2i1
compute the hash value H i for a block. The second stage is
H3i ¼ d þ H3i1 H4i ¼ e þ H4i1 H5i ¼ f þ H5i1 the block round, which repeats digest round operations on
different blocks of the message to compute the final hash
H6i ¼ g þ H6i1 H7i ¼ h þ H7i1 value H. The digest and block round operations are depicted
in Fig. 2.
The basic operations of an SHA-512 are listed below:
† The message digest is computed by HN0 kHN1 kHN2 kHN3 kHN4
kHN5 kHN6 kHN7 after processing all N blocks in the (a) Memory operation:
message. Constants Kt and initial H values are read from memory.
The computation of message schedule Wt ¼ s1 (Wt22) þ (b) Register transfer:
Wt27 þ s0 (Wt215) þ Wt216 requires four operand modulo- There are 24 registers, W0 2W15 and a2h, and the oper-
ands are transferred from one register to another.
P The intermediate value a ¼ T1 þ T2Pwhere,
64 additions.
T1 ¼ h þ 1(e) þ Ch(e,f, g) þ Kt þ Wt and T2 ¼ 0(a) þ The memory and register transfer operations are straightfor-
Maj(a, b, c) and hence computation of a requires seven ward transfers of operands without any modification to input.
operand additions. The ith intermediate hash values Hi0 to Hi7 (c) Addition using FAA:
are computed by modulo-64 bit adders after the iterations In total, there are eight FAAs, which use separate cir-
and hence, an additional eight modulo-64 adders are required cuits for pseudo-sum and pseudo-carry computations. If
to find the final hash value for the block. Carry save adders X, Y and Z are the operands that are added in a FAA,
(CSAs) are used for addition as they are faster and have a the pseudo-sum ¼ X  Y  Z and pseudo-carry ¼ XY þ
smaller area than carry look ahead (CLA) adders. In a CSA, XZ þ YZ.
a full array adders (FAA) performs addition of three binary (d) Logic operations:
vectors without propagating the carries and generate two Logic functions Ch and Maj.
binary vectors, pseudo-sum and pseudo-carry. Reduction by (e) Rotate/shift operations: P P
rows technique is used to perform the multi operand addition Four functions s1, s0, and 0 and 1 involving
wherein an array of full adders is used with a CLA in the last rotation/shift operation.
stage to add the final pseudo-sum and pseudo-carry. The (f) Addition using CLA:
implementation of an SHA-512 with CSAs reported by Final addition of pseudo-sum and pseudo carry using
Dadda and Macchetti [28] is shown in Fig. 1. three CLAs.

FIGURE 1: SHA-512 digest round implementation using CSAs

THE COMPUTER JOURNAL, 2007


IMPLEMENTATION OF SHA-512 ALGORITHMS ON FPGAS Page 5 of 11

that at any time only one bit will be in error and the error
detection circuits are also designed based on this model.
Included in this study are transient as well as permanent
faults. In Section 5, we discuss in detail the quantum of experi-
ments conducted.

3.1 Error analysis of individual operations in a


digest round
In this section, we discuss the effect of a single error on every
operation of a digest round individually. Even though the
digest round of SHA-512 consists of operations performed
in sequence as shown in Fig. 1, we discuss in this section
the effect of an error in the basic operations listed in Section 2.
Experiments were conducted by injecting a single error in
one of the inputs of an operation block and obtaining the
number of erroneous bits at the output of the same block. As
Steps 2 and 3, shown in Fig. 2, are repeated 80 times, hereafter
referred to as simply rounds, the errors were introduced at
FIGURE 2: Digest and block rounds in SHA-512 different bits randomly in every round and the number of
bits that were in error was computed. The distribution of
number of erroneous bits was plotted by computing the fre-
All operations of a digest round except rotation/shift can be quency of number of errors, where, frequency of errors ¼
performed on bytes. Hence, the operands, which are 64 bits (number of times specific number of bits in error/80). As
wide, are partitioned into 8 bytes to facilitate better error memory and register transfer operations merely transfer the
detection capability. input to output, the erroneous bit in the input is also transferred
to the output at the same position. Similarly, in the addition
operation performed by FAA, an erroneous bit in any one of
3. ERROR ANALYSIS
operands results in single-bit error in the pseudo-sum output
In this section, the propagation of error in an SHA-512 is dis- in the same position. The situation is different in the case of
cussed. Error analysis is carried out to understand the effect of pseudo-carry output of FAA. The error is shifted to the left
an error injected into the hash computation circuit. Error injec- by 1 bit position since every bit of carry produced has a
tion points are the following: (i) One of the inputs of an indi- higher positional weight than the pseudo-sum bits and error
vidual operation of a digest round, (ii) One of the inputs in a masking also takes place. In Table 2, the propagation of an
digest round and (iii) One of the inputs at the beginning of a error in a FAA is presented. The shaded cells in Table 2
block round, i.e. first digest round of the first block in a multi- depict the error masking in pseudo-carry output. The error
block message. A restricted error model is used in our analysis masking takes place in Maj function as well, since it has an
as an SHA-512 involves several loops. This model assumes identical logic function as the pseudo-carry output of FAA.

TABLE 2: Propagation of error in FAA


Correct outputs Erroneous outputs (error in Z)

Pseudo-sum Pseudo-carry Pseudo-sum Pseudo-carry


X Y Z XYZ XY þ XZ þ YZ XYZ XY þ XZ þ YZ
0 0 0 0 0 1 0
0 0 1 1 0 0 0
0 1 0 1 0 0 1
0 1 1 0 1 1 0
1 0 0 1 0 0 1
1 0 1 0 1 1 0
1 1 0 0 1 1 1
1 1 1 1 1 0 1

THE COMPUTER JOURNAL, 2007


Page 6 of 11 AHMAD AND DAS

FIGURE 3: Effect of single bit error


(a) Inserted at Wt22 on s1
(b) Inserted at e on Ch

P P
The functions 0 and 1 are basically rotate functions and Fig. 3b, the effect of a single-bit error injected at the input e
an erroneous bit in the input produces a single error in the of Ch block is shown. It is clear from graph of Ch that error
output, but the position of the erroneous bit in the output is masking taking place in 55% of cases.
changed
P due to rotate operation. As there are three rotations Propagation of errors by CLA is markedly different from
in 0, uniformly three erroneous bits were obtained inP the that of FAAs. An error in any one of the inputs is propagated
output in all the 80 rounds. Same is the case with 1. by a CLA to several bits in the output. The propagation is done
Figure 3a shows the effect of a single-bit error injected at through the bit carried over to the subsequent cells, which in
the only input Wt22 of s1 block. In 75% of rounds, three turn propagates errors to sum bits and carry bits and so on.
bits of output s1 were in error. As s1 has shift function The trend line is exponential in nature and every bit is in
[SHR 6(x)], and the errors are injected at random in each error at one time or the other. In 50% of the rounds, the
round, the effect of errors in the six least significant of error was present in only one output of the CLA, and the
inputs are ignored in the output. The result is zero error in worst-case error of seven erroneous bits in the output occurred
20% of the rounds. A similar set of output error patterns was only in one round.
obtained in the case of s0. From the foregoing analysis it is clear that memory, register
In the case of Ch, one of the three inputs had to be chosen and pseudo-sum outputs of FAA are linear operations and rest
for the injection of an error. The function Ch(e, f, g) ¼ of the operations are non-linear in nature. A study of this
(e^f)  (: e^g) is such that the error in the output could nature helps us in choosing a suitable error detection
not be detected due to error masking in four out of eight scheme, which is discussed in detail in Section 4.
cases similar to pseudo-carry output of FAA shown in It can be seen from Fig. 4 that 50% of bits (256) of message
Table 2 if a single error was injected into f or g inputs and digest are in error in almost 90% of cases and the number of
in three out of eight cases if the choice of input was e. In erroneous bits decreases drastically in the last 10% of the

FIGURE 4: Errors in message digest due to an error in Wt

THE COMPUTER JOURNAL, 2007


IMPLEMENTATION OF SHA-512 ALGORITHMS ON FPGAS Page 7 of 11

cases. The number of erroneous bits in message digest was 4. ERROR DETECTION SCHEMES
only three when a fault was inserted in the 80th round. The
Fault detection is achieved in a circuit only by including
reason for this behavior is the iterative nature of the algorithm.
redundancy in one form or the other. The selection of a suit-
able scheme depends on the type of errors to be covered,
error models and the performance degradation acceptable in
3.2. Error analysis of a digest round terms of hardware overhead and delay. As discussed in
Section 1, the schemes based only on time redundancy are
Experiments were conducted by injecting a single fault in the not capable of handling permanent faults and the simplest
computed value of Wt. A message consisting of a data block of form of error detection scheme which can detect transient as
1024 bits is used and an error is introduced in only 1 bit pos- well as permanent faults with hardware redundancy is the
ition of Wt, choosing the position randomly in every round in DWC scheme. Even though the hardware overhead in this
order to simulate the transient fault condition. The number of case is little .100% with very little additional delay, these
erroneous bits in the message digest is computed for this can handle unrestricted error models.
block. The results of simulation are shown in Fig. 4. Coding schemes can address the problem of detecting tran-
sient as well as permanent faults, with lesser overhead on hard-
ware and delay, but these cannot match DWC scheme in terms
of error detection capability. The earliest error detecting
3.3. Error analysis of a block round
coding scheme used is the parity bit scheme. It is a well-known
In a message, which consists of several blocks of data, the fact that parity codes can detect all single bit errors and all
intermediate hash value Hi of a block i computed in digest errors with multiple odd erroneous bits. A single parity bit
rounds is used in the subsequent block as initial H values. In scheme is suitable if the group of bits handled by the parity
order to simulate the effect of a single fault in a block bit is small in size. In SHA-512, as the operands are 64 bits
round, a fault was injected in the computed hash value of a long, multiple parity bits are required to improve the error
block and the block round was unrolled to estimate the erro- detection capability.
neous bits in final hash value. A message consisting of 10 We use a multiple parity bit scheme comprised of eight parity
data blocks was used and it was found that out of 5120 bits bits with each parity bit handling a byte of an operand. The eight
of intermediate hash values, 1600 bits were in error. This is parity bits of an operand so generated will be referred to as parity
more than 30%. vector of the operand in the rest of the paper. In our scheme, the

FIGURE 5: Parity coding in SHA-512 digest round

THE COMPUTER JOURNAL, 2007


Page 8 of 11 AHMAD AND DAS

single parity bit of a vector can certainly detect not only all bits for pseudo-sum are predicted by XORing
single bit errors in a byte but also all odd number of erroneous the parity bits of respective bytes of the operands.
bits in the same group. Moreover, multiple even errors in the For example, if X ¼ 1011, Y ¼ 1100, Z ¼ 1001, then
64-bit operands may also be detected if the errors are spread X  Y  Z ¼ 1110 and p(X  Y  Z ) ¼ 1. Computing
in such a way that there are only odd number of erroneous bits parity of individual operands, pX ¼ 1, pY ¼ 0, pZ ¼ 0.
in a byte. As the parity bits are handling only a byte at a time, As, pX  pY  pZ ¼ 1, parity of pseudo-sum can be
the hardware complexity is also reduced. The parity coding of predicted by XORing the parity of individual operands.
an SHA-512 digest round is shown in Fig. 5. Every parity A parity vector generator computes the parity vector of
block is 8 bits wide and FAAs have two sets of parity blocks, pseudo-sum. The predicted parity vector is then com-
one for pseudo-sum and another for pseudo-carry outputs. In pared with the generated parity vector of pseudo-sum.
the following section, we discuss the method of predicting and Parity prediction and checking procedure is shown in
checking parity bits for each operation. Fig. 6a for pseudo-sum output of FAA, wherein X, Y
and Z are 8-bit operands. Implementation of this
scheme would require a parity generator block for the
4.1. Parity prediction and checking pseudo-sum, which is constructed with an 8-bit XOR
for every byte of pseudo-sum, which in turn generates
The parity prediction scheme depends on the operation per-
1 bit of the parity vector.
formed. Parity checking basically compares the parity genera-
(3) Pseudo-carry output of FAA, logic and rotate/shift
ted from the output of an operation with that of the predicted
operations: For the pseudo-carry output of FAA,
parity.
logic operations
P PCh and Maj and rotate/shift operations
(1) Memory and register transfer operations: The initial s1, s0, 0 and 1 no parity codes can be generated as
constant H is stored in 8  64 ¼ 512 bits and constant such since these operations are non-linear in nature, and
Kt in 80  64 ¼ 5120 bits of memory. As these are con- any attempt to create a coding scheme results in marked
stants, even parity bits are computed beforehand for increase in hardware overhead. Moreover, rotate/shift
every byte of the operand and stored in memory as a operations cannot be performed on bytes and this
vector along with the operands. Memory capacity is adds to the increased complexity of coding scheme.
increased to 8  72 ¼ 576 bits for H and 80  72 ¼ Hence, parity prediction is done in these operations
5760 bits for K to store the constants along with the by duplicating the operation block and generating the
parity bits. During memory read or register transfer parity vectors of original block and the duplicated
operations, the parity vectors are transferred along blocks separately. The parity vectors so generated are
with the operands to the destination. A total of 704 then compared to detect errors. Two schemes were
extra bits are required to store the parity bits generated. attempted which are shown in Fig. 6b and c. The
(2) Pseudo-sum output of FAAs: In FAAs, the sum output is scheme chosen for implementation is Fig. 6b as
generated by full adders and is given by pseudo-sum ¼ the hardware overhead was lesser in this case than the
X  Y  Z where X, Y and Z are operands. If pX, pY and scheme in Fig. 6c. For example, s0 function when
pZ are parity vectors of X, Y and Z, respectively, as implemented using scheme in Fig. 6b, 44 logic
parity is a linear operator, p(X  Y  Z ) ¼ pX  pY elements were used whereas scheme in Fig. 6c required
 pZ. In the FAAs of SHA-512 circuit, eight parity 83 logic elements.

FIGURE 6: Parity prediction and checking

THE COMPUTER JOURNAL, 2007


IMPLEMENTATION OF SHA-512 ALGORITHMS ON FPGAS Page 9 of 11

5. EXPERIMENTAL RESULTS
The parity scheme for the SHA-512 algorithm proposed in this
paper was initially implemented in software in order to
perform exhaustive analysis as the hardware implementation
on FPGA is too complex. The algorithm was later designed
and tested using comprehensive design software, the Altera
Quartus II, version 5.0. The designs were synthesized using
Verilog HDL and VHDL, placed and routed in Altera device
EP1S20F780C5 of Stratix family FPGA. The design was
tested for a large number of test cases encompassing the differ-
ent types of expected errors. The test cases can be classified
into three categories on the basis of position of error
FIGURE 7: Parity prediction and checking in a CLA introduced.
In the first category, a message of one data block size was
(4) Addition in CLA: Propagation of errors by a CLA was chosen and a single fault was injected as discussed in
discussed in Section 3.1. As the carries are propagated Section 3.2. In each of the 80 rounds, the error position was
to subsequent cells, any error in carry will spill over to randomly chosen among the 64 bits of Wt in order to simulate
other bits. A scheme for predicting the parity bit is the transient errors. Ten such blocks were tested in the same
shown in Fig. 7. If A and B are the pseudo-sum and manner amounting to 10  80 ¼ 800 test cases. These tests
pseudo-carry from FAA and if pA and pB are their were used to detect errors in digest rounds. Next a message
respective parity bits, C0 is the carry-in, pS is the of multiple blocks with varying number of blocks was used
parity of the final sum output and RCn21 is the XOR and 10 such messages were tested when a single fault was
of all the carries generated in all the cells excluding injected in each of the 80 rounds giving rise to same number
the last cell, then predicted carry ¼ pA  pB  RCn21 of test cases as above. This tests errors in the block loop.
 C0. Parity check then involves comparing pS with The third category test cases were used to detect permanent
the predicted parity. For example, let A ¼ 1001, B ¼ errors. A message of one data block size was chosen and a
0111; then pA ¼ 0 and pB ¼ 1; pS ¼ 0 as the Sum ¼ single fault was injected in every bit of Wt in each of the 80
0000. The carry-in C0 ¼ 0 and carries generated are rounds. Five such blocks were tested in the same manner
1110 where C4 ¼ 1. The XOR of all the carries gener- amounting to test cases 5  64  80 ¼ 25 600. In all cases,
ated in all the cells excluding the last cell, i.e. C4, the errors were injected in the bits of Wt and 100% error detec-
RC3 ¼ 1  1  1 ¼ 1. The predicted carry as computed tion was achieved.
by the expression, pA  pB  RC3  C0 ¼ 0  1  1 The performance metrics such as the area (a) and memory
 0 ¼ 0, which is same as pS. In the implementation (m) were obtained from the tool and throughput (d) was com-
of the adder, the required modifications are carried puted for a digest round of the SHA-512 algorithm with and
out to handle the parity generation. without error detection capability and these performance
metrics are presented in Table 3. The resources used in
terms of the number of logic elements for the implementation
The prediction scheme for a digest round of an SHA-512 is of algorithm are referred to as the area. A memory segment
constructed by cascading the individual prediction schemes consists of a bit-slice of a memory that is implemented in a
discussed in this section. single embedded cell. Each embedded cell implements one
output of the memory and multiple memory segments may

4.2. Parity checking points


The parity bits are checked at the output of all operational TABLE 3: Synthesis results of SHA-512 on EP1S20F780C5
blocks except at the register outputs where only the transfer Percentage
of parity bits is done. The total number of checkpoints is 25, Area of chip
which are marked in Fig. 5. Experiments were conducted by Digest round (LEs) area used Memory Throughput
injecting a single fault at each of these check points and we of SHA-512 (a) (%) (bits) (m) (Mbits/s)
found that all 25 check points are required as each block Without error 4158 22 9600 415
assumes that the incoming parities are error free. Moreover, detection
checking parity at the output of every operation prevents trans- With error 5038 27 10 704 372
mission of errors to subsequent operational blocks and pro- detection
vides the shortest detection latency.

THE COMPUTER JOURNAL, 2007


Page 10 of 11 AHMAD AND DAS

be needed to create a single memory block. The throughput (d) Computer Science, Vol. 809, 71 –83. Springer-Verlag,
is computed as, d ¼ Message block size/(Clock period  London, pp. 71–83.
Latency) where Latency is the number of rounds in a loop ¼ [4] Sklavos, N. and Koufopavlou, O. (2003) On the Hardware
80 for SHA-512 and the minimum operating clock obtained Implementation of the SHA-2 (256, 384, 512) Hash
from the tool is referred to as clock period. Functions. Proc. IEEE Int. Symp. on Circuits and Systems
It can be seen from the performance characteristics in ISCAS’03, Bangkok, Thailand, May 25– 28, pp. 153–156.
Table 3, the number of additional logic elements required IEEE Computer Society, Washington, DC, USA.
for error detection circuits ¼ (5038 2 4158) ¼ 880 which [5] Bar-El, H., et al. (2006) The Sorcerer’s Apprentice Guide to
amounts to a hardware overhead of 21%, i.e. (880/4158)  Fault Attacks. Proc. IEEE, 94, 370–382.
100. The delay overhead computed from reciprocal of fre- [6] Karri, R., et al. (2001) Fault-based Side-Channel Cryptanalysis
quency of both the circuits is 11.6%. These parameters show Tolerant Rijndael Symmetric Block Cipher architecture. Proc.
that our scheme is superior to complete the hardware redun- IEEE Int. Symp. on Defect and Fault Tolerance in VLSI
dancy technique in terms of hardware overhead and has Systems DFT’01, San Francisco, CA, USA, October 24– 26,
much less delay overhead than time redundancy techniques. pp. 427–435. IEEE Computer Society, Washington, DC, USA.
[7] Wu, K., et al. (2004) Low Cost Concurrent error Detection for
the Advanced Encryption Standard. Proc. Int. Test Conf. 2004
ITC 2004, Charlotte Convention Center, October 26– 28,
6. CONCLUSIONS pp. 1242–1248. IEEE Computer Society, Washington, DC,
In this paper, a detailed analysis of the propagation of errors in USA.
a hardware implementation of SHA-512 is studied. This analy- [8] Bertoni, G., et al. (2004) An Efficient Hardware-Based Fault
sis included single, transient as well as permanent faults Diagnosis Scheme for AES: Performances and Cost. Proc.
injected at all stages of hash value computation. It is found IEEE Int. Symp. on Defect and Fault Tolerance in VLSI
that even a single error injected resulted in half the bits of Systems DFT’04, Cannes, France, October 11– 13,
hash value being in error and the errors are spread across the pp. 130–138. IEEE Computer Society, Washington, DC, USA.
computed hash value. In-depth analysis of the individual oper- [9] Wu, K. and Karri, R. (2001) Idle Cycle-Based Concurrent Error
ations of SHA-512 guided us in proposing suitable parity pre- Detection of RC6 Encryption. Proc. IEEE Int. Symp. on Defect
diction schemes. We proposed an error detection scheme and Fault Tolerance in VLSI Systems (DFT’01), San Francisco,
based on parity codes and hardware redundancy and our CA, USA, October 24– 26, pp. 200–205. IEEE Computer
experiments conducted on a large number of test cases show Society, Washington, DC, USA.
that our scheme has 100% fault coverage in the case of [10] Bertoni, G., et al. (2003) Error analysis and detection
single errors. The computation of performance metrics, such procedures for a hardware implementation of the advanced
as area and throughput indicate that our scheme has only encryption standard. IEEE Trans. Comput., 52, 492– 505.
limited hardware overhead and short delay overhead. Future [11] Yen, C. and Wu, B. (2006) Simple error detection methods for
work will be directed toward developing error detection hardware implementation of advanced encryption standard.
schemes with pipelined architecture and developing error cor- IEEE Trans. Comput., 55, 720–731.
rection codes for SHA-512 algorithm. [12] Siewiorek, D.P. and Swarz, R.S. (1998) Reliable Computer
Systems Design and Evaluation (3rd edn). A. K. Peters,
Natick, MA.
[13] Johnson, B.W. (1989) Design and Analysis of Fault-Tolerant
ACKNOWLEDGEMENTS Digital Systems. Addison-Wesley, Reading, MA.
The authors would like to acknowledge the financial support [14] Drineas, P. and Makris, Y. (2003) Concurrent Fault Detection in
provided for this work by Kuwait University grant EO 04/ Random Logic. Proc. 4th Int. Symp. on Quality Electronic
03. Thanks are also due to anonymous referees for their con- Design ISQED’03, San Jose, CA, USA, March 24– 26,
structive comments to improve the quality of this manuscript. pp. 425–430. IEEE Computer Society, Washington, DC, USA.
[15] Patel, J.H. and Fung, L.Y. (1982) Concurrent error detection in
alus by recomputing with shifted operands. IEEE Trans.
Comput., C-31, 589– 595.
REFERENCES
[16] Reynolds, D.A. and Metze, G. (1978) Fault detection
[1] Dierks, T. and Allen, C. (1999) The TLS Protocol—Version 1.0. capabilities of alternating logic. IEEE Trans. Comput., C-27,
RFC 2246. 1093–1098.
[2] Madson, C. and Glenn, R. (1998) The Use of HMAC-SHA-1-96 [17] Karri, R. and Wu, K. (2002) Algorithm level re-computing
within ESP and AH. RFC 2404. using implementation diversity: a register transfer level
[3] Preneel, B. (1993) Design Principles for Dedicated Hash concurrent error detection technique. IEEE Trans. Very Large
Functions. Fast Software Encryption, Lecture Notes in Scale Integr. Syst., 10, 864–875.

THE COMPUTER JOURNAL, 2007


IMPLEMENTATION OF SHA-512 ALGORITHMS ON FPGAS Page 11 of 11

[18] Townsend, W.J., Abraham, J.A. and Swartzlander, E.E. Jr. [24] Almukhaizim, S., Drineas, P. and Makris, Y. (2004) Concurrent
(2003) Quadruple Time Redundancy Adders. Proc. 18th IEEE Error Detection for Combinational and Sequential Logic via
Int. Symp. on Defect and Fault Tolerance in VLSI Output Compaction. Proc. 5th Int. Symp. on Quality
Systems DFT’03, Cambridge, MA, USA, November 3–5, Electronic Design (ISQED’04), Santa Clara, CA, March
pp. 250 –256. IEEE Computer Society, Washington, DC, USA. 22 –24, pp. 459–464. IEEE Computer Society, Washington,
[19] Parhami, B. (2001) Approach to the Design of Parity-Checked DC, USA.
Arithmetic Circuits. Proc. 36th Asilomar Conf. Signals, [25] Sogomonyan, E. (1974) Design of built-in self-checking
Systems, and Computers Asilomar 2002, Pacific Grove, CA, monitoring circuits for combinational devices. Autom. Remote
USA, November 4–7, pp. 1–5. IEEE Computer Society, Control, 35, pp. 280–289.
Washington, DC. [26] McLoone, M. and McCanny, J.V. (2002) Efficient Single-Chip
[20] Nicolaidis, M., et al. (1997) Fault-secure parity prediction Implementation of SHA-384 & SHA-512. Proc. IEEE Int. Conf.
arithmetic operators. IEEE Des. Test Comput., 14, 60 –71. on Field-Programmable Technology FPT, Hong Kong,
[21] Das, D. and Touba, N.A. (1998) Synthesis of Circuits with December 16–18, pp. 311–314. IEEE Computer Society,
Low-Cost Concurrent Error Detection based on Bose-Lin Washington, DC, USA.
Codes. Proc. IEEE VLSI Test Symposium, Princeton, NJ, [27] Grembowski, T., et al. (2002) Comparative Analysis of the
USA, 28 April–1 May, pp. 309 –315. IEEE Computer Hardware Implementation of Hash Functions SHA-1 and
Society, Washington, DC, USA. SHA-512. Proc. 5th Int. Conf. on Information Security
[22] Lo, J.C., et al. (1992) An SFS Berger check prediction ALU and ISC’2002, Sao Paulo, Brazil, 30 September–2 October, pp.
its application to self-checking processor designs. IEEE Trans. 75 –89, Springer-Verlag.
Comput.-Aided Des. Integr. Circuits Syst., 11, 525– 540. [28] Dadda, L. and Macchetti, M. (2004) The Design of a High
[23] Almukhaizim, S. and Makris, Y. (2003) Fault Tolerant Design of Speed ASIC Unit for the Hash Functions SHA-256 (384,
Combinational and Sequential Logic based on Parity Check 512). Proc. Design, Automation and Test in Europe
Code. Proc. 18th IEEE Int. Symp. on Defect and Fault Tolerance Conference DATE’04, CNIT La Defese, Paris, France,
in VLSI Systems DFT’03, Cambridge, MA, USA, November 3– February 16–20, pp. 70–75. IEEE Computer Society,
5, pp. 563. IEEE Computer Society, Washington, DC, USA. Washington, DC, USA.

THE COMPUTER JOURNAL, 2007

Vous aimerez peut-être aussi