Scalable Elliptic Curve Cryptosystem FPGA Processor For NIST Prime Curves

This article has been accepted for inclusion in a future issue of this journal.
Content is final as presented, with the exception of pagination.

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS
Scalable Elliptic Curve Cryptosystem FPGA Processor

for NIST Prime Curves
Kung Chi Cinnati Loi and Seok-Bum Ko
Abstract The architecture and the implementation of a

high-performance scalable elliptic curve cryptography processor (ECP)
are presented. The proposed ECP is able to support all five prime field
elliptic curves recommended by the National Institute of Standards and
Technology (NIST). The design takes advantage of the high-performance
capabilities of the DSP48E slices available in Xilinx field-programmable
gate arrays (FPGAs) to achieve high speed and low hardware resource
utilization. The proposed design parallelizes the underlying prime
field operations to reduce the latency of the elliptic curve point
multiplication (ECPM) operation. Prime field inversion is performed
efficiently using the same arithmetic blocks as the ones used for prime
field multiplication and addition/subtraction. To the best of the authors
knowledge, the proposed scalable ECP is the fastest and smallest ECP
that can support all five NIST recommended prime curves without the
need to reconfigure the hardware. It can compute the ECPM between
1.709 and 28.04 ms using a Xilinx Virtex-5 FPGA.
Index Terms Elliptic curve cryptography (ECC), field-programmable
gate array (FPGA), finite field arithmetic, National Institute of Standards
and Technology (NIST) prime, scalable ECC processor (ECP).
Ananyi et al. [8] proposed an ECP that supports all five NIST
recommended prime fields that compute the ECPM for the five prime
field curves between 4.8 and 45.6 ms. In the abovementioned research
works, many ECPs have very high hardware resource utilization
due to the wide datapath and as a result have low maximum clock
frequencies.
The main contribution of this brief is the novel hardware
architecture of the finite field arithmetic units that take advantage of
the DSP48E slices [9] in the Xilinx FPGAs to improve the efficiency
of the ECPM operation. These DSP48E slices are hardwired
arithmetic units that can operate at much higher frequencies than
the FPGA fabric. Thus, by optimizing the arithmetic operations to
utilize the DSP48E slices, the proposed design is more efficient
than the designs in the current literature. The proposed ECP is also
able to implement the prime field inversion algorithm efficiently
using the same arithmetic units. Since the resultant scalable ECP
can support ECPM for prime fields from 192 to 521 bits, it can
satisfy the demand for higher security levels in the future, while
being compatible with lower key sizes presently used. The use of
FPGAs also allows for quick modifications to include additional
curves should they be required in the future.
II. E LLIPTIC C URVE C RYPTOGRAPHY OVER P RIME F IELDS
I. I NTRODUCTION
Elliptic curve cryptography (ECC) has gained an increased
amount of attention in the past few years as more efficient publickey cryptography algorithms are needed for the growing amount
of secure transactions over the network. Originally proposed by
Miller [1] and Koblitz [2], ECC implementations have shown to be
more efficient than other public-key cryptography algorithms, such
as RivestShamirAdleman [3].
The most important operation in ECC protocols is the elliptic curve
point multiplication (ECPM). Due to the complexity of the ECPM,
many applications opt to offload the ECPM to hardware accelerators,
also referred to as an ECC processor (ECP). The United States of
America National Security Agency (NSA) [4] indicated that classified
and unclassified information in the NSA will move toward using ECC
over prime fields of key sizes 256, 384, and 521 bits. Thus, it is
important for modern ECP implementations to support large prime
fields efficiently. For server-side applications, it is also important for
the ECP to support a variety of elliptic curves to be compatible with
devices with different security needs. For example, for elliptic curve
digital signature algorithm, the ECP can verify various signature
requests with different settings using the same hardware. Thus, this
brief presents an ECP implementation that supports all five prime
field curves recommended by the National Institute of Standards and
Technology (NIST) [5].
Other researchers have also implemented ECPs over prime
fields. Guneysu and Paar [6] optimized the architecture of the
ECP using high-performance DSP slices on field-programmable
gate arrays (FPGAs). Ghosh et al. [7] developed a side-channel
attack-resistant ECP using the double-and-add-always algorithm.
Manuscript received May 30, 2014; revised September 8, 2014 and
October 22, 2014; accepted November 23, 2014. This work was supported by
the National Science and Engineering Research Council of Canada through
the Alexander Graham Bell Canada Graduate Scholarship.
The authors are with the Department of Electrical and Computer
Engineering, University of Saskatchewan, Saskatoon, SK S7N 5A8, Canada
(e-mail: c.loi@usask.ca; seokbum.ko@usask.ca).
Digital Object Identifier 10.1109/TVLSI.2014.2375640
A. Scalable Prime Field Operations

The implementation of prime field operations can be accomplished
using any integer operation followed by a reduction step. In this brief,
the prime field operations that are implemented by the hardware are
multiplication, addition, and subtraction. Inversion is accomplished
using a series of additions and shifting as will be discussed later.
Multiplication in this brief uses the Comba algorithm [10], which
is a digit-wise multiplication algorithm. Its advantage is that digits of
the product are produced from the least to the most significant, which
is ideal for the proposed design where the number of digits differs
among the prime fields. Addition and subtraction are also performed
digit by digit.
The reduction (or modulo) operation is required for multiplication,
addition, and subtraction. The five prime numbers recommended
by NIST [5] have been selected to make reduction by these prime
numbers more efficient. Using the NIST recommended primes, the
reduction operation becomes a series of additions and subtractions
followed by a final modulo operation. The reduction algorithms can
be found in [11].
B. Elliptic Curve Point Multiplication Over Prime Fields
ECC is based on point operations performed on points on the
elliptic curve, E. The most important point operation is called the
ECPM. The ECPM operation can be computed using a sequence
of point additions (PADDs) and point doublings (PDBLs). The
efficiency of the ECPM operation depends on the selection of the
point multiplication algorithm, the coordinates of the points used,
and the efficiency of the underlying finite field operations. In this
brief, the double-and-add algorithm is selected using mixed Jacobian
and affine coordinates for PADD and Jacobian coordinates for PDBL.
The complete sequence of prime field operations for PADD and
PDBL can be found in [11]. The main advantage of using this
coordinate system is that PADD requires eight multiplications and
three squarings, and PDBL requires four multiplications and four
squarings. In this brief, there is no hardware dedicated to implementing the squaring operation, so squaring is evaluated using a
1063-8210 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
2
TABLE I
O PERATION S EQUENCE FOR THE A DDITION /S UBTRACTION /
R EDUCTION (AR) B LOCK FOR p192
Fig. 1.
Block diagram of the multiplier (MULT) block.
Fig. 3.
Fig. 2.
Block diagram of the addition/subtraction/reduction (AR) block.
multiplier. Thus, in this brief, PADD requires 11 multiplications and

PDBL requires eight multiplications.
III. D ESIGN AND A RCHITECTURE OF ECC P ROCESSOR
A. Prime Field Arithmetic Blocks
The proposed ECP has two prime field arithmetic blocks. One computes integer multiplication (MULT block), and the other computes
addition/subtraction/reduction (AR block).
1) Multiplier (MULT) Block: The MULT block is shown in Fig. 1.
The inputs of the multiplier are A and B. These 17-bit digits are stored
in two 3117-bit RAMs. Since the ECP supports all five NIST prime
fields, the RAM must accommodate up to 521/17 = 31 words.
A controller selects the appropriate digit to be input into the DSP48E
slice. In the MULT block, a DSP48E slice is set up for multiply
and accumulate [9], with the ability to shift the input to the adder
by 17 bits. The digit size of 17 bits is chosen due to the built-in
17-bit shift of the DSP48E slice to avoid using additional logic to
implement the right-shift operation used by the Comba algorithm. The
First-In-First-Out buffer is used to store the least significant digits of
the result, and the shift register is used to store the most significant
digits (MSD) of the result. The shift register combines the MSD of
the product and outputs them in parallel to the AR block.
2) Addition/Subtraction/Reduction (AR) Block: The block diagram
of the AR block is shown in Fig. 2. Using A_msd to input the MSDs
of the result of the MULT block, port A can input data with the same
number of clock cycles as A_add and A_sub, eliminating the need
for the AR block to wait for the MSDs to input digit by digit. The
input registers A Reg, A_add Reg, and A_sub Reg collect the input
digits to form the complete operands and performs A Reg + A_add
Reg A_sub Reg (mod p). To better fit the architecture of the
Block diagram of the scalable ECC processor.
AR block to the reduction algorithms of the five NIST recommended

primes, the internal operation of the AR block uses a 32-bit datapath.
To reduce the size of the multiplexers, the input registers may shift
by 32 bits after the digit is processed. Consider the case of operating
in the 192-bit mode. The sequence of digits selected for each input
is shown in Table I.
During the zeroth pass of the zeroth digit, the zeroth digit of
A_add Reg (add0 ) is selected for C0, the zeroth digit of A_sub Reg
(sub0 ) is selected for A1:B1, 0 is input into C1, the zeroth digit of
A Reg (a0 ) for A2:B2, and the sixth digit (a6 ) for C2. Once the
zeroth and first pass are complete for digit 0, the input registers are
shifted by 32 bits, such that add1 becomes add0 , sub1 becomes sub0 ,
a1 becomes a0 , and so on. By doing so, when operating for digit 1,
the multiplexer selects add0 and sub0 again for C0 and A1:B1,
respectively. Using this method, the size of the multiplexers becomes
much smaller.
The result of the sum and reduction is stored in Z Reg to be used by
the second half of the AR block, which performs the final (mod p)
of the reduction algorithm and converts the result back to 17-bit digits.
At the completion of the series of additions and subtractions at the
first half of the AR block, at most one extra 17-bit digit can be
produced (stored in Z Carry). Thus, the second half of the AR block
performs the reduction of the one extra digit. Since the digit size
is 17 bits, the shift input is a one-hot value that is multiplied to
Z Carry to shift the extra digit accordingly for reduction. The topright DSP48E unit in Fig. 2 is used to handle carryout bits from each
digit to be carried to the next digit. The result of the AR block is
output through port C as 17-bit digits.
B. Scalable ECC Processor (ECP)
1) Hardware Architecture: The block diagram of the proposed
scalable ECP is shown in Fig. 3. The inputs of the ECP are x1 and y1,
the affine coordinates of the point, and k, the scalar multiplier. The
RAM stores the x1, y1, and six temporary variables. Thus, it needs
to have 521/17 = 31 words and 8 18 = 144-bit words. The
reason for using 18-bit variables, instead of 17, is because the RAM
is configured for byte writing, which allows it to only write selected
TABLE II
R ESULTS C OMPARISON
Fig. 4.
Finite-state machine of the scalable ECC processor.
bytes at the time, eliminating the need to feed back the output of the
RAM to the input.
In the proposed architecture, the MULT and AR block operations
are performed in parallel. In addition, the latency of the MULT
operation is long enough that multiple AR operations can be executed
during each MULT operation. Since the C_msd output of MULT
connects directly to the A_msd input of AR, the RAM does not
need to store the product of an integer multiplication, which would
require the RAM to be double in size. However, since the result of
the MULT block is not stored, it must be immediately followed by
an AR operation to reduce the product.
2) Finite-State Machine and ECPM Operation: The finite-state
machine (FSM) of the design is stored in the controller in Fig. 3
and is shown in Fig. 4. The machine resets to the IDLE state, where
it waits for the load signal to initiate the ECPM. Once triggered,
the FSM moves to the LOAD state for initialization and goes into
the PDBL state. The FSM then moves to the PADD state only if the
current bit of k (cur_k) is 1. Otherwise, it will restart the PDBL state.
The FSM stays in the PDBL and PADD states until the doubleand-add algorithm is complete (k_count = 0), then it exits to the
INVS state, which sets up the inversion operation.
Prime field inversion is evaluated using the binary algorithm for
inversion shown in [11], but modified to fit the proposed architecture.
There are two other major modifications in the algorithm. In the
original algorithm, if x1 is even then x1 = x1 /2, otherwise,
x1 = (x1 + p)/2. In this brief, the operation is more hardware
friendly where it always performs x1 /2 and adds ( p + 1)/2 only
if x1 is odd. These operations are equivalent because if x1 is odd
(x1 /2) + ( p + 1)/2 = (x1 1)/2 + ( p + 1)/2 = (x1 + p)/2. The
second modification is that reduction is always executed because the
AR block always evaluates the prime modulo. Thus, the values are
always reduced by p and reduction is not necessary at the end of the
algorithm.
Further analyzing the inversion algorithm, one can also notice that
since v is initialized to p, which is an odd number, the INVS state
will move into the UEVEN state if u is even or SUBT state if u is
odd. When entering the SUBT state, both u and v must be odd, so at
the end of the SUBT state, only one of u or v is odd. Therefore, either
the UEVEN or the VEVEN state is entered. If the UEVEN state is
entered, u must be even and v must be odd. During the UEVEN state,
v is not modified, so when u is no longer even at the completion of
the UEVEN state, the FSM moves to the SUBT state.
When u = 1 or v = 1, the binary inversion algorithm is completed
and the FSM moves to the FINAL state to convert the result to affine
coordinates ready to be output.
The WAIT state allows the AR block to complete its operation and
the result is output on the x3 and y3 ports of the ECP. At this point,
if the load signal is detected, the FSM moves immediately back to
the LOAD state to commence the next ECPM calculation, otherwise

it returns to the IDLE state to wait for the next trigger.
IV. I MPLEMENTATION R ESULTS AND A NALYSIS
A. FPGA Implementation Results
The proposed design has been implemented for a Xilinx Virtex-5
XC5LX110T FPGA and the post-place-and-route results are shown
in Table II. The proposed scalable ECP requires 3567 registers,
6115 lookup tables (LUTs), 1980 slices, two block RAMs (BRAM),
and seven DSP48E slices on the Virtex-5 FPGA. It can run at
a maximum frequency of 251.3 MHz and the critical path is in
the MULT block between a control signal that is not shown and
Shift Reg in Fig. 1. For comparison purposes, the implementation
result for Virtex-4 XC4VFX100 FPGA is also included. The
Virtex-4 implementation requires 3545 registers, 12 435 LUTs,
7020 slices, four BRAMs, and eight DSP48 slices. It can run at
182.0 MHz and the critical path is in the AR block between A reg
and the input of a DSP slice.
To compare the designs from different literatures shown in Table II,
the efficiency metric is used to provide an easier comparison. In this
brief, efficiency is evaluated by 1/(latency number of slices).
The ECP presented in [6] is one of the fastest ECPs known to the
authors with the ability to evaluate the ECPM for 244- and 256-bit
key sizes in 0.452 and 0.620 ms. However, the design in [6] can
only support one prime field at a time and it uses 26 or 32 DSP
slices, whereas our proposed design implements five prime curves
using only eight DSP slices (in Virtex-4). In addition, the design
in [6] does not implement the inversion operation, which would be
required to convert the projective coordinates back to affine. In our
design, the inversion algorithm is implemented with little overhead
since it uses the MULT and AR blocks with minimal modifications.
Even though the efficiency metric shows a lower performance for the
proposed design, in order for the design in [6] to support multiple
prime fields in the same hardware, it would require additional DSP
slices and much wider multiplexers in their modular multiplier and
reduction circuits.
Ghosh et al. [7] present two ECP architectures that are resistant
against side channel attacks using the double-and-add-always
algorithm for ECPM. The results presented in Table II are from their
4
design using two multipliers and two divider units, which has lower
latency. However, due to the use of additional hardware, the slice
count is higher and results in a lower maximum clock frequency.
The architecture in [7] is also one that only supports a single prime
field at a time. In comparison, the latency of the design in [7] is
higher and has a higher hardware resource utilization, which results
in much lower efficiencies compared with the proposed design.
The design in [8] most resembles the one proposed in this brief.
The authors propose an ECP that supports all five prime field curves
recommended by NIST, which are the same five curves supported
by the proposed design. The design in [8] uses a 265-bit datapath
to implement a modular adder/subtractor, a integer multiplier with a
521-bit datapath reductor, and a modular inverter. Due to its wide
datapath, the maximum clock frequency is only 60 MHz and it
uses 20 793 slices. In comparison, our proposed design can run at
a maximum frequency of 182.0 MHz on the same Virtex-4 FPGA
and only requires 7020 slices because of the optimized use of the
DSP slices. It also only requires eight DSP48 slices instead of 32.
The proposed design uses fewer hardware resources and requires a
lower latency to evaluate the ECPM. Using the efficiency metric as
comparison, the proposed design is 3.56 times more efficient than
the design in [8].
V. C ONCLUSION
In this brief, the architecture of a highly efficient and scalable
ECP has been presented. The proposed design takes advantage of
the high-performance DSP48E slices available on Xilinx FPGAs
to increase its performance. It parallelizes the PADD and PDBL
operations by separating the integer multiplication step from the
reduction step in prime field multiplication. By doing so, the addition,
subtraction, and reduction operations run in parallel with the integer
multiplication. The prime field inversion operation uses the binary
inversion algorithm, which is implemented with minimal modifications to the arithmetic blocks used for other operations.
The Virtex-5 implementation of the proposed ECP requires
1980 slices and DSP48E slices and runs at a maximum clock frequency of 251.3 MHz. It supports all five prime fields recommended
by NIST without the need to reconfigure the hardware. The proposed

ECP can evaluate the ECPM for P-192, P-224, P-256, P-384, and
P-521 in 1.709, 2.652, 3.951, 11.81, and 28.04 ms, respectively.
To the best of authors knowledge, the proposed scalable ECP is the
fastest and smallest ECP that supports all five NIST recommended
prime field without the need to reconfigure the hardware.
R EFERENCES
[1] V. Miller, Use of elliptic curves in cryptography, in Proc. Adv.
Cryptol. (CRYPTO), 1986, pp. 417426.
[2] N. Koblitz, Elliptic curve cryptosystems, Math. Comput., vol. 48,
no. 177, pp. 203209, 1987.
[3] R. L. Rivest, A. Shamir, and L. Adleman, A method for obtaining digital
signatures and public-key cryptosystems, Commun. ACM, vol. 21, no. 2,
pp. 120126, Feb. 1978.
[4] National Security Agency. (Jan. 2009). The Case for Elliptic Curve Cryptography. [Online]. Available: http://www.nsa.gov/
business/programs/elliptic_curve.shtml
[5] Recommended Elliptic Curves for Federal Government Use, National
Institute of Standards and Technology, Gaithersburg, MD, USA,
Jul. 1999.
[6] T. Guneysu and C. Paar, Ultra high performance ECC over NIST
primes on commercial FPGAs, in Cryptographic Hardware and
Embedded Systems (Lecture Notes in Computer Science), vol. 5154,
E. Oswald and P. Rohatgi, Eds. Berlin, Germany: Springer-Verlag, 2008,
pp. 6278.
[7] S. Ghosh, M. Alam, D. R. Chowdhury, and I. S. Gupta, Parallel cryptodevices for GF( p) elliptic curve multiplication resistant against side
channel attacks, Comput. Elect. Eng., vol. 35, no. 2, pp. 329338, 2008.
[8] K. Ananyi, H. Alrimeih, and D. Rakhmatov, Flexible hardware processor for elliptic curve cryptography over NIST prime fields, IEEE Trans.
Very Large Scale Integr. (VLSI) Syst., vol. 17, no. 8, pp. 10991112,
Aug. 2009.
[9] Xilinx. (Jan. 2012). Virtex-5 FPGA XtremeDSP Design Considerations: User Guide. [Online]. Available: http://www.xilinx.com/support/
documentation/user_guides/ug193.pdf
[10] J. Groschadl, S. Tillich, P. Ienne, L. Pozzi, and A. K. Verma, When
instruction set extensions change algorithm design: A study in elliptic
curve cryptography, in Proc. 4th Workshop Appl. Specific Process.,
2005, pp. 29.
[11] D. Hankerson, A. Menezes, and S. Vanstone, Guide to Elliptic Curve
Cryptography. New York, NY, USA: Springer-Verlag, 2004.

Scalable Elliptic Curve Cryptosystem FPGA Processor For NIST Prime Curves

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Scalable Elliptic Curve Cryptosystem FPGA Processor For NIST Prime Curves

Transféré par

Droits d'auteur :

Formats disponibles

This article has been accepted for inclusion in a future issue of this journal.

Content is final as presented, with the exception of pagination.

Scalable Elliptic Curve Cryptosystem FPGA Processor

Abstract The architecture and the implementation of a

A. Scalable Prime Field Operations

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

Block diagram of the multiplier (MULT) block.

Block diagram of the addition/subtraction/reduction (AR) block.

multiplier. Thus, in this brief, PADD requires 11 multiplications and

Block diagram of the scalable ECC processor.

AR block to the reduction algorithms of the five NIST recommended

Finite-state machine of the scalable ECC processor.

the LOAD state to commence the next ECPM calculation, otherwise

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

by NIST without the need to reconfigure the hardware. The proposed

Vous aimerez peut-être aussi