Académique Documents
Professionnel Documents
Culture Documents
Ananyi et al. [8] proposed an ECP that supports all five NIST
recommended prime fields that compute the ECPM for the five prime
field curves between 4.8 and 45.6 ms. In the abovementioned research
works, many ECPs have very high hardware resource utilization
due to the wide datapath and as a result have low maximum clock
frequencies.
The main contribution of this brief is the novel hardware
architecture of the finite field arithmetic units that take advantage of
the DSP48E slices [9] in the Xilinx FPGAs to improve the efficiency
of the ECPM operation. These DSP48E slices are hardwired
arithmetic units that can operate at much higher frequencies than
the FPGA fabric. Thus, by optimizing the arithmetic operations to
utilize the DSP48E slices, the proposed design is more efficient
than the designs in the current literature. The proposed ECP is also
able to implement the prime field inversion algorithm efficiently
using the same arithmetic units. Since the resultant scalable ECP
can support ECPM for prime fields from 192 to 521 bits, it can
satisfy the demand for higher security levels in the future, while
being compatible with lower key sizes presently used. The use of
FPGAs also allows for quick modifications to include additional
curves should they be required in the future.
II. E LLIPTIC C URVE C RYPTOGRAPHY OVER P RIME F IELDS
I. I NTRODUCTION
Elliptic curve cryptography (ECC) has gained an increased
amount of attention in the past few years as more efficient publickey cryptography algorithms are needed for the growing amount
of secure transactions over the network. Originally proposed by
Miller [1] and Koblitz [2], ECC implementations have shown to be
more efficient than other public-key cryptography algorithms, such
as RivestShamirAdleman [3].
The most important operation in ECC protocols is the elliptic curve
point multiplication (ECPM). Due to the complexity of the ECPM,
many applications opt to offload the ECPM to hardware accelerators,
also referred to as an ECC processor (ECP). The United States of
America National Security Agency (NSA) [4] indicated that classified
and unclassified information in the NSA will move toward using ECC
over prime fields of key sizes 256, 384, and 521 bits. Thus, it is
important for modern ECP implementations to support large prime
fields efficiently. For server-side applications, it is also important for
the ECP to support a variety of elliptic curves to be compatible with
devices with different security needs. For example, for elliptic curve
digital signature algorithm, the ECP can verify various signature
requests with different settings using the same hardware. Thus, this
brief presents an ECP implementation that supports all five prime
field curves recommended by the National Institute of Standards and
Technology (NIST) [5].
Other researchers have also implemented ECPs over prime
fields. Guneysu and Paar [6] optimized the architecture of the
ECP using high-performance DSP slices on field-programmable
gate arrays (FPGAs). Ghosh et al. [7] developed a side-channel
attack-resistant ECP using the double-and-add-always algorithm.
Manuscript received May 30, 2014; revised September 8, 2014 and
October 22, 2014; accepted November 23, 2014. This work was supported by
the National Science and Engineering Research Council of Canada through
the Alexander Graham Bell Canada Graduate Scholarship.
The authors are with the Department of Electrical and Computer
Engineering, University of Saskatchewan, Saskatoon, SK S7N 5A8, Canada
(e-mail: c.loi@usask.ca; seokbum.ko@usask.ca).
Digital Object Identifier 10.1109/TVLSI.2014.2375640
1063-8210 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
2
TABLE I
O PERATION S EQUENCE FOR THE A DDITION /S UBTRACTION /
R EDUCTION (AR) B LOCK FOR p192
Fig. 1.
Fig. 3.
Fig. 2.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS
TABLE II
R ESULTS C OMPARISON
Fig. 4.
bytes at the time, eliminating the need to feed back the output of the
RAM to the input.
In the proposed architecture, the MULT and AR block operations
are performed in parallel. In addition, the latency of the MULT
operation is long enough that multiple AR operations can be executed
during each MULT operation. Since the C_msd output of MULT
connects directly to the A_msd input of AR, the RAM does not
need to store the product of an integer multiplication, which would
require the RAM to be double in size. However, since the result of
the MULT block is not stored, it must be immediately followed by
an AR operation to reduce the product.
2) Finite-State Machine and ECPM Operation: The finite-state
machine (FSM) of the design is stored in the controller in Fig. 3
and is shown in Fig. 4. The machine resets to the IDLE state, where
it waits for the load signal to initiate the ECPM. Once triggered,
the FSM moves to the LOAD state for initialization and goes into
the PDBL state. The FSM then moves to the PADD state only if the
current bit of k (cur_k) is 1. Otherwise, it will restart the PDBL state.
The FSM stays in the PDBL and PADD states until the doubleand-add algorithm is complete (k_count = 0), then it exits to the
INVS state, which sets up the inversion operation.
Prime field inversion is evaluated using the binary algorithm for
inversion shown in [11], but modified to fit the proposed architecture.
There are two other major modifications in the algorithm. In the
original algorithm, if x1 is even then x1 = x1 /2, otherwise,
x1 = (x1 + p)/2. In this brief, the operation is more hardware
friendly where it always performs x1 /2 and adds ( p + 1)/2 only
if x1 is odd. These operations are equivalent because if x1 is odd
(x1 /2) + ( p + 1)/2 = (x1 1)/2 + ( p + 1)/2 = (x1 + p)/2. The
second modification is that reduction is always executed because the
AR block always evaluates the prime modulo. Thus, the values are
always reduced by p and reduction is not necessary at the end of the
algorithm.
Further analyzing the inversion algorithm, one can also notice that
since v is initialized to p, which is an odd number, the INVS state
will move into the UEVEN state if u is even or SUBT state if u is
odd. When entering the SUBT state, both u and v must be odd, so at
the end of the SUBT state, only one of u or v is odd. Therefore, either
the UEVEN or the VEVEN state is entered. If the UEVEN state is
entered, u must be even and v must be odd. During the UEVEN state,
v is not modified, so when u is no longer even at the completion of
the UEVEN state, the FSM moves to the SUBT state.
When u = 1 or v = 1, the binary inversion algorithm is completed
and the FSM moves to the FINAL state to convert the result to affine
coordinates ready to be output.
The WAIT state allows the AR block to complete its operation and
the result is output on the x3 and y3 ports of the ECP. At this point,
if the load signal is detected, the FSM moves immediately back to
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
4
design using two multipliers and two divider units, which has lower
latency. However, due to the use of additional hardware, the slice
count is higher and results in a lower maximum clock frequency.
The architecture in [7] is also one that only supports a single prime
field at a time. In comparison, the latency of the design in [7] is
higher and has a higher hardware resource utilization, which results
in much lower efficiencies compared with the proposed design.
The design in [8] most resembles the one proposed in this brief.
The authors propose an ECP that supports all five prime field curves
recommended by NIST, which are the same five curves supported
by the proposed design. The design in [8] uses a 265-bit datapath
to implement a modular adder/subtractor, a integer multiplier with a
521-bit datapath reductor, and a modular inverter. Due to its wide
datapath, the maximum clock frequency is only 60 MHz and it
uses 20 793 slices. In comparison, our proposed design can run at
a maximum frequency of 182.0 MHz on the same Virtex-4 FPGA
and only requires 7020 slices because of the optimized use of the
DSP slices. It also only requires eight DSP48 slices instead of 32.
The proposed design uses fewer hardware resources and requires a
lower latency to evaluate the ECPM. Using the efficiency metric as
comparison, the proposed design is 3.56 times more efficient than
the design in [8].
V. C ONCLUSION
In this brief, the architecture of a highly efficient and scalable
ECP has been presented. The proposed design takes advantage of
the high-performance DSP48E slices available on Xilinx FPGAs
to increase its performance. It parallelizes the PADD and PDBL
operations by separating the integer multiplication step from the
reduction step in prime field multiplication. By doing so, the addition,
subtraction, and reduction operations run in parallel with the integer
multiplication. The prime field inversion operation uses the binary
inversion algorithm, which is implemented with minimal modifications to the arithmetic blocks used for other operations.
The Virtex-5 implementation of the proposed ECP requires
1980 slices and DSP48E slices and runs at a maximum clock frequency of 251.3 MHz. It supports all five prime fields recommended
R EFERENCES
[1] V. Miller, Use of elliptic curves in cryptography, in Proc. Adv.
Cryptol. (CRYPTO), 1986, pp. 417426.
[2] N. Koblitz, Elliptic curve cryptosystems, Math. Comput., vol. 48,
no. 177, pp. 203209, 1987.
[3] R. L. Rivest, A. Shamir, and L. Adleman, A method for obtaining digital
signatures and public-key cryptosystems, Commun. ACM, vol. 21, no. 2,
pp. 120126, Feb. 1978.
[4] National Security Agency. (Jan. 2009). The Case for Elliptic Curve Cryptography. [Online]. Available: http://www.nsa.gov/
business/programs/elliptic_curve.shtml
[5] Recommended Elliptic Curves for Federal Government Use, National
Institute of Standards and Technology, Gaithersburg, MD, USA,
Jul. 1999.
[6] T. Guneysu and C. Paar, Ultra high performance ECC over NIST
primes on commercial FPGAs, in Cryptographic Hardware and
Embedded Systems (Lecture Notes in Computer Science), vol. 5154,
E. Oswald and P. Rohatgi, Eds. Berlin, Germany: Springer-Verlag, 2008,
pp. 6278.
[7] S. Ghosh, M. Alam, D. R. Chowdhury, and I. S. Gupta, Parallel cryptodevices for GF( p) elliptic curve multiplication resistant against side
channel attacks, Comput. Elect. Eng., vol. 35, no. 2, pp. 329338, 2008.
[8] K. Ananyi, H. Alrimeih, and D. Rakhmatov, Flexible hardware processor for elliptic curve cryptography over NIST prime fields, IEEE Trans.
Very Large Scale Integr. (VLSI) Syst., vol. 17, no. 8, pp. 10991112,
Aug. 2009.
[9] Xilinx. (Jan. 2012). Virtex-5 FPGA XtremeDSP Design Considerations: User Guide. [Online]. Available: http://www.xilinx.com/support/
documentation/user_guides/ug193.pdf
[10] J. Groschadl, S. Tillich, P. Ienne, L. Pozzi, and A. K. Verma, When
instruction set extensions change algorithm design: A study in elliptic
curve cryptography, in Proc. 4th Workshop Appl. Specific Process.,
2005, pp. 29.
[11] D. Hankerson, A. Menezes, and S. Vanstone, Guide to Elliptic Curve
Cryptography. New York, NY, USA: Springer-Verlag, 2004.