Académique Documents
Professionnel Documents
Culture Documents
Efficient Implementation of Rijndael Encryption in Reconfigurable Hardware: Improvements and Design Tradeoffs
OUTLINE
Performance Evaluation of AES Algorithm Effective FPGA implementation Heuristics to evaluate hardware efficiency Derive at optimum throughput/area efficiency Optimum Throughput = 18.5 Gbps , Area = 542 slices , 10 RAM blocks
Hardware Description
Hardware Description
XILINX VIRTEX E 32448 slices 64986 LUTs,F.Fs 208 RAM Blocks Synthesis Synopsys Circuit modeling - VHDL
Hardware Description
2 Slices per CLB Slice 2 L.C L.C one 4-I/p LUT + storage + additional logic Storage element Latch/Edge Triggered D F.F Additional Logic Mux F5,F6 Arithmetic logic CY logic + XOR + AND
Evaluation Paramaters
In terms of resource
Throughput : bits processed per sec Area : Slices Ratio is an evaluation parameter Nbr of LUTs Nbr of Registers Ratio is Evaluation parameter
Encryption Block
Input 128 bit blocks State transformed S[r+c] = in[r+4c] Out[r+4c] = S[r+c] 0<=r<4, o<=c<Nb(=4)
Implementation
MixColumns
Architectural
Sbox Table
Mux Model
MUX Model
Mux Model
Realization on FPGA
LUT based 4 I/p 4 o/p Lookup Four 4 I/p 1 o/p LUT Coupled 4:1 Mux Realizing 4:1 Mux through three 2:1 Mux
a b
c
d
s0
s1
1 Bit output Repeated 16 times and looped 16 times Critical path LUT4 + MUXF5 + MUXF6 2 level pipelining 12 clock pulses
Implementation
Multiplexer Model
RAM based
MixColumns
Architectural
Lookup type BRAM two single port 256x8 bit Write enable of RAM made low Input held low ROM implemented 1 clock Design
SBOX = 16x16x8 = 2048 bits = 2Kbits 16 SBOx for each state 1 BRAM = two 2Kbit RAM Hence 8 BRAM required
Implementation
Multiplexer Model
RAM based
MixColumns
Composite field
MixColumns transform Mixadd transform
Architectural
Byte representation in Galois Field GF(28) For e.g. 01100011 is x6 + x5 + x + 1. Addition Modulo 2 Arithmetic (No subtraction) Multiplication polynomial multiplication modulo irreducible polynomial (deg = 8) m(x) = x8 + x4 + x3 + x +1 Multiplicative inverse
b(x)a(x) + m(x)c(x) = 1. b-1 (x) = a(x) mod m(x) because a(x) b(x) mod m(x) = 1,
GF(28) = GF(24) 2 GF(24) = a1x + a0 Inverse given by X belongs to x2 + x + = 0 b0=(a0+a1)-1 b1=a1-1 = a0.(a0+a1)+ a12
Linear transformation + Translation Transformation = rotations, scaling, shear Translation = shift In AES
Implementation
Multiplexer Model
RAM based Composite field
MixColumns
MixColumns transform
Mixadd transform
Architectural
Four-term polynomials Coefficients are bytes M(x) = X4 + 1 Product defined as a(x) X b(x) = d(x)
Solution
Multiplication of GF(28) polynomial with X = multiplication by 02 = left shift plus Conditional XOR (based on MSB)
To implement 03a1 = (02 + 01)a1 = 02a1 + a1 Hence we have 2 multiplication with x (a0,a1) 5 XOR addition Above two + a1+a2+a3 2 level pipelined
Implementation
Multiplexer Model
RAM based Composite field
MixColumns
MixColumns transform
Mixadd transform
Architectural
Inside X(a0) or X(a1) Mostly shift operator In both the bytes XOR is done only to 3 bits So these three bits separately added Now pipelined Combined with Key addition
Implementation
Multiplexer Model
RAM based Composite field
MixColumns
Pipelining Sub-Pipelining
Unrolled Architecture
Implementation
Multiplexer Model
RAM based Composite field
MixColumns
Architectural
Loop unrolling
Pipelining
Sub-Pipelining
Pipelined Architecture - I
At a time only one round Hardware reduced Throughput reduced Area reduced
Pipelined Architecture - II
All 10 rounds taken inside loop Loss of mixadd combination Additional Mux Good choice in ASIC
Heuristic optimization
Results
Pipelined -I architecture
Unrolled Architecture
Results Contd
Comparison
RAM/unrolled RAM/pipelined
Mux/pipelined
composite/pipelined
Summary
http://www.cs.bc.edu/~straubin/cs38105/blockciphers/rijndael_ingles2004.swf
Conclusion
Algorithmic and Architectural Design Tradeoffs were evaluated Optimum Design principle found through heuristics Throughput = 1563Mbps Performance (throughput/Area) = .69
Phase 2 preview