Vous êtes sur la page 1sur 43

FPGA Implementation of Advanced Encryption standards

Srihari Sridharan October 22nd 2007

Efficient Implementation of Rijndael Encryption in Reconfigurable Hardware: Improvements and Design Tradeoffs

Francois-Xavier Standaert,Gael Rouvroy,JeanJacques Quisquater, and Jean-Didler Legat

CHES Springer-Verlag Berlin Heidelberg 2003

OUTLINE

Performance Evaluation of AES Algorithm Effective FPGA implementation Heuristics to evaluate hardware efficiency Derive at optimum throughput/area efficiency Optimum Throughput = 18.5 Gbps , Area = 542 slices , 10 RAM blocks

Hardware Description

Hardware Description

XILINX VIRTEX E 32448 slices 64986 LUTs,F.Fs 208 RAM Blocks Synthesis Synopsys Circuit modeling - VHDL

Hardware Description

2 Slices per CLB Slice 2 L.C L.C one 4-I/p LUT + storage + additional logic Storage element Latch/Edge Triggered D F.F Additional Logic Mux F5,F6 Arithmetic logic CY logic + XOR + AND

Evaluation Paramaters

2 Types of Performance evaluation parameters. In terms of performance


In terms of resource

Throughput : bits processed per sec Area : Slices Ratio is an evaluation parameter Nbr of LUTs Nbr of Registers Ratio is Evaluation parameter

Encryption Block

Plain Text - Block Ciphers


Input 128 bit blocks State transformed S[r+c] = in[r+4c] Out[r+4c] = S[r+c] 0<=r<4, o<=c<Nb(=4)

Implementation

2 Types of Optimization Algorithmic SBox Multiplexer Model


MixColumns

RAM based Composite field

MixColumns transform Mixadd transform

Architectural

Loop unrolling Pipelining Sub-Pipelining

SBOX - Mux Model

Sbox Table

Mux Model - Background

N i/p boolean function G(x) represented by

In AES Which is bit representation Implemented as

Mux Model

MUX Model

Mux Model

Realization on FPGA

LUT based 4 I/p 4 o/p Lookup Four 4 I/p 1 o/p LUT Coupled 4:1 Mux Realizing 4:1 Mux through three 2:1 Mux
a b
c

d
s0

s1

Mux Model - Implementation

Mux Model - Analysis


1 Bit output Repeated 16 times and looped 16 times Critical path LUT4 + MUXF5 + MUXF6 2 level pipelining 12 clock pulses

Implementation

2 Types of Optimization Algorithmic SBox

Multiplexer Model

RAM based

MixColumns

Composite field MixColumns transform Mixadd transform

Architectural

Loop unrolling Pipelining Sub-Pipelining

SBOX RAM Based

Lookup type BRAM two single port 256x8 bit Write enable of RAM made low Input held low ROM implemented 1 clock Design

SBOX = 16x16x8 = 2048 bits = 2Kbits 16 SBOx for each state 1 BRAM = two 2Kbit RAM Hence 8 BRAM required

Implementation

2 Types of Optimization Algorithmic SBox

Multiplexer Model
RAM based

MixColumns

Composite field
MixColumns transform Mixadd transform

Architectural

Loop unrolling Pipelining Sub-Pipelining

Composite field - Math Basics

Byte representation in Galois Field GF(28) For e.g. 01100011 is x6 + x5 + x + 1. Addition Modulo 2 Arithmetic (No subtraction) Multiplication polynomial multiplication modulo irreducible polynomial (deg = 8) m(x) = x8 + x4 + x3 + x +1 Multiplicative inverse

b(x)a(x) + m(x)c(x) = 1. b-1 (x) = a(x) mod m(x) because a(x) b(x) mod m(x) = 1,

E.g 3m 1 (mod 11) , 3-1 m (mod 11)

Composite model equations Multiplicative Inverse

GF(28) = GF(24) 2 GF(24) = a1x + a0 Inverse given by X belongs to x2 + x + = 0 b0=(a0+a1)-1 b1=a1-1 = a0.(a0+a1)+ a12

Composite field - Affine Transformation

Linear transformation + Translation Transformation = rotations, scaling, shear Translation = shift In AES

Composite field - implementation

Implementation

2 Types of Optimization Algorithmic SBox

Multiplexer Model
RAM based Composite field

MixColumns

MixColumns transform
Mixadd transform

Architectural

Loop unrolling Pipelining Sub-Pipelining

Mixcolumns transform - Background


Four-term polynomials Coefficients are bytes M(x) = X4 + 1 Product defined as a(x) X b(x) = d(x)

Mixcolumns transform - Equations

Solution

Multiplication of GF(28) polynomial with X = multiplication by 02 = left shift plus Conditional XOR (based on MSB)

Mixcolumns transform - Implementation

Mixcolumns transform Implementation

To implement 03a1 = (02 + 01)a1 = 02a1 + a1 Hence we have 2 multiplication with x (a0,a1) 5 XOR addition Above two + a1+a2+a3 2 level pipelined

Mixcolumns transform Implementation

Implementation

2 Types of Optimization Algorithmic SBox

Multiplexer Model
RAM based Composite field

MixColumns

MixColumns transform

Mixadd transform

Architectural

Loop unrolling Pipelining Sub-Pipelining

Mixadd transform - Principle

Inside X(a0) or X(a1) Mostly shift operator In both the bytes XOR is done only to 3 bits So these three bits separately added Now pipelined Combined with Key addition

Mixadd transform Implementation

Implementation

2 Types of Optimization Algorithmic SBox

Multiplexer Model
RAM based Composite field

MixColumns

MixColumns transform Mixadd transform

Architectural Loop unrolling

Pipelining Sub-Pipelining

Unrolled Architecture

10 AES round unrolled Lots of hardware Area is increased Throughput is Increased

Implementation

2 Types of Optimization Algorithmic SBox

Multiplexer Model
RAM based Composite field

MixColumns

MixColumns transform Mixadd transform

Architectural

Loop unrolling

Pipelining

Sub-Pipelining

Pipelined Architecture - I

At a time only one round Hardware reduced Throughput reduced Area reduced

Pipelined Architecture - II

All 10 rounds taken inside loop Loss of mixadd combination Additional Mux Good choice in ASIC

Heuristic optimization

Results

Pipelined -I architecture

Unrolled Architecture

Results Contd

Comparison

RAM/unrolled RAM/pipelined

Mux/pipelined
composite/pipelined

Summary

http://www.cs.bc.edu/~straubin/cs38105/blockciphers/rijndael_ingles2004.swf

Conclusion

Algorithmic and Architectural Design Tradeoffs were evaluated Optimum Design principle found through heuristics Throughput = 1563Mbps Performance (throughput/Area) = .69

Phase 2 preview

Implement Implement transform Implement Implement shift

SBOX RAM based Mixcoloumn Mixcoloumn

Addkey Direct XOR ShiftRow Simple cyclic

Vous aimerez peut-être aussi