Vous êtes sur la page 1sur 96

Kris Gaj

Electrical and Computer Engineering


George Mason University

Towards secure cryptographic transformations


efficient in both software and hardware:
A case for synergy among
math, computing, and engineering

http://ece.gmu.edu/crypto-text.htm

Motivation

Criteria used to evaluate cryptographic


transformations
Security

Hardware
Efficiency

Software
Efficiency
Flexibility

Flexibility
Additional key-sizes and block-sizes
Ability to function efficiently and securely in a wide
variety of platforms and applications
low-end smartcards, wireless: small memory requirements
IPSec, ATM small key setup time in hardware
B-ISDN, satellite communication large encryption speed

Advanced Encryption Standard (AES) Contest


1997-2001
June 1998
15 Candidates

Round 1

from USA, Canada, Belgium,


France, Germany, Norway, UK, Israel,
Korea, Japan, Australia, Costa Rica

Security
Software efficiency
Flexibility

August 1999
5 final candidates
Mars, RC6, Rijndael, Serpent, Twofish

October 2000
1 winner: Rijndael
Belgium

Round 2
Security
Hardware efficiency

Europe

NESSIE Project
New European Schemes for Signatures,
Integrity, and Encryption
2000-2002

Japan

CRYPTREC Project
2000-2002

NESSIE, CRYPTREC
Multiple types of transformations:
Symmetric-key block ciphers
Stream ciphers
Hash functions
MACs
Asymmetric encryption schemes
Asymmetric digital signature schemes
Asymmetric identification schemes

Development of methodology of a fair evaluation and


comparison of algorithms belonging to the same class,
including
software and hardware efficiency

Speed of the final AES candidates in hardware


Speed [Mbit/s]

K.Gaj, P. Chodowiec, AES3, April, 2000

500
450
400
350
300
250
200
150
100
50
0

Serpent Rijndael Twofish RC6

Mars

Survey filled by 167 participants of


the Third AES Conference, April 2000

# votes
100
90
80
70
60
50
40
30
20
10
0

Rijndael Serpent Twofish

RC6

Mars

Results of the NSA group


Hardware
Speed [Mbit/s]
700

NSA
ASIC

606

600
500

414

GMU
FPGA

431

400
300

202

177

200
100
0

105

143
103

Rijndael Serpent Twofish RC6

57

61

Mars

Efficiency in software: NIST-specified platform


200 MHz Pentium Pro, Borland C++

Speed [Mbits/s]

128-bit key
192-bit key
256-bit key

30
25
20
15
10
5
0

Rijndael

RC6

Twofish Mars Serpent

NIST Report: Security


Security Margin

High

Adequate

Serpent

MARS
Twofish

Rijndael
RC6
Simple

Complex
Complexity

Security: Theoretical attacks better


than exhaustive key search
Serpent

Twofish

23
10

Mars
Rijndael

RC6

3
15

16
5

11

32

16 without 16 mixing rounds

10
5

20

5
10
15
20
25
30
35
# of rounds in the attack/total # of rounds

Security: Theoretical attacks better


than exhaustive key search
28%

Serpent

72%

38%

Twofish
Mars

62%
31%

69%
70%

Rijndael
RC6

30%
25%

75%
0

10

20

30

40

50

60

70

80

90 100

# of rounds in the attack/total # of rounds 100%

Security and hardware speed for hash functions


Speed in hardware [Mbit/s]
700

GMU team, May 2002


610

600
500
400

359

300
200
100
0
Complexity
of the best attack
the same as

SHA-1

SHA-512

280

2256

Skipjack

AES-256

Whats more important:


software or hardware?

Historical view
Secret-key ciphers

Hash functions

1970
DES optimized for hardware
DES-based hash functions
optimized for hardware

1980

1990

2000
time

Fast Software Encryption:


ciphers optimized for software:
e.g., RC5, Blowfish, RC4
AES optimized for
software and hardware

MD4-family
optimized primarily
for software

Software or hardware?
HARDWARE

SOFTWARE
security of data
during transmission

low cost
flexibility
(new cryptoalgorithms,
protection against new attacks)

speed
random key
generation
access control
to keys

tamper resistance
(viruses, internal attacks)

Efficiency indicators

Primary efficiency indicators


Hardware

Software

Speed

Memory

Speed

Area

Power
consumption

Efficiency parameters
Latency
Mi

Encryption/
decryption

Ci

Throughput = Speed
Mi+2
Mi+1
Mi

Time to
encrypt/decrypt Encryption/
a single block
decryption
of data
Ci+2
Ci+1
Ci

Number of bits
encrypted/decrypted
in a unit of time

Block_size Number_of_blocks_processed_simultaneously
Throughput =
Latency

Whats more important:


Speed or area?

Non-Feedback Cipher Modes


ECB, counter

Comparison for non-feedback cipher modes, e.g.


Counter Mode - CTR
IV

IV+1

IV+2

IV+N-1

IV+N

...

M0

M2

M1

C1

C2

...

MN

MN-1

C3

Ci = Mi E(IV+i)

CN-1

for i=0..N

CN

Increasing speed by parallel processing

Encryption/
decryption
unit

Encryption/
decryption
unit

Encryption/
decryption
unit

Encryption/
decryption
unit

Encryption/
decryption
unit

Encryption/
decryption
unit

Increasing speed using pipelining


Cipher 2

Cipher 1

round 1

round 1

round 2
target
clock
period,
e.g., 20 ns

...

...

round 10
round 16

block size
Speed =
target_clock_period

clock
cycle

clock
cycle

Pipelined operation of the encryption unit


2

B1

B2
B1

B3
B2
B1

B4
B3
B2
B1

B5
B4
B3
B2

B6
B5
B4
B3

B7
B6
B5
B4

B8
B7
B6
B5

10

11

12

13

14

15

16

B9
B8
B7
B6

B10
B9
B8
B7

B11
B10
B9
B8

B12
B3
B2
B9

B13
B4
B3
B10

B14
B5
B4
B11

B15
B6
B5
B12

B16
B7
B6
B13

Encryption in non-feedback modes (ECB, counter)


decryption in all modes
Speed [Mbit/s]
7000
6000

Rijndael

6.4 Gbit/s

Serpent RC6
Twofish

5000

Mars

4000
3000
2000

Assuming clock period = 50 MHz

1000
0

10000 20000

30000

40000

50000

60000

Area [CLB slices]

Our Results: Full mixed pipelining


Throughput [Gbit/s] Virtex FPGA
18

16.8

15.2

16

13.1

14

12.2

12
10
8
6
4
2
0

Serpent

Twofish

RC6

Rijndael

Our Results: Full mixed pipelining


Area [CLB slices]
50000
45000
40000

dedicated memory
blocks, RAMs

46,900

35000
30000
25000
20000

19,700

21,000
12,600
80 RAMs

15000
10000
5000
0

Serpent

Twofish

RC6

Rijndael

NIST Report + GMU Report:


Hardware Efficiency

Non-feedback cipher modes: ECB, CTR


Speed
Rijndael
Serpent
Twofish

High

RC6
Mars

Medium
Low
Small

Medium Large

Area

Feedback cipher modes


CBC, CFB, OFB

Feedback cipher modes - CBC


M1

M3

M2

MN-1

MN

...

IV

C1

C2

...
C3

CN-1

C1 = E(Mi IV)
Ci = E(Mi Ci-1)

for i=2..N

CN

Typical Flow Diagram of


a Secret-Key Block Cipher
Round Key[0]

Initial transformation
i:=1

Round Key[i]

Cipher Round

i<#rounds?
Round Key[#rounds+1]
Final transformation

i:=i+1

#rounds
times

Basic iterative architecture

multiplexer

register
one round

combinational
logic

Increasing speed in cipher feedback modes


speed

loop-unrolling
basic architecture

k=2

k=3

k=4

k=5

area

GMU Results: Encryption in cipher feedback modes


(CBC, CFB, OFB) - Virtex FPGA
Throughput [Mbit/s]
500
400

Serpent I8

Rijndael

300
Twofish
Serpent I1

200
100

RC6
Mars

1000

2000

3000

4000
5000
Area [CLB slices]

NSA Results: Encryption in cipher feedback modes


(CBC, CFB, OFB) - ASIC, 0.5 m CMOS

Throughput [Mbit/s]
700
600

Rijndael

500
400
300
Serpent I1

200
100
0

RC6
0

Mars

Twofish
10

15

20

25
30
35
40
Area [CLB slices]

Decreasing area by resource sharing


After

Before
D0

D0

D1

D1

multiplexer

F
F

D0

D1
D0

register

D1

register

Resource sharing: Speed vs. Area


Throughput
- basic architecture
- resource sharing

basic architecture

resource sharing

Area

NIST Report + GMU Report:


Hardware Efficiency

Feedback cipher modes: CBC, CFB


Speed
High
Medium

Rijndael Serpent
Twofish
RC6

Low

MARS
Small

Medium Large

Area

Arent software and hardware


optimizations equivalent?

Efficiency in software: NIST-specified platform


200 MHz Pentium Pro, Borland C++

Speed [Mbits/s]

128-bit key
192-bit key
256-bit key

30
25
20
15
10
5
0

Rijndael

RC6

Twofish Mars Serpent

Our Results: Basic architecture - Speed


Throughput [Mbit/s]
500
450
400
350
300
250
200
150
100
50
0

Serpent Rijndael Twofish RC6

Mars

Basic atomic operations


of secret-key ciphers
and hash functions

Atomic operations used in 41 most popular


secret-key ciphers (1)
B. Chetwynd, MS Thesis, WPI

Considered ciphers:
Blowfish, CAST, CAST-128, CAST-256, CRYPTON,
CS-Cipher, DEAL, DES, DFC, E2,
FEAL, FROG, GOST, Hasty Pudding, ICE,
IDEA, Khafre, Khufu, LOKI91, LOKI97,
Lucifer, MacGuffin, MAGENTA, MARS, MISTY1,
MISTY2, MMB, RC2, RC5, RC6,
Rijndael, SAFER K, SAFER+, Serpent, SQUARE,
SHARK, Skipjack, TEA, Twofish, WAKE,
WiderWake

Major atomic operations used in 41 most popular


secret-key ciphers (2)
B. Chetwynd, MS Thesis, WPI
40
35
30
25
20
15
10
5
0

30

10

7
1

S-box

Variable
rotation

Modular
multiplication

GF(2n)
multiplication

Modular
inversion

Auxiliary atomic operations used in 41 most popular


secret-key ciphers (3)
B. Chetwynd, MS Thesis, WPI

40
40
35
30
25
20
15
10
5
0
Boolean
(XOR, AND, OR,
etc.)

25
20
?

Fixed
rotation

Modular
addition
& subtraction

Permutation

Major cipher operations (1) - S-box


Software
S-box n x m
n

Hardware
ROM
n-bit address

WORD S[1<<n]=
{ 0x23, 0x34, 0x56
..............
}

2n m
bits

S
m

ASM

m-bit output
direct logic

...

...

S DW 23H, 34H,
56H
..
x1
x2
xn

2n words

y1
y2
ym

S-box: Memory in hardware


32 x 4 = 128 bits
4

S
4

S
4

S
4

...

Memory = 32 24 4 bits = 2 kbit


16 x 8 = 128 bits
8

S
8

...

S
8

Memory = 16 28 8 bits = 32 kbit = 16 2 kbit

S-box: Memory in software


32 x 4 = 128 bits
4

S
4

S
4

S
4

...

Memory = 24 4 bits = 64 bit


16 x 8 = 128 bits
8

S
8

...

S
8

Memory = 28 8 bits = 2 kbit = 32 64 bits

S
8

Major cipher operations (2) Variable Rotation


Software

Hardware
Mux-based shifter

A<<<0 A<<<16

C
C = (A << B) | (A >> (32-B));

A <<< B
32

variable rotation
ROL32

ASM
ROL A, B

B[4]
B[3]
B[2]
B[1]
B[0]
A<<<B

High-speed clock
fast clock
CLK

min (B, 32-B) CLK cycles

Major cipher operations (3) Modular Multiplication


A

Software
n

Hardware

C
unsigned long A, B, C;

C = A*B;

MUL
n

C
C=AB mod 2n
n=32, 16

ASM
MUL

HalfMultiplier

Major cipher operations (4)


Multiplication in the Galois Field GF(2 m)
Software
X

C = const
8

MUL GF(28)

Hardware

C
x0 x3 x4 x7
x0 x3 x7
<<, ^, |, &
or
...
alog[log[X]+log[C]%255]
ASM

ROL, XOR, OR, AND


or
ALOG DW 3H, 5H,
LOG DW 7H, 9H,

y0

y7

Auxiliary cipher operations (1) - Permutation


Software
n

P
n
Permutation

Hardware
x1 x2 x3

C
complex
sequence of
instructions
<<, |, &

ASM
complex
sequence of
instructions
ROL, OR, AND

xn-1 xn

...

...
y1 y2 y3

yn-1 yn

order of wires

Auxiliary cipher operations (2) - Fixed rotation


Software

Hardware

C
C = (A << n) | (A >> (32-n));

fixed rotation
ROL32

...

...

A <<< n
32

xn-1 xn

x1 x2 x3

ASM
ROL A, n

y1 y2 y3

yn-1 yn

order of wires

Auxiliary cipher operations (3)


Boolean operations
Software
A

XOR, AND, OR

Hardware
a0

an-1

b0

A^ B
A&B
A| B

...
yn-1

y0

ASM
Y

bn-1

a0

an-1

b0

XOR A, B
AND A, B
OR A, B

bn-1

...
y0

yn-1

Auxiliary cipher operations (4)


Addition/subtraction
A

Software
n

Hardware

C
unsigned long A, B, C;

C = A+B;

ADD
n

C
C=A+B mod 2n
n=32, 16

ASM
ADD

Adder/subtractor

Multiple designs for hardware adders


Delay

Ripple carry adder (RC)

Carry-Skip adder (CS)


Carry-LookAhead adder (CLA)
Carry-Select adder
Parallel-Prefix Network adder
(Kogge-Stone, Brent-Kung)

Area

Basic operations
Delay and area in HARDWARE
Delay
modular
multiplication

addition (RC)
GF(2n)
multiplication
Boolean
permutation
fixed rotation

modular
inverse

variable
rotation

addition (CLA)
S-box
4x4

S-box
8x8

S-box
9x32
Area

Basic operations
Delay and area in SOFTWARE
Delay
modular inverse

permutation

GF(2n)
multiplication

variable rotation
fixed rotation
multiplication
addition
Boolean

S-box
4x4

S-box
8x8

S-box
9x32
Memory

Major operations of AES finalists


Serpent Twofish Rijndael
S-boxes
Multiplication
in GF(2m)
Variable
rotation
Integer
multiplication

RC6

Mars

Auxiliary operations of AES finalists


Serpent Twofish Rijndael
Boolean
Fixed rotation
Addition/
subtraction
Permutation

RC6

Mars

MARS IBM team


Delay and area in HARDWARE
Delay
modular
multiplication

addition (RC)
GF(2n)
multiplication
Boolean
permutation
fixed rotation

modular
inverse

variable
rotation

addition (CLA)
S-box
4x4

S-box
8x8

S-box
9x32

Area

Serpent R. Anderson, E. Biham, L. Knudsen


Delay and area in HARDWARE
Delay
modular
multiplication

addition (RC)
GF(2n)
multiplication

variable
rotation

addition (CLA)
S-box
permutation 4x4
fixed rotation

Boolean

modular
inverse

S-box
8x8

S-box
9x32
Area

Rijndael V. Rijmen, J. Daemen


Delay and area in HARDWARE
Delay
modular
multiplication

addition (RC)
GF(2n)
multiplication

variable
rotation

addition (CLA)
S-box
permutation 4x4
fixed rotation

Boolean

modular
inverse

S-box
8x8

S-box
9x32
Area

MARS IBM team


Delay and area in SOFTWARE
Delay
modular inverse

permutation

GF(2n)
multiplication

variable rotation
fixed rotation
multiplication
addition
Boolean

S-box
4x4

S-box
8x8

S-box
9x32
Memory

Operations efficient in both software and hardware


Summary
Software
Slow &
big

Slow or
big

permutation
GF(2n) multiply

Fast &
compact

S-box

modular inverse

variable rotation

Boolean
fixed rotation

addition

Fast & compact

Slow or big

multiplication
Slow & big Hardware

Types of ciphers

AES: Types of candidate algorithms


Feistel Networks
Twofish
E2
DFC

Deal
LOKI97
Magenta

SubstitutionLinear Transformation
Networks
Rijndael
Serpent

Safer+
Crypton

Modified Feistel
Network
RC6
MARS
CAST-256

Others
Frog
HPC

Feistel Network: Single Round of Twofish


D[1] D[0]

D[3] D[2]
K2r+8 K2r+9

<<< 1

F - function
>>> 1

D[3] D[2]

D[1] D[0]

- units shared between encryption and decryption

Modified Feistel Network: Single Round of MARS


D[3]

D[2]

D[1]

D[0]
k

k=K[4+2i],
k = K[5+2i],
i - round no.

out1
out2

in

out3

<<<13

D[3]

D[2]

D[1]

D[0]

- units shared between encryption and decryption

Substitution-Linear Transformation Network:


Single Round of Serpent
128

S-boxes
Linear Transformation
K[i]
128

128

- units shared between encryption and decryption

Substitution-Linear Transformation Network:


Serpent in Hardware
128

initial permutation
128
128

K0, ... , K7, K32

encryption
block

128

decryption
block

128

128

128

final permutation
128

K32, ... , K7, K0

Substitution-Linear Transformation Network:


Rijndael in Hardware
- units shared between encryption and decryption
inversed affine
transformation
decryption

encryption

Inversion in GF(28)
affine
transformation

InvShiftRow
subkey

ShiftRow

InvMixColumn

MixColumn
subkey

Number and complexity of rounds

Number vs. complexity of a round


Number of rounds
50

Triple DES

40
Serpent

Mars

30
20
10

RC6
DES

Twofish
Rijndael
Complexity of a round

Complexity of the cipher round in hardware


Time in hardware [ns]
0
Serpent

20

40

60

80

regular round
S-box 4x4 XOR7 MUX2

100

K. Gaj, P. Chodowiec
April 2000

Rijndael
S-box 8x8 XOR6 XOR5 XOR4 2 MUX2

Twofish
2 ADD32 6 S-boxes 4x4 9 XOR2 XOR5 XOR4 2 MUX2

RC6
SQR32 2 ADD32 ROT32 4 MUX2

Mars
ADD32 MUL32 ROT32 ADD32 2XOR2 4 MUX2

Security margin: Theoretical attacks better


than exhaustive key search
Serpent

Twofish

23
10

Mars
Rijndael

RC6

3
15

16
5

11

32

16 without 16 mixing rounds

10
5

20

5
10
15
20
25
30
35
# of rounds in the attack/total # of rounds

Making all rounds identical

Serpent: Hardware Architecture I8


128
128
128-bit register

K0

round 0
32 x S-box 0
linear transformation

one implementation
round of Serpent

=
K7

round 7
32 x S-box 7
linear transformation

K32
128
output

8 regular cipher
rounds

Serpent Hardware Architecture I1


128
128
128-bit register

Ki

regular Serpent round


128
32 x S-box 0

128
32 x S-box 1
128

128

8-to-1 128-bit multiplexer

linear transformation

K32

128

output

128
32 x S-box 7
128

GMU Results: Encryption in cipher feedback modes


(CBC, CFB, OFB) - Virtex FPGA
Throughput [Mbit/s]
500
400
Serpent with all S-boxes Rijndael
identical
300

Serpent I8

Twofish
Serpent I1

200
100

RC6
Mars

1000

2000

3000

4000
5000
Area [CLB slices]

Parallelism

Parallelism in SHA-1
A

A
32

32

ROTL5

A
ROTL5

B
32

B
32

ROTL30

ROTL30

C
32

32

ft

ft

32

32

32

Kt

Wt

32

Kt

Wt

Operations from two different steps that can be performed


in parallel

Executing SHA-1 on a 7-way superscalar processor


A. Bosselaers, R. Govaerts, J. Vandewalle, 1997
step n
ROL1

step n+1

ROL30

ROL1

step n+2

ROL30
ROL1
ROL5

step n+3

ROL30
ROL1
ROL5

step n+4

ROL30
ROL1
ROL30

ROL5

ROL1
ROL5

ROL30

Number of operations that can be


executed in parallel
for various hash functions
A. Bosselaers, R. Govaerts, J. Vandewalle, 1997

8
7
6
5
4
3
2
1
0

SHA-1

RIPEMD RIPEMD RIPEMD


128
160

MD5

MD4

Optimization tricks

Rijndael round: Table-lookup implementation


a0,0 a0,1 a0,2 a0,3

T0

a1,0 a1,1 a1,2 a1,3

T1

a2,0 a2,1 a2,2 a2,3

T2

a3,0 a3,1 a3,2 a3,3

k2

b0 b1

b2 b3

T3

x3,2

x2,2

x1,2

Speed-up in software:
Speed-up in hardware:

x0,2

b2

~ 100 times
~ 20%

Serpent: Bit-slice implementation


32 x 4 = 128 bits
(0) (0)

(0)

(1) (1)

(1)

(2) (2)

(3)

(31) (31)

(31)

(2) (2)
(0)
(3)
(3)
(1)
x
x1 x2 x3 x4 x1 x2 x3 x4 x1 x2 x3 x4 x1 x(3)
x
4
3
2

(31)
x1 x2 x3 x4

y1(0)

y1(1)

y1(2)

y1(3)

y1(31)

(k)
1

(k) (k)
1
2

(k) (k)
3
4

e.g.

y = f (x , x , x , x ) =
(1) (0)
x
. . . x1(3) x(2)
x1
1
1
AND
(30)
(1) (0)
x(31)
. . . x2(3) x(2)
2 x2
2 x2 x2
=
(31) (30)
(1) (0)
u1 u1 . . . u1(3) u(2)
u
u1
1
1

XOR

(k)

(k )

(1) (0)
. . . x3(3) x(2)
3 x3 x3
OR
(30)
(1) (0)
x(31)
. . . x4(3) x(2)
4 x4
4 x4 x4
=
(31) (30)
(1) (0)
v1 v1 . . . v1(3) v(2)
1 v1 v1
(30)
x(31)
3 x3

(30)
x(31)
1 x1

(30)
y(31)
1 y1

(k) (k)

x1 x2 (x3 x4 )

(1) (0)
y1(3) y(2)
1 y1 y1

The proposed approach

Cipher design methodology (1)

1. Choose one or maximum two major operations efficient in


both software and hardware
best choice: S-box 4x4, GF(2n) multiplication
2. Choose one or maximum two auxiliary operations efficient in
both software and hardware
best choice: Boolean, fixed rotation
3. Choose cipher type that enables maximum sharing
among encryption and decryption
best choice: Feistel network, modified Feistel network

Cipher design methodology (2)


4. Design a round taking into account a trade-off among
round complexity
number of rounds necessary to guarantee
sufficient security margin
5. Make each round [possibly] identical
negative examples: Serpent, Mars
6. Look for parallelism within a round and among consecutive
rounds
positive example:
SHA-1
7. Look for optimization tricks
positive examples:
table-look-up in Rijndael
bit-slice implementation in Serpent

Mathematicians

Security
Flexibility
Software
efficiency
Computer
scientists

Hardware
efficiency
Computer
Engineers

$A100 Challenges
For mathematicians:
Prove or disprove that Serpent with
all S-boxes identical
16 rounds
is at least as secure as Rijndael
For computer scientists:
Is there a way of using instruction level parallelism
to speed-up software implementation of
[modified] Serpent to make it as fast as Rijndael?

$A50 Challenge
For mathematicians:
Is there a way of changing Serpent into
a modified Feistel network cipher
without loosing its security properties?

For computer scientists:


What is a level of parallelism present in
SHA-256, SHA-384, SHA-512?

Vous aimerez peut-être aussi