Vous êtes sur la page 1sur 126

Securing Asynchronous Transfer Mode Networks

by Gregory M. Haskins A Thesis Submitted to the Faculty of the WORCESTER POLYTECHNIC INSTITUTE In partial ful llment of the requirements for the Degree of Master of Science in Electrical Engineering May, 1997 Approved: Prof. Christof Paar ECE Department Thesis Advisor Scott Lane GTE Government Systems Thesis Committee Prof. Fred J. Looft ECE Department Thesis Committee Prof. John M. Rulnick ECE Department Thesis Committee

Prof. John Orr ECE Department Head

Abstract
Data security plays an increasingly important role in today's information technology. Potential data rates in the gigabit range, such as o ered by ATM networks, put many constraints on the design of a secure, but usable, network. In addition, the cell structure of ATM makes bulk data encryption as well as public-key security services challenging tasks. In this work, two major areas of ATM security are addressed. First, the special aspects and problems associated with overall security for ATM networks, such as potential threats, services, design considerations, and topology are explored. The second part deals with agility of cryptographic algorithms, that is the capability of an encryption device to change its algorithm. This feature appears to be very desirable for high speed networks because it facilitates design exibility and future protocol additions and changes. We propose the use of recon gurable hardware since they appear to be naturally suited for the task. The use of recon gurables in cryptographic applications, to our knowledge, has not been systematically analyzed before and appears to be a highly interesting area within high speed network security. The result of this thesis is a design for a secure ATM network, and a detailed analysis on the feasibility of using recon gurable hardware to implement algorithm agility. The analysis includes information regarding an actual implementation and its price vs. performance in two popular architectures. One of the more interesting results are that DES can be realized without loop unrollment with data rates beyond 60Mb/sec on standard recon gurable hardware.

ii

Preface
I would like to thank the many people who contributed to this work. First, my advisor Christof Paar for his advice and support throughout this entire project. Without him, I may have inadvertently designed some insecure devices and embarrassed myself at RSA ATEX. Together we gained '97. Next, I would like to thank Kate Sullivan for her help with L valuable insight as to how to get tables to work correctly. She usually came to me for help, only to answer her own question and teach me something new in the process. Martin Rosner and Mike Roberts worked together with me on the component synthesis and testing. Without them I may not have been able to nish all of the experiments in time. Scott Lane and Dave King from GTE Government Systems were kind enough to meet with Dr. Paar and myself early in the project to discuss various architectures, designs, etc. I would like to thank them for giving us that initial start which help us complete the ATM design work for Lockheed. Lastly, I would like to thank the thesis committee for taking the time to read this project in the midst of busy schedules. Thanks everyone! -Greg

iii

Contents
1 Motivation 2 Thesis Outline 3 ATM Overview

I Introduction

3.1 ISDN . . . . . . . . . . . . . . . . . . 3.2 ISDN and B-ISDN . . . . . . . . . . . 3.3 The ATM Layers . . . . . . . . . . . . 3.3.1 ATM Adaptation Layer (AAL) 3.3.2 ATM Layer . . . . . . . . . . . 3.3.3 HEC . . . . . . . . . . . . . . . 3.3.4 Physical Layer . . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

6 7 10 10 12 13 15

2 4 6

4 Introduction to Security Issues 5 Design Considerations

II ATM Security Issues

16
17
18 18

4.1 Potential Threats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Security Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Encryption Hardware . . . . . . . . . . . . . . . . 5.1.1 ASICs . . . . . . . . . . . . . . . . . . . . . 5.1.2 Recon gurables . . . . . . . . . . . . . . . . 5.2 Symmetric Algorithms . . . . . . . . . . . . . . . . 5.2.1 Approved Algorithms . . . . . . . . . . . . 5.3 Mode of Operation . . . . . . . . . . . . . . . . . . 5.4 Synchronization . . . . . . . . . . . . . . . . . . . . 5.5 Interleaving . . . . . . . . . . . . . . . . . . . . . . 5.6 Key Storage . . . . . . . . . . . . . . . . . . . . . . 5.6.1 Session Keys . . . . . . . . . . . . . . . . . 5.6.2 Public Key Encryption and Signature Keys 5.7 Authentication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

21
21 22 22 23 25 25 27 27 28 29 29 30

iv

5.8 Numeric Computation . . . . . . 5.9 Key Agility . . . . . . . . . . . . 5.9.1 Overall Layout . . . . . . 5.9.2 Architecture Description . 5.9.3 Design Considerations . . 5.10 Algorithm Agility . . . . . . . . .

. . . . . .

. . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

30 31 32 33 35 35

6 Security Topology

6.1 Where to Place the Services . . . . . 6.1.1 Privacy . . . . . . . . . . . . 6.1.2 Authentication . . . . . . . . 6.1.3 Integrity . . . . . . . . . . . . 6.1.4 Access Control . . . . . . . . 6.1.5 Replay Prevention . . . . . . 6.1.6 Non-Repudiation . . . . . . . 6.2 Hardware Location . . . . . . . . . . 6.2.1 Network Placement . . . . . . 6.2.2 Should Services Be Built In? 6.3 Cryptographic Signaling . . . . . . . 6.3.1 Location . . . . . . . . . . . . 6.3.2 Secure Call Establishment . . 6.4 Key Management and Distribution .

37
37 39 42 44 46 47 48 48 48 49 50 51 54 56

III Achieving Algorithm Agility


7 Introduction
7.1 Using ASICs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Using Recon gurable Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

58
59
60 60

8 Introduction to Recon gurable Hardware

8.1 Simple Recon gurable Hardware . . . . . . . . . . . . 8.2 Device Technology . . . . . . . . . . . . . . . . . . . . 8.2.1 Interconnection Technology . . . . . . . . . . . 8.2.2 Logic Technology . . . . . . . . . . . . . . . . . 8.2.3 Segment Technology . . . . . . . . . . . . . . . 8.2.4 Internal Architectures . . . . . . . . . . . . . . 8.2.5 Field Programmable Gate Arrays (FPGA) . . . 8.2.6 Complex Programmable Logic Devices (CPLD)

62
64 64 64 65 66 67 70 72

9 Decomposition of Cryptographic Algorithms

9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2 Symmetric Block Cipher Algorithms . . . . . . . . . . . . . . . . . . . . . . 9.2.1 Block Cipher Architecture . . . . . . . . . . . . . . . . . . . . . . . .

76
76 77 77

9.3 Methodology . . . . . . . . . . . . . . . . . . . . . 9.3.1 Component Breakdown . . . . . . . . . . . 9.3.2 Implementation . . . . . . . . . . . . . . . . 9.4 Component Description . . . . . . . . . . . . . . . 9.4.1 Permutation Boxes . . . . . . . . . . . . . . 9.4.2 Logical Functions - XOR, AND, OR, NOT 9.4.3 Substitution Boxes . . . . . . . . . . . . . . 9.4.4 Shift/Rotate Registers . . . . . . . . . . . . 9.4.5 Adders . . . . . . . . . . . . . . . . . . . . . 9.4.6 The Hidden Components . . . . . . . . . . 9.4.7 Component Conclusion . . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

80 80 81 82 82 83 84 86 88 89 90

10 Designing for High Performance

10.1 The Data Encryption Standard . . . . . . . . . . . . . . . . . . . . . . . . . 10.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.1.2 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

91
91 91 91

IV Results and Conclusions


11 Results
11.1 Comparing the Results in Recon gurable Hardware . 11.1.1 Methodology . . . . . . . . . . . . . . . . . . 11.1.2 Same Relative Cost Comparison . . . . . . . 11.1.3 Same Relative Size Comparison . . . . . . . . 11.2 Analysis of DES Implementation . . . . . . . . . . . 11.2.1 Component Comparison . . . . . . . . . . . . 11.2.2 DES Comparison . . . . . . . . . . . . . . . . 11.2.3 Same Cost Comparison of DES . . . . . . . . 11.2.4 Same Size Comparison of DES . . . . . . . . 12.1 12.2 12.3 12.4 Design Recommendations for ATM . Recon gurable Hardware and ATM . Future Work . . . . . . . . . . . . . Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

96
97 98 99 100 101 101 102 103 103 105 106 107 108

97

12 Conclusions

105

A DES

V References

109
110

vi

List of Tables
3.1 3.2 3.3 3.4 ISDN Q.931 Messages 2] . . . . . . . . . . . . AAL Classes . . . . . . . . . . . . . . . . . . . Additional ATM Connection Control Messages Functions supported by the UNI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 11 13 14 18 24 25 38 38 80 81 82 84 85 85 86 86 87 88 89 90 98 98 99 100 100 102 103 103 4.1 Network Threats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Crypto algorithms suitable for ATM . . . . . . . . . . . . . . . . . . . . . . 5.2 Approved protocols for use with ATM . . . . . . . . . . . . . . . . . . . . . 6.1 ATM Channel De nition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Recommended service location . . . . . . . . . . . . . . . . . . . . . . . . . 9.1 9.2 9.3 9.4 9.5 9.6 9.7 9.8 9.9 9.10 9.11 9.12 11.1 11.2 11.3 11.4 11.5 11.6 11.7 11.8 The available algorithms and their component breakdown . . . . . . . . . . Component Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . PAD delays in experimental hardware . . . . . . . . . . . . . . . . . . . . . 32 bit XOR box (source:xormod.vhd) . . . . . . . . . . . . . . . . . . . . . Substitution box implementation in XC4000E technology with synthesized combinatorial logic (source: sbox1.vhd) . . . . . . . . . . . . . . . . . . . . Substitution box in XC4000E technology with ROM (source: sbox1.mem) . Substitution box in FLEX10K technology with synthesis (source: sbox1.vhd) Substitution box in FLEX10K technology with ROM (source: sbox1.mif) . 32bit rotation box (source:rot.vhd,lmrot.vhd, larot.vhd) . . . . . . . . . . . Adder in various hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256x32 RAM bu er in various hardware . . . . . . . . . . . . . . . . . . . . 32bit 2x1 MUX in various hardware . . . . . . . . . . . . . . . . . . . . . . LE Weighted (LeW) Cost Analysis of Various Devices RE Weighted (ReW) Cost Analysis of Various Devices Similar cost comparison . . . . . . . . . . . . . . . . . Components Evaluated with Cost Comparison . . . . Same Relative Size Comparison . . . . . . . . . . . . . Components Evaluated with Size Comparison . . . . . DES Performance . . . . . . . . . . . . . . . . . . . . . Similar Cost Device Comparison of DES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

vii

11.9 Similar Size Device Comparison of DES . . . . . . . . . . . . . . . . . . . . 104

viii

List of Figures
3.1 3.2 3.3 3.4 ATM and the B-ISDN model . . . . . . . . . . . . . . . The AAL 3/4 PDU . . . . . . . . . . . . . . . . . . . . . ATM 5 byte Header . . . . . . . . . . . . . . . . . . . . Segmentation of a 65535 user payload into 53 byte cells . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 12 14 15 32 33 34 42 53 54 55 56 61 67 68 71 72 74 75 78 79 79 93 94 95 95 5.1 The Key Agile Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 The General Security Architecture . . . . . . . . . . . . . . . . . . . . . . . 5.3 Module Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 6.2 6.3 6.4 6.5 8.1 8.2 8.3 8.4 8.5 8.6 Comparison of encryption in various levels of the ATM stack . . . The Modi ed Security Model for NIC implementations . . . . . . . The Modi ed Security Model for Network Device implementations The operating system model of an ATM host . . . . . . . . . . . . The operating system model with the A/B plane . . . . . . . . . . Classes of Recon gurable Hardware . . . . . . . . . . . . Channeled array (side view) . . . . . . . . . . . . . . . . The SRAM FPGA . . . . . . . . . . . . . . . . . . . . . Programmed Interconnects . . . . . . . . . . . . . . . . The Programmable Array Logic Architecture . . . . . . The Complex Programmable Logic Device Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7.1 The Cipher Array with ASICs . . . . . . . . . . . . . . . . . . . . . . . . . .

9.1 The Feistel Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2 A 3x4 Substitution Box . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3 The Permutation Box . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.1 10.2 10.3 10.4 Schematic map of DES algorithm . . Schematic map of key schedule logic State Diagram of Control Unit . . . DES Signal Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11.1 Similar Cost Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 11.2 Similar Cost Device Comparison of DES . . . . . . . . . . . . . . . . . . . . 104

ix

11.3 Similar Size Device Comparison of DES . . . . . . . . . . . . . . . . . . . . 104 A.1 Simulation of DES Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 A.2 Floorplan of DES Unit in XC4020EPG223-3 . . . . . . . . . . . . . . . . . 112

Part I Introduction

Chapter 1 Motivation
Asynchronous Transfer Mode, or ATM, is a newly emerging technology for the transmission of voice, video, and data information in one common network. The security of this information has become an issue of great importance among many groups with the advent of electronic commerce and related technologies. In the past, security solutions have been devised after a communications technology has been declared a standard. The result of such solutions are often awkward and clumsy to use because work-arounds must be implemented where the network lacks support. However, for ATM, security issues are being worked out in parallel with the standard, so many opportunities as a security designer exist to perfect the state-of-the-art with regards to secure ATM networks. This work is a result of two research grants received by the Cryptography and Information Security Group at Worcester Polytechnic Institute. The rst dealt with a study of the state-of-the-art in ATM security. This research sparked our interest into the low level issues inherent with adding security to ATM such as encryptor design and placement within the networking environment. The result of the research was ideas about how the most e cient link encryptor could (or should) operate and how to achieve true system agility. However, it was also unclear whether technology 2

CHAPTER 1. MOTIVATION

existed to support our ideas regarding system agility without further research. We proposed that recon gurable hardware o ered the greatest advantage for algorithm agility because of its exibility and upgradable features. We began investigations into the state{of{the{art of recon gurables and assessing the viability of cryptographic applications in them. There has been considerable e orts in the past investigating the best optimal design of the recon gurable hardware and the routing software. However, the application of cryptography is new to these devices.

Chapter 2 Thesis Outline


Part I of this thesis includes Chapters 1-3 and servers as an outline and general structure for the rest of this thesis. Chapter 1 provides a brief introduction to this thesis and describes the motivations for both performing this research, and our decisions to move the research into certain directions. Chapter 3 gives the reader a general background on Asynchronous Transfer Mode networks. Topics such as its history and physical parameters are discussed at a fairly high level. It is intended for those who may not be very familiar with the innerworkings of ATM, but it also serves to set the stage for Part II of this thesis, which covers security issues. Part II is a three chapter segment covering all of the topics associated with securing ATM networks. Chapter 4 starts Part II with an introduction to threats against ATM networks and services used to prevent such attacks. Chapter 5 describes various parameters that must be considered when designing a secure ATM device. Together with Chapter 6, this chapter is intended to inform the reader about all of the issues regarding ATM security in general. We also introduce the concept of agility, which is covered in greater detail in Part III. Chapter 6 de nes the locations that we have decided make the optimal design for 4

CHAPTER 2. THESIS OUTLINE

services covered in Chapter 4. Through this chapter, the reader should gain knowledge about the actual structure of a secure ATM network. Part III is a four chapter segment covering algorithm agility. It begins with Chapter 7 and a discussion on motivations behind adding algorithm agility and why we feel recon gurable hardware is the best solution. Chapter 8 introduces recon gurable technology and the various architectures that are available. The reader should gain enough knowledge here to understand the motivations for Chapter 9 which is a study of cryptography in recon gurable hardware. As mentioned above, Chapter 9 is a study of cryptographic applications in recon gurable hardware. We assert that the general knowledge of cryptography implemented on recon gurable hardware is relatively unknown. We begin by studying some algorithms and how certain portions of the algorithms behave when actually mapped to the hardware. The data collected here is helpful the low level design phase presented in Chapter 10. Chapter 10 is a description of the experiences we had while designing a popular cryptographic application in recon gurable hardware. The main issues were to both t the design into an available device, and to make it as fast as possible since the application is bound for a high speed network. Finally, in Part IV presents the reader with the results and conclusions that we obtained from our recon gurable computing research. The results section contains an analysis of the various hardware con gurations that we have tested while the conclusion section contains descriptions of recommended design con gurations and work left to be done by a future research group. Immediately following the thesis summary are oorplans and simulation data from our resulting hardware designs.

Chapter 3 ATM Overview


The International Telecommunications Union (ITU) (formerly the CCITT) de nes ATM in the following manner 14]: A transfer mode in which information is organized into cells; it is asynchronous in the sense that the reoccurrence of cells containing information from an individual user is not necessarily periodic. The original design of ATM was drafted with telecommunications application in mind. Thus, many of the attributes of the ATM systems re ect desirable features of a voice network. However, the computer industry gained interest in the development of ATM soon after its conception, and their participation is re ected in the standards as they are today.
3.1 ISDN

Integrated Services Digital Network (ISDN) 6] was introduced by the telecommunications industry in the 1970s. It was originally designed as an evolutionary upgrade from previous technologies, allowing digital connections between a user and a net6

CHAPTER 3. ATM OVERVIEW

work. It was intended to carry voice and image tra c, but has been extended to carry a wide variety of information like facsimile, television, data, etc. The design of ISDN was in uenced by existing technologies such as the T1/E1 standards. T1 is a 1.536Mb/s medium that multiplexes 24 64Kb/s channels using Time Division Multiplexing (TDM). Like T1, ISDN has 64kb channels, similar transmission codes and identical physical connections. ISDN Basic Rate Interface (BRI) however, only has two 64kb/s \B" channels used for normal tra c, and an additional 16kb/s \D" channel for signaling, yielding a total of 128Kb/s throughput for user data. A second con guration, The Primary Rate Interface (PRI), has 23{31 B + D channels, yielding approximately 1.5Mb/s. By separating the signaling from the user tra c, ISDN has a type of Out-of-Band protocol. The advantages to this type of design are that the user and signaling packets are never confused because they each have their own channel. Also, no additional overhead is wasted trying to di erentiate between a signal and user packet. However, during periods when no signaling information is needed, the 16kb/s of bandwidth is wasted. The User Network Interface of ISDN is very similar to that of X.25 networks. Control is maintained through the use of the Q.931 protocol, which de nes a set of messages used to manage ISDN connections. Table 3.1 lists the Q.931 messages.
3.2 ISDN and B-ISDN

Designers soon realized after the deployment of ISDN in the 1980s that the BRI/PRI interface was too slow. Development began on a new technology called BroadbandISDN (B-ISDN), which was meant to be an extension of the existing ISDN technology. It became clear that a new functional approach must be used to achieve the gains required. Conceptually, B-ISDN is in fact an extension of ISDN, but the two technolo-

CHAPTER 3. ATM OVERVIEW


Call Establishment ALERTING CALL PROCEEDING CONNECT CONNECT ACKNOWLEDGE SETUP SETUP ACKNOWLEDGE Call Disestablishment DETACH DETACH ACKNOWLEDGE DISCONNECT RELEASE RELEASE COMPLETE Call Information RESUME RESUME ACKNOWLEDGE RESUME REJECT SUSPEND SUSPEND ACKNOWLEDGE SUSPEND REJECT USER INFORMATION Misc. CANCEL CANCEL ACKNOWLEDGE CANCEL REJECT CONGESTION CONTROL FACILITY FACILITY ACKNOWLEDGE FACILITY REJECT INFORMATION REGISTER REGISTER ACKNOWLEDGE REGISTER REJECT STATUS STATUS ENQUIRY

Table 3.1: ISDN Q.931 Messages 2] gies are not compatible with each other. Figure 3.1 shows the layers of the B-ISDN model. One major di erence with the B-ISDN and ISDN models is that B-ISDN does not specify a physical layer. The ITU-T, however, recommends the use of ATM over Synchronous Optical NETwork (SONET) or Synchronous Digital Hierarchy (SDH). It was the goal of the designers of B-ISDN to provide the following features which had not yet been implemented in any existing network: 1. Bandwidth-on-demand 2. Guaranteed cell sequence priority 3. Low overhead 4. Low delay 5. Constant delay

CHAPTER 3. ATM OVERVIEW


6. Multiple virtual connections over a single path 7. Bit rates higher than 150Mb/s 8. Constant and variable bit rates 9. Quality of service 10. Hardware controlled
Control Plane User Plane

Higher Layers
Plane Management Layer Management

ATM Adaption Layer

ATM Layer

Physical Layer

Figure 3.1: ATM and the B-ISDN model There are three primary planes in the B-ISDN model. The User plane (UPLANE), Control plane (CPLANE), and the Management Plane (MPLANE). The UPLANE provides services for ow control, recovery options, and user data transfer. The CPLANE manages connections, and is also responsible for setup and release of the connections. The MPLANE is responsible for maintaining the other layers and planes. ATM has been accepted by the standards bodies as the transport for B-ISDN networks.

CHAPTER 3. ATM OVERVIEW

10

3.3 The ATM Layers


Understanding that ATM is the underlying transport structure for B-ISDN, we now take a look at ATM itself (see Figure 3.1). ATM uses xed length cells of 53 octets (48 octets payload, 5 octets header). The header in each cell contains a Virtual Path Identi er/Virtual Channel Identi er (VPI/VCI) pair. These two id's are used to route a cell through the network. Together they form what is known as a virtual connection (VC). Each of the three main layers: ATM Abstraction Layer, ATM Layer, and Physical Layer play an important role in allowing an application to use a VC to communicate with each host. The following is a more detailed description of what occurs in each layer.

3.3.1 ATM Adaptation Layer (AAL)


The AAL layer creates classes of tra c which use the lower layers. It de nes parameters such as stream types (constant or variable), connection oriented vs. connectionless, level of error correction, acceptable cell loss, etc. Essentially, the AAL layer provides an interface from the user application to the ATM layer. It was designed to allow di erent types of applications to take advantage of the services that ATM provides. The AAL layer has two main sublayers, the Segmentation and Re-assembly (SAR) and the Convergence Sublayer (CS).

SAR
The SAR sublayer must take a user size payload from the Application Layer and segment it into 48 byte payloads for the ATM layer. Conversely, the SAR layer must reassemble 48 byte payloads from the ATM layer into user size payloads for the Application Layer.

CHAPTER 3. ATM OVERVIEW

11

CS
The CS sublayer performs di erent operations depending on the AAL class of data. In general, there are 5 classes of data (see Table 3.2). Depending on the AAL Type of operation, the CS layer will add extra information into a cell so that the remote host's AAL layer will be able to reassemble the user payload. It is beyond the scope of this report to explain each AAL Type and the elds that are present. However, for clarity, the AAL Type 3/4 will be explained below. Further information can be found in 2].
Constant Bit Rate (CBR) Class A Connection Oriented Timing relationship: Source to Dest - Req. Variable Bit Rate (VBR) Class B Connection Oriented Timing relationship: Source to Dest - Req. Variable Bit Rate (VBR) Class C Connection Oriented Timing relationship: Source to Dest - Not Req. Variable Bit Rate (VBR) Class D Connection-less Timing relationship: Source to Dest - Not Req. Class X Tra c and timing determined by user

Table 3.2: AAL Classes

AAL 3/4
The AAL 3/4 Type (see Figure 3.2) operation supports VBR applications operating in either message or stream mode. Message mode indicates that a user payload has been segmented into multiple cells, while stream mode indicates the message is a stream in nature or is as small as one octet. It carries a 44 octet payload. The other 4 octets are split into the following information elements:

CHAPTER 3. ATM OVERVIEW


1. 2 bit segment type (ST) BOM = 10, COM = 00, EOM = 01, SSM = 11] 2. 4 bit sequence number 3. 10 bit message ID 4. 6 bit length indicator 5. 10 bit CRC checksum
48B

12

2b ST

4b SN

10b MID

44B PAYLOAD

6b LI

10b CRC

ST = Segment Type SN = Sequence Number MID = Message ID LI = Length Indicator CRC = Cyclic Redunancy Check

Figure 3.2: The AAL 3/4 PDU Since the user payload is often larger than a single cell, the 3/4 type cell splits the message with the following notations: BOM = Beginning of Message, COM = Continuation of Message, EOM = End of Message, SSM = Single Segment Message (for when the payload does t into one cell).

3.3.2 ATM Layer


The ATM layer's main function is to add 5 byte headers to the 48 bytes of data received from the AAL layer and pass it to the physical layer. Conversely, it also receives 53 bytes cells from the physical layer, processes the header, and passes the 48 bytes up to the AAL layer. There are six elds in the ATM header; GFC, VPI, VCI, PT, CLP, and HEC (see Figure 3.3).

CHAPTER 3. ATM OVERVIEW

13

GFC
Generic Flow Control (GFC) eld is used at each local site to assess ow control. The value is not carried end to end and may change at each switch point.

VPI/VCI
Virtual Path Identi er/Virtual Channel Identi er is used to identify the Virtual Connection (VC) the a cell belongs to.

PT
Payload Type (PT) is used to designate whether the cell contains user or control information. It can also signal whether congestion has been experienced.

CLP
Cell Loss Priority (CLP) is a boolean ag indicating the level of priority of a cell.

3.3.3 HEC
Header Error Control (HEC) is used by the physical layer to detect errors in the header.
RESTART RESTART ACKNOWLEDGE ADD PARTY ADD PARTY ACKNOWLEDGE ADD PARTY REJECT DROP PARTY DROP PARTY ACKNOWLEDGE

Table 3.3: Additional ATM Connection Control Messages Each of these header elds play a primary role in implementing the various planes existing in the B-ISDN model. Table 3.4 shows the roles of the header elds in

CHAPTER 3. ATM OVERVIEW

14

various functions. The CPLANE functions are implemented through the Q.2931 protocol, which is an extension of the Q.931 protocol (see Table 3.3). Many nd the adaptation of the Q.931 protocol a disappointment since it is rather outdated. It is believed that the full potential of ATM cannot be utilized with a mere addition of a few command sets to the existing ISDN standard. ATM is a new protocol, with new features and new options, and therefore requires a new approach.
UPLANE Functions Multiplexing among di erent ATM connections Cell rate decoupling (unassigned) Cell discrimination based on pre de ned header values Payload type discrimination Loss priority indication and selective cell discarding Tra c Shaping MPLANE Functions Alarm Surveillance (VP) Connectivity Veri cation Invalid VPI/VCI detection UPLANE Parameters VPI/VCI

Pre assigned header values Pre assigned header values PT eld CLP eld, network congestion state Tra c descriptor MPLANE Parameters OAM Cells OAM Cells VPI/VCI

Table 3.4: Functions supported by the UNI


BITS 8 7 6 5 4 3 VPI VCI VCI VCI HEC payload payload PT 2 1 1 2 3 4 5 6 53 OCTET CLP

GFC VPI

Figure 3.3: ATM 5 byte Header

CHAPTER 3. ATM OVERVIEW

15

3.3.4 Physical Layer


The role of the ATM physical layer, like any physical layer, is to convert a data abstraction like a cell into electrical or optical impulses on the physical medium. It may also be the role of the physical layer to check for errors and perform other media dependent functions such as frame generation/recovery, line coding, etc. There are many aspects to the physical layer, such as cell delineation, cell scrambling, and HEC generation/veri cation, but they are beyond the scope of this paper. For further information see 2] or 6].
65535 Byte PDU

AAL Layer

1 EOM 1448 COM 1 BOM ATM Layer

48 Byte PDUs

EOM = End of Msg COM = Continuation BOM = Beginning

53 Byte cells

Physical Layer

Physical Medium

Figure 3.4: Segmentation of a 65535 user payload into 53 byte cells

Part II ATM Security Issues

16

Chapter 4 Introduction to Security Issues


Adding security through cryptography to any system is almost never a trivial task. One must carefully weigh the issues when deciding how much or how little security will be provided. Obtaining security services through strong cryptography may require so much bandwidth and processing overhead or such a high production cost, that the system would become in-feasible for real-world applications. On the other hand, providing a lower level of services may have better performance, but the security may be too weak. The ideal is to nd a middle ground which satis es security needs with a reasonable throughput/cost ratio. Today's most relevant networking topics are high speed networks and wireless technology, both of which have separate issues with cryptography. On one hand, we need encryption rates fast enough to sustain high speed data streams. On the other hand, we need low bandwidth/low power consumption solutions for wireless operations. ATM falls into the high speed category. It is di cult to design a universal solution to the ATM security problem, since so much of the design depends on the throughput requirements of the network. ATM by nature does not specify the physical layer which will be used to transfer ATM cells. We therefore need some kind of scaleable 17

CHAPTER 4. INTRODUCTION TO SECURITY ISSUES

18

architecture that can be implemented in today's ATM technology, but can adapt to tomorrow's with little e ort or change in protocols. The mainstream developments with ATM support 155Mb/s, and 622Mb/s. The question is what cryptographic element is needed to support security services at these speeds? Unfortunately, software is too slow. A software implementation of a common private-key block cipher (which are relatively fast encryptors) allows throughputs of 1{10Mb/s 26]. Hardware versions typically run 3{4 orders of magnitude faster.

4.1 Potential Threats


The rst step to take when developing a secure system is to identify the potential threats against the system. After the threat analysis is complete, the security services used to thwart the attack can be applied more e ectively. Lane 16] describes a list of threats in an ATM Network. They range from passive listening to actively destroying network service. The complete list can be found in Table 4.1.
Threat Example of Threat disclosure passive listening denial of service ooding network insertion, removal, modifying ATM headers or modi cation to alter channel of data de nition (misrouting) attached workstations intrusion unauthorized login network or application fraud using false identity provider revenue to obtain resources Object information privacy resource availability integrity

Table 4.1: Network Threats

4.2 Security Services


There are many security services that the use of cryptography provides. Each one attempts to thwart one or more attacks from Table 4.1. This section serves as an outline to some features desirable in a secure ATM link.

CHAPTER 4. INTRODUCTION TO SECURITY ISSUES


1. Privacy

19

De nition - The ability to send information in a manner so that only the intended recipients have the ability to \see" the data. Solution - Use an encryption algorithm. Issues of key management will need to be resolved.

2. Integrity
De nition - The process of verifying that a payload was not tampered with in transit. Solution - Include a cryptographic checksum or Message Authentication Code (MAC) with the payload or use digital signatures.

3. Authentication
De nition - Authentication is the process of one node calculating the true identity of a remote node and verifying that the payload has integrity. Solution - Use digital signatures and/or MAC codes.

4. Access Control
De nition - The ability to control access to objects and resources based upon the identity or the current level of access granted to an entity. Solution - For simpler discretionary access control, a password accessed system can be used. For mandatory access control, higher level labeling and compartmenting should be used. Access Control Lists (ACL) are usually kept to govern the access privileges of entities and objects.

5. Replay Prevention

CHAPTER 4. INTRODUCTION TO SECURITY ISSUES

20

De nition - Preventing an opponent from resending a once valid payload at a later time. Solution - Include a (secured) timestamp with the payload. If the packet arrives at a time interval greater than the local security policy allows, discard the cell and/or update audit trail.

6. Non-repudiation
De nition - The ability to prove the absolute identity of a payload. Solution - Use public-key signatures. Private-key signatures require that the two (or more) parties involved share a secret. If party A sends a private-key signed packet to B, A can later cheat and claim that B sent the packet. However, public-key signatures would allow B to prove that only A knows how to generate a given signature.

Chapter 5 Design Considerations


After building a threat model and deciding on which services need to be implemented, the design and implementation phase begins. Real world issues come into play at this stage and many decisions have to be made. This chapter serves to introduce the issues that were uncovered as research progressed towards a solution to the secure ATM problem.

5.1 Encryption Hardware


What encryption hardware can support these >100Mb/s throughput rates available in ATM links? There are currently two major forms of hardware suitable for cryptography: custom Application Speci c Integrated Circuits (ASIC), or Recon gurable Hardware (i.e. Erasable Programmable Logic Devices/Field Programmable Gate Arrays (EPLD/FPGA1 )).
1

EPLD is used in this document interchangeably with EPLDs or FPGAs.

21

CHAPTER 5. DESIGN CONSIDERATIONS

22

5.1.1 ASICs
ASICs have some advantages over Recon gurable (RC) hardware. First, they are usually faster, since they were designed speci cally for the problem, whereas RCs are generic logic devices that are programmed using variable switching matrices and con gurable logic elements. These switching matrices add variable delay due to increased parasitic capacitance and resistance of each switch, and the logic elements may sometimes exhibit poorer performance when compared to the gate implemented in custom silicon. Second, ASICs are usually smaller and consume less power because RCs have overhead logic for maintaining the reprogrammable circuitry. However, ASICs cannot o er the exibility of a recon gurable device.

5.1.2 Recon gurables


Aside from the points made above, Recon gurables (RC) have several major advantages over a custom ASIC design: 1. Algorithm Agility - Because the device can be recon gured, it allows a designer to simply reprogram the device when a new algorithm is needed. With ASICs, a separate device must standby until it is needed. 2. Shorter design time and veri cation - The development time of RC hardware is signi cantly shorter that a full custom solution because the designs can be veri ed quickly without waiting for manufacturing delays. 3. Design changes are easily accommodated - If an algorithm changes in the future, the binary image can be distributed among the devices in the network allowing \hardware" upgrades to be done without actually changing any hardware. Recon gurables would appear to be the ideal choice for implementing cryptographic elements. However, there are some major problems. Aside from the points

CHAPTER 5. DESIGN CONSIDERATIONS

23

above regarding the inherent delay/power-consumption/size problems, it is unclear whether RC hardware can accommodate cryptographic applications. Up until recently, these device have been extremely small and could only replace a few thousand equivalent gates. A modern link encryptor may consume tens (or hundreds) of thousands of gates. In addition the I/O resources required to support ATM cells and encryption is fairly signi cant and may have problems mapping to a device. The speed problem could become an issue as the link of the ATM network increases as well. How many RC devices are needed? An array of RCs may exhibit enough throughput to sustain an OC-12 (approx 622Mb/s) rate, but how many chips must run in parallel to achieve this? The most likely location for encryption will be in the ATM layer itself, before the data leaves a node. This means that all the encryption hardware must be on the local Network Interface Card (NIC). The card itself has size restraints and power consumption/cooling requirements. Will laptops using PCMCIA cards ever be able to enjoy secure ATM? If they do, it will most likely be under an ASIC control or an external device. This topics need further evaluation. For more information, see Section III.

5.2 Symmetric Algorithms


Lane and Cohen 17] acknowledge that there are many symmetric algorithms that can be used e ectively with ATM. A general set of criteria can be established to test an algorithm for eligibility: 1. The algorithm block size should be able to divide evenly into 384 bits (the ATM payload size). This allows for greater e ciency. 2. The block size should be relatively large ( 64 bits) so that small patterns in the plaintext do not generate ciphertext that is easy to perform \ciphertextsubstitution" attacks.

CHAPTER 5. DESIGN CONSIDERATIONS

24

3. The combination of (1) and (2) limit the block sizes to be 64, 96, 128, 196, or 384 bits. 4. The algorithm should have at least the strength of DES 31], but preferably higher in order to provide long term security. This includes both the key length, and level of immunity to linear (see 19], 20], 21]) and di erential (see 4], 5]) cryptanalysis. 5. The algorithm should be easily implemented in hardware with either direct support of ATM speeds (45{622Mb/s) or provisions for parallel execution to sustain these rates. 6. It should include provisions for key agility on a per-cell basis. 7. Details of the algorithm should be publicly available. Table 5.1 is a list of algorithms which meet the criteria. All items have been derived from 17], 26], and 29].
Block Key Length Security Speed Size DES 64 56 baseline baseline Triple DES 22] 64 112 >>DES 1/3 DES DESX 25] 64 56 + 64 >DES =DES RC2 24] 64 variable variable >DES RC5 23] variable variable variable variable IDEA 15] 64 128 >>DES >DES CA-1.1 13] 384 64 + 1024 unknown CAST 1] 64 64 >=DES unknown SAFER 18] 64 64 unknown >DES LOKI 7] 64 64 >=DES unknown 3-Way 10] 96 96 unknown >DES Algorithm

Table 5.1: Crypto algorithms suitable for ATM Currently the ATM Forum is discussing which algorithm will become the standard. One problem holding the decision back is the US Government's restriction on exporting cryptography. The current law considers cryptography a munitions and

CHAPTER 5. DESIGN CONSIDERATIONS

25

limits exportable encryption devices to 40 bits. This restriction is a matter of intense controversy. There are several policy proposals pending which would either increase the allowed bit length, drop the current restriction altogether, or would call for key escrow/recovery mechanisms.

5.2.1 Approved Algorithms


The ATM Forum has approved the following protocols and/or algorithms (to date) for use with Secure ATM (see Table 5.2).
Encryption
DES 3DES FEAL-32 Mode ECB CBC EDE Counter Hash MD5 SHA Authentication RSA DSS Elliptic Curves Exchange RSA Di e-Hellman
Note - Columns do not correlate

Table 5.2: Approved protocols for use with ATM

5.3 Mode of Operation


A cryptographic algorithm usually provides only the core functionality of data encryption. Several methods, or modes of operation, exist to allow the customization needed for a particular application. For instance, one mode of operation, called Electronic Code Book (ECB), may be used to simply encrypt/decrypt in straight blocks without any feedback or additional operations. This creates a one-to-one mapping between plaintext and ciphertext. Other times, the previous input or output will

CHAPTER 5. DESIGN CONSIDERATIONS

26

actually change the next output. These are referred to as feedback modes. Feedback modes serve to randomize the output (thus producing a more non-deterministic output), and to make it harder to modify any given block of cipher text. The following is an explanation of each of some commonly used modes of operation 26]; There is no Initialization Vector (IV) or feedback. The input is simply processed with the current key and the output forms the next block in the cipher text. There is no correlation between previous output or input bits in the output. phertext into an XOR with the next plaintext before it produces the next cipher block in the sequence. crypted version of the previous ciphertext block. crypted version of the previous IV.

ECB The Electronic Code Book mode of operation is the simplest to implement.

CBC The Cipher Block Chaining mode simply feeds the previous operations ci-

CFB The Cipher Feedback mode XORs the current plaintext block with an enOFB The Output Feedback mode XORs the current plaintext block with an enCounter The Counter mode XORs the current plaintext block with an encrypted
version of the previous counter 16].

Bypass To allow maximum interoperability with the rest of the world, a special

mode called \Controlled Bypass" should be allowed. This mode, as the name implies, allows connections to be formed insecurely by bypassing the encryption steps. This design would allow users to select whether security was really necessary for their connection, as well as allowing secure devices to communicate with unsecure devices.

CHAPTER 5. DESIGN CONSIDERATIONS

27

Care must be taken when designing such a mode so that the security devices could not be disabled by an error or malicious user.

5.4 Synchronization
Typically, security devices need to stay in synchronization with one another due to the mathematical dependencies of crypto algorithms. Some modes of operation (see Section 5.3) have self synchronizing properties, while others require some form of communication or mutual agreement between each other to maintain lock-step. Modes of operation that use feedback usually require \manual" synchronization. ATM presents a unique problem to cryptographic algorithms because it allows cells to be discarded in a stream without any noti cation to either side. It is the task of the AAL layer, or higher layers, to determine when a cell or cells have been lost. Modes of operation which use feedback will result in a partially or completely corrupt stream if as little as one cell is discarded. Therefore, normal mechanisms must be in place to ensure that resynchronization can occur. This is typically accomplished through the combined use of AAL PDU markers and OAM cells. The AAL layer usually adds markers for \Beginning of Message" (BOM) and \End of Message" (EOM). The designers of a system may opt to allow resynchronization upon the receipt of a EOM cell. In some instances, an Operation, Administration, and Maintenance (OAM) cell may be used when the EOM cell itself has been discarded by the network

5.5 Interleaving
When the available hardware crypto chips are not fast enough to sustain the desired rate (say 622Mb/s) they must be interleaved (or run in parallel). For instance, if a given chip runs with 64 bit blocks at 100Mb/s, and we want to encrypt at the ATM

CHAPTER 5. DESIGN CONSIDERATIONS


layer, we would need
d

28

(48 53 622
=

Mb =

) 100e = d5 63e = 6
:

chips to sustain an OC-12 line. Note the ratio 48/53 in the formula compensates for payload/cell-size di erences. There are implications involved when using chips in parallel because it is necessary to generate new IVs for each chip in some modes of operation.

5.6 Key Storage


A typical modern day cryptographic system uses two types of cryptographic functions to establish a secure connection, public-key and private key. Public-key encryption allows users to publish a key, which any node can obtain in order to send encrypted information to the issuing party. Only the publisher of the key is able to decrypt the encrypted message with a private-key which is mathematically linked to the published key. Conversely, public-key signatures allow a user to publish a veri cation key, which any node can verify the signature of a block, but only the publisher can produce a valid signature. The advantage of public-key is that information (such as a private-key) does not need to be shared among users. The disadvantage is that the algorithms are inherently slow compared to private-key algorithms, and the keys need to be very large (768{1024 bits are common). Private-key algorithms require users to share a common key between the two (or more) communicating nodes. The advantages of private-key include: fast encryption, and shorter key lengths. The disadvantages include: users must share a secret, and keys must be transferred over a secure channel. By using a combination of the two, a designer can create a secure channel using public-key cryptography, negotiate a \session key" with the remote party, and continue the rest of the communications with the speed and agility of private-key

CHAPTER 5. DESIGN CONSIDERATIONS

29

algorithms. Also, security services such as data integrity and sender authentication are often achieved through public-key digital signatures. This is what is known as a hybrid scheme, because it uses the advantages of multiple types of algorithms to produce a robust security system with good performance.

5.6.1 Session Keys


Session keys refer to the temporary keys that are valid for the length of one session (vs. more permanent keys like the public encryption keys). This abstraction works well with the ATM model because of its connection-oriented setup. In some systems, it is unclear how long a session key should remain valid. With ATM, it makes sense to associate a session key with a virtual connection and to invalidate that key when the connection is closed. This, of course, means that a session key must be stored in memory for each virtual connection supported by the secure ATM device. Based upon all discussions above we consider the following example: 622Mb/s (OC-12 rate) with six 100Mb/s DES chips in parallel. 1 chip has 64 bit block size, 64 bit IV and 56 bit session key, thus requiring 6 64 + 56 = 55Bytes of storage per VC supported. If an example implementation supports 1024 connections, it would require about 57KB of RAM to store these keys. Of course, these calculations are highly dependent on the private-key algorithm used, and the mode of operation chosen.

5.6.2 Public Key Encryption and Signature Keys


Public-Key cryptography should be used in order to authenticate a remote host and to negotiate session keys. In order to do so, the public and private-key pairs of both the encryption and signature must be stored in the device. A typical key size may be in the order of 768{2048 bits. To allow for higher security, the higher value will be

CHAPTER 5. DESIGN CONSIDERATIONS

30

used. The memory requirements for a typical RSA type implementation would be in the order of 4kbits + 4kbits for both the encryption and the signatures.

5.7 Authentication
Authentication and integrity services may demand additional storage in hardware for the generation of Message Authentication Codes (MAC). For a more detailed description of the services, see Sections 6.1.2 and 6.1.3. Generating a MAC code usually requires the calculation of a cryptographically secure checksum over the entire payload. These checksums must be stored temporarily until the entire payload has been received. Therefore, depending on which layer the authentication/integrity services are implemented, there could be a need for a memory bu er equal to the size of one checksum multiplied by the number of secure virtual connections supported by the system. For instance, if a hash such as MD5 were used to generate the MAC, and 1024 connections are supported, then about 16KB would be needed to support the service.

5.8 Numeric Computation


Computation of public-key signatures and private-key encryption will need to be done in hardware in order to sustain the ATM link rates. One option is that the crypto functions are implemented in external chips (with regard to the ATM processing chips). Most calculations regarding the private-key encryption should be done on local silicon with the ATM hardware because of the high throughput requirements. The computation of public-key algorithms can be done externally, however, it may not be convenient to do so. The Arithmetic Logical Unit (ALU) responsible for these computations will most likely need to perform the following operations on large numbers;

CHAPTER 5. DESIGN CONSIDERATIONS


Multiplication modulo an integer Exponentiation modulo an integer

31

One must take into consideration that an ALU providing arithmetic for very long operands can consume major silicon real estate in a typical ASIC controlling the ATM layer portion of the stack (in other words, between the SAR and PHY interfaces).

5.9 Key Agility


Key Agility de nes the ability of a system to switch cryptographic keys. A security device that uses a single key for all communications would be considered non{key agile. A system that can switch keys on a per connection basis would be considered highly agile. In almost all scenarios, it is desirable to enable a separate cryptographically isolated channel on each virtual circuit. In a worst case scenario for a key agile system, every cell arriving would originate from a unique VC. An ATM system that incorporates a technology with data rate in the physical layer must be able to handle 424bits/cell/ if it is to support a new key for every incoming cell. For instance if is derived from OC-12 SONET, the transmission speed is 622Mb/sec, yielding:
x x x

424 622 106 = 681


=

ns

which means that a new key (and initialization vector, if required) must be referenced and loaded in less than 681ns (minus the decryption time). This can have a signi cant impact on the overall design of the crypto unit. Agility can be realized by limiting the number of secure channels (thereby reducing the memory requirements) and to pipeline the crypto unit so that keys may be loaded before they are scheduled for decryption/encryption. For instance, GTE limits the secure connections on their GTE FastLane ATM encryptor to 4096. 27] uses an address hash table and limits the secure connections to 216.

CHAPTER 5. DESIGN CONSIDERATIONS

32

Our architecture takes into consideration all of the issues presented above. The general layout of the components is given in Figure 5.1. The idea is to use enough encryption hardware in parallel to sustain the link speed.
Key Buffer Cell Encryptor [key+iv] Cell Encryptor 24

Output
Cell Register 424 Reassem Cell Encryptor Scheduler 424

Input
424

Cell Encryptor

Controled Bypass

Figure 5.1: The Key Agile Architecture

5.9.1 Overall Layout


The module described in Figure 5.1 is the heart of the general architecture presented in Figure 5.2 which demonstrates the placement of the hardware with respect to the utopia bus. This design allows tra c owing through a bi-directional bus to use a single bank of security hardware, rather than having separate devices. Often times the hardware modules will not be local to one device, but rather spread out into separate modules. This allows larger memory and faster encryption hardware to be integrated. The layout may resemble Figure 5.3.

CHAPTER 5. DESIGN CONSIDERATIONS


Control Logic 424 424

33

Utopia Interface

Utopia Interface

TO/FROM SAR 424

KeyAgile Architecture

424

TO/FROM PHY

424

424

Figure 5.2: The General Security Architecture

5.9.2 Architecture Description


The Input Bu er
The input bu ers job is to essential queue one single cell (424 bits + control information) while the memory unit is accessed. In the event that the memory unit is fast enough, this step is unnecessary. However, as we pointed out in earlier sections, as the speed of the network increases, this stage will become more and more important. If the network speed actually increases so high that the single bu er is not enough delay to accommodate the latency of the memory unit, additional units can be added as long as the memory unit has additional I/O ports. This is a topic for further study.

The Cipher Array


The cipher array is a simple parallel con guration of two main components: The encryption block and a controlled bypass unit. Each block accepts one cell plus control data and will process that cell in a xed time unit. The processing that occurs is essentially encrypting (or decrypting) the user payload of the cell (in the

CHAPTER 5. DESIGN CONSIDERATIONS


FPGA

34

External Key RAM


FPGA

utopia

SAR

utopia

ASIC
FPGA

PHY

Schedule and Control Logic


FPGA

FPGA

Encryption Hardware

Figure 5.3: Module Layout case of the encryption block), or simply passing the data through after the xed amount of time has expired (in the case of the controlled bypass unit).

The Scheduler
The Schedulers main task is two organize the tra c and route it through the proper block in the cipher array. Certain channels, such as those designated as control or management plane always pass in the clear and therefore are routed into the controlled bypass unit. Other channels may be designated as clear channels dynamically (at call setup time), therefore the scheduler has provisions for storing VPI/VCI pairs. Whenever a cell arrives at the unit, it is checked against both its static and dynamic table for a match. Otherwise, the tra c is routed through an available cipher unit for processing.

CHAPTER 5. DESIGN CONSIDERATIONS

35

5.9.3 Design Considerations


There are several variations that may alter the overall layout of the modules, such as exploring parallelism on a di erent basis from the obvious. Normally the parallelism is exploited for use directly in stream where all hardware is dedicated to encrypting/decrypting one cell at a time. This leads to very low latency. The drawback, however, is that each device needs a new Initialization Vector (IV)2 which will increase the memory requirements by the number of chips in parallel. Possibilities exist to reduce the memory by multiplexing the channels over multiple devices. However, several real world issues may prevent the design from ever being used. For instance, since the IV is needed for each channel and the IV changes with each encryption, a single channel would require a read/encrypt/write-back cycle to complete before the next cell in that channel could be handled. This changes the manner in which the tra c could be handled by the unit and would require that each individual channel be limited by the bandwidth of a single device. It should be noted that in most circumstances, this would be acceptable, however, it complicates the tra c shaping and may be more trouble than its worth.

5.10 Algorithm Agility


Algorithm agility is the ability to switch cryptographic algorithms. An implementation that has one algorithm in hardware would be non agile. An implementation that could switch crypto routines on a per-cell basis would be highly agile. Algorithm agility is easy to accomplish if encryption is performed in software. However, ATM speeds dictate that hardware approaches must be used. Hardware algorithm agility can either be realized through providing all algorithms of interest on an ASIC, or to use reprogrammable logic (FPGAs, EPLDs). One problem with the latter approach
2

If it is running in a feedback mode

CHAPTER 5. DESIGN CONSIDERATIONS

36

is that achievable data rates might be too slow for ATM. With today's technology, it is not possible to reprogram the chip on a per cell basis. In fact, it takes orders of magnitude more time to program the chip as compared to the cell arrival rate. The FPGA method is good for allowing the exibility of the protocol design, without allowing algorithm agility per cell. If the algorithms t onto a feasible amount of ASIC chips, the ASIC approach o ers a fast throughput to size ratio and may o er better overall performance in the ATM network. The ATM Forum is designing the system to allow algorithm type/version information to be exchanged at secure call setup time. This allows the two remote hosts to guarantee that both parties are using the same security devices. Another form of algorithm agility is the ability to change modes of operation. This is easier to accomplish because it involves the use of a single hardware algorithm, only requiring the usage of the algorithm to be changed. For a more detailed discussion on Modes of Operation, see Section 5.3. Currently, the ATM Forum is considering using DES ECB mode by default, with CBC and possibly counter mode as alternatives negotiated at call setup time 16]. Section III covers algorithm agility issues at a much deeper level.

Chapter 6 Security Topology


6.1 Where to Place the Services

There are various security services (see Section 4.2) available to protect ATM tra c. One major issue to solve is how to protect di erent types of tra c, and where to place these services in the ATM model. The types of tra c can be categorized into four major groups: User Control Management OAM User cells come from the user plane. OAM cells come from both the control and user planes. The control and management cells come from the control and management planes, respectively. Each type of tra c uses a di erent pre-de ned channel. In general, it is recommended to provide the services according to Table 6.2 to protect each plane from attacks. As the table notes, some features are desirable but 37

CHAPTER 6. SECURITY TOPOLOGY


Use
Unassigned Channel Meta-signaling (default) Meta-signaling General Broadcast signaling (default) General Broadcast signaling Segment OAM F4 Flow End-to-End OAM F4 Flow Point-to-Point signaling (default) Point-to-Point signaling ILMI Transmission User data cell Segment OAM F5 Flow End-to-End OAM F5 Flow Reserved

38
VPI VCI PT PL
0 0 nz 0 nz ns ns 0 nz 0 ns ns ns ns 0 1 1 2 2 3 4 5 5 16 n n n n ns 0a0 0a0 0aa 0aa 0a0 0a0 0aa 0aa aaa 0aa 100 101 11a ns C C C C CU CU C C M U CU CU ns

\a" = bit is available for use by other layers \n" = None on the above \ns" = Not Speci ed. depends on the current connection's assignment \nz" = Non Zero \U" = User plane \C" = Control plane \M" = Management plane \PT" = Payload Type \PL" = Plane Origination

Table 6.1: ATM Channel De nition are better left for higher layers to handle due to their nature. For instance, some applications may not care about non-repudiation, like a secure telnet program while others such as electronic commerce may wish to prove that a customer ordered a product when they claim they have not.
Service
Privacy Authentication Integrity Access Control Replay Prevention Non-Repudiation

UPLANE CPLANE MPLANE


M M R,O M H,O H M M M M M M

M: Mandatory R: Recommended O: Optional H: possibly in higher layers

Table 6.2: Recommended service location

CHAPTER 6. SECURITY TOPOLOGY

39

6.1.1 Privacy
Privacy, the most widely understood security service, is the protection of data from unintentional disclosure. It is the cycle of cryptographic encryption/decryption that makes this service possible. Table 6.2 lists privacy as being mandatory in the user plane, and non-existent in the others. The reason behind this design choice is simple. The user plane is what carries any data of interest to a user application. The whole system architecture's main goal regarding ATM security is to provide services to the user plane. The control and management planes, while playing a signi cant role in the system operation, only stand to serve the user plane itself. This is not to imply that the control and management planes do not need security. Only that they do not need privacy. In fact, it is important to make sure that data from the control and management planes stay in cleartext, because they are often interpreted by the network while in transit. It is important to note some of the implications of this design choice. The most relevant side e ect is that tra c analysis is still possible because the control and management planes play a central role in tra c management. This may not be a major concern in commercial applications, but can be an issue on the military side. While it has been pointed out that the privacy service should only be implemented in the user plane, it has not been described where in the ATM model the encryption/decryption cycle should occur. Essentially, there are three choices, and all have their pros and cons.

At or Above the AAL Layer Privacy services implementation in the AAL layer

would depend on the AAL type because the size of the user Protocol Data Unit (PDU) is entirely up to the designers of the AAL classes. A typical PDU may be 65535 bytes. The idea behind this approach would be very simple. The PDU is block encrypted (or decrypted, depending on the direction of travel) before any of

CHAPTER 6. SECURITY TOPOLOGY

40

the Convergence Sublayer (CS) or Segmentation And Reassembly (SAR) operations are performed. As the now encrypted payload is broken down into ATM payloads, additional overhead such a checksums and sequence numbers are added to the 48 byte cell. For instance, consider Figure 3.2 which speci es an AAL3/4 class cell. The 44 octets of user payload would be encrypted, but the other elds would not. Figure 3.4 shows the convergence of a 65535 user payload into cells as it passes through the layers. If all 65535 bytes were encrypted as a block then the 44 bytes in an AAL3/4 cell would be the only encrypted portion. There are several advantages to this approach. First, this represents the minimal amount of data to encrypt (as opposed to including the header and AAL elds). Also resynchronizing the crypto algorithm is easy because there are very de ned data boundaries (beginning and ending of the user PDU). The completed PDU, after being processed by the CS and SAR layers would be ready for decryption. If an error occurred in the bit stream, the PDU would be discarded before any decryption cycle was started, thus removing the resynchronization problem from the crypto systems point of view to the already existing AAL infrastructure. (See the AAL section of Figure 6.1 for a diagram of an ATM cell encrypted in this fashion).

At or Below the Physical Layer Adding Privacy services in the physical layer

has several implications. First, all 53 bytes from the cell are encrypted, thus providing the highest amount of protection against eavesdropping or tra c analysis. However, encrypting all 53 bytes means that the header must be decrypted at each and every switch that the cell passes through. This can create an extreme security breach, since the switches may be in an uncontrolled environment such as a public network. The addition of controlling tra c analysis can be a major advantage. This is the only method of completely eliminating the ability to monitor the tra c from any point in the connection, but the security problem in the switches makes the physical layer a poor choice. An alternate solution may be to encrypt the header and payload

CHAPTER 6. SECURITY TOPOLOGY

41

separately, but this increases the complexity signi cantly. If, in this case, every switch in the virtual channel must negotiate keys with one another for decrypting the headers, complete chaos could result. Setup latency could increase beyond the limits of usability. Multiple security negotiations could weaken the overall security of the system, leaving multiple points for attack. Last but not least, simple design aws in the crypto protocols could result in multiple switches suddenly becoming obsolete.

At or Above the ATM Layer Adding the privacy services to the ATM layer

o ers many advantages. It allows the maximum amount of privacy and protection against tra c analysis without requiring switches to decrypt the 5 byte header. This increases the overall security and speed of the system. Figure 6.1 shows the level of con dentiality achieved when the security is implemented in the various layers. Encryption in the ATM layer prevents an eavesdropper from obtaining information about the operation of the AAL layer. For instance, if an opponent were able to distinguish an Segment Type of type BOM and the crypto algorithm was determinate, a known plaintext attack would become very easy to execute because the opponent now knows the synchronization pattern of the messages. In general, it makes sense to have the highest amount of privacy possible without a ecting the operation of the system. With both the Physical and ATM Layer choices, cryptographic resynchronization becomes more of an issue (when compared to the AAL solution). Since the AAL layer has direct control over the PDU framing, resynchronization boundaries become easy to pin-point. In the lower layers, more work must be done to acquire synchronization boundaries. Fortunately, the AAL information is easy to interpret, and their format is standardized. For the AAL type 2, and 3/4, the PDU boundaries are clearly marked by BOM and EOM labels in the cell elds. For types 1 and 5, an alternate approach must be used. A given interval of bytes (or cells) should be agreed upon which will signal when resynchronization is needed.

CHAPTER 6. SECURITY TOPOLOGY


48 Bytes 53 bytes

42

AAL Layer ATM Layer Physical Layer = ciphertext = plaintext

Figure 6.1: Comparison of encryption in various levels of the ATM stack These same issues will surface again when authentication/integrity is discussed later in the text.

6.1.2 Authentication
Authentication services were not widespread until the advent of public-key cryptography. Authentication was provided as a bene cial side e ect of symmetric cryptography since the day cryptography was started. All communications were performed using secret keys, thus all communications were assumed \authentic" if both parties knew the key. This is the nature of private-key cryptography. Public-key algorithms changed everything. They created ways to encrypt data without sharing the secret. In fact, as the name suggests, they allow encrypting data using \public" information. That means anyone can transmit data over a network to a remote host with a certain level of assurance that only the intended recipient, or holder of the private/public-key pair, can decode the message. This solves many problems, but it creates a major one. No longer can the assumption that the correctly encoded data signi es an authentic host. Another mechanism must be installed to allow for this secondary check. These mechanisms provide the service of \Authentication", and as described below, have implications when included into the secure ATM model.

CHAPTER 6. SECURITY TOPOLOGY

43

Once again, refer to Table 6.2 to the authentication portion. It lists authentication as being mandatory in all planes. While this is true, the protocols and algorithms vary greatly from plane to plane, as described below.

Control Plane
Control plane authentication is an important issue to discuss because it involves the operation of the whole ATM system. Control messages are sent by all types of devices on both sides of the ATM UNI to control tasks such as; Call Setup, Call Disconnect, Parameter Setting, etc. A malicious user could insert an invalid Call Disconnect message into an existing stream and cause the ATM entity to disassociate with the call unintentionally. However, authentication services at this level would limit the stream of valid control messages to originate from only parties involved in a connection (meaning the remote nodes and switching parties along the way). There are many di erent methods of implementing this. Although nothing has been agreed upon by the ATM forum, the most obvious method of adding authentication to the control plane is by adding one or more Information Elements to the messages. At rst glance, it may not be obvious what the requirements of the system are, but further research would reveal two classes of control messages; those which could cause damage to a network, and those which are more passive in nature. Recall Tables 3.1 and 3.3. They brie y describe the possible messages from an ATM entity.

Management Plane
The Management plane, currently controlled by the ILMI speci cation uses normal AAL class tra c over a prede ned channel. The implementation of authentication services will either need to be taken care of in the management plane, or by a protocol which detects the ILMI protocol in the ATM layer (VPI/VCI=0/16) and perform a

CHAPTER 6. SECURITY TOPOLOGY

44

lower layer authentication protocol with the use of OAM or other out-of-band sources.

User Plane
The User plane does not need direct support for authentication if the control plane performs its job correctly. If the User plane can rely on the control plane to authenticate all call setup procedures, then the resulting session can be assumed to be authentic, and integrity services can be used to authenticate all tra c because of the symmetric algorithms used for normal data ow (see above description).

6.1.3 Integrity
Integrity services are listed in Table 6.2 as being optional in the userplane and mandatory in the control and management planes. Integrity services should be implemented with Message Authentication Codes (MAC), but there are two di erent requirements based on the types of tra c to handle. Control and Management tra c will generally have integrity services provided as a direct result of the authentication services, which typically provide integrity as a bene cial side e ect. If the signature system which protects the C and M planes does not provide integrity, then a form of keyed one-way hashes will work to provide tamper proof MAC codes. However, the keyed variants of the hashed require a key, which can cause extra setup overhead. Integrity in the user plane, if provided can be a simple algorithm like SHA, which produces a 160 bit output. By appending the hash output to the plaintext, and then encrypting the whole packet, integrity can be checked at the remote end by reproducing the hash on the data, and then comparing the received hash value with the computed hash value. This is very similar to the method used with checksums. In a previous discussion it was mentioned that a memory bu er may be needed for each VC supported in the system. The use of that bu er will be described here. There are essentially two choices for location of the MAC generator; The AAL

CHAPTER 6. SECURITY TOPOLOGY


layer, or the ATM layer.

45

AAL Implementation
Adding integrity to the user plane can be taken care of in the AAL layer. The PDU, while still contained as a single unit (say 65535 octets) can have the 128 bit MAC generated across the whole bu er and appended to the end, much like standard checksum calculation. In this manner, the PDU+MAC are treated as a new single unit PDU and passed through the segmentation and convergence sublayers and on to the ATM layer. Conversely, once the complete PDU+MAC code has been received, the checksum can be calculated and compared.

ATM Implementation
Using the 16KB bu er in the ATM layer may have a cleaner implementation by decoupling the MAC code from the user stream. The main disadvantage to inserting the MAC inline is that a device that does not comply with the security protocol of the sending host may incorrectly interpret the MAC as standard PDU data, causing both stream corruption and loss of PDU framing. The design is as follows: A memory bu er is set aside for each Secure VC in the system. As a new PDU (designated by the BOM or synthetic1 markers) the bu er corresponding to the VC is lled with the rst sweep through the hash algorithm, which will store its temporary result in the bu er. This continues until the EOM (or synthetic) marker arrives, at which time the bu er is built into a cell and encrypted with the rest of the stream. The OAM cell is sent to the remote party which should be able extract the MAC information and compute its own result on the previously received PDU. It may be advantageous to use OAM cells to send the MAC out-of-band (OOB) to di erentiate between user
marker designated by the protocol designers for use with the AAL 1 and 5, which has no BOM marker
1

CHAPTER 6. SECURITY TOPOLOGY

46

and crypto streams. However, OAM cells are on plaintext channels, and therefore a keyed one-way hash would have to be used, as opposed to the hash (or equivalent) algorithm.

6.1.4 Access Control


Access Control services are strictly in the User Plane domain. It is the technique that controls which objects a host object is allowed access to. In network technologies, access control typically implemented as a series of checks that governs which remote host or network, or more speci cally, which remote process the local host can gain access to. This is directly relevant to the interests of MLS compliant systems. Access control can be realized with a protocol very similar to that of Section 6.1.2 regarding authentication. Most access control can be negotiated at call setup time, and assumed to be valid while the connection exists. This implies that access control is actually in the Control Plane as well, but this is not the case. The Control Plane simply negotiates the access control parameters, and is not regulated by the rules of access control to normal user plane data. Once again, the use of additional IEs in the setup message could include provisions for security labels, which would ultimately help to establish a connection at a certain security level. An example transaction would proceed as follows: 1. Host A builds a setup request for Host B with level \Top Secret" 2. Host A sends the request 3. Host B receives the request, and sends a \Call Proceeding" message back 4. Host B signals a trusted authority out-of-band from the original contact channel 5. TA returns a Certi cate of Host A and highest level of clearance ( equal to \Top Secret"

CHAPTER 6. SECURITY TOPOLOGY

47

6. Additional key negotiations transpire, and the \Connect" event is sent back to Host A On an even higher resolution system, the actual user, or user process level could be checked at each end to govern access control. The disadvantage to access control is that it will most de nitely require modi cations to the signaling protocol, which could take some time within the standards bodies. The ATM Forum is basing the security label format on the IETF Working groups recommendation from Common IP Security Option (CIPSO).

6.1.5 Replay Prevention


Replay Prevention is the service which prevents an attacker to store a valid message (from a valid host) and later retransmit that message to redo a previous operation. This poses a serious threat for messages such as the \Restart" event, or a bank transaction which says \Transfer 1000 dollars from A to B". Reply can be prevented with several di erent techniques, such as time stamps, counters, and nonces2 in challengeand-response protocols 26]. Each has advantages and disadvantages. Time stamps are easy to understand but hard to implement because of synchronization between clocks on di erent systems. Nonce challenging involves the sending of a random number back and forth a few times to gain assurance as to the originator of a message. Such services are only necessary in the Control and Management planes so that ATM control cannot be tampered with. If a user application wishes to implement reply prevention, it should be performed in higher layers.
2

\NOnce" stands for \Number Once", or a random number that is used once

CHAPTER 6. SECURITY TOPOLOGY

48

6.1.6 Non-Repudiation
Non-repudiation is the service which controls \cheating" parties from sending a valid message and then claiming not to. In general, this service should be implemented in higher layers.

6.2 Hardware Location


In addition to the issue of which layer in the ATM model should control security, there is also an issue of placement in the network. Three major designs are presented: Endto-End, Edge-to-Edge, and End-to-Edge.

6.2.1 Network Placement


mitted from one End Station to the remote End Station. All intermediate steps in the network (such as switches and bridges) only process cipher text. This con guration o ers the highest level of security to a network, but also requires the greatest amount of overhead.

End-to-End End-to-End security refers to a system in which ciphertext is trans-

Edge-to-Edge Edge-to-Edge security refers to a system in which plaintext is used

in a local environment (such as a LAN) and is only converted to ciphertext at the \Edge" of the network where the data leaves the local area. Such a system would be deployed in an area where the LAN would be considered safe from attack (such as a single building in a business) and all data leaving the LAN would be considered vulnerable, thus requiring the security. The remote destination would have an equivalent security device at the \Edge" of their network as well, which would convert the ciphertext back to plaintext once the data has entered the remote \safe domain".

CHAPTER 6. SECURITY TOPOLOGY

49

environment (such as a LAN) which encrypts all data which remains in the LAN but decrypts data on its way to the outside world. At rst glance, this may seem like a useless security measure, but it does have valid application. Consider a military system with multi level data being transmitted on the same network. It is imperative to have all data sent securely so that no lower level process can access a higher level processes data. However, to gain interoperability with the outside world, it is necessary to allow some means of access. Since all of the internal tra c is encrypted in a manner the outside world cannot understand, there must be a way to intelligently decide which tra c is allowed to pass. The answer is: any data with a low level is allowed to be decrypted and sent through the \Edge" guard and vice-versa. Any data arriving to the guard is considered low level and is encrypted and labeled this way. The guard forms a crypto rewall. There is no de nitive solution for ATM. The needs of the application will dictate which method is implemented in any given situation.

End-to-Edge End-to-Edge security refers to the use of security device inside a local

6.2.2 Should Services Be Built In?


Yet another issue in security placement is whether the services should be built into the Network Interface Card (NIC) or whether a \Black Box" type Network Device (ND) should be used. There are several advantages to each.

Network Interface Card


In cryptography in general, the less access points to the \Red" side (plaintext) of the network, the better. By building the security services into each and every NIC would eliminate all but the most subtle Red points in the design. In order for an attacker to gain access to Red side data (ignoring cryptanalysis) they would need internal access to the host machine (while it is running) to acquire anything. This may be an

CHAPTER 6. SECURITY TOPOLOGY

50

acceptable level of physical security in all but the most demanding situations. It also promotes the use of security because it eliminates the need for a user to carry and maintain an external peripheral.

Network Device
External NDs o er the exibility of changing security devices as they are outdated and upgraded. They also allow users to opt for no security (in the NIC) when it is not needed, instead of paying for an option that will never be used. NDs can also come in handy for providing Edge-to-Edge services, as they can be placed inline at the WAN access point. The major downfall of a ND implementation is that they have a much larger Red side then their internal NIC counterparts. Should an opponent have physical access to the Red side (such as the ber connection between the ND and the workstation), all security regarding data to and from that ES would be compromised.

6.3 Cryptographic Signaling


Cryptographic protocols require communications to establish synchronization between the two endpoints. Parameters such as type of crypto algorithms, public-keys, mode of operation, session keys, and signatures must have means of propagating to all parties involved. As it was pointed out in the previous sections, di erent planes have di erent service needs, and therefore a mixed selection may be the best answer. Questions that arise are \What will negotiate protocols and keys?", or \What will authenticate my connections?". The answer is not straightforward, for as usual, there are a few choices to make. Current discussions in the secure ATM community are debating which plane (or Layer) should be used to control the security. The options, of course, are to use one of the following:

CHAPTER 6. SECURITY TOPOLOGY


Application Layer/User Plane Control Plane A new plane speci cally for security. Here are the implications in using each layer.

51

6.3.1 Location
Application Layer/User Plane
Adding security to ATM by using the Application Layer as the crypto platform has some bene ts. First, it's conceptually easy to implement because the design is very straight forward. All synchronization is transported in user level cells and interpreted at the other end. Second, since it is not included in the signaling standard, it is very easy to change the protocol without disrupting other major standards. Unfortunately, crypto in the UPLANE has major disadvantages. For one, synchronization between two nodes cannot start until a user level channel has been established. This means that the two end stations must wait until CONNECT has been received. The ATM speci cation allows up to 14 seconds before a CONNECT state is timed out, meaning synchronization could be delayed up to 14 seconds before it is allowed to start. High connection latency would more than likely be rejected by the community. 27] suggest that the UPLANE o ers optimal con guration when the security is implemented as a Network Device (ND) instead of built into the End System (ES). They argue that the use of Operation, Administration, and Maintenance (OAM) cells (from the CPLANE) generated by the ND might be confused with the OAM cells generated by the ES, and therefore, the CPLANE is not a good choice for security control.

CHAPTER 6. SECURITY TOPOLOGY

52

CPLANE
Adding security to the CPLANE has several advantages over the UPLANE approach. CPLANE messages are sent over one or multiple cells which contain Information Elements (IE). Each IE contains some piece of information pertaining to that message or type of message. Some IEs are mandatory per message, while some are optional. If new IE types are added to existing Q.2931 protocol messages, and new messages are de ned, the CPLANE can be used to add security. For instance, if there was a security channel designation, it could be used to send a request to a key server at the same time as a SETUP message is sent to an ES. To protect the Q.2931 protocol itself from attack, a signature IE could be added to every message that would allow the ES to verify the authenticity of the message. The downside to such an approach is that it requires additions to the already monolith Q.2931 protocol.

SecurePlane
Creating a new model, with new layers, may be the best approach, because it allows isolation from the previously de ned layers, and gives the greatest amount of exibility (see Figure 6.2). With this approach, the congested CPLANE protocol does not need to be overburdened with yet another task, and the downfalls of the UPLANE implementation are not inherited. It also frees the other layers from having to do the extra work necessary to maintain secure connections. Key management and negotiations can be taken care of in one clean module, instead of dispersing the tasks throughout the protocol stack. Referencing gure 6.2, we can see that there is the addition of the A, B, and Encryption planes. Above the A,B planes we have the Application, Control, and Management planes. These planes are usually implemented as a user or kernel level process. The interface between the AAL layer is usually done at the kernel level. The device driver is most likely the best location for the A/B entities. However,

CHAPTER 6. SECURITY TOPOLOGY

53

Figure 6.2: The Modi ed Security Model for NIC implementations this couples the security services with a certain brand or type of card. A more generic service layer that works with the device driver would be better suited to the problem. The diagram points out, communication between the A/B entities and the Encryption planes is necessary. This is simple in the case of a Network Interface card that is compliant with the advanced security model presented here. The device driver can provide an interface to the encryption hardware. However, if the system designers opt for a External Network Device, such as in gure 6.3, another means of control path must be established. It would be unwise to require any entity other than the device driver (which is speci c to a certain model of a board) to know whether the Encryption layer lies within the NIC or o board in the ND. By de ning an API between the device driver and kernel, we can maintain a single version of a A/B entity. Figure 6.4 shows the relationship in terms of a simple Operating System running an ATM network. Each process, whether it be the control plane process, or a user application, must access the device through the kernel, and ultimately, through the device driver. Figure 6.5 shows the addition of the A/B entity in the model, and its interface

CHAPTER 6. SECURITY TOPOLOGY

54

Figure 6.3: The Modi ed Security Model for Network Device implementations to the device driver. If the device driver provides a communication interface to the Encryption plane, the A/B unit does not have to worry whether the encrytion plane is local or remote. The A/B planes provide all security services, except for privacy. The A plane's major function is to provide integrity to the user plane. The B plane's major function is to authenticate messages sent by the control plane, and to perform access control operations. By authenticating the control messages, we provide authentication to the user plane, and protect the control plane from a malicous attacker. The downside to this approach is that, once again, the community must agree on the speci cation, which can take some time.

6.3.2 Secure Call Establishment


The ATM forum is discussing four di erent techniques for establishing a secure call.

CHAPTER 6. SECURITY TOPOLOGY

55

Figure 6.4: The operating system model of an ATM host labeling elements (see above description). This appears to be the best approach for the long term, but it may take too long to pass in the standards bodies. For the interim, one of the other methods must be used.

Method 1 The Q.2931 signaling protocol is modi ed to include authentication and

Method 2 The OAM cells can be used to carry the security information. Unfor-

tunately, this method would require changes to the AAL standard to include a new type of multicell OAM transmission. Since most manufacturers have already built circuits supporting the other classes, it is unlikely that they would want to change this speci cation. In addition, the OAM cells would be restricted by the users network parameters such as QOS.

Method 3 The third choice is to use standard call procedures and have the security
devices at each end hold o

nishing the connection to the user until negotiations

CHAPTER 6. SECURITY TOPOLOGY

56

Figure 6.5: The operating system model with the A/B plane take place. After a session has been established, the connections are nalized with the user and the security device decouples itself from the connection (it still provides services)

Method 4 The fourth option is to use an auxiliary channel to negotiate security

info and then close the connection and allow the original to start. This removes the QOS limitations and makes a cleaner interface. This option seems to be preferred in the community for the short term.

6.4 Key Management and Distribution


One of the nal topics of discussion with regard to ATM security is key management and distribution. Two protocols have been accepted by the community for use with ATM; RSA key distribution 3] and Di e-Hellman key establishment 11]. Through

CHAPTER 6. SECURITY TOPOLOGY

57

the use of techniques described above, such as the security channel allocation and additional IEs to control messages, these protocols can be implemented. One problem to solve is certi cate management. The caching of certi cates can have a large memory requirement. It may be advisable at this point to use an external memory device to store these certi cates. Since call setup is a rather rare occurrence (as compared to switching to a new VC on an inbound cell), external storage may work just ne. The idea would be to cache as many of the certi cates as possible to eliminate the certi cate negotiation overhead. The ATM Forum has selected the ISO/IEC 9594-8 authentication and key exchange protocol to base its operation on.

Part III Achieving Algorithm Agility

58

Chapter 7 Introduction
For a basic introduction to agility, see Section 5.10. Designing a hardware architecture that can support agility must be done carefully. As was pointed out in earlier sections, the bulk data encryption takes place between the ATM and Physical layers (on the data bus). So our design must incorporate several key components. The rst is an interface to the Utopia bus so that the device can be inserted between the SAR and PHY devices. The second is a high speed memory architecture that can store and provide keys at the same rate that our cells arrive, and the third is a bank of encryption hardware that can support the given throughput of the network. All of these components must be arranged for maximum e ciency in the smallest space possible. In this section, we take a deeper look into the issues of implementing agile crypto hardware and some possible design solutions. We propose the use of recon gurable (RC) hardware for the purpose of algorithm agility. Recon gurable hardware is not commonly used for cryptographic applications and therefore needs further study. In Section 8 we introduce the various forms of RC hardware available and their theoretical di erences. In Section 9 we analyze cryptographic algorithms for their component decomposition in order to derive general conclusions. The data gathered is used to 59

CHAPTER 7. INTRODUCTION

60

analyze the e ciency of crypto algorithms in recon gurable hardware. This analysis forms the formal laboratory portion of this research. Algorithm agility imposes a new problem: the ability to switch algorithms. This can either be done through the use of recon gurable hardware or shadow ASICs which occupy the same cell block location but implement a di erent algorithm. Both have advantages and disadvantages, which will be pointed out below. For a better understanding of the bene ts of each technology, see Section 5.1.

7.1 Using ASICs


Application Speci c Integrated Circuits (ASICs) can be used to implement algorithm agility at an expense of static con guration and redundant hardware. The model is a simple one which we have called \shadow" blocks. Each block encryptor from the cipher array is designed to encrypt or decrypt one block of data. This block typically only operates with one algorithm. This solution involves either designing more than one cipher core into the ASIC, or using multiple ASICs as a \shadow" algorithm (see Figure 7.1). However, it is still very hard to modify algorithms or add a new algorithm once the system is installed. With this setup, each block could select which algorithm to run by selecting one of its many chips. This would allow for a high speed implementation with a exible algorithm scheme, but would unfortunately leave a signi cant portion of the crypto hardware idle for most of the operation of the device. This generally is not a good situation to be in, especially if space and/or power consumption is an issue.

7.2 Using Recon gurable Hardware


An alternate solution is to use recon gurable hardware (thereby eliminating the need for the shadow blocks) and to simply recon gure the device when a new algorithm is

CHAPTER 7. INTRODUCTION
Block 0

61

Redundant Hardware

Block 1

Block 2

Block 3

Figure 7.1: The Cipher Array with ASICs needed. Unfortunately, to our knowledge, there is no systematic treatment of cryptography applications in RC hardware available in the literature. That means that both their behavior and performance is unknown for this type of application. If we were to decide to implement this design, we would rst need to test for performance benchmarks in various cryptographic applications then implement the desired algorithms. For this task we rst investigated recon gurables in general. We classed them based on architectural di erences and then ran tests based on cryptographic primitives. Section 8 is a comprehensive report on the state of the art in recon gurable technology. Based on these ndings, we proceeded to the work in Section 9.4 where we ran experiments to determine whether recon gurables were suitable for ATM use.

Chapter 8 Introduction to Recon gurable Hardware


The term Recon gurable Hardware (RC Hardware) refers to a device which is manufactured in such a way to allow the end user (usually the application developer) to \program" the device in such a manner that the chip behaves exactly the same (when placed in-circuit) as if it had been custom designed. There are many types of devices that t this description. Types of Read Only Memory (ROM), Programmable Array Logic (PAL), and Field Programmable Gate Arrays (FPGA) are common, just to name a few. There are various bene ts to using recon gurable hardware for many roles in today's development/product cycle. For instance, recon gurable hardware can be used to develop fast prototypes. This can be attributed to two major factors. First, fabrication of a design often takes weeks, while recon gurable hardware can be programmed \on the y". Second, fabrication facilities will often have prohibitively high costs for small volumes. This is often unacceptable for a product in a development stage, since there is bound to be mistakes that have to be xed. A second advantage to using recon gurable hardware occurs when the RC devices are actually used in the 62

CHAPTER 8. INTRODUCTION TO RECONFIGURABLE HARDWARE

63

nal product. This allows the recon gurable nature to be exploited for applications such as: hardware updates, rmware updates, dynamic functionality (the device acts as a video accelerator at one moment, and an ATM interface in the next, etc.), and for other areas where recon guration would be bene cial. However, recon gurable hardware is almost always signi cantly more expensive than the cost of custom silicon on a per{device basis1 . A typical ASIC may cost on the order of $3.00 { $100.00 per chip (in large quantities), while an FPGA may cost $5.00 { $1200.00 per chip 30]. These cost are usually disregarded during a prototype stage of a product, but a vendor will de nitely hesitate to absorb such an expense for mass production. In fact, there are studies that show most companies will impose strong resistance to designs with RC hardware if the chip's cost{per{unit is greater than 100.00 30, p. 2]. Another problem with RC hardware is that it typically has smaller gate counts and slower performance than custom hardware. In a custom design, the gate layout is tailored and optimized to the applications best interest. On the other hand, a reprogrammable device tries to utilize generic structures to implement the same functionality. Often times the interconnection matrices and placement of logic modules has the greatest e ect on the critical path delays. However, recent developments in the recon gurable industry have pushed RC's into the domain of small to medium gate arrays with the introduction of 100k (or more) gates and over 200MHz clock speeds. The recon gurables have become so advanced that the design methodologies that are typically used for ASIC development are now being applied to applications targeted at RC hardware.
Independent studies have shown that this cost is signi cantly reduced when costs of reengineering and time-to-market are considered
1

CHAPTER 8. INTRODUCTION TO RECONFIGURABLE HARDWARE

64

8.1 Simple Recon gurable Hardware


Recon gurable hardware is not a new topic. Research has been conducted on the subject for years. Before advanced recon gurable hardware was developed (such as FPGAs and CPLDs), there were simpler devices such as Programmable Read Only Memories (PROM) that are used in computer BIOS, and Programmable Logic Arrays (PLA) which were used to implement small logic functions (typically less than 1000 equivalent gates in a masked gate array). The original PLAs were planes of programmable AND arrays, and programmable OR arrays. Through the combination of the two gates, any Boolean equation could be realized. Unfortunately, the devices were very expensive to manufacture and o ered poor speed performance (mostly due to the two levels of programmable logic). The solution was to remove one level of programmability by \ xing" the OR array 9]. This means that all outputs of the AND array are fed into the OR gate no matter what. This design is referred to as Programmable Array Logic and is the foundation for modern day CPLDs.

8.2 Device Technology


8.2.1 Interconnection Technology

The main task of interconnects is the e cient routing of signals between core logic blocks. In original recon gurables, these paths were mask programmable, meaning the logic blocks themselves were programmable, but the interconnects were programmed during the masking process at a fabrication facility. This technology was called MPGA (Mask Programmable Gate Array). In the mid eighties, several vendors introduced the concept of the programmable interconnect which gave birth to the FPGA. The programmable interconnect allowed the paths the be programmed (at run-time) in the same manner as the logic blocks. The following is a list of current interconnection

CHAPTER 8. INTRODUCTION TO RECONFIGURABLE HARDWARE


technology: Static Random Access Memory (SRAM) Anti-Fuse EPROM and EEPROM Flash Hybrid (EEPROM and SRAM)

65

8.2.2 Logic Technology


Fundamentally, the recon gurable hardware must have resources which allow general logic gates to be implemented. The question that arises is: How is this done? It is well known that there are two basic types of circuits: combinatorial and sequential. Both occur very frequently in most designs. Naturally, the recon gurable hardware should have provisions for both kinds and be as e cient as possible.

Combinatorial Logic
rectly from PAL technology. It is based on the concept of programmable AND arrays being fed into OR arrays. It computes the sum{of{products based on the inputs and can be con gured for any Boolean realization.

Product{Term Architecture Product{term architecture is generally derived di-

Lookup Table Architecture (LUT) It is also well known that any logic com-

bination can be represented with a simple table lookup where m inputs map into n outputs. There has been many studies conducted into the optimal size for m and n for the best utilization of hardware. It has been shown that this number is generally (m; n) = (4; 1) 8]. This con guration appears in many commercially available devices

CHAPTER 8. INTRODUCTION TO RECONFIGURABLE HARDWARE

66

(the Xilinx XC4000E and Altera FLEX series, to name a couple). 8] also show that 40 to 60% of all propagation delays can be attributed to routing resources and that cascading several LUTs together can yield higher performance. This has been done in the Xilinx XC4000E where two 4-input LUTs and a 3-input LUT are cascaded together.

Sequential Logic
To add the provisions for sequential logic to RC hardware, most vendors place a con gurable ip op at the output of the combinatorial logic mentioned above. This ip op can be con gured in various ways, such as bypassed, D, JK, SR, etc.

Putting it All Together


The combination of a small number of combinatorial and sequential logic constitute what is usually called a logic element. There are usually many of these logic elements in a single design. In addition to the basics mentioned above, vendors may implement special features such as special carry or feedback logic which can be used to implement larger designs more e ciently. There are also other devices such as tri{state bu ers, inverters, XORs, etc.

8.2.3 Segment Technology


Segment technology is the wires that run between logic elements. There are various methodologies behind their optimal length and di erent vendors have taken di erent approaches. The following list is a compilation of the various types: Single Width Segment Full Width Segment Multi Width Segment

CHAPTER 8. INTRODUCTION TO RECONFIGURABLE HARDWARE

67

8.2.4 Internal Architectures


The methods of organizing the core logic functions and the routing of signals between the logic blocks varies from chip to chip. However, each architecture can be classed into one of the following categories discussed in this chapter. Figure 8.1 displays a brief overview of the various architectures from a high level.
Channeless array Symmetrical Array Channeled array

Logic Blocks

Interconnects

I/O Cells Sea-of-Gates CPLD

PLD

PLD

Logic Blocks PIA PLD PLD

Figure 8.1: Classes of Recon gurable Hardware

Channeled Array
Channeled arrays, or row-based architectures, o er a linear array of logic blocks interconnected by busses running parallel and perpendicular over the logic blocks. One advantage to this type of approach is that only two layer CMOS processes are needed because the interconnect channel lies in the same plane as the logic blocks (see Figure 8.1). Another advantage is that speed critical connections can be established through a maximum of two programmable nodes. Since each node introduces a resistive path, fewer nodes directly relates to higher clock speeds.

CHAPTER 8. INTRODUCTION TO RECONFIGURABLE HARDWARE


Horizontal Metal Vertical Metal

68

Wafer Logic Blocks

Figure 8.2: Channeled array (side view)

Channeless array (Sea-of-Gates)


Channeless array technology (also commonly referred to as Sea-of-Gates) employs a layout of logic blocks covering the entire face of the silicon. This di ers from the architecture of channeled arrays which have distinct interconnection channels. There are two derivations from this technology: The symmetric array and the standard sea-of-gates.

Symmetrical Array The symmetrical array variation of this technology uses a two

layer metal CMOS process to deploy the horizontal and vertical interconnection lines. The logic blocks sit in-between the interconnects. The advantage to this architecture is that it is cheaper to produce two layer processes when compared to three layers. It is also necessary to use this architecture with SRAM based FPGAs because the SRAM circuitry in the interconnect matrix is implemented in the same level as the logic blocks. Therefore it is impossible to have a interconnect directly above a logic block.

CHAPTER 8. INTRODUCTION TO RECONFIGURABLE HARDWARE

69

Sea-of-Gates Standard Sea-of-Gates technology is used with anti-fuse FPGAs when


space is a critical factor. If a three layer metal CMOS process is used, the interconnection matrix can actually be deployed over the logic blocks, thus saving a substantial amount of silicon real-estate. It should be noted however, that the three layer process takes an extra step over two layer, and therefore can increase cost.

Hierarchical-PLD This design is found in CPLD devices. In this design, two or

more PAL blocks are connected in a hierarchical fashion as to allow integration into a complex device.

Cell-Based Array
Cell Based Array (CBA) is a technique where functionality is modularized and placed into blocks. This technology is most commonly found in ASIC design, where cells are designed in-house (or bought from a third party) and integrated together to form the complete design. The methodology is based on the concept of reusable components. If a certain function is generic enough (such as the PCI interface on a chip) the module can be designed into a \cell" and stored in a data base. When the current design calls for a PCI interface, the cell is incorporated into the design. CBA architectures have been recently introduced to the RC world where certain functionality has proven to be useful to RC designers. The cells are often called cores or megacells, and they work in the same way as cells in an ASIC do. Note that the core cells are NOT reprogrammable. They are merely a special type of logic cell that can be interfaced to and used to perform a certain function (once again, the PCI interface is a good example). One important thing to mention about CBA with RC hardware is that it is independent of the RC architectures noted above. This means that a device may have both a CBA and Symmetric Array architecture.

CHAPTER 8. INTRODUCTION TO RECONFIGURABLE HARDWARE

70

8.2.5 Field Programmable Gate Arrays (FPGA)


Field Programmable Gate Array (FPGA) technology is usually based on SRAM or anti-fuse and is most noted for its exibility and large number of ip ops 32]. One of the biggest architectural di erences between FPGAs and CPLDs is that FPGAs have an array of many small logic blocks with vast interconnection networks, while CPLDs have a few large logic blocks (based on PALs), with smaller interconnection networks. Generally speaking, FPGAs are noted for their performance in datapath oriented applications, while CPLDs are noted for their superior overall performance and predictability with state-machine operation. Proponents of FPGAs say that FPGAs also outperform CPLDs in larger designs when the gate count exceeds 2K gates. This restriction may play a key factor in the overall performance of certain applications. The SRAM and anti-fuse technologies are used for the interconnection resources. Each one has a di erent advantage as will be pointed out below.

FPGA Components
As brie y mentioned before, FPGAs are composed of arrays of small logic blocks. There are three main components in an FPGA. Each vendor may call the components slightly di erent names, but they are essentially the same. The components are described below (refer to Figure 8.3):

Con gurable Logic Blocks The Con gurable Logic Blocks (CLBs) are the core

logic element in an FPGA. The CLBs are usually small (4{32 inputs) and plentiful. They usually contain some or all of the following components: RAM (8{64bits) for lookup tables (LUT), ip- ops, latches, tri-state gates, standard logic gates, etc. In the cases where lookup tables are used, the RAM that implements the LUT can often be con gured as standard RAM for use in registers, etc.

CHAPTER 8. INTRODUCTION TO RECONFIGURABLE HARDWARE


Interconnection Matrix

71

Configurable Logic Block I/O Blocks

Figure 8.3: The SRAM FPGA blocks, and the other interconnects themselves. By programming the interconnect matrix in a particular fashion, any combination of connections can be established (see Figure 8.4). One of the criticisms with FPGAs is that the interconnection matrix causes variable and unpredictable delays because of the multiple paths available. This design allows the FPGA to be the most exible, but it makes it di cult to analyze because the delays be unknown until late stages in the design and synthesis of the circuit.

Interconnects The interconnects form the connections between the CLBs, I/O

Input/Output Blocks The I/O blocks in a FPGA are very similar to the I/O

pads in an ASIC. They act as bu ers to the outside world and provide an interface to the interconnection network and CLBs to communicate with other devices. The design of the I/O module is identical to any other design which allow the selection of read-in or write out through the control OE, or output enable.

CHAPTER 8. INTRODUCTION TO RECONFIGURABLE HARDWARE

72

CLBs

Programmed Interconnects

Figure 8.4: Programmed Interconnects

8.2.6 Complex Programmable Logic Devices (CPLD)


Complex Programmable Logic Devices di er from the FPGAs mentioned above in the way that logic and routing resources are organized di erently. CPLDs are usually a hierarchical interconnection of PAL units, so understanding how a PAL is organized is imperative to understanding the CPLD. Proponents of CPLDs claim that they o er superior performance over FPGAs for designs that are not oriented at large datapaths and/or register driven. This is due to the CPLDs inherently weaker interconnection network. However, CPLDs reportedly yield very high performance for logic designs under 2k gates. Recently CPLD products have been released which do target datapath type applications utilizing a matrix routing array, as opposed to the traditional hierarchy. CPLD technology is often based on Programmable Read Only Memory (PROM), Erasable Programmable Read Only Memory (EPROM), or Electrically Erasable Programmable Read Only Memory (EEPROM) technology, and as such exhibit properties similar to their ROM counterparts. With the introduction of EEPROM to the EPLD devices, EPLDs which typically are bound to out-of-circuit architectures can now exhibit properties of an in-circuit chip. SRAM has also been introduced into the

CHAPTER 8. INTRODUCTION TO RECONFIGURABLE HARDWARE

73

CPLD technology allowing the recon gurable exibility that has been exploited in FPGA architectures.

CPLD Components
Just as in an FPGA, CPLDs have several core components which make up the overall architecture. The PAL units implement the core logic, while the Programmable Interconnection Array connects the units together and to the I/O blocks. around for quite a while. It was used long before the complex recon gurable devices were thought of. The general architecture of a PAL is described in Figure 8.5. The PAL's core component is called a MacroCell. The MacroCell is very similar to a CLB in an FPGA. The PAL block itself has many MacroCells (8{64) aligned in an array. Each PAL has a small con gurable internal interconnection matrix which allows signals arriving at the PAL to be routed in any order to any of the macrocells. After the data has owed through the macrocell, it can exit the device at the other side. Inside a CPLD, the signal will be connected to the PIA and possible routed to another PAL or outside to an I/O block nects all PALs and I/O Blocks. This entity allows all blocks and modules to be available to other blocks throughout the entire device. The exact layout of a CPLD is available in Figure 8.6. In some of the larger devices, the PIA is actually a matrix of rows and columns and the PALs are arranged in an array fashion. This allows a higher level of exibility and brings the CPLD into the application domain of an FPGA.

Programmable Array Logic Blocks Programmable Array Logic (PAL) has been

Programmable Interconnection Array The PIA is a global bus that intercon-

CHAPTER 8. INTRODUCTION TO RECONFIGURABLE HARDWARE


I/O I/O I/O I/O I/O I/O I/O I/O I/O I/O

74

Programmable Interconnect Array

MacroCell

MacroCell

MacroCell

MacroCell

MacroCell

I/O

I/O

I/O

I/O

I/O

I/O

I/O

I/O

I/O

I/O

Figure 8.5: The Programmable Array Logic Architecture

Input/Output Blocks I/O Blocks are common to both CPLDs and FPGAs. See
Section 8.2.5 for more information about I/O blocks.

CHAPTER 8. INTRODUCTION TO RECONFIGURABLE HARDWARE

75

MacroCells Programmable Interconnect Matrix


I/O I/O

I/O

PIA

I/O

I/O Blocks

I/O

I/O

PAL Blocks

Figure 8.6: The Complex Programmable Logic Device Architecture

Chapter 9 Decomposition of Cryptographic Algorithms


9.1 Introduction

The next step in our investigation is to determine which device architecture works best with which type of cryptographic algorithms. We need both speed and e ciency for a wide variety of algorithm types. Some algorithms will use a very wide data path, while others will need many registers and ip- ops. It is unclear which device architecture will perform adequately in an ATM type environment so we must design a methodology that will allow us to predict which one will. The task of assessing all cryptographic algorithm performances in all RC hardware is an extremely complex task. To overcome this problem, we decided to analyze the algorithms for their components (such as XOR, ADD, SHIFT, etc.) and then run extensive tests on those components based on certain classes of hardware. Through this research: 1. We can derive general statements about cryptography on RC hardware. 76

CHAPTER 9. DECOMPOSITION OF CRYPTOGRAPHIC ALGORITHMS


2. Our ndings can be extended to future algorithms, as they are proposed.

77

3. We could gain some insight into which architecture might work acceptably well in an ATM environment. The rst step is to study the types of algorithms that already exist. There are many di erent branches of cryptographic algorithms, such as private-key ciphers, public-key ciphers, and hash functions. For the purpose of ATM, we are only concerned with symmetric block ciphers, since they seem more promising for the use in the encryption unit. We start by explaining how symmetric encryption is accomplished.

9.2 Symmetric Block Cipher Algorithms


Symmetric block cipher algorithms use private-key technology to provide security services. There are two basic classes of operations which can be applied to data to achieve an acceptable level of obscurity: confusion and di usion.

9.2.1 Block Cipher Architecture


Block ciphers are ciphers that operate on large blocks of data at a time (as opposed to single bits). The design of such ciphers has been studied for many years and several dominant design theories have emerged.

Feistel Networks
Many block ciphers use a Feistel network architecture, named after the inventor, Horst Feistel, who made major contributions to the eld in the 1960's and 70's while working on ciphers for IBM Research. The Feistel network is simple in concept. One of its most attractive features is that it allows for a particular algorithms to be

CHAPTER 9. DECOMPOSITION OF CRYPTOGRAPHIC ALGORITHMS

78

used for both encryption and decryption due to its inversion properties. The basic architecture is as follows: The datapath is split into two halves, the right and left sides. The right half is operated on by an function f which incorporates elements of confusion, di usion and key material and produces an output of the same size. The output from the f-function is XORed with the left half which is then stored into the right half. A copy of the original right half is swapped into the left half, thus completing the cycle (see Figure 9.1). It should be noted that only the left part L i-1] is encrypted in one round, whereas R i-1] is passed through in the clear.
L[i-1] R[i-1]

Ki

L[i]

R[i]

Figure 9.1: The Feistel Network This cycle is usually repeated many times and is referred to as a round.

Substitution-Permutation Networks
Substitution-Permutation networks, or SP networks are elements that add hybrid confusion/di usion to the input data. f{functions in Feistel networks are often based on them. Substitution elements take a m bit input and provide an n bit output (see Figure 9.2). The elements can be implemented as look-up tables or as combinatorial logic, but both are rather expensive so to minimize the cost, m and n are often kept small.

CHAPTER 9. DECOMPOSITION OF CRYPTOGRAPHIC ALGORITHMS


1 0 0 1 1 0 1 0 1 1 0 1 0 0 0 1 1 0 1 1 1 0 0 1 1 0 0 1 0 0 0 1

79

Figure 9.2: A 3x4 Substitution Box Permutation elements remap input bits into output bits and add the di usion property to the data. The construct is quite simple and runs extremely well in hardware, but runs very poorly in software (Figure 9.3 shows the relationship between input and output). Ciphers that use these elements in combination are often called product ciphers
1 4 6 7 8 13 16 24 26 34

Figure 9.3: The Permutation Box

CHAPTER 9. DECOMPOSITION OF CRYPTOGRAPHIC ALGORITHMS

80

Iteration Ciphers
Iteration ciphers are simply ciphers that incorporate loops to achieve the desired level of confusion/di usion. Each iteration of the loop further scrambles the data. Almost all block ciphers in use are based on multiple iterations or \rounds".

9.3 Methodology
By studying existing algorithms (see Table 9.1), we have determined that there is a small nite group of components that is common to all algorithms. Some of these components may have been discussed above regarding design theory, etc., but this section serves to include a detailed description of the component and it analysis.
Algorithm DES MADRYGA NewDES FEAL REDOC II REDOC III LOKI KHUFU KHAFRE IDEA MMB GOST CAST BlowFish SAFER 3-Way CRAB Parameters 64bit I/O, 56bit key var I/O, var key 64bit I/O, 120bit key 64bit I/O, 64bit key 80bit I/O, 160bit key var key up to 20kbits 64bit I/O, 64bit key 64bit I/O, 512bit key 64bit I/O, 64{128bit key 64bit I/O, 128bit key 128bit I/O, 128bit key 64bit I/O, 256bit key 64bit I/O, 64bit key 64bit I/O, 0{448bit key 64bit I/O, 64bit key 96bit I/O, 96bit key 1024bytes I/O, 128bit key Components P-BOX, XOR, SBOX, ROT XOR, SFT, ROT XOR, N/A f box XOR, ROT P-BOX, SBOX, XOR XOR, STOR XOR, SBOX, P-BOX XOR, Dyn-SBOX XOR, SBOX XOR, ADDER, MULT XOR, MULT, STOR SBOX, ROT, ADDER XOR, SBOX XOR, SBOX, ADDER XOR, ADDER, ROT, GFMULT XOR, P-BOX, ROT P-BOX, XOR, AND, OR, NOT

Table 9.1: The available algorithms and their component breakdown

9.3.1 Component Breakdown


Table9.2 summarizes what we found in the analysis. Each component must be mapped into hardware, but we need a method that will yield accurate results across all plat-

CHAPTER 9. DECOMPOSITION OF CRYPTOGRAPHIC ALGORITHMS


forms.
Component XOR/AND/OR/NOT SBOX SFT/ROT STOR ADDER MULT PBOX GFMULT Type Boolean Logic Substitution Box Shift/Rotate Element Storage Element Modulo Addition Modulo Multiplication Permutation Box GF (2n ) MULT1

81

Table 9.2: Component Description

9.3.2 Implementation
The next step is to build behavioral models using a hardware description language (HDL). Using these models we can map the given components into various architectures and record the results. HDLs were chosen for their portability across platforms. The advantage is that the di erences in entry style can be factored out of the overall equations, because the resulting design is derived from the same source code. After the models are built, they were synthesized and mapped using the tools appropriate for the device. Various optimizations were selected over multiple runs to average the results. The most interesting results are the % resources consumed and the critical path delay.

Device Selection
In order to determine the nal results of our experiments, two stages of processing were needed. The rst is the synthesis stage, which compiles the HDL into device speci c logic maps or netlists. The second is a place and route stage where the netlists are mapped into actual hardware entities. Because of the availability of these tools, we were only able to work with two vendors, namely Xilinx Corporation and Altera

CHAPTER 9. DECOMPOSITION OF CRYPTOGRAPHIC ALGORITHMS

82

Corporation. Fortunately, these two vendors are the market leaders and also provide some of the largest devices for us to work with. Unless otherwise noted all implementations of the algorithms were performed on speed grades 3. Devices are selected based on their availability in the market place and their relative size/speed/price/package selection. For example, XILINX has a multitude of devices in various families. However, only the XC4000 family is large enough to accommodate the large scale design of a cryptographic algorithm (typically 20K-60K gates2 ), so it is the only one used here. The same holds for Altera, where only the FLEX10K devices can support the needs of our application. Initial research has revealed an average PAD delay for each device tested, which is always subtracted from the critical path delay in cases where clocking was not used to determine component speed. The results of this calculation are in Table 9.3.
Device Input delay (ns) Output delay (ns) Total EPF10K70RC240-3 5.6 5.3 10.9 XC4020EPG223-3 2.5 8.5 11.0

Table 9.3: PAD delays in experimental hardware

9.4 Component Description


9.4.1 Permutation Boxes
Permutation boxes (see Section 9.2.1) are di usion elements that are easily implemented in hardware. Because the operation is essentially a remaping of input pins to output pins, the synthesis of such a circuit utilizes very few resources in RC hardware if su cient routing resources are available. In cases where drivers are not needed, the realization of such a circuit only changes the pin mapping of the components
2

This is based on our own research in this area.

CHAPTER 9. DECOMPOSITION OF CRYPTOGRAPHIC ALGORITHMS

83

connected to it. The following is a VHDL description of a permutation component found in the Data Encryption Standard (DES) algorithm:
library ieee; use ieee.std_logic_1164.all; entity pbox is port ( PI: PO: ); end ; architecture behave of pbox is begin PO(16) <= PO(29) <= PO(1) <= PO(5) <= PO(2) <= PO(32) <= PO(19) <= PO(22) <= end behave; PI(1); PI(5); PI(9); PI(13); PI(17); PI(21); PI(25); PI(29); PO(7) <= PI(2); PO(12) <= PI(6); PO(15) <= PI(10); PO(18) <= PI(14); PO(8) <= PI(18); PO(27) <= PI(22); PO(13) <= PI(26); PO(11) <= PI(30); PO(20) <= PI(3); PO(28) <= PI(7); PO(23) <= PI(11); PO(31) <= PI(15); PO(24) <= PI(19); PO(3) <= PI(23); PO(30) <= PI(27); PO(4) <= PI(31); PO(21) <= PI(4); PO(17) <= PI(8); PO(26) <= PI(12); PO(10) <= PI(16); PO(14) <= PI(20); PO(9) <= PI(24); PO(6) <= PI(28); PO(25) <= PI(32);

in std_logic_vector (32 downto 1); out std_logic_vector (32 downto 1)

Analysis
Since Permutation boxes essentially require zero hardware elements, they will not be analyzed here. These components can be added to a hardware architecture for a negligible penalty in both time and space measurements.

9.4.2 Logical Functions - XOR, AND, OR, NOT


Logical functions are composed of basic gates and are easily integrated into designs if they are applied bitwise. The e ciency of such components is highly dependent on the underlying hardware, but in most cases will have a critical path of one gate. Since even the operation on multiple bits (such as 32) will not hamper performance,

CHAPTER 9. DECOMPOSITION OF CRYPTOGRAPHIC ALGORITHMS


these components work well in a variety of devices.

84

Analysis
A VHDL model of a 32bit XOR component was built and then processed through synthesis tools to allow for a comparison. Table 9.4 shows the results obtained from this synthesis.
Device Compiler FLEX10K MaxPlus II 7.0 Optimization LE Utilization Speed 32 Area 32 WVO 7.2/XA 6.0.1 Area=LOW 16 Area=HIGH 16 Speed=HIGH 16 Max Delay 12.0ns 12.0ns 10.4ns 10.4ns 10.4ns

XC4000E

Table 9.4: 32 bit XOR box (source:xormod.vhd) Observe that the results are similar between the two devices. Note that the resource consumptions do not correlate because of the di erences in architectures.

9.4.3 Substitution Boxes


For a detailed description of substitution boxes, or sboxes, see Section 9.2.1. Implementing these components can be rather expensive in recon gurable logic because they are not typically tuned to a look-up table type architecture. In fact, our studies have shown that sboxes are often the largest component in a synthesized design, requiring tens or hundreds of logic elements to implement even small tables, such as a 6 4 which has 64 four-bit values (256 bits each). Hardware architectures that support RAM and ROM components faired extremely well in these tests, as reported below. It should be noted that in both examples that use ROM architecture, special design parameters were used to utilize the special hardware.

CHAPTER 9. DECOMPOSITION OF CRYPTOGRAPHIC ALGORITHMS

85

Analysis
A VHDL model of a 6 4 SBOX was built from the data provided in the Data Encryption Standard 12]. In addition, ROM tables were built for technologies that support ROM mapping. ing XILINX XC4000E technology. The results are posted in Table 9.5. Note that synthesis results in combinatorial circuits.
Optimization CLB Utilization Un-Optimized 89 Collapsing=HIGH 89 Area=HIGH 18 Speed=HIGH Failed Max Delay 36.3ns 36.3ns 51.1ns N/A

Test 1 - Xilinx The model was synthesized using WorkView O ce v7.2 target-

Table 9.5: Substitution box implementation in XC4000E technology with synthesized combinatorial logic (source: sbox1.vhd)

32x1/16x2 LUT implementation. This means that a group of CLBs can be con gured as a large lookup table and implement the SBOX more e ciently than through the synthesized design. The results are in Table 9.6
Optimization CLB Utilization Max Delay ROM Mapped 10 15.8ns

Test 2 - Xilinx XC4000E also supports RAM / ROM through the use of its

Table 9.6: Substitution box in XC4000E technology with ROM (source: sbox1.mem)

Test 3 - Altera The model was then synthesized using MaxPlusII from Altera and

analyzed for e ciency in a FLEX10K device. The results are posted in Table 9.7. Note that this design also results in combinartorial circuits.

CHAPTER 9. DECOMPOSITION OF CRYPTOGRAPHIC ALGORITHMS


Optimization LE Utilization Max Delay Normal,Area 113 49.6ns Fast, Speed 119 31.0ns

86

Table 9.7: Substitution box in FLEX10K technology with synthesis (source: sbox1.vhd)

Test 4 - Altera The FLEX10K devices support ROM/ROM implementations


through the use of specialized hardware called Embedded Array Blocks (EAB). The EABs are additional to the other logic elements and therefore can be utilized without sacri cing any of the logic resources. However, as will be pointed out later, the EABs are in a xed location and therefore may not be optimally placed in the design. In addition, designs that do not need to use EABs waste the real-estate. These are minor architectural di erences. The results of this test are in Table 9.8.

Based on the results of Tests 1{4 in the SBOX test, we can assert that SBOX performance is greatly accelerated by the use of RAM facilities present in our testing hardware. Since SBOXs are critical components to symmetric ciphers, we can conclude that architectures with RAM will greatly enhance the overall throughput.
Optimization EAB Utilization Max Delay ROM Mapped 8 18.3ns

Table 9.8: Substitution box in FLEX10K technology with ROM (source: sbox1.mif)

9.4.4 Shift/Rotate Registers


Shift and Rotate registers are commonly found in cryptographic applications. They are commonly found in the key scheduling logic and there are various architectures which work with di erent types of ciphers. Combinatorial shifters3 are the simplest
3

shifters will be used synonymously with rotators from this point on

CHAPTER 9. DECOMPOSITION OF CRYPTOGRAPHIC ALGORITHMS

87

to map into hardware because they are essentially a permutation. As was pointed out earlier, these permutations take hardly any hardware resources. The more advanced shifters, such as decisive and sequential shifters require more logic and will therefore be analyzed below.

Decisive shifters
Decisive shifters are components that take in two inputs. The rst is the data word to be shifted, and the other is a binary value the allows the shifter to decide when to shift or not. All processing is done combinatorially, and therefore doesn't require clocking or registered output. However, the decision circuitry requires a multiplexer so this unit is more than a simple permutation.

Sequential Shifters
Sequential shifters are components that register the input and shift based on a clock edge. They require the most amount of hardware resources, but can o er the advantage of a register and a combinatorial shift in one component. For some designs this may o er the perfect element for key scheduling or round iteration processing.

Analysis
A VHDL model of all three shifters was built and analyzed on the speci ed hardware targets.
Design Sequential Optimization LE Utilization Speed 64 Area 64 XC4000E WVO7.2/XA6.0.1 Area 32 Decisive FLEX10K MaxPlus II 7.1 Speed 32 Area 32 XC4000E WVO7.2/XA6.0.1 Area 16 Combinatorial N/A N/A N/A N/A Device Compiler FLEX10K MaxPlus II 7.0 Max Delay 10.9ns 10.4ns 10.4ns 5.2ns 5.2ns 19.3ns N/A

Table 9.9: 32bit rotation box (source:rot.vhd,lmrot.vhd, larot.vhd)

CHAPTER 9. DECOMPOSITION OF CRYPTOGRAPHIC ALGORITHMS

88

Observe that the performance was equal across the two devices for the sequential shifter implementation, but dropped o signi cantly in the FPGA for the multiplexer based design.

9.4.5 Adders
There are various types of standard adder architectures which have various speed/area properties and are well suited to di erent types of hardware architecture. Rather than try to model and analyze each one in each di erent piece of hardware, we just used the built-in functions provided with the hardware vendor. For Altera, we used the LPM Module LPMADDSUB and for Xilinx, we used the XBLOX module ADDSUB. The results are in Table 9.10.
Device Size Optimization Utilization Max Delay EPF10K10TC144-3 32 bit Area 63 LEs 102.4ns Speed 110 LEs 43.1ns EPF10K30BC356-3 64 bit Area 127 LEs 196.0ns Speed 240 LEs 73.2ns XC4020EPG223-3 32 bit { 17 CLBs 24.1ns 64 bit { 33 CLBs 45.7ns

Table 9.10: Adder in various hardware

Observations
Here we note that the performance in the FPGA architecture far exceeded the CPLD. This can be attributed to the fast carry logic available between adjacent logic elements in the FPGA. In the CPLD, once the design is larger than one PLD, the logic must be placed in a neighboring PLD. Since the carry logic is implemented to skip every other PLD, the propagation delay is greater due to the longer distances. In addition, the e ects of utilizing the global routing resources may play a role in the slow down that we observed.

CHAPTER 9. DECOMPOSITION OF CRYPTOGRAPHIC ALGORITHMS

89

9.4.6 The Hidden Components


The components listed above in this chapter are all examples of components found in actual cryptographic algorithms. They represent the mathematical/logical operations that provide the actual security. However, in a real world implementation, there are components that need to be used that aren't necessarily part of the speci ed algorithm. These include especially registers (to hold temporary results), data bu ers (to hold user data while processing), and multiplexers (to route data through the device for processing). Registers themselves don't require individual study for they simply require one ip- op per bit and have been studied as part of other components (such as the sequential shifter). The other two, however, deserve a little more attention.

Data Bu ers
Algorithms such as the Secure Hash Algorithm (SHA) process a 512 bit bu er of user data. This data is read in 32 bit portions at a time, but the ordering is such that most of the bu er must be available at any given moment. This is because 4 words are needed for every calculation and the words are spread out over the 512 bit array. This means that a 512 bit register would not work well (which is just as well because a register of that size is very ine cient) but some kind of RAM with a 32 bit word size would be perfect. Luckily modern recon gurable hardware such as the XC4000E and FLEX10K support RAM/ROM allocation and will work well with the SHA algorithm. For this test a 256x32 RAM component was built using the builtin RAM macrofunctions provided by each vendor. For Altera, the lpmramdq was chosen, and MemGen was used for Xilinx. The results are posted in Table 9.11
Device Utilization Max Delay EPF10K100GC503-3DX 8192 bits (4 EABs) 9.5ns XC4020EPG223-3 8192 bits (360 CLBs) 60.6ns

Table 9.11: 256x32 RAM bu er in various hardware

CHAPTER 9. DECOMPOSITION OF CRYPTOGRAPHIC ALGORITHMS

90

Multiplexers
As mentioned above, multiplexers are components commonly found in any real world implementation, and are therefore important to analyze. For this test we took a 32bit 2x1 multiplexer model and synthesized it into our test chips. The results are in Table 9.12
Device Utilization Max Delay EPF10K10TC144-3 32 LEs 5.2ns XC4020EPG223-3 16 LEs 19.3ns

Table 9.12: 32bit 2x1 MUX in various hardware

9.4.7 Component Conclusion


This section has presented a description of some of the components found in cryptographic algorithms. In addition, the results of our analysis accompanied each description. It was shown that neither architecture analyzed has a distinct advantage over that other. One may excel in one area, while the other will excel in an unrelated area. Making a choice for a particular vendor is a question of the speci c cryptographic algorithm, and the cost of each chip verses the e ciency of the component placement. The algorithm in question should either be checked against Table 9.1 in Section 9.3. If the algorithm is new, or not listed, the designer can simple decompose the algorithm in that same manner as presented here and then perform the analysis.

Chapter 10 Designing for High Performance


10.1 The Data Encryption Standard
10.1.1 Introduction

The Data Encryption Standard (DES) is probably the most commonly used algorithm in the world for symmetric encryption of data. It makes a good example for implementation because it contains many of the components and constructs that have been introduced in the previous chapter. Especially important is the fact that the algorithm has been approved for use in ATM by the ATM Forum. For an explanation of DES, see 12], 28, page 70]. As was explained earlier, there are many algorithms that will work well with ATM, but by using the data from Section 9.4 together with this design example, we should be able to asess whether RC hardware is principally acceptable for use in high speed secure networks.
10.1.2 Design

DES uses a Feistel Network architecture with 16 rounds. Each round can be implemented as separate hardware with pipe-line stages between each one for high through91

CHAPTER 10. DESIGNING FOR HIGH PERFORMANCE

92

put applications. However, this consumes major silicon real-estate and generally will not work in recon gurable hardware because it is too large. For designs with less than 16 rounds of xed hardware, some kind of feedback loop must be established. This is where the need for registers and multiplexers comes in. When we set out to implement DES in recon gurable logic for high speed networks, there was a set of design criteria that we wanted to meet: 1. It must be targeted for high performance (as opposed to smallest size). 2. It must complete an operation (such as encrypt or decrypt) in the fewest possible cycles (which is 16 for a simple design). 3. It must t into a commercially available chip (as opposed to one that is only in beta test). 4. It must provide for loop unrolling for future speed improvements. To meet (1), we designed the I/O for full bus width with separate input and output buses. We also implemented a full width key bus. To meet (2) we used strategic placement of the bus registers to grab the data at key locations to allow for state machine operation that was dependent only on the number of iterations, not I/O latencies. To meet (4) we carefully designed the control, key scheduler, and Feistel network to allow for the addition of arrayed components. Figure 10.1 shows the layout of the components from a schematic point of view. Note the use of two 64bit registers: one on the inputs and another in the feedback loop. This design allows us to \steal" an extra clock cycle at the expense of 64 ip- ops and one gate more of complexity through the Feistel network (through the MUX). Without the secondary register at the data feed, we would be required to go to INIT state after the sixteenth round completed so that the outputs could stabilize to the correct data (DONE is asserted) before new data is read in. With this register

CHAPTER 10. DESIGNING FOR HIGH PERFORMANCE


Start Busy Done Key In 64 Key Sched

93

Control

Arrayable portion for loop unrolling

REG32

32

48 64 64 Feistel Net IPINV 64 Data Out

Data In

IP

REG32

32

REG32 REG32

32

32

64

Figure 10.1: Schematic map of DES algorithm in place, we can successfully read a new data set in at the conclusion of round 16, thus producing the cycle chain INIT, R1, R2,....R16, R1, R2,... etc. Without it the state transition diagram must conclude to INIT every time before starting the next set. This addition saves a single pulse width of latency and increases system throughput by 6 25%. In addition to designing the main data path for high speed, the key scheduler must also be designed to deliver data at the same rate and with correct framing with respect to the data path. In order to do this we placed a single register to sample data coming from a multiplexer. The multiplexer is fed by both the feedback and the outside key input. The output from the registers feeds a unitchain of schedulers, one for each unrolled loop in the path1. Each keyunit schedules the keys for one
:

In this case study, no unrolling is performed, so only keyunit0 is installed.

CHAPTER 10. DESIGNING FOR HIGH PERFORMANCE


key in 64

94

key in
Master Keysched feedback 56 PC-1 56

Master

feedback from last unit

source 2x1 MUX

REG28 28

REG28 28

st_u0

Unit 0

st_u1
C out D out

Unit 1

C in 28

D in 28

st_u2

Unit 2

Shift Shift

Shift Shift

st_u3

Unit 3

st in 56 next

PC-2 key unit 56 to Feistel

To FN0

FN1

FN2

FN3

Figure 10.2: Schematic map of key schedule logic round, receiving control information from its respective command line (stkeyunit0, stkeyunit1, stkeyunit2, etc.). The output from each keyunit is fed to its respective Feistel network and the next unit in the chain. The last unit in the chain feeds its Feistel network and then loops the output back into the master keyscheduler for storage in the registers. This operation is displayed in Figure 10.2. The results of our implementation is posted in 11.2.

CHAPTER 10. DESIGNING FOR HIGH PERFORMANCE

95

start=0/ busy = 0

start = 0

INIT start = 1/ busy=1, st = 0, source = 0, done = 0 start = 1

R16

R1

xxxx / busy = 0, st = 0, source = 1, done = 1

xxxx / busy = 1, st = 0, source = 1, done = 0

R15

R2

xxxx / busy = 0, st = 1, source = 1, done = 0

xxxx / busy = 1, st = 0, source = 1, done = 0

Figure 10.3: State Diagram of Control Unit

Data_IN Data_OUT Key_IN E/D Start Busy Done Clk Reset

IN[63:0] OUT[63:0] IN[63:0] IN[1] IN[1] OUT[1] OUT[1] IN[1] IN[1]

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Figure 10.4: DES Signal Map

Part IV Results and Conclusions

96

Chapter 11 Results
11.1 Comparing the Results in Recon gurable Hardware
In our experiments we compared two high end devices (manufactured by Xilinx and Altera) in a series of tests that we hope show the relative performance of cryptographic algorithms. It may be unclear which device actually performs better, and in actuality, they were very close. It is di cult to estimate the actual resources consumed in a device since there are so many discrepancies in the way information is provided by the companies. Often times the transformation between logic elements (LEs) and typical gate counts is overestimated and can confuse the user. We have assembled data from the two market leaders, Xilinx and Altera, based on a series of devices from each family, and their cost to the end user. All prices are assumed to be in 100 unit quantities.

97

CHAPTER 11. RESULTS

98

11.1.1 Methodology
Because the task of assessing a dollar value of each individual resource in a device is much too complex, we will use two simple models. The rst only takes logic elements into consideration, which we will refer to as Logic Element Weighting (LeW). This method simply takes the number of LEs available and divides it into the total cost of the device. The second method only considers RAM as a resource, which we will call Ram Element Weighting (ReW). For simplicity, we will de ne one RE as 32 bits of RAM.
Device Family 1 2 3 4 5 6 1 2 3 4 5 6 Device XC4003EPG120C-3 XC4008EPG191C-3 XC4010EPG191C-3 XC4013EPG223C-3 XC4020EPG223C-3 XC4036EXPG411C-3 EPF10K10LC84-3 EPF10K20TC144-3 EPF10K30RC208-3 EPF10K40RC208-3 EPF10K50VRC -3 EPF10K100CG503-3 Logic Elements 100 324 400 576 784 1296 576 1152 1728 2304 2880 4992 Device($) 65.10 163.00 227.00 382.00 401.00 814.00 32.50 65.50 132.00 175.00 274.00 892.00 LE($) 0.65 0.51 0.57 0.66 0.51 0.63 0.06 0.06 0.08 0.08 0.10 0.18

Table 11.1: LE Weighted (LeW) Cost Analysis of Various Devices


Device Family 1 2 3 4 5 6 1 2 3 4 5 6 Device XC4003EPG120C-3 XC4008EPG191C-3 XC4010EPG191C-3 XC4013EPG223C-3 XC4020EPG223C-3 XC4036EXPG411C-3 EPF10K10LC84-3 EPF10K20TC144-3 EPF10K30RC208-3 EPF10K40RC208-3 EPF10K50VRC -3 EPF10K100CG503-3 Ram Elements 100 324 400 576 784 1296 192 384 384 512 640 768 Device($) 65.10 163.00 227.00 382.00 401.00 814.00 32.50 65.50 132.00 175.00 274.00 892.00 RE($) 0.65 0.51 0.57 0.66 0.51 0.63 0.17 0.17 0.34 0.34 0.43 1.16

Table 11.2: RE Weighted (ReW) Cost Analysis of Various Devices Our next step is to establish the relative cost of each of the cryptographic compo-

CHAPTER 11. RESULTS

99

nents in the di erent architectures. One way of comparing the devices is to use two chips from the same price range. This way, we can compare consumption of resources versus available resources. Another way is to try and assign a price per resource and list the components relative price/performance. Because of the di erences in architecture, it will be necessary to take into consideration all of the subtle details in order to get accurate results. For instance, Altera FLEX10K devices have separate RAM components, while Xilinx XC4000E devices trade logic for RAM and vice-versa. So if a component takes 32 logic elements each, this equals 32 LE resources in Altera, and 32 LE + 32 RE in Xilinx, since each logic element is also 32 bits of RAM (potentially).

11.1.2 Same Relative Cost Comparison


In this comparison, we take two devices of the same relative price and compare them. The EPF10K40RC208-3 currently costs $175.00 while the XC4010EPG191C3 costs $227.00, which is relatively close. In addition to these two devices, the EPF10K50VRC-3 costs $315.001 and the XC4013EPG223C-3 costs $382.00. Table 11.3 lists the resources available in each chip. Note that Altera will be denoted as Type A, and Xilinx will be Type B.
Group Device LEs Flip-Flops REs

1 2 3

EPF10K40RC208-3 XC4010EPG191C-3 EPF10K50VRC-3 XC4013EPG223C-3 EPF10K100CG503-3 XC4036EXPG411C-3

2,304 400 2,880 576 4,992 1,296

2,576 1,120 3,184 1,536 5,392 3,168

I/Os Total Cost($) 512 189 175 400 160 227 640 310 315 576 192 382 768 406 892 1,296 288 814

LeW($) ReW($)

0.08 0.57 0.10 0.66 0.18 0.63

0.34 0.57 .43 0.66 1.16 0.63

Table 11.3: Similar cost comparison


1

for 500-999 unit quantities.

CHAPTER 11. RESULTS


Component XOR Group Type LEs 1 A 32 B 16 2 A 32 B 16 3 A 32 B 16 8 SBOX(ROM) 2 A 0 B 80 1 SBOX(AREA) 2 A 113 B 18 1 SBOX(SPEED) 2 A 119 B 89 Shifter(DecArea) 2 A 32 B 16 Adder(32bitArea) 2 A 63 B 17 Adder(64bitArea) 2 A 127 B 33 Bu er(256x32) 2 A 0 B 360 REs 0 16 0 16 0 16 512 64 0 18 0 89 0 16 0 17 0 33 256 256 LeW($) 2.56 9.12 3.20 10.56 5.76 10.08 0 52.80 11.30 23.76 11.9 58.74 3.20 21.12 6.30 11.22 12.70 43.56 0 237.6 ReW($) 0 9.12 0 10.56 0 10.08 220.16 52.80 0 23.76 0 58.74 0 21.12 0 11.22 0 43.56 110.08 168.96 % Resources 1.39 4.00 1.11 2.78 0.64 1.23 80.00 13.89 3.92 3.13 4.13 15.4 1.11 2.78 2.18 2.95 4.41 5.73 40.0 62.5

100
Speed(ns) 12.0 10.4

18.3 15.8 49.6 51.1 31.0 36.3 5.2 19.3 102.4 24.1 196.0 45.7 9.5 60.6

Table 11.4: Components Evaluated with Cost Comparison

11.1.3 Same Relative Size Comparison


In this comparison, we take the estimated typical gate count from the manufacturers speci cations and compare two devices of about the same size. The rst is the EPF10K20TC144-3 with an estimated gate range of 15K{63K. The second is the XC4020EPG223C-3 with an estimated gate range of 13K-40K. Table 11.5 lists the resources available in each chip.
Device Gate LEs Flip-Flops REs I/Os Total LeW($) ReW($) Range Cost($) EPF10K20TC144-3 15K{63K 1,152 1,344 384 189 65.50 0.06 0.17 XC4020EPG223-3 13K{40K 784 2,016 784 224 401 0.51 0.51

Table 11.5: Same Relative Size Comparison

CHAPTER 11. RESULTS

101

Figure 11.1: Similar Cost Comparison

11.2 Analysis of DES Implementation


11.2.1 Component Comparison
Table 11.4 lists several components and there relative costs in both dollars and resources consumed. It is obvious that Altera devices are cheaper when compared side by side with a Xilinx. Table 11.6 lists several components and their resources in devices where the chips are rated to be the same size. As mentioned above, these results are re ective of our methodology, which is a rather simple model. A more complex model is needed to accurately describe the di erences and cost/performance ratios. However, we can note that a fair indicator may be in the % resources consumed.

CHAPTER 11. RESULTS


Component XOR Type A B 8 SBOX(ROM) A B 1 SBOX(AREA) A B 1 SBOX(SPEED) A B Shifter(DecArea) A B Adder(32bitArea) A B Adder(64bitArea) A B Bu er(256x32) A B LEs 32 16 0 80 113 18 119 89 32 16 63 17 127 33 0 360 REs 0 16 512 64 0 18 0 89 0 16 0 17 0 33 256 256 LeW($) 1.92 8.16 0 40.80 6.78 9.18 7.14 45.39 1.92 8.16 3.78 6.67 7.62 16.83 0 186.6 ReW($) 0 8.16 87.04 40.80 0 9.18 0 45.39 0 8.16 0 6.67 0 16.83 43.52 130.56 % Resources 2.78 2.04 133.33 10.20 9.81 2.29 10.22 11.35 2.78 2.04 5.47 2.17 11.02 4.21 66.67 45.92 Speed(ns) 12.0 10.4 18.3 15.8 49.6 51.1 31.0 36.3 5.2 19.3 102.4 24.1 196.0 45.7 9.5 60.6

102

Table 11.6: Components Evaluated with Size Comparison

11.2.2 DES Comparison


After completing the design using VHDL modeling and synthesis tools, we realized an entire DES implementation. The design mapped easily into a variety of devices. Figure A.2 in the appendix is an example of the logic oorplan in an XC4020EPG2233 device. We also determined the following performance ratings. The results are posted in Table 11.7. For a maximum throughput, we measured 62.5Mb/s from the Xilinx device without a single loop unrolled. Close behind it was the Altera device which maxed out at 57.60Mb/s. One important di erence, however, is that the Altera device cannot be unrolled any more because the memory EABs have been exhausted. From this point on, we will consider the designs that are ROM mapped only in the Xilinx devices because of the limitations in the Altera chips. So this yields comparatively: 62.5Mb/s for Xilinx and 39.96Mb/s for Altera. Both of these designs support loop unrolling. We will now analyze these designs in the same manner as the individual components.

CHAPTER 11. RESULTS


Device Opt. Floorplan Resources

103
Manual Automatic Manual Automatic Automatic Automatic 448 646 359 549
Delay Clock Tput. (ns) (MHz) (Mb/s) 154.9 7.00 27.99

XC4020EPG223-3

area=low collapse=o redundancy=o P/R = 2,2 same as above P/R = 4,4 SBOX=RAM SBOX=RAM P/R = 4,4 SBOX=RAM P/R = 4,4 EPF10K30RC240-3 Norm/Speed/Area SBOX=RAM

115.0 9.10 145.5 6.80 76.0 14.10 70.5 15.60

36.40 27.20 56.40


62.40

Auto w/o Bus 545

1319 112.7 9.99 403+8EAB 69.4 14.4

39.96
57.60

Table 11.7: DES Performance

11.2.3 Same Cost Comparison of DES


In this comparison, we use devices from Group 2 to check how DES performs in terms of cost and resource consumption. Table 11.8 lists the ndings. Note that the resource consumption for the Altera device is signi cantly lower than the FPGA. However, the throughput of the FPGA was higher.
Type LEs REs LeW Total ReW Total % Resources Throughput A 1319 0 131.9 0 45.80 39.96 B 545 500 359.7 330 94.62 62.4

Table 11.8: Similar Cost Device Comparison of DES

11.2.4 Same Size Comparison of DES


In this comparison, we use the two devices of the same size2 to determine the overall resources consumed. The results are in Table 11.9. As you can see, DES does not even t into the Altera device, even though the are supposed to be the same size. In this comparison, we show that the resource consumption test yields better results for the FPGA architecture.
2

As claimed by their manufacturers

CHAPTER 11. RESULTS

104

Figure 11.2: Similar Cost Device Comparison of DES


Type LEs REs LeW Total ReW Total % Resources Throughput A 1319 0 79.14 0 114.50 39.96 B 545 500 277.95 255 69.52 62.4

Table 11.9: Similar Size Device Comparison of DES

Figure 11.3: Similar Size Device Comparison of DES

Chapter 12 Conclusions
12.1 Design Recommendations for ATM
There are several parameters that should be considered for designing secure ATM devices. This section serves to summarize our ndings to give the reader a clear picture about which speci cations are the most important in the system.
Key agility - The system should cryptographically isolate every channel, even if operating in VPC mode. Call Mode - Should support both PVC and SVC operation. Throughput - Should support the throughput requirements of the desired application (i.e., 155Mb/s for OC-3, 622Mb/s for OC-12c). Latency - Security services require computation, and the resulting latency in the data stream will re ect this. Make sure the maximum latency meets the needs of the application. Typical ranges are 5ms through 7ms. Maximum Call Capacity - A key agile system requires more memory for every VC supported. Make sure the supported cell capacity meets the system require-

105

CHAPTER 12. CONCLUSIONS

106

ments. Typical values are in the order of 1024{65,535 connections for ND link encryptors and 256{1024 for NICs.
Algorithms - Support for strong public and private key algorithms is a very important issue. Refer to Section 5.2 for further details. Hardware accelerators - Authentication and other public key operations can be greatly improved through the use of hardware accelerators. Key Management - A system that supports automatic negotiation of keys is more desirable than some previous implementations that required out-of-band negotiations.

12.2 Recon gurable Hardware and ATM


We conclude with the following regarding RC hardware: 1. High speed encryptors seem feasible for RC hardware 2. Speeds in exess of 60Mb/s were acheived when taking advantage of special RC hardware. 3. We believe through further study we can achieve 100Mb/s. 4. Substitutions boxes are a critical component in RC hardware. Carefully executed designs can achieve high throughputs. 5. Once the price per device is lowered, the advantages of using RC hardware for bulk data encryption will far outweigh ASIC solutions Overall, Altera o ered a better cost/e ciency ratio, but Xilinx outperformed Altera. Since ATM is targeted at high speed applications and the devices are typically high end, the additional costs in hardware would most likely be worth it to the system

CHAPTER 12. CONCLUSIONS

107

designer. For this reason we recommend Xilinx, or an FPGA like architecture, as opposed to CPLDs. The main reasons are:
Faster overall performance - Our DES implementation performed better in the Xilinx FPGA. Scales better - In both size and cost. The Altera devices exhibited non-linear scaling factors for changes in bit widths, and device changes. Provides for loop unrolling - Loop unrolling should provide a higher throughput in the device. This is not possible to do with the FLEX10K because each sbox takes one EAB. Even though the EAB can hold a substantial amount of memory (2048 bits), but it can only be used for one component. Therefore, one SBOX takes one EAB and we quickly exhaust the supply. However, we can instantiate any number of 10CLB SBOXs in the XC4000E until we run out of logic elements. Since our one loop implementation only consumed half of the available resources in our test device (xc4020epg223C-3), we can theoretically unroll the loop 2-4 times. Potentially even more in a bigger device.

12.3 Future Work


This work has touched upon many di erent disciplines and areas of communications engineering. Had there been more time we would like to have explored the areas involving the e ects of cryptographic algorithms in recon gurable hardware more carefully. The results that we did nd show that this technology is viable for the use of high speed networking and data security. In addition, the work involving the design of the link encryptor in Section 5.9.3 needs to be studied closer to determine the feasibility of expanding parallelism on a per channel basis (as opposed to a per cell basis). Another area left unexplored is the bene ts of loop unrolling. We believe

CHAPTER 12. CONCLUSIONS

108

that this technique will allow a developer to even further improve the performance of the RC device, above and beyond the results we published here. Other symmetric ciphers are based on algebraic operations (AO) (such as IDEA). Since the work done in this thesis is based on SP ciphers, we cant make accurate predictions as to how AO ciphers will behave. Future work should include these ciphers as well.

12.4 Summary
This thesis hopefully gave the reader some insight into ATM networks, the issues with providing security over those networks, and an introduction to issues using recon gurable logic for the main encryption hardware. We also provided data regarding the implementation of cryptographic algorithms in recon gurable hardware in general, such as the cost vs. speed, and how to asses an algorithm for its size and delay characteristics before any design work begins. Although there has been substantial work done in the area of recon gurable architectures, very little has been done in terms of cryptographic algorithms. We hope that this work will alert both potential developers of security devices and recon gurable hardware vendors about the viability of cryptographic applications and the need for further study. Often times crypto algorithms exhibit patterns in there architecture that may be exploited by new hardware designs. With the recent interest in cryptographic technologies by mass market companies, the use of new hardware technologies will be of value. In closing, there has been a lot of work done in the last few years regarding ATM security. We believe that the concept of using recon gurables for this technology is a promising and interesting addition to the growing interest in the eld of high speed secure networks. It is hoped that through the experiments performed in this study, any designer can make an intelligent decision as to which hardware will meet the needs of their application.

Part V References

109

Appendix A DES

110

APPENDIX A. DES
KEY_IN XXXXXXXXXXXXXXXX\H KU0_PC1CD XXXXXXXXXXXXXXXX\H KU0_PC1KS XXXXXXXXXXXXXX\H KU0_REGC XXXXXXX\H KU0_REGD XXXXXXX\H KU0_SC KU0_SD XXXXXXX\H XXXXXXX\H XXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXX XXXXXXXXXXXXXX 0000000 0000000 0000000 0000000 00000000000000 000000000000 XXXXXXXXXXXXXXXX

111

KU0_PC2IN XXXXXXXXXXXXXX\H KU0_IN XXXXXXXXXXXX\H DATA_IN XXXXXXXXXXXXXXXX\H DATA_OUT A1A26510C55AC5E9\H STAGEA_OUT XXXXXXXXXXXXXXXX\H STAGEC_OUT 6F47F290F42854D5\H STAGEC2_OUT 6F47F290F42854D5\H STAGED_OUT F42854D5D387A022\H CLK RESET START BUSY DONE CNTR_KSEL CNTR_DSEL CNTR_KU0 1 0 0 0 0 0 1 0

XXXXXXXXXXXXXXXX

T(KEY_IN)

3.7009u

1u

2u

3u

4u

Time (Seconds)

Figure A.1: Simulation of DES Unit

APPENDIX A. DES

112

Figure A.2: Floorplan of DES Unit in XC4020EPG223-3

Bibliography
1] C.M. Adams and S.E. Tavares. Designing S-Boxes for ciphers resistant to di erential cryptanalysis. Proceedings of the 3rd Symposium on State and Progress of Research in Cryptography, pages 181{190, Feb 1993. 2] ATMForum. ATM User Network Interface (UNI) Speci cation v3.1. Prentice Hall, Upper Saddle River, NJ, 1995. 3] S. M. Bellovin and M. Merritt. Encrypted key exchange: Password-based protocols secure against dictionary attacks. In Proceedings of the 1992 IEEE Computer Society Conference on Research in Security and Privacy, pages 72{84, 1992. 4] E. Bilham and A. Shamir. Di erential cryptanalysis of DES{like cryptosystems. In Advances in Cryptology - CRYPTO '90 Proceedings, pages 2{21. SpringerVerlag, 1991. 5] E. Bilham and A. Shamir. Di erential cryptanalysis of DES{like cryptosystems. In Journal of Cryptology,, volume 4, pages 2{21, 1991. 6] Uyless Black. ATM: Foundation for Broadband Networks. Prentice Hall, Englewood Cli s, NJ, 1995. 7] L. Brown, J. Pieprzyk, and J. Seberry. LOKI: a cryptographic primitive for authentication and secrecy applications. In Advances in Cryptology { AUSCRYPT '90 Proceedings, pages 229{236. Springer-Verlag, 1990. 113

BIBLIOGRAPHY

114

8] Stephen Brown. FPGA architectural research: A survey. In Design and Test of Computers, volume 13, pages 9{15. IEEE, 1996. 9] Stephen Brown and Jonathan Rose. Architecture of FPGAs and CPLDs: A tutorial. Technical report, Department of Electrical and Computer Engineering, University of Toronto, 1996. 10] J. Daemen, R. Govaerts, and J. Vandewalle. A new approach to block cipher design. In Fast Software Encryption, pages 18{32. Cambridge Security Workshop Proceedings, Springer-Verlag, 1994. 11] W. Di e and M. E. Hellman. New directions in cryptography. IEEE Transactions on Information Theory, IT-22:644{654, Nov 1976. 12] W.F. Ehrsam, C.H.W. Meyer, R.L. Powers, J.L. Smith, and W.L. Tuchman. Product block ciphers for data security. U.S. Patent Number 3,962,539, June 1976. 13] H. Gutowitz. Cryptography with dynamical systems. Cellular Automata and Cooperative Phenomenon, 1993. 14] ITU-T. Vocabulary of terms for broadband aspects of ISDN. Recommendation I.113 Section 2.2, ITU-T, November 1993. 15] X. Lai and J. Massey. A proposal for a new block encryption standard. In Advances in Cryptology - EUROCRYPT '90 Proceedings, pages 389{404. SpringerVerlag, 1991. 16] S. Lane. Security issues in moving from private to public ATM service. In ITU Americas Telecom 96, Technology Summit, Rio de Janeiro, Brazil, June 1996. ITU.

BIBLIOGRAPHY

115

17] S. Lane and G. Cohen. Security in ATM networks. Proceedings of the Technical Conference on Telecommunications RD in Massachusetts, pages 23{32, 1995. 18] J.L. Massey. SAFER K-64: a byte oriented block-ciphering algorithm. In Fast Software Encryption, pages 1{17. Cambridge Security Workshop Proceedings, Springer-Verlag, 1994. 19] M. Matsui. Linear cryptanalysis method for DES cipher. In Advances in Cryptology { EUROCRYPT '93 Proceedings, pages 386{397. Springer-Verlag, 1993. 20] M. Matsui. Linear cryptanalysis of DES cipher (I). In Proceedings of the 1993 Symposium on Cryptography and Information Security (SCIS 93), pages 3C.1{ 14, Shuzenji, Japan, Jan 1993. (In Japanese). 21] M. Matsui. Linear cryptanalysis method for DES cipher(III). In Proceedings of the 1994 Symposium on Cryptography and Information Security (SCIS 94), pages 4A.1{11, Lake Biwa, Japan, 27-29 Jan 1994. (In Japanese). 22] R.C. Merkle and M. Hellman. On the security of multiple encryption. Communications of the ACM, 24:465{467, 1981. 23] R.L. Rivest. The RC5 encryption algorithm. Dr. Dobb's Journal, 20:146{148, Jan 1995. 24] M.J.B. Robshaw. Block ciphers. Technical report, RSA Laboratories, Jul 1994. 25] M.J.B. Robshaw. Personal communication, 1995. 26] Bruce Schneier. Applied Cryptography: Protocols, Algorithms, and Source Code in C. Wiley, 2nd edition, 1996. 27] Daniel Stevenson, Nathan Hillery, and Greg Byrd. Secure communications in ATM networks. Technical report, MCNC, 1995.

BIBLIOGRAPHY
28] Doug Stinson. Cryptography: Theory and Practice. CRC Press, 1995.

116

29] Douglas R. Stinson. Cryptography: Theory and Practice. CRC Press, 1st edition, 1995. 30] Larry Waller. Focus report: Programmable logic. Technical report, ISD Archives, 1996. 31] ANSI X3.92. American national standard for data encryption algorithm (DEA). American National Standards Institute, 1981. 32] Xilinx Corporation. Data Book, 1996.

Vous aimerez peut-être aussi