Académique Documents
Professionnel Documents
Culture Documents
by Gregory M. Haskins A Thesis Submitted to the Faculty of the WORCESTER POLYTECHNIC INSTITUTE In partial ful llment of the requirements for the Degree of Master of Science in Electrical Engineering May, 1997 Approved: Prof. Christof Paar ECE Department Thesis Advisor Scott Lane GTE Government Systems Thesis Committee Prof. Fred J. Looft ECE Department Thesis Committee Prof. John M. Rulnick ECE Department Thesis Committee
Abstract
Data security plays an increasingly important role in today's information technology. Potential data rates in the gigabit range, such as o ered by ATM networks, put many constraints on the design of a secure, but usable, network. In addition, the cell structure of ATM makes bulk data encryption as well as public-key security services challenging tasks. In this work, two major areas of ATM security are addressed. First, the special aspects and problems associated with overall security for ATM networks, such as potential threats, services, design considerations, and topology are explored. The second part deals with agility of cryptographic algorithms, that is the capability of an encryption device to change its algorithm. This feature appears to be very desirable for high speed networks because it facilitates design exibility and future protocol additions and changes. We propose the use of recon gurable hardware since they appear to be naturally suited for the task. The use of recon gurables in cryptographic applications, to our knowledge, has not been systematically analyzed before and appears to be a highly interesting area within high speed network security. The result of this thesis is a design for a secure ATM network, and a detailed analysis on the feasibility of using recon gurable hardware to implement algorithm agility. The analysis includes information regarding an actual implementation and its price vs. performance in two popular architectures. One of the more interesting results are that DES can be realized without loop unrollment with data rates beyond 60Mb/sec on standard recon gurable hardware.
ii
Preface
I would like to thank the many people who contributed to this work. First, my advisor Christof Paar for his advice and support throughout this entire project. Without him, I may have inadvertently designed some insecure devices and embarrassed myself at RSA ATEX. Together we gained '97. Next, I would like to thank Kate Sullivan for her help with L valuable insight as to how to get tables to work correctly. She usually came to me for help, only to answer her own question and teach me something new in the process. Martin Rosner and Mike Roberts worked together with me on the component synthesis and testing. Without them I may not have been able to nish all of the experiments in time. Scott Lane and Dave King from GTE Government Systems were kind enough to meet with Dr. Paar and myself early in the project to discuss various architectures, designs, etc. I would like to thank them for giving us that initial start which help us complete the ATM design work for Lockheed. Lastly, I would like to thank the thesis committee for taking the time to read this project in the midst of busy schedules. Thanks everyone! -Greg
iii
Contents
1 Motivation 2 Thesis Outline 3 ATM Overview
I Introduction
3.1 ISDN . . . . . . . . . . . . . . . . . . 3.2 ISDN and B-ISDN . . . . . . . . . . . 3.3 The ATM Layers . . . . . . . . . . . . 3.3.1 ATM Adaptation Layer (AAL) 3.3.2 ATM Layer . . . . . . . . . . . 3.3.3 HEC . . . . . . . . . . . . . . . 3.3.4 Physical Layer . . . . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
6 7 10 10 12 13 15
2 4 6
16
17
18 18
4.1 Potential Threats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Security Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Encryption Hardware . . . . . . . . . . . . . . . . 5.1.1 ASICs . . . . . . . . . . . . . . . . . . . . . 5.1.2 Recon gurables . . . . . . . . . . . . . . . . 5.2 Symmetric Algorithms . . . . . . . . . . . . . . . . 5.2.1 Approved Algorithms . . . . . . . . . . . . 5.3 Mode of Operation . . . . . . . . . . . . . . . . . . 5.4 Synchronization . . . . . . . . . . . . . . . . . . . . 5.5 Interleaving . . . . . . . . . . . . . . . . . . . . . . 5.6 Key Storage . . . . . . . . . . . . . . . . . . . . . . 5.6.1 Session Keys . . . . . . . . . . . . . . . . . 5.6.2 Public Key Encryption and Signature Keys 5.7 Authentication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
21 22 22 23 25 25 27 27 28 29 29 30
iv
5.8 Numeric Computation . . . . . . 5.9 Key Agility . . . . . . . . . . . . 5.9.1 Overall Layout . . . . . . 5.9.2 Architecture Description . 5.9.3 Design Considerations . . 5.10 Algorithm Agility . . . . . . . . .
. . . . . .
. . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
30 31 32 33 35 35
6 Security Topology
6.1 Where to Place the Services . . . . . 6.1.1 Privacy . . . . . . . . . . . . 6.1.2 Authentication . . . . . . . . 6.1.3 Integrity . . . . . . . . . . . . 6.1.4 Access Control . . . . . . . . 6.1.5 Replay Prevention . . . . . . 6.1.6 Non-Repudiation . . . . . . . 6.2 Hardware Location . . . . . . . . . . 6.2.1 Network Placement . . . . . . 6.2.2 Should Services Be Built In? 6.3 Cryptographic Signaling . . . . . . . 6.3.1 Location . . . . . . . . . . . . 6.3.2 Secure Call Establishment . . 6.4 Key Management and Distribution .
37
37 39 42 44 46 47 48 48 48 49 50 51 54 56
58
59
60 60
8.1 Simple Recon gurable Hardware . . . . . . . . . . . . 8.2 Device Technology . . . . . . . . . . . . . . . . . . . . 8.2.1 Interconnection Technology . . . . . . . . . . . 8.2.2 Logic Technology . . . . . . . . . . . . . . . . . 8.2.3 Segment Technology . . . . . . . . . . . . . . . 8.2.4 Internal Architectures . . . . . . . . . . . . . . 8.2.5 Field Programmable Gate Arrays (FPGA) . . . 8.2.6 Complex Programmable Logic Devices (CPLD)
62
64 64 64 65 66 67 70 72
9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2 Symmetric Block Cipher Algorithms . . . . . . . . . . . . . . . . . . . . . . 9.2.1 Block Cipher Architecture . . . . . . . . . . . . . . . . . . . . . . . .
76
76 77 77
9.3 Methodology . . . . . . . . . . . . . . . . . . . . . 9.3.1 Component Breakdown . . . . . . . . . . . 9.3.2 Implementation . . . . . . . . . . . . . . . . 9.4 Component Description . . . . . . . . . . . . . . . 9.4.1 Permutation Boxes . . . . . . . . . . . . . . 9.4.2 Logical Functions - XOR, AND, OR, NOT 9.4.3 Substitution Boxes . . . . . . . . . . . . . . 9.4.4 Shift/Rotate Registers . . . . . . . . . . . . 9.4.5 Adders . . . . . . . . . . . . . . . . . . . . . 9.4.6 The Hidden Components . . . . . . . . . . 9.4.7 Component Conclusion . . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
80 80 81 82 82 83 84 86 88 89 90
91
91 91 91
96
97 98 99 100 101 101 102 103 103 105 106 107 108
97
12 Conclusions
105
A DES
V References
109
110
vi
List of Tables
3.1 3.2 3.3 3.4 ISDN Q.931 Messages 2] . . . . . . . . . . . . AAL Classes . . . . . . . . . . . . . . . . . . . Additional ATM Connection Control Messages Functions supported by the UNI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 11 13 14 18 24 25 38 38 80 81 82 84 85 85 86 86 87 88 89 90 98 98 99 100 100 102 103 103 4.1 Network Threats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Crypto algorithms suitable for ATM . . . . . . . . . . . . . . . . . . . . . . 5.2 Approved protocols for use with ATM . . . . . . . . . . . . . . . . . . . . . 6.1 ATM Channel De nition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Recommended service location . . . . . . . . . . . . . . . . . . . . . . . . . 9.1 9.2 9.3 9.4 9.5 9.6 9.7 9.8 9.9 9.10 9.11 9.12 11.1 11.2 11.3 11.4 11.5 11.6 11.7 11.8 The available algorithms and their component breakdown . . . . . . . . . . Component Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . PAD delays in experimental hardware . . . . . . . . . . . . . . . . . . . . . 32 bit XOR box (source:xormod.vhd) . . . . . . . . . . . . . . . . . . . . . Substitution box implementation in XC4000E technology with synthesized combinatorial logic (source: sbox1.vhd) . . . . . . . . . . . . . . . . . . . . Substitution box in XC4000E technology with ROM (source: sbox1.mem) . Substitution box in FLEX10K technology with synthesis (source: sbox1.vhd) Substitution box in FLEX10K technology with ROM (source: sbox1.mif) . 32bit rotation box (source:rot.vhd,lmrot.vhd, larot.vhd) . . . . . . . . . . . Adder in various hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256x32 RAM bu er in various hardware . . . . . . . . . . . . . . . . . . . . 32bit 2x1 MUX in various hardware . . . . . . . . . . . . . . . . . . . . . . LE Weighted (LeW) Cost Analysis of Various Devices RE Weighted (ReW) Cost Analysis of Various Devices Similar cost comparison . . . . . . . . . . . . . . . . . Components Evaluated with Cost Comparison . . . . Same Relative Size Comparison . . . . . . . . . . . . . Components Evaluated with Size Comparison . . . . . DES Performance . . . . . . . . . . . . . . . . . . . . . Similar Cost Device Comparison of DES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
vii
viii
List of Figures
3.1 3.2 3.3 3.4 ATM and the B-ISDN model . . . . . . . . . . . . . . . The AAL 3/4 PDU . . . . . . . . . . . . . . . . . . . . . ATM 5 byte Header . . . . . . . . . . . . . . . . . . . . Segmentation of a 65535 user payload into 53 byte cells . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 12 14 15 32 33 34 42 53 54 55 56 61 67 68 71 72 74 75 78 79 79 93 94 95 95 5.1 The Key Agile Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 The General Security Architecture . . . . . . . . . . . . . . . . . . . . . . . 5.3 Module Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 6.2 6.3 6.4 6.5 8.1 8.2 8.3 8.4 8.5 8.6 Comparison of encryption in various levels of the ATM stack . . . The Modi ed Security Model for NIC implementations . . . . . . . The Modi ed Security Model for Network Device implementations The operating system model of an ATM host . . . . . . . . . . . . The operating system model with the A/B plane . . . . . . . . . . Classes of Recon gurable Hardware . . . . . . . . . . . . Channeled array (side view) . . . . . . . . . . . . . . . . The SRAM FPGA . . . . . . . . . . . . . . . . . . . . . Programmed Interconnects . . . . . . . . . . . . . . . . The Programmable Array Logic Architecture . . . . . . The Complex Programmable Logic Device Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.1 The Feistel Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2 A 3x4 Substitution Box . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3 The Permutation Box . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.1 10.2 10.3 10.4 Schematic map of DES algorithm . . Schematic map of key schedule logic State Diagram of Control Unit . . . DES Signal Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.1 Similar Cost Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 11.2 Similar Cost Device Comparison of DES . . . . . . . . . . . . . . . . . . . . 104
ix
11.3 Similar Size Device Comparison of DES . . . . . . . . . . . . . . . . . . . . 104 A.1 Simulation of DES Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 A.2 Floorplan of DES Unit in XC4020EPG223-3 . . . . . . . . . . . . . . . . . 112
Part I Introduction
Chapter 1 Motivation
Asynchronous Transfer Mode, or ATM, is a newly emerging technology for the transmission of voice, video, and data information in one common network. The security of this information has become an issue of great importance among many groups with the advent of electronic commerce and related technologies. In the past, security solutions have been devised after a communications technology has been declared a standard. The result of such solutions are often awkward and clumsy to use because work-arounds must be implemented where the network lacks support. However, for ATM, security issues are being worked out in parallel with the standard, so many opportunities as a security designer exist to perfect the state-of-the-art with regards to secure ATM networks. This work is a result of two research grants received by the Cryptography and Information Security Group at Worcester Polytechnic Institute. The rst dealt with a study of the state-of-the-art in ATM security. This research sparked our interest into the low level issues inherent with adding security to ATM such as encryptor design and placement within the networking environment. The result of the research was ideas about how the most e cient link encryptor could (or should) operate and how to achieve true system agility. However, it was also unclear whether technology 2
CHAPTER 1. MOTIVATION
existed to support our ideas regarding system agility without further research. We proposed that recon gurable hardware o ered the greatest advantage for algorithm agility because of its exibility and upgradable features. We began investigations into the state{of{the{art of recon gurables and assessing the viability of cryptographic applications in them. There has been considerable e orts in the past investigating the best optimal design of the recon gurable hardware and the routing software. However, the application of cryptography is new to these devices.
services covered in Chapter 4. Through this chapter, the reader should gain knowledge about the actual structure of a secure ATM network. Part III is a four chapter segment covering algorithm agility. It begins with Chapter 7 and a discussion on motivations behind adding algorithm agility and why we feel recon gurable hardware is the best solution. Chapter 8 introduces recon gurable technology and the various architectures that are available. The reader should gain enough knowledge here to understand the motivations for Chapter 9 which is a study of cryptography in recon gurable hardware. As mentioned above, Chapter 9 is a study of cryptographic applications in recon gurable hardware. We assert that the general knowledge of cryptography implemented on recon gurable hardware is relatively unknown. We begin by studying some algorithms and how certain portions of the algorithms behave when actually mapped to the hardware. The data collected here is helpful the low level design phase presented in Chapter 10. Chapter 10 is a description of the experiences we had while designing a popular cryptographic application in recon gurable hardware. The main issues were to both t the design into an available device, and to make it as fast as possible since the application is bound for a high speed network. Finally, in Part IV presents the reader with the results and conclusions that we obtained from our recon gurable computing research. The results section contains an analysis of the various hardware con gurations that we have tested while the conclusion section contains descriptions of recommended design con gurations and work left to be done by a future research group. Immediately following the thesis summary are oorplans and simulation data from our resulting hardware designs.
Integrated Services Digital Network (ISDN) 6] was introduced by the telecommunications industry in the 1970s. It was originally designed as an evolutionary upgrade from previous technologies, allowing digital connections between a user and a net6
work. It was intended to carry voice and image tra c, but has been extended to carry a wide variety of information like facsimile, television, data, etc. The design of ISDN was in uenced by existing technologies such as the T1/E1 standards. T1 is a 1.536Mb/s medium that multiplexes 24 64Kb/s channels using Time Division Multiplexing (TDM). Like T1, ISDN has 64kb channels, similar transmission codes and identical physical connections. ISDN Basic Rate Interface (BRI) however, only has two 64kb/s \B" channels used for normal tra c, and an additional 16kb/s \D" channel for signaling, yielding a total of 128Kb/s throughput for user data. A second con guration, The Primary Rate Interface (PRI), has 23{31 B + D channels, yielding approximately 1.5Mb/s. By separating the signaling from the user tra c, ISDN has a type of Out-of-Band protocol. The advantages to this type of design are that the user and signaling packets are never confused because they each have their own channel. Also, no additional overhead is wasted trying to di erentiate between a signal and user packet. However, during periods when no signaling information is needed, the 16kb/s of bandwidth is wasted. The User Network Interface of ISDN is very similar to that of X.25 networks. Control is maintained through the use of the Q.931 protocol, which de nes a set of messages used to manage ISDN connections. Table 3.1 lists the Q.931 messages.
3.2 ISDN and B-ISDN
Designers soon realized after the deployment of ISDN in the 1980s that the BRI/PRI interface was too slow. Development began on a new technology called BroadbandISDN (B-ISDN), which was meant to be an extension of the existing ISDN technology. It became clear that a new functional approach must be used to achieve the gains required. Conceptually, B-ISDN is in fact an extension of ISDN, but the two technolo-
Table 3.1: ISDN Q.931 Messages 2] gies are not compatible with each other. Figure 3.1 shows the layers of the B-ISDN model. One major di erence with the B-ISDN and ISDN models is that B-ISDN does not specify a physical layer. The ITU-T, however, recommends the use of ATM over Synchronous Optical NETwork (SONET) or Synchronous Digital Hierarchy (SDH). It was the goal of the designers of B-ISDN to provide the following features which had not yet been implemented in any existing network: 1. Bandwidth-on-demand 2. Guaranteed cell sequence priority 3. Low overhead 4. Low delay 5. Constant delay
Higher Layers
Plane Management Layer Management
ATM Layer
Physical Layer
Figure 3.1: ATM and the B-ISDN model There are three primary planes in the B-ISDN model. The User plane (UPLANE), Control plane (CPLANE), and the Management Plane (MPLANE). The UPLANE provides services for ow control, recovery options, and user data transfer. The CPLANE manages connections, and is also responsible for setup and release of the connections. The MPLANE is responsible for maintaining the other layers and planes. ATM has been accepted by the standards bodies as the transport for B-ISDN networks.
10
SAR
The SAR sublayer must take a user size payload from the Application Layer and segment it into 48 byte payloads for the ATM layer. Conversely, the SAR layer must reassemble 48 byte payloads from the ATM layer into user size payloads for the Application Layer.
11
CS
The CS sublayer performs di erent operations depending on the AAL class of data. In general, there are 5 classes of data (see Table 3.2). Depending on the AAL Type of operation, the CS layer will add extra information into a cell so that the remote host's AAL layer will be able to reassemble the user payload. It is beyond the scope of this report to explain each AAL Type and the elds that are present. However, for clarity, the AAL Type 3/4 will be explained below. Further information can be found in 2].
Constant Bit Rate (CBR) Class A Connection Oriented Timing relationship: Source to Dest - Req. Variable Bit Rate (VBR) Class B Connection Oriented Timing relationship: Source to Dest - Req. Variable Bit Rate (VBR) Class C Connection Oriented Timing relationship: Source to Dest - Not Req. Variable Bit Rate (VBR) Class D Connection-less Timing relationship: Source to Dest - Not Req. Class X Tra c and timing determined by user
AAL 3/4
The AAL 3/4 Type (see Figure 3.2) operation supports VBR applications operating in either message or stream mode. Message mode indicates that a user payload has been segmented into multiple cells, while stream mode indicates the message is a stream in nature or is as small as one octet. It carries a 44 octet payload. The other 4 octets are split into the following information elements:
12
2b ST
4b SN
10b MID
44B PAYLOAD
6b LI
10b CRC
ST = Segment Type SN = Sequence Number MID = Message ID LI = Length Indicator CRC = Cyclic Redunancy Check
Figure 3.2: The AAL 3/4 PDU Since the user payload is often larger than a single cell, the 3/4 type cell splits the message with the following notations: BOM = Beginning of Message, COM = Continuation of Message, EOM = End of Message, SSM = Single Segment Message (for when the payload does t into one cell).
13
GFC
Generic Flow Control (GFC) eld is used at each local site to assess ow control. The value is not carried end to end and may change at each switch point.
VPI/VCI
Virtual Path Identi er/Virtual Channel Identi er is used to identify the Virtual Connection (VC) the a cell belongs to.
PT
Payload Type (PT) is used to designate whether the cell contains user or control information. It can also signal whether congestion has been experienced.
CLP
Cell Loss Priority (CLP) is a boolean ag indicating the level of priority of a cell.
3.3.3 HEC
Header Error Control (HEC) is used by the physical layer to detect errors in the header.
RESTART RESTART ACKNOWLEDGE ADD PARTY ADD PARTY ACKNOWLEDGE ADD PARTY REJECT DROP PARTY DROP PARTY ACKNOWLEDGE
Table 3.3: Additional ATM Connection Control Messages Each of these header elds play a primary role in implementing the various planes existing in the B-ISDN model. Table 3.4 shows the roles of the header elds in
14
various functions. The CPLANE functions are implemented through the Q.2931 protocol, which is an extension of the Q.931 protocol (see Table 3.3). Many nd the adaptation of the Q.931 protocol a disappointment since it is rather outdated. It is believed that the full potential of ATM cannot be utilized with a mere addition of a few command sets to the existing ISDN standard. ATM is a new protocol, with new features and new options, and therefore requires a new approach.
UPLANE Functions Multiplexing among di erent ATM connections Cell rate decoupling (unassigned) Cell discrimination based on pre de ned header values Payload type discrimination Loss priority indication and selective cell discarding Tra c Shaping MPLANE Functions Alarm Surveillance (VP) Connectivity Veri cation Invalid VPI/VCI detection UPLANE Parameters VPI/VCI
Pre assigned header values Pre assigned header values PT eld CLP eld, network congestion state Tra c descriptor MPLANE Parameters OAM Cells OAM Cells VPI/VCI
GFC VPI
15
AAL Layer
48 Byte PDUs
53 Byte cells
Physical Layer
Physical Medium
16
18
architecture that can be implemented in today's ATM technology, but can adapt to tomorrow's with little e ort or change in protocols. The mainstream developments with ATM support 155Mb/s, and 622Mb/s. The question is what cryptographic element is needed to support security services at these speeds? Unfortunately, software is too slow. A software implementation of a common private-key block cipher (which are relatively fast encryptors) allows throughputs of 1{10Mb/s 26]. Hardware versions typically run 3{4 orders of magnitude faster.
19
De nition - The ability to send information in a manner so that only the intended recipients have the ability to \see" the data. Solution - Use an encryption algorithm. Issues of key management will need to be resolved.
2. Integrity
De nition - The process of verifying that a payload was not tampered with in transit. Solution - Include a cryptographic checksum or Message Authentication Code (MAC) with the payload or use digital signatures.
3. Authentication
De nition - Authentication is the process of one node calculating the true identity of a remote node and verifying that the payload has integrity. Solution - Use digital signatures and/or MAC codes.
4. Access Control
De nition - The ability to control access to objects and resources based upon the identity or the current level of access granted to an entity. Solution - For simpler discretionary access control, a password accessed system can be used. For mandatory access control, higher level labeling and compartmenting should be used. Access Control Lists (ACL) are usually kept to govern the access privileges of entities and objects.
5. Replay Prevention
20
De nition - Preventing an opponent from resending a once valid payload at a later time. Solution - Include a (secured) timestamp with the payload. If the packet arrives at a time interval greater than the local security policy allows, discard the cell and/or update audit trail.
6. Non-repudiation
De nition - The ability to prove the absolute identity of a payload. Solution - Use public-key signatures. Private-key signatures require that the two (or more) parties involved share a secret. If party A sends a private-key signed packet to B, A can later cheat and claim that B sent the packet. However, public-key signatures would allow B to prove that only A knows how to generate a given signature.
21
22
5.1.1 ASICs
ASICs have some advantages over Recon gurable (RC) hardware. First, they are usually faster, since they were designed speci cally for the problem, whereas RCs are generic logic devices that are programmed using variable switching matrices and con gurable logic elements. These switching matrices add variable delay due to increased parasitic capacitance and resistance of each switch, and the logic elements may sometimes exhibit poorer performance when compared to the gate implemented in custom silicon. Second, ASICs are usually smaller and consume less power because RCs have overhead logic for maintaining the reprogrammable circuitry. However, ASICs cannot o er the exibility of a recon gurable device.
23
above regarding the inherent delay/power-consumption/size problems, it is unclear whether RC hardware can accommodate cryptographic applications. Up until recently, these device have been extremely small and could only replace a few thousand equivalent gates. A modern link encryptor may consume tens (or hundreds) of thousands of gates. In addition the I/O resources required to support ATM cells and encryption is fairly signi cant and may have problems mapping to a device. The speed problem could become an issue as the link of the ATM network increases as well. How many RC devices are needed? An array of RCs may exhibit enough throughput to sustain an OC-12 (approx 622Mb/s) rate, but how many chips must run in parallel to achieve this? The most likely location for encryption will be in the ATM layer itself, before the data leaves a node. This means that all the encryption hardware must be on the local Network Interface Card (NIC). The card itself has size restraints and power consumption/cooling requirements. Will laptops using PCMCIA cards ever be able to enjoy secure ATM? If they do, it will most likely be under an ASIC control or an external device. This topics need further evaluation. For more information, see Section III.
24
3. The combination of (1) and (2) limit the block sizes to be 64, 96, 128, 196, or 384 bits. 4. The algorithm should have at least the strength of DES 31], but preferably higher in order to provide long term security. This includes both the key length, and level of immunity to linear (see 19], 20], 21]) and di erential (see 4], 5]) cryptanalysis. 5. The algorithm should be easily implemented in hardware with either direct support of ATM speeds (45{622Mb/s) or provisions for parallel execution to sustain these rates. 6. It should include provisions for key agility on a per-cell basis. 7. Details of the algorithm should be publicly available. Table 5.1 is a list of algorithms which meet the criteria. All items have been derived from 17], 26], and 29].
Block Key Length Security Speed Size DES 64 56 baseline baseline Triple DES 22] 64 112 >>DES 1/3 DES DESX 25] 64 56 + 64 >DES =DES RC2 24] 64 variable variable >DES RC5 23] variable variable variable variable IDEA 15] 64 128 >>DES >DES CA-1.1 13] 384 64 + 1024 unknown CAST 1] 64 64 >=DES unknown SAFER 18] 64 64 unknown >DES LOKI 7] 64 64 >=DES unknown 3-Way 10] 96 96 unknown >DES Algorithm
Table 5.1: Crypto algorithms suitable for ATM Currently the ATM Forum is discussing which algorithm will become the standard. One problem holding the decision back is the US Government's restriction on exporting cryptography. The current law considers cryptography a munitions and
25
limits exportable encryption devices to 40 bits. This restriction is a matter of intense controversy. There are several policy proposals pending which would either increase the allowed bit length, drop the current restriction altogether, or would call for key escrow/recovery mechanisms.
26
actually change the next output. These are referred to as feedback modes. Feedback modes serve to randomize the output (thus producing a more non-deterministic output), and to make it harder to modify any given block of cipher text. The following is an explanation of each of some commonly used modes of operation 26]; There is no Initialization Vector (IV) or feedback. The input is simply processed with the current key and the output forms the next block in the cipher text. There is no correlation between previous output or input bits in the output. phertext into an XOR with the next plaintext before it produces the next cipher block in the sequence. crypted version of the previous ciphertext block. crypted version of the previous IV.
ECB The Electronic Code Book mode of operation is the simplest to implement.
CBC The Cipher Block Chaining mode simply feeds the previous operations ci-
CFB The Cipher Feedback mode XORs the current plaintext block with an enOFB The Output Feedback mode XORs the current plaintext block with an enCounter The Counter mode XORs the current plaintext block with an encrypted
version of the previous counter 16].
Bypass To allow maximum interoperability with the rest of the world, a special
mode called \Controlled Bypass" should be allowed. This mode, as the name implies, allows connections to be formed insecurely by bypassing the encryption steps. This design would allow users to select whether security was really necessary for their connection, as well as allowing secure devices to communicate with unsecure devices.
27
Care must be taken when designing such a mode so that the security devices could not be disabled by an error or malicious user.
5.4 Synchronization
Typically, security devices need to stay in synchronization with one another due to the mathematical dependencies of crypto algorithms. Some modes of operation (see Section 5.3) have self synchronizing properties, while others require some form of communication or mutual agreement between each other to maintain lock-step. Modes of operation that use feedback usually require \manual" synchronization. ATM presents a unique problem to cryptographic algorithms because it allows cells to be discarded in a stream without any noti cation to either side. It is the task of the AAL layer, or higher layers, to determine when a cell or cells have been lost. Modes of operation which use feedback will result in a partially or completely corrupt stream if as little as one cell is discarded. Therefore, normal mechanisms must be in place to ensure that resynchronization can occur. This is typically accomplished through the combined use of AAL PDU markers and OAM cells. The AAL layer usually adds markers for \Beginning of Message" (BOM) and \End of Message" (EOM). The designers of a system may opt to allow resynchronization upon the receipt of a EOM cell. In some instances, an Operation, Administration, and Maintenance (OAM) cell may be used when the EOM cell itself has been discarded by the network
5.5 Interleaving
When the available hardware crypto chips are not fast enough to sustain the desired rate (say 622Mb/s) they must be interleaved (or run in parallel). For instance, if a given chip runs with 64 bit blocks at 100Mb/s, and we want to encrypt at the ATM
28
(48 53 622
=
Mb =
) 100e = d5 63e = 6
:
chips to sustain an OC-12 line. Note the ratio 48/53 in the formula compensates for payload/cell-size di erences. There are implications involved when using chips in parallel because it is necessary to generate new IVs for each chip in some modes of operation.
29
algorithms. Also, security services such as data integrity and sender authentication are often achieved through public-key digital signatures. This is what is known as a hybrid scheme, because it uses the advantages of multiple types of algorithms to produce a robust security system with good performance.
30
used. The memory requirements for a typical RSA type implementation would be in the order of 4kbits + 4kbits for both the encryption and the signatures.
5.7 Authentication
Authentication and integrity services may demand additional storage in hardware for the generation of Message Authentication Codes (MAC). For a more detailed description of the services, see Sections 6.1.2 and 6.1.3. Generating a MAC code usually requires the calculation of a cryptographically secure checksum over the entire payload. These checksums must be stored temporarily until the entire payload has been received. Therefore, depending on which layer the authentication/integrity services are implemented, there could be a need for a memory bu er equal to the size of one checksum multiplied by the number of secure virtual connections supported by the system. For instance, if a hash such as MD5 were used to generate the MAC, and 1024 connections are supported, then about 16KB would be needed to support the service.
31
One must take into consideration that an ALU providing arithmetic for very long operands can consume major silicon real estate in a typical ASIC controlling the ATM layer portion of the stack (in other words, between the SAR and PHY interfaces).
ns
which means that a new key (and initialization vector, if required) must be referenced and loaded in less than 681ns (minus the decryption time). This can have a signi cant impact on the overall design of the crypto unit. Agility can be realized by limiting the number of secure channels (thereby reducing the memory requirements) and to pipeline the crypto unit so that keys may be loaded before they are scheduled for decryption/encryption. For instance, GTE limits the secure connections on their GTE FastLane ATM encryptor to 4096. 27] uses an address hash table and limits the secure connections to 216.
32
Our architecture takes into consideration all of the issues presented above. The general layout of the components is given in Figure 5.1. The idea is to use enough encryption hardware in parallel to sustain the link speed.
Key Buffer Cell Encryptor [key+iv] Cell Encryptor 24
Output
Cell Register 424 Reassem Cell Encryptor Scheduler 424
Input
424
Cell Encryptor
Controled Bypass
33
Utopia Interface
Utopia Interface
KeyAgile Architecture
424
TO/FROM PHY
424
424
34
utopia
SAR
utopia
ASIC
FPGA
PHY
FPGA
Encryption Hardware
Figure 5.3: Module Layout case of the encryption block), or simply passing the data through after the xed amount of time has expired (in the case of the controlled bypass unit).
The Scheduler
The Schedulers main task is two organize the tra c and route it through the proper block in the cipher array. Certain channels, such as those designated as control or management plane always pass in the clear and therefore are routed into the controlled bypass unit. Other channels may be designated as clear channels dynamically (at call setup time), therefore the scheduler has provisions for storing VPI/VCI pairs. Whenever a cell arrives at the unit, it is checked against both its static and dynamic table for a match. Otherwise, the tra c is routed through an available cipher unit for processing.
35
36
is that achievable data rates might be too slow for ATM. With today's technology, it is not possible to reprogram the chip on a per cell basis. In fact, it takes orders of magnitude more time to program the chip as compared to the cell arrival rate. The FPGA method is good for allowing the exibility of the protocol design, without allowing algorithm agility per cell. If the algorithms t onto a feasible amount of ASIC chips, the ASIC approach o ers a fast throughput to size ratio and may o er better overall performance in the ATM network. The ATM Forum is designing the system to allow algorithm type/version information to be exchanged at secure call setup time. This allows the two remote hosts to guarantee that both parties are using the same security devices. Another form of algorithm agility is the ability to change modes of operation. This is easier to accomplish because it involves the use of a single hardware algorithm, only requiring the usage of the algorithm to be changed. For a more detailed discussion on Modes of Operation, see Section 5.3. Currently, the ATM Forum is considering using DES ECB mode by default, with CBC and possibly counter mode as alternatives negotiated at call setup time 16]. Section III covers algorithm agility issues at a much deeper level.
There are various security services (see Section 4.2) available to protect ATM tra c. One major issue to solve is how to protect di erent types of tra c, and where to place these services in the ATM model. The types of tra c can be categorized into four major groups: User Control Management OAM User cells come from the user plane. OAM cells come from both the control and user planes. The control and management cells come from the control and management planes, respectively. Each type of tra c uses a di erent pre-de ned channel. In general, it is recommended to provide the services according to Table 6.2 to protect each plane from attacks. As the table notes, some features are desirable but 37
38
VPI VCI PT PL
0 0 nz 0 nz ns ns 0 nz 0 ns ns ns ns 0 1 1 2 2 3 4 5 5 16 n n n n ns 0a0 0a0 0aa 0aa 0a0 0a0 0aa 0aa aaa 0aa 100 101 11a ns C C C C CU CU C C M U CU CU ns
\a" = bit is available for use by other layers \n" = None on the above \ns" = Not Speci ed. depends on the current connection's assignment \nz" = Non Zero \U" = User plane \C" = Control plane \M" = Management plane \PT" = Payload Type \PL" = Plane Origination
Table 6.1: ATM Channel De nition are better left for higher layers to handle due to their nature. For instance, some applications may not care about non-repudiation, like a secure telnet program while others such as electronic commerce may wish to prove that a customer ordered a product when they claim they have not.
Service
Privacy Authentication Integrity Access Control Replay Prevention Non-Repudiation
39
6.1.1 Privacy
Privacy, the most widely understood security service, is the protection of data from unintentional disclosure. It is the cycle of cryptographic encryption/decryption that makes this service possible. Table 6.2 lists privacy as being mandatory in the user plane, and non-existent in the others. The reason behind this design choice is simple. The user plane is what carries any data of interest to a user application. The whole system architecture's main goal regarding ATM security is to provide services to the user plane. The control and management planes, while playing a signi cant role in the system operation, only stand to serve the user plane itself. This is not to imply that the control and management planes do not need security. Only that they do not need privacy. In fact, it is important to make sure that data from the control and management planes stay in cleartext, because they are often interpreted by the network while in transit. It is important to note some of the implications of this design choice. The most relevant side e ect is that tra c analysis is still possible because the control and management planes play a central role in tra c management. This may not be a major concern in commercial applications, but can be an issue on the military side. While it has been pointed out that the privacy service should only be implemented in the user plane, it has not been described where in the ATM model the encryption/decryption cycle should occur. Essentially, there are three choices, and all have their pros and cons.
At or Above the AAL Layer Privacy services implementation in the AAL layer
would depend on the AAL type because the size of the user Protocol Data Unit (PDU) is entirely up to the designers of the AAL classes. A typical PDU may be 65535 bytes. The idea behind this approach would be very simple. The PDU is block encrypted (or decrypted, depending on the direction of travel) before any of
40
the Convergence Sublayer (CS) or Segmentation And Reassembly (SAR) operations are performed. As the now encrypted payload is broken down into ATM payloads, additional overhead such a checksums and sequence numbers are added to the 48 byte cell. For instance, consider Figure 3.2 which speci es an AAL3/4 class cell. The 44 octets of user payload would be encrypted, but the other elds would not. Figure 3.4 shows the convergence of a 65535 user payload into cells as it passes through the layers. If all 65535 bytes were encrypted as a block then the 44 bytes in an AAL3/4 cell would be the only encrypted portion. There are several advantages to this approach. First, this represents the minimal amount of data to encrypt (as opposed to including the header and AAL elds). Also resynchronizing the crypto algorithm is easy because there are very de ned data boundaries (beginning and ending of the user PDU). The completed PDU, after being processed by the CS and SAR layers would be ready for decryption. If an error occurred in the bit stream, the PDU would be discarded before any decryption cycle was started, thus removing the resynchronization problem from the crypto systems point of view to the already existing AAL infrastructure. (See the AAL section of Figure 6.1 for a diagram of an ATM cell encrypted in this fashion).
At or Below the Physical Layer Adding Privacy services in the physical layer
has several implications. First, all 53 bytes from the cell are encrypted, thus providing the highest amount of protection against eavesdropping or tra c analysis. However, encrypting all 53 bytes means that the header must be decrypted at each and every switch that the cell passes through. This can create an extreme security breach, since the switches may be in an uncontrolled environment such as a public network. The addition of controlling tra c analysis can be a major advantage. This is the only method of completely eliminating the ability to monitor the tra c from any point in the connection, but the security problem in the switches makes the physical layer a poor choice. An alternate solution may be to encrypt the header and payload
41
separately, but this increases the complexity signi cantly. If, in this case, every switch in the virtual channel must negotiate keys with one another for decrypting the headers, complete chaos could result. Setup latency could increase beyond the limits of usability. Multiple security negotiations could weaken the overall security of the system, leaving multiple points for attack. Last but not least, simple design aws in the crypto protocols could result in multiple switches suddenly becoming obsolete.
At or Above the ATM Layer Adding the privacy services to the ATM layer
o ers many advantages. It allows the maximum amount of privacy and protection against tra c analysis without requiring switches to decrypt the 5 byte header. This increases the overall security and speed of the system. Figure 6.1 shows the level of con dentiality achieved when the security is implemented in the various layers. Encryption in the ATM layer prevents an eavesdropper from obtaining information about the operation of the AAL layer. For instance, if an opponent were able to distinguish an Segment Type of type BOM and the crypto algorithm was determinate, a known plaintext attack would become very easy to execute because the opponent now knows the synchronization pattern of the messages. In general, it makes sense to have the highest amount of privacy possible without a ecting the operation of the system. With both the Physical and ATM Layer choices, cryptographic resynchronization becomes more of an issue (when compared to the AAL solution). Since the AAL layer has direct control over the PDU framing, resynchronization boundaries become easy to pin-point. In the lower layers, more work must be done to acquire synchronization boundaries. Fortunately, the AAL information is easy to interpret, and their format is standardized. For the AAL type 2, and 3/4, the PDU boundaries are clearly marked by BOM and EOM labels in the cell elds. For types 1 and 5, an alternate approach must be used. A given interval of bytes (or cells) should be agreed upon which will signal when resynchronization is needed.
42
Figure 6.1: Comparison of encryption in various levels of the ATM stack These same issues will surface again when authentication/integrity is discussed later in the text.
6.1.2 Authentication
Authentication services were not widespread until the advent of public-key cryptography. Authentication was provided as a bene cial side e ect of symmetric cryptography since the day cryptography was started. All communications were performed using secret keys, thus all communications were assumed \authentic" if both parties knew the key. This is the nature of private-key cryptography. Public-key algorithms changed everything. They created ways to encrypt data without sharing the secret. In fact, as the name suggests, they allow encrypting data using \public" information. That means anyone can transmit data over a network to a remote host with a certain level of assurance that only the intended recipient, or holder of the private/public-key pair, can decode the message. This solves many problems, but it creates a major one. No longer can the assumption that the correctly encoded data signi es an authentic host. Another mechanism must be installed to allow for this secondary check. These mechanisms provide the service of \Authentication", and as described below, have implications when included into the secure ATM model.
43
Once again, refer to Table 6.2 to the authentication portion. It lists authentication as being mandatory in all planes. While this is true, the protocols and algorithms vary greatly from plane to plane, as described below.
Control Plane
Control plane authentication is an important issue to discuss because it involves the operation of the whole ATM system. Control messages are sent by all types of devices on both sides of the ATM UNI to control tasks such as; Call Setup, Call Disconnect, Parameter Setting, etc. A malicious user could insert an invalid Call Disconnect message into an existing stream and cause the ATM entity to disassociate with the call unintentionally. However, authentication services at this level would limit the stream of valid control messages to originate from only parties involved in a connection (meaning the remote nodes and switching parties along the way). There are many di erent methods of implementing this. Although nothing has been agreed upon by the ATM forum, the most obvious method of adding authentication to the control plane is by adding one or more Information Elements to the messages. At rst glance, it may not be obvious what the requirements of the system are, but further research would reveal two classes of control messages; those which could cause damage to a network, and those which are more passive in nature. Recall Tables 3.1 and 3.3. They brie y describe the possible messages from an ATM entity.
Management Plane
The Management plane, currently controlled by the ILMI speci cation uses normal AAL class tra c over a prede ned channel. The implementation of authentication services will either need to be taken care of in the management plane, or by a protocol which detects the ILMI protocol in the ATM layer (VPI/VCI=0/16) and perform a
44
lower layer authentication protocol with the use of OAM or other out-of-band sources.
User Plane
The User plane does not need direct support for authentication if the control plane performs its job correctly. If the User plane can rely on the control plane to authenticate all call setup procedures, then the resulting session can be assumed to be authentic, and integrity services can be used to authenticate all tra c because of the symmetric algorithms used for normal data ow (see above description).
6.1.3 Integrity
Integrity services are listed in Table 6.2 as being optional in the userplane and mandatory in the control and management planes. Integrity services should be implemented with Message Authentication Codes (MAC), but there are two di erent requirements based on the types of tra c to handle. Control and Management tra c will generally have integrity services provided as a direct result of the authentication services, which typically provide integrity as a bene cial side e ect. If the signature system which protects the C and M planes does not provide integrity, then a form of keyed one-way hashes will work to provide tamper proof MAC codes. However, the keyed variants of the hashed require a key, which can cause extra setup overhead. Integrity in the user plane, if provided can be a simple algorithm like SHA, which produces a 160 bit output. By appending the hash output to the plaintext, and then encrypting the whole packet, integrity can be checked at the remote end by reproducing the hash on the data, and then comparing the received hash value with the computed hash value. This is very similar to the method used with checksums. In a previous discussion it was mentioned that a memory bu er may be needed for each VC supported in the system. The use of that bu er will be described here. There are essentially two choices for location of the MAC generator; The AAL
45
AAL Implementation
Adding integrity to the user plane can be taken care of in the AAL layer. The PDU, while still contained as a single unit (say 65535 octets) can have the 128 bit MAC generated across the whole bu er and appended to the end, much like standard checksum calculation. In this manner, the PDU+MAC are treated as a new single unit PDU and passed through the segmentation and convergence sublayers and on to the ATM layer. Conversely, once the complete PDU+MAC code has been received, the checksum can be calculated and compared.
ATM Implementation
Using the 16KB bu er in the ATM layer may have a cleaner implementation by decoupling the MAC code from the user stream. The main disadvantage to inserting the MAC inline is that a device that does not comply with the security protocol of the sending host may incorrectly interpret the MAC as standard PDU data, causing both stream corruption and loss of PDU framing. The design is as follows: A memory bu er is set aside for each Secure VC in the system. As a new PDU (designated by the BOM or synthetic1 markers) the bu er corresponding to the VC is lled with the rst sweep through the hash algorithm, which will store its temporary result in the bu er. This continues until the EOM (or synthetic) marker arrives, at which time the bu er is built into a cell and encrypted with the rest of the stream. The OAM cell is sent to the remote party which should be able extract the MAC information and compute its own result on the previously received PDU. It may be advantageous to use OAM cells to send the MAC out-of-band (OOB) to di erentiate between user
marker designated by the protocol designers for use with the AAL 1 and 5, which has no BOM marker
1
46
and crypto streams. However, OAM cells are on plaintext channels, and therefore a keyed one-way hash would have to be used, as opposed to the hash (or equivalent) algorithm.
47
6. Additional key negotiations transpire, and the \Connect" event is sent back to Host A On an even higher resolution system, the actual user, or user process level could be checked at each end to govern access control. The disadvantage to access control is that it will most de nitely require modi cations to the signaling protocol, which could take some time within the standards bodies. The ATM Forum is basing the security label format on the IETF Working groups recommendation from Common IP Security Option (CIPSO).
\NOnce" stands for \Number Once", or a random number that is used once
48
6.1.6 Non-Repudiation
Non-repudiation is the service which controls \cheating" parties from sending a valid message and then claiming not to. In general, this service should be implemented in higher layers.
in a local environment (such as a LAN) and is only converted to ciphertext at the \Edge" of the network where the data leaves the local area. Such a system would be deployed in an area where the LAN would be considered safe from attack (such as a single building in a business) and all data leaving the LAN would be considered vulnerable, thus requiring the security. The remote destination would have an equivalent security device at the \Edge" of their network as well, which would convert the ciphertext back to plaintext once the data has entered the remote \safe domain".
49
environment (such as a LAN) which encrypts all data which remains in the LAN but decrypts data on its way to the outside world. At rst glance, this may seem like a useless security measure, but it does have valid application. Consider a military system with multi level data being transmitted on the same network. It is imperative to have all data sent securely so that no lower level process can access a higher level processes data. However, to gain interoperability with the outside world, it is necessary to allow some means of access. Since all of the internal tra c is encrypted in a manner the outside world cannot understand, there must be a way to intelligently decide which tra c is allowed to pass. The answer is: any data with a low level is allowed to be decrypted and sent through the \Edge" guard and vice-versa. Any data arriving to the guard is considered low level and is encrypted and labeled this way. The guard forms a crypto rewall. There is no de nitive solution for ATM. The needs of the application will dictate which method is implemented in any given situation.
End-to-Edge End-to-Edge security refers to the use of security device inside a local
50
acceptable level of physical security in all but the most demanding situations. It also promotes the use of security because it eliminates the need for a user to carry and maintain an external peripheral.
Network Device
External NDs o er the exibility of changing security devices as they are outdated and upgraded. They also allow users to opt for no security (in the NIC) when it is not needed, instead of paying for an option that will never be used. NDs can also come in handy for providing Edge-to-Edge services, as they can be placed inline at the WAN access point. The major downfall of a ND implementation is that they have a much larger Red side then their internal NIC counterparts. Should an opponent have physical access to the Red side (such as the ber connection between the ND and the workstation), all security regarding data to and from that ES would be compromised.
51
6.3.1 Location
Application Layer/User Plane
Adding security to ATM by using the Application Layer as the crypto platform has some bene ts. First, it's conceptually easy to implement because the design is very straight forward. All synchronization is transported in user level cells and interpreted at the other end. Second, since it is not included in the signaling standard, it is very easy to change the protocol without disrupting other major standards. Unfortunately, crypto in the UPLANE has major disadvantages. For one, synchronization between two nodes cannot start until a user level channel has been established. This means that the two end stations must wait until CONNECT has been received. The ATM speci cation allows up to 14 seconds before a CONNECT state is timed out, meaning synchronization could be delayed up to 14 seconds before it is allowed to start. High connection latency would more than likely be rejected by the community. 27] suggest that the UPLANE o ers optimal con guration when the security is implemented as a Network Device (ND) instead of built into the End System (ES). They argue that the use of Operation, Administration, and Maintenance (OAM) cells (from the CPLANE) generated by the ND might be confused with the OAM cells generated by the ES, and therefore, the CPLANE is not a good choice for security control.
52
CPLANE
Adding security to the CPLANE has several advantages over the UPLANE approach. CPLANE messages are sent over one or multiple cells which contain Information Elements (IE). Each IE contains some piece of information pertaining to that message or type of message. Some IEs are mandatory per message, while some are optional. If new IE types are added to existing Q.2931 protocol messages, and new messages are de ned, the CPLANE can be used to add security. For instance, if there was a security channel designation, it could be used to send a request to a key server at the same time as a SETUP message is sent to an ES. To protect the Q.2931 protocol itself from attack, a signature IE could be added to every message that would allow the ES to verify the authenticity of the message. The downside to such an approach is that it requires additions to the already monolith Q.2931 protocol.
SecurePlane
Creating a new model, with new layers, may be the best approach, because it allows isolation from the previously de ned layers, and gives the greatest amount of exibility (see Figure 6.2). With this approach, the congested CPLANE protocol does not need to be overburdened with yet another task, and the downfalls of the UPLANE implementation are not inherited. It also frees the other layers from having to do the extra work necessary to maintain secure connections. Key management and negotiations can be taken care of in one clean module, instead of dispersing the tasks throughout the protocol stack. Referencing gure 6.2, we can see that there is the addition of the A, B, and Encryption planes. Above the A,B planes we have the Application, Control, and Management planes. These planes are usually implemented as a user or kernel level process. The interface between the AAL layer is usually done at the kernel level. The device driver is most likely the best location for the A/B entities. However,
53
Figure 6.2: The Modi ed Security Model for NIC implementations this couples the security services with a certain brand or type of card. A more generic service layer that works with the device driver would be better suited to the problem. The diagram points out, communication between the A/B entities and the Encryption planes is necessary. This is simple in the case of a Network Interface card that is compliant with the advanced security model presented here. The device driver can provide an interface to the encryption hardware. However, if the system designers opt for a External Network Device, such as in gure 6.3, another means of control path must be established. It would be unwise to require any entity other than the device driver (which is speci c to a certain model of a board) to know whether the Encryption layer lies within the NIC or o board in the ND. By de ning an API between the device driver and kernel, we can maintain a single version of a A/B entity. Figure 6.4 shows the relationship in terms of a simple Operating System running an ATM network. Each process, whether it be the control plane process, or a user application, must access the device through the kernel, and ultimately, through the device driver. Figure 6.5 shows the addition of the A/B entity in the model, and its interface
54
Figure 6.3: The Modi ed Security Model for Network Device implementations to the device driver. If the device driver provides a communication interface to the Encryption plane, the A/B unit does not have to worry whether the encrytion plane is local or remote. The A/B planes provide all security services, except for privacy. The A plane's major function is to provide integrity to the user plane. The B plane's major function is to authenticate messages sent by the control plane, and to perform access control operations. By authenticating the control messages, we provide authentication to the user plane, and protect the control plane from a malicous attacker. The downside to this approach is that, once again, the community must agree on the speci cation, which can take some time.
55
Figure 6.4: The operating system model of an ATM host labeling elements (see above description). This appears to be the best approach for the long term, but it may take too long to pass in the standards bodies. For the interim, one of the other methods must be used.
Method 2 The OAM cells can be used to carry the security information. Unfor-
tunately, this method would require changes to the AAL standard to include a new type of multicell OAM transmission. Since most manufacturers have already built circuits supporting the other classes, it is unlikely that they would want to change this speci cation. In addition, the OAM cells would be restricted by the users network parameters such as QOS.
Method 3 The third choice is to use standard call procedures and have the security
devices at each end hold o
56
Figure 6.5: The operating system model with the A/B plane take place. After a session has been established, the connections are nalized with the user and the security device decouples itself from the connection (it still provides services)
info and then close the connection and allow the original to start. This removes the QOS limitations and makes a cleaner interface. This option seems to be preferred in the community for the short term.
57
the use of techniques described above, such as the security channel allocation and additional IEs to control messages, these protocols can be implemented. One problem to solve is certi cate management. The caching of certi cates can have a large memory requirement. It may be advisable at this point to use an external memory device to store these certi cates. Since call setup is a rather rare occurrence (as compared to switching to a new VC on an inbound cell), external storage may work just ne. The idea would be to cache as many of the certi cates as possible to eliminate the certi cate negotiation overhead. The ATM Forum has selected the ISO/IEC 9594-8 authentication and key exchange protocol to base its operation on.
58
Chapter 7 Introduction
For a basic introduction to agility, see Section 5.10. Designing a hardware architecture that can support agility must be done carefully. As was pointed out in earlier sections, the bulk data encryption takes place between the ATM and Physical layers (on the data bus). So our design must incorporate several key components. The rst is an interface to the Utopia bus so that the device can be inserted between the SAR and PHY devices. The second is a high speed memory architecture that can store and provide keys at the same rate that our cells arrive, and the third is a bank of encryption hardware that can support the given throughput of the network. All of these components must be arranged for maximum e ciency in the smallest space possible. In this section, we take a deeper look into the issues of implementing agile crypto hardware and some possible design solutions. We propose the use of recon gurable (RC) hardware for the purpose of algorithm agility. Recon gurable hardware is not commonly used for cryptographic applications and therefore needs further study. In Section 8 we introduce the various forms of RC hardware available and their theoretical di erences. In Section 9 we analyze cryptographic algorithms for their component decomposition in order to derive general conclusions. The data gathered is used to 59
CHAPTER 7. INTRODUCTION
60
analyze the e ciency of crypto algorithms in recon gurable hardware. This analysis forms the formal laboratory portion of this research. Algorithm agility imposes a new problem: the ability to switch algorithms. This can either be done through the use of recon gurable hardware or shadow ASICs which occupy the same cell block location but implement a di erent algorithm. Both have advantages and disadvantages, which will be pointed out below. For a better understanding of the bene ts of each technology, see Section 5.1.
CHAPTER 7. INTRODUCTION
Block 0
61
Redundant Hardware
Block 1
Block 2
Block 3
Figure 7.1: The Cipher Array with ASICs needed. Unfortunately, to our knowledge, there is no systematic treatment of cryptography applications in RC hardware available in the literature. That means that both their behavior and performance is unknown for this type of application. If we were to decide to implement this design, we would rst need to test for performance benchmarks in various cryptographic applications then implement the desired algorithms. For this task we rst investigated recon gurables in general. We classed them based on architectural di erences and then ran tests based on cryptographic primitives. Section 8 is a comprehensive report on the state of the art in recon gurable technology. Based on these ndings, we proceeded to the work in Section 9.4 where we ran experiments to determine whether recon gurables were suitable for ATM use.
63
nal product. This allows the recon gurable nature to be exploited for applications such as: hardware updates, rmware updates, dynamic functionality (the device acts as a video accelerator at one moment, and an ATM interface in the next, etc.), and for other areas where recon guration would be bene cial. However, recon gurable hardware is almost always signi cantly more expensive than the cost of custom silicon on a per{device basis1 . A typical ASIC may cost on the order of $3.00 { $100.00 per chip (in large quantities), while an FPGA may cost $5.00 { $1200.00 per chip 30]. These cost are usually disregarded during a prototype stage of a product, but a vendor will de nitely hesitate to absorb such an expense for mass production. In fact, there are studies that show most companies will impose strong resistance to designs with RC hardware if the chip's cost{per{unit is greater than 100.00 30, p. 2]. Another problem with RC hardware is that it typically has smaller gate counts and slower performance than custom hardware. In a custom design, the gate layout is tailored and optimized to the applications best interest. On the other hand, a reprogrammable device tries to utilize generic structures to implement the same functionality. Often times the interconnection matrices and placement of logic modules has the greatest e ect on the critical path delays. However, recent developments in the recon gurable industry have pushed RC's into the domain of small to medium gate arrays with the introduction of 100k (or more) gates and over 200MHz clock speeds. The recon gurables have become so advanced that the design methodologies that are typically used for ASIC development are now being applied to applications targeted at RC hardware.
Independent studies have shown that this cost is signi cantly reduced when costs of reengineering and time-to-market are considered
1
64
The main task of interconnects is the e cient routing of signals between core logic blocks. In original recon gurables, these paths were mask programmable, meaning the logic blocks themselves were programmable, but the interconnects were programmed during the masking process at a fabrication facility. This technology was called MPGA (Mask Programmable Gate Array). In the mid eighties, several vendors introduced the concept of the programmable interconnect which gave birth to the FPGA. The programmable interconnect allowed the paths the be programmed (at run-time) in the same manner as the logic blocks. The following is a list of current interconnection
65
Combinatorial Logic
rectly from PAL technology. It is based on the concept of programmable AND arrays being fed into OR arrays. It computes the sum{of{products based on the inputs and can be con gured for any Boolean realization.
Lookup Table Architecture (LUT) It is also well known that any logic com-
bination can be represented with a simple table lookup where m inputs map into n outputs. There has been many studies conducted into the optimal size for m and n for the best utilization of hardware. It has been shown that this number is generally (m; n) = (4; 1) 8]. This con guration appears in many commercially available devices
66
(the Xilinx XC4000E and Altera FLEX series, to name a couple). 8] also show that 40 to 60% of all propagation delays can be attributed to routing resources and that cascading several LUTs together can yield higher performance. This has been done in the Xilinx XC4000E where two 4-input LUTs and a 3-input LUT are cascaded together.
Sequential Logic
To add the provisions for sequential logic to RC hardware, most vendors place a con gurable ip op at the output of the combinatorial logic mentioned above. This ip op can be con gured in various ways, such as bypassed, D, JK, SR, etc.
67
Logic Blocks
Interconnects
PLD
PLD
Channeled Array
Channeled arrays, or row-based architectures, o er a linear array of logic blocks interconnected by busses running parallel and perpendicular over the logic blocks. One advantage to this type of approach is that only two layer CMOS processes are needed because the interconnect channel lies in the same plane as the logic blocks (see Figure 8.1). Another advantage is that speed critical connections can be established through a maximum of two programmable nodes. Since each node introduces a resistive path, fewer nodes directly relates to higher clock speeds.
68
Symmetrical Array The symmetrical array variation of this technology uses a two
layer metal CMOS process to deploy the horizontal and vertical interconnection lines. The logic blocks sit in-between the interconnects. The advantage to this architecture is that it is cheaper to produce two layer processes when compared to three layers. It is also necessary to use this architecture with SRAM based FPGAs because the SRAM circuitry in the interconnect matrix is implemented in the same level as the logic blocks. Therefore it is impossible to have a interconnect directly above a logic block.
69
more PAL blocks are connected in a hierarchical fashion as to allow integration into a complex device.
Cell-Based Array
Cell Based Array (CBA) is a technique where functionality is modularized and placed into blocks. This technology is most commonly found in ASIC design, where cells are designed in-house (or bought from a third party) and integrated together to form the complete design. The methodology is based on the concept of reusable components. If a certain function is generic enough (such as the PCI interface on a chip) the module can be designed into a \cell" and stored in a data base. When the current design calls for a PCI interface, the cell is incorporated into the design. CBA architectures have been recently introduced to the RC world where certain functionality has proven to be useful to RC designers. The cells are often called cores or megacells, and they work in the same way as cells in an ASIC do. Note that the core cells are NOT reprogrammable. They are merely a special type of logic cell that can be interfaced to and used to perform a certain function (once again, the PCI interface is a good example). One important thing to mention about CBA with RC hardware is that it is independent of the RC architectures noted above. This means that a device may have both a CBA and Symmetric Array architecture.
70
FPGA Components
As brie y mentioned before, FPGAs are composed of arrays of small logic blocks. There are three main components in an FPGA. Each vendor may call the components slightly di erent names, but they are essentially the same. The components are described below (refer to Figure 8.3):
Con gurable Logic Blocks The Con gurable Logic Blocks (CLBs) are the core
logic element in an FPGA. The CLBs are usually small (4{32 inputs) and plentiful. They usually contain some or all of the following components: RAM (8{64bits) for lookup tables (LUT), ip- ops, latches, tri-state gates, standard logic gates, etc. In the cases where lookup tables are used, the RAM that implements the LUT can often be con gured as standard RAM for use in registers, etc.
71
Figure 8.3: The SRAM FPGA blocks, and the other interconnects themselves. By programming the interconnect matrix in a particular fashion, any combination of connections can be established (see Figure 8.4). One of the criticisms with FPGAs is that the interconnection matrix causes variable and unpredictable delays because of the multiple paths available. This design allows the FPGA to be the most exible, but it makes it di cult to analyze because the delays be unknown until late stages in the design and synthesis of the circuit.
Interconnects The interconnects form the connections between the CLBs, I/O
Input/Output Blocks The I/O blocks in a FPGA are very similar to the I/O
pads in an ASIC. They act as bu ers to the outside world and provide an interface to the interconnection network and CLBs to communicate with other devices. The design of the I/O module is identical to any other design which allow the selection of read-in or write out through the control OE, or output enable.
72
CLBs
Programmed Interconnects
73
CPLD technology allowing the recon gurable exibility that has been exploited in FPGA architectures.
CPLD Components
Just as in an FPGA, CPLDs have several core components which make up the overall architecture. The PAL units implement the core logic, while the Programmable Interconnection Array connects the units together and to the I/O blocks. around for quite a while. It was used long before the complex recon gurable devices were thought of. The general architecture of a PAL is described in Figure 8.5. The PAL's core component is called a MacroCell. The MacroCell is very similar to a CLB in an FPGA. The PAL block itself has many MacroCells (8{64) aligned in an array. Each PAL has a small con gurable internal interconnection matrix which allows signals arriving at the PAL to be routed in any order to any of the macrocells. After the data has owed through the macrocell, it can exit the device at the other side. Inside a CPLD, the signal will be connected to the PIA and possible routed to another PAL or outside to an I/O block nects all PALs and I/O Blocks. This entity allows all blocks and modules to be available to other blocks throughout the entire device. The exact layout of a CPLD is available in Figure 8.6. In some of the larger devices, the PIA is actually a matrix of rows and columns and the PALs are arranged in an array fashion. This allows a higher level of exibility and brings the CPLD into the application domain of an FPGA.
Programmable Array Logic Blocks Programmable Array Logic (PAL) has been
74
MacroCell
MacroCell
MacroCell
MacroCell
MacroCell
I/O
I/O
I/O
I/O
I/O
I/O
I/O
I/O
I/O
I/O
Input/Output Blocks I/O Blocks are common to both CPLDs and FPGAs. See
Section 8.2.5 for more information about I/O blocks.
75
I/O
PIA
I/O
I/O Blocks
I/O
I/O
PAL Blocks
The next step in our investigation is to determine which device architecture works best with which type of cryptographic algorithms. We need both speed and e ciency for a wide variety of algorithm types. Some algorithms will use a very wide data path, while others will need many registers and ip- ops. It is unclear which device architecture will perform adequately in an ATM type environment so we must design a methodology that will allow us to predict which one will. The task of assessing all cryptographic algorithm performances in all RC hardware is an extremely complex task. To overcome this problem, we decided to analyze the algorithms for their components (such as XOR, ADD, SHIFT, etc.) and then run extensive tests on those components based on certain classes of hardware. Through this research: 1. We can derive general statements about cryptography on RC hardware. 76
77
3. We could gain some insight into which architecture might work acceptably well in an ATM environment. The rst step is to study the types of algorithms that already exist. There are many di erent branches of cryptographic algorithms, such as private-key ciphers, public-key ciphers, and hash functions. For the purpose of ATM, we are only concerned with symmetric block ciphers, since they seem more promising for the use in the encryption unit. We start by explaining how symmetric encryption is accomplished.
Feistel Networks
Many block ciphers use a Feistel network architecture, named after the inventor, Horst Feistel, who made major contributions to the eld in the 1960's and 70's while working on ciphers for IBM Research. The Feistel network is simple in concept. One of its most attractive features is that it allows for a particular algorithms to be
78
used for both encryption and decryption due to its inversion properties. The basic architecture is as follows: The datapath is split into two halves, the right and left sides. The right half is operated on by an function f which incorporates elements of confusion, di usion and key material and produces an output of the same size. The output from the f-function is XORed with the left half which is then stored into the right half. A copy of the original right half is swapped into the left half, thus completing the cycle (see Figure 9.1). It should be noted that only the left part L i-1] is encrypted in one round, whereas R i-1] is passed through in the clear.
L[i-1] R[i-1]
Ki
L[i]
R[i]
Figure 9.1: The Feistel Network This cycle is usually repeated many times and is referred to as a round.
Substitution-Permutation Networks
Substitution-Permutation networks, or SP networks are elements that add hybrid confusion/di usion to the input data. f{functions in Feistel networks are often based on them. Substitution elements take a m bit input and provide an n bit output (see Figure 9.2). The elements can be implemented as look-up tables or as combinatorial logic, but both are rather expensive so to minimize the cost, m and n are often kept small.
79
Figure 9.2: A 3x4 Substitution Box Permutation elements remap input bits into output bits and add the di usion property to the data. The construct is quite simple and runs extremely well in hardware, but runs very poorly in software (Figure 9.3 shows the relationship between input and output). Ciphers that use these elements in combination are often called product ciphers
1 4 6 7 8 13 16 24 26 34
80
Iteration Ciphers
Iteration ciphers are simply ciphers that incorporate loops to achieve the desired level of confusion/di usion. Each iteration of the loop further scrambles the data. Almost all block ciphers in use are based on multiple iterations or \rounds".
9.3 Methodology
By studying existing algorithms (see Table 9.1), we have determined that there is a small nite group of components that is common to all algorithms. Some of these components may have been discussed above regarding design theory, etc., but this section serves to include a detailed description of the component and it analysis.
Algorithm DES MADRYGA NewDES FEAL REDOC II REDOC III LOKI KHUFU KHAFRE IDEA MMB GOST CAST BlowFish SAFER 3-Way CRAB Parameters 64bit I/O, 56bit key var I/O, var key 64bit I/O, 120bit key 64bit I/O, 64bit key 80bit I/O, 160bit key var key up to 20kbits 64bit I/O, 64bit key 64bit I/O, 512bit key 64bit I/O, 64{128bit key 64bit I/O, 128bit key 128bit I/O, 128bit key 64bit I/O, 256bit key 64bit I/O, 64bit key 64bit I/O, 0{448bit key 64bit I/O, 64bit key 96bit I/O, 96bit key 1024bytes I/O, 128bit key Components P-BOX, XOR, SBOX, ROT XOR, SFT, ROT XOR, N/A f box XOR, ROT P-BOX, SBOX, XOR XOR, STOR XOR, SBOX, P-BOX XOR, Dyn-SBOX XOR, SBOX XOR, ADDER, MULT XOR, MULT, STOR SBOX, ROT, ADDER XOR, SBOX XOR, SBOX, ADDER XOR, ADDER, ROT, GFMULT XOR, P-BOX, ROT P-BOX, XOR, AND, OR, NOT
81
9.3.2 Implementation
The next step is to build behavioral models using a hardware description language (HDL). Using these models we can map the given components into various architectures and record the results. HDLs were chosen for their portability across platforms. The advantage is that the di erences in entry style can be factored out of the overall equations, because the resulting design is derived from the same source code. After the models are built, they were synthesized and mapped using the tools appropriate for the device. Various optimizations were selected over multiple runs to average the results. The most interesting results are the % resources consumed and the critical path delay.
Device Selection
In order to determine the nal results of our experiments, two stages of processing were needed. The rst is the synthesis stage, which compiles the HDL into device speci c logic maps or netlists. The second is a place and route stage where the netlists are mapped into actual hardware entities. Because of the availability of these tools, we were only able to work with two vendors, namely Xilinx Corporation and Altera
82
Corporation. Fortunately, these two vendors are the market leaders and also provide some of the largest devices for us to work with. Unless otherwise noted all implementations of the algorithms were performed on speed grades 3. Devices are selected based on their availability in the market place and their relative size/speed/price/package selection. For example, XILINX has a multitude of devices in various families. However, only the XC4000 family is large enough to accommodate the large scale design of a cryptographic algorithm (typically 20K-60K gates2 ), so it is the only one used here. The same holds for Altera, where only the FLEX10K devices can support the needs of our application. Initial research has revealed an average PAD delay for each device tested, which is always subtracted from the critical path delay in cases where clocking was not used to determine component speed. The results of this calculation are in Table 9.3.
Device Input delay (ns) Output delay (ns) Total EPF10K70RC240-3 5.6 5.3 10.9 XC4020EPG223-3 2.5 8.5 11.0
83
connected to it. The following is a VHDL description of a permutation component found in the Data Encryption Standard (DES) algorithm:
library ieee; use ieee.std_logic_1164.all; entity pbox is port ( PI: PO: ); end ; architecture behave of pbox is begin PO(16) <= PO(29) <= PO(1) <= PO(5) <= PO(2) <= PO(32) <= PO(19) <= PO(22) <= end behave; PI(1); PI(5); PI(9); PI(13); PI(17); PI(21); PI(25); PI(29); PO(7) <= PI(2); PO(12) <= PI(6); PO(15) <= PI(10); PO(18) <= PI(14); PO(8) <= PI(18); PO(27) <= PI(22); PO(13) <= PI(26); PO(11) <= PI(30); PO(20) <= PI(3); PO(28) <= PI(7); PO(23) <= PI(11); PO(31) <= PI(15); PO(24) <= PI(19); PO(3) <= PI(23); PO(30) <= PI(27); PO(4) <= PI(31); PO(21) <= PI(4); PO(17) <= PI(8); PO(26) <= PI(12); PO(10) <= PI(16); PO(14) <= PI(20); PO(9) <= PI(24); PO(6) <= PI(28); PO(25) <= PI(32);
Analysis
Since Permutation boxes essentially require zero hardware elements, they will not be analyzed here. These components can be added to a hardware architecture for a negligible penalty in both time and space measurements.
84
Analysis
A VHDL model of a 32bit XOR component was built and then processed through synthesis tools to allow for a comparison. Table 9.4 shows the results obtained from this synthesis.
Device Compiler FLEX10K MaxPlus II 7.0 Optimization LE Utilization Speed 32 Area 32 WVO 7.2/XA 6.0.1 Area=LOW 16 Area=HIGH 16 Speed=HIGH 16 Max Delay 12.0ns 12.0ns 10.4ns 10.4ns 10.4ns
XC4000E
Table 9.4: 32 bit XOR box (source:xormod.vhd) Observe that the results are similar between the two devices. Note that the resource consumptions do not correlate because of the di erences in architectures.
85
Analysis
A VHDL model of a 6 4 SBOX was built from the data provided in the Data Encryption Standard 12]. In addition, ROM tables were built for technologies that support ROM mapping. ing XILINX XC4000E technology. The results are posted in Table 9.5. Note that synthesis results in combinatorial circuits.
Optimization CLB Utilization Un-Optimized 89 Collapsing=HIGH 89 Area=HIGH 18 Speed=HIGH Failed Max Delay 36.3ns 36.3ns 51.1ns N/A
Test 1 - Xilinx The model was synthesized using WorkView O ce v7.2 target-
Table 9.5: Substitution box implementation in XC4000E technology with synthesized combinatorial logic (source: sbox1.vhd)
32x1/16x2 LUT implementation. This means that a group of CLBs can be con gured as a large lookup table and implement the SBOX more e ciently than through the synthesized design. The results are in Table 9.6
Optimization CLB Utilization Max Delay ROM Mapped 10 15.8ns
Test 2 - Xilinx XC4000E also supports RAM / ROM through the use of its
Table 9.6: Substitution box in XC4000E technology with ROM (source: sbox1.mem)
Test 3 - Altera The model was then synthesized using MaxPlusII from Altera and
analyzed for e ciency in a FLEX10K device. The results are posted in Table 9.7. Note that this design also results in combinartorial circuits.
86
Table 9.7: Substitution box in FLEX10K technology with synthesis (source: sbox1.vhd)
Based on the results of Tests 1{4 in the SBOX test, we can assert that SBOX performance is greatly accelerated by the use of RAM facilities present in our testing hardware. Since SBOXs are critical components to symmetric ciphers, we can conclude that architectures with RAM will greatly enhance the overall throughput.
Optimization EAB Utilization Max Delay ROM Mapped 8 18.3ns
Table 9.8: Substitution box in FLEX10K technology with ROM (source: sbox1.mif)
87
to map into hardware because they are essentially a permutation. As was pointed out earlier, these permutations take hardly any hardware resources. The more advanced shifters, such as decisive and sequential shifters require more logic and will therefore be analyzed below.
Decisive shifters
Decisive shifters are components that take in two inputs. The rst is the data word to be shifted, and the other is a binary value the allows the shifter to decide when to shift or not. All processing is done combinatorially, and therefore doesn't require clocking or registered output. However, the decision circuitry requires a multiplexer so this unit is more than a simple permutation.
Sequential Shifters
Sequential shifters are components that register the input and shift based on a clock edge. They require the most amount of hardware resources, but can o er the advantage of a register and a combinatorial shift in one component. For some designs this may o er the perfect element for key scheduling or round iteration processing.
Analysis
A VHDL model of all three shifters was built and analyzed on the speci ed hardware targets.
Design Sequential Optimization LE Utilization Speed 64 Area 64 XC4000E WVO7.2/XA6.0.1 Area 32 Decisive FLEX10K MaxPlus II 7.1 Speed 32 Area 32 XC4000E WVO7.2/XA6.0.1 Area 16 Combinatorial N/A N/A N/A N/A Device Compiler FLEX10K MaxPlus II 7.0 Max Delay 10.9ns 10.4ns 10.4ns 5.2ns 5.2ns 19.3ns N/A
88
Observe that the performance was equal across the two devices for the sequential shifter implementation, but dropped o signi cantly in the FPGA for the multiplexer based design.
9.4.5 Adders
There are various types of standard adder architectures which have various speed/area properties and are well suited to di erent types of hardware architecture. Rather than try to model and analyze each one in each di erent piece of hardware, we just used the built-in functions provided with the hardware vendor. For Altera, we used the LPM Module LPMADDSUB and for Xilinx, we used the XBLOX module ADDSUB. The results are in Table 9.10.
Device Size Optimization Utilization Max Delay EPF10K10TC144-3 32 bit Area 63 LEs 102.4ns Speed 110 LEs 43.1ns EPF10K30BC356-3 64 bit Area 127 LEs 196.0ns Speed 240 LEs 73.2ns XC4020EPG223-3 32 bit { 17 CLBs 24.1ns 64 bit { 33 CLBs 45.7ns
Observations
Here we note that the performance in the FPGA architecture far exceeded the CPLD. This can be attributed to the fast carry logic available between adjacent logic elements in the FPGA. In the CPLD, once the design is larger than one PLD, the logic must be placed in a neighboring PLD. Since the carry logic is implemented to skip every other PLD, the propagation delay is greater due to the longer distances. In addition, the e ects of utilizing the global routing resources may play a role in the slow down that we observed.
89
Data Bu ers
Algorithms such as the Secure Hash Algorithm (SHA) process a 512 bit bu er of user data. This data is read in 32 bit portions at a time, but the ordering is such that most of the bu er must be available at any given moment. This is because 4 words are needed for every calculation and the words are spread out over the 512 bit array. This means that a 512 bit register would not work well (which is just as well because a register of that size is very ine cient) but some kind of RAM with a 32 bit word size would be perfect. Luckily modern recon gurable hardware such as the XC4000E and FLEX10K support RAM/ROM allocation and will work well with the SHA algorithm. For this test a 256x32 RAM component was built using the builtin RAM macrofunctions provided by each vendor. For Altera, the lpmramdq was chosen, and MemGen was used for Xilinx. The results are posted in Table 9.11
Device Utilization Max Delay EPF10K100GC503-3DX 8192 bits (4 EABs) 9.5ns XC4020EPG223-3 8192 bits (360 CLBs) 60.6ns
90
Multiplexers
As mentioned above, multiplexers are components commonly found in any real world implementation, and are therefore important to analyze. For this test we took a 32bit 2x1 multiplexer model and synthesized it into our test chips. The results are in Table 9.12
Device Utilization Max Delay EPF10K10TC144-3 32 LEs 5.2ns XC4020EPG223-3 16 LEs 19.3ns
The Data Encryption Standard (DES) is probably the most commonly used algorithm in the world for symmetric encryption of data. It makes a good example for implementation because it contains many of the components and constructs that have been introduced in the previous chapter. Especially important is the fact that the algorithm has been approved for use in ATM by the ATM Forum. For an explanation of DES, see 12], 28, page 70]. As was explained earlier, there are many algorithms that will work well with ATM, but by using the data from Section 9.4 together with this design example, we should be able to asess whether RC hardware is principally acceptable for use in high speed secure networks.
10.1.2 Design
DES uses a Feistel Network architecture with 16 rounds. Each round can be implemented as separate hardware with pipe-line stages between each one for high through91
92
put applications. However, this consumes major silicon real-estate and generally will not work in recon gurable hardware because it is too large. For designs with less than 16 rounds of xed hardware, some kind of feedback loop must be established. This is where the need for registers and multiplexers comes in. When we set out to implement DES in recon gurable logic for high speed networks, there was a set of design criteria that we wanted to meet: 1. It must be targeted for high performance (as opposed to smallest size). 2. It must complete an operation (such as encrypt or decrypt) in the fewest possible cycles (which is 16 for a simple design). 3. It must t into a commercially available chip (as opposed to one that is only in beta test). 4. It must provide for loop unrolling for future speed improvements. To meet (1), we designed the I/O for full bus width with separate input and output buses. We also implemented a full width key bus. To meet (2) we used strategic placement of the bus registers to grab the data at key locations to allow for state machine operation that was dependent only on the number of iterations, not I/O latencies. To meet (4) we carefully designed the control, key scheduler, and Feistel network to allow for the addition of arrayed components. Figure 10.1 shows the layout of the components from a schematic point of view. Note the use of two 64bit registers: one on the inputs and another in the feedback loop. This design allows us to \steal" an extra clock cycle at the expense of 64 ip- ops and one gate more of complexity through the Feistel network (through the MUX). Without the secondary register at the data feed, we would be required to go to INIT state after the sixteenth round completed so that the outputs could stabilize to the correct data (DONE is asserted) before new data is read in. With this register
93
Control
REG32
32
Data In
IP
REG32
32
REG32 REG32
32
32
64
Figure 10.1: Schematic map of DES algorithm in place, we can successfully read a new data set in at the conclusion of round 16, thus producing the cycle chain INIT, R1, R2,....R16, R1, R2,... etc. Without it the state transition diagram must conclude to INIT every time before starting the next set. This addition saves a single pulse width of latency and increases system throughput by 6 25%. In addition to designing the main data path for high speed, the key scheduler must also be designed to deliver data at the same rate and with correct framing with respect to the data path. In order to do this we placed a single register to sample data coming from a multiplexer. The multiplexer is fed by both the feedback and the outside key input. The output from the registers feeds a unitchain of schedulers, one for each unrolled loop in the path1. Each keyunit schedules the keys for one
:
94
key in
Master Keysched feedback 56 PC-1 56
Master
REG28 28
REG28 28
st_u0
Unit 0
st_u1
C out D out
Unit 1
C in 28
D in 28
st_u2
Unit 2
Shift Shift
Shift Shift
st_u3
Unit 3
st in 56 next
To FN0
FN1
FN2
FN3
Figure 10.2: Schematic map of key schedule logic round, receiving control information from its respective command line (stkeyunit0, stkeyunit1, stkeyunit2, etc.). The output from each keyunit is fed to its respective Feistel network and the next unit in the chain. The last unit in the chain feeds its Feistel network and then loops the output back into the master keyscheduler for storage in the registers. This operation is displayed in Figure 10.2. The results of our implementation is posted in 11.2.
95
start=0/ busy = 0
start = 0
R16
R1
R15
R2
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
96
Chapter 11 Results
11.1 Comparing the Results in Recon gurable Hardware
In our experiments we compared two high end devices (manufactured by Xilinx and Altera) in a series of tests that we hope show the relative performance of cryptographic algorithms. It may be unclear which device actually performs better, and in actuality, they were very close. It is di cult to estimate the actual resources consumed in a device since there are so many discrepancies in the way information is provided by the companies. Often times the transformation between logic elements (LEs) and typical gate counts is overestimated and can confuse the user. We have assembled data from the two market leaders, Xilinx and Altera, based on a series of devices from each family, and their cost to the end user. All prices are assumed to be in 100 unit quantities.
97
98
11.1.1 Methodology
Because the task of assessing a dollar value of each individual resource in a device is much too complex, we will use two simple models. The rst only takes logic elements into consideration, which we will refer to as Logic Element Weighting (LeW). This method simply takes the number of LEs available and divides it into the total cost of the device. The second method only considers RAM as a resource, which we will call Ram Element Weighting (ReW). For simplicity, we will de ne one RE as 32 bits of RAM.
Device Family 1 2 3 4 5 6 1 2 3 4 5 6 Device XC4003EPG120C-3 XC4008EPG191C-3 XC4010EPG191C-3 XC4013EPG223C-3 XC4020EPG223C-3 XC4036EXPG411C-3 EPF10K10LC84-3 EPF10K20TC144-3 EPF10K30RC208-3 EPF10K40RC208-3 EPF10K50VRC -3 EPF10K100CG503-3 Logic Elements 100 324 400 576 784 1296 576 1152 1728 2304 2880 4992 Device($) 65.10 163.00 227.00 382.00 401.00 814.00 32.50 65.50 132.00 175.00 274.00 892.00 LE($) 0.65 0.51 0.57 0.66 0.51 0.63 0.06 0.06 0.08 0.08 0.10 0.18
Table 11.2: RE Weighted (ReW) Cost Analysis of Various Devices Our next step is to establish the relative cost of each of the cryptographic compo-
99
nents in the di erent architectures. One way of comparing the devices is to use two chips from the same price range. This way, we can compare consumption of resources versus available resources. Another way is to try and assign a price per resource and list the components relative price/performance. Because of the di erences in architecture, it will be necessary to take into consideration all of the subtle details in order to get accurate results. For instance, Altera FLEX10K devices have separate RAM components, while Xilinx XC4000E devices trade logic for RAM and vice-versa. So if a component takes 32 logic elements each, this equals 32 LE resources in Altera, and 32 LE + 32 RE in Xilinx, since each logic element is also 32 bits of RAM (potentially).
1 2 3
I/Os Total Cost($) 512 189 175 400 160 227 640 310 315 576 192 382 768 406 892 1,296 288 814
LeW($) ReW($)
100
Speed(ns) 12.0 10.4
18.3 15.8 49.6 51.1 31.0 36.3 5.2 19.3 102.4 24.1 196.0 45.7 9.5 60.6
101
102
103
Manual Automatic Manual Automatic Automatic Automatic 448 646 359 549
Delay Clock Tput. (ns) (MHz) (Mb/s) 154.9 7.00 27.99
XC4020EPG223-3
area=low collapse=o redundancy=o P/R = 2,2 same as above P/R = 4,4 SBOX=RAM SBOX=RAM P/R = 4,4 SBOX=RAM P/R = 4,4 EPF10K30RC240-3 Norm/Speed/Area SBOX=RAM
39.96
57.60
104
Chapter 12 Conclusions
12.1 Design Recommendations for ATM
There are several parameters that should be considered for designing secure ATM devices. This section serves to summarize our ndings to give the reader a clear picture about which speci cations are the most important in the system.
Key agility - The system should cryptographically isolate every channel, even if operating in VPC mode. Call Mode - Should support both PVC and SVC operation. Throughput - Should support the throughput requirements of the desired application (i.e., 155Mb/s for OC-3, 622Mb/s for OC-12c). Latency - Security services require computation, and the resulting latency in the data stream will re ect this. Make sure the maximum latency meets the needs of the application. Typical ranges are 5ms through 7ms. Maximum Call Capacity - A key agile system requires more memory for every VC supported. Make sure the supported cell capacity meets the system require-
105
106
ments. Typical values are in the order of 1024{65,535 connections for ND link encryptors and 256{1024 for NICs.
Algorithms - Support for strong public and private key algorithms is a very important issue. Refer to Section 5.2 for further details. Hardware accelerators - Authentication and other public key operations can be greatly improved through the use of hardware accelerators. Key Management - A system that supports automatic negotiation of keys is more desirable than some previous implementations that required out-of-band negotiations.
107
designer. For this reason we recommend Xilinx, or an FPGA like architecture, as opposed to CPLDs. The main reasons are:
Faster overall performance - Our DES implementation performed better in the Xilinx FPGA. Scales better - In both size and cost. The Altera devices exhibited non-linear scaling factors for changes in bit widths, and device changes. Provides for loop unrolling - Loop unrolling should provide a higher throughput in the device. This is not possible to do with the FLEX10K because each sbox takes one EAB. Even though the EAB can hold a substantial amount of memory (2048 bits), but it can only be used for one component. Therefore, one SBOX takes one EAB and we quickly exhaust the supply. However, we can instantiate any number of 10CLB SBOXs in the XC4000E until we run out of logic elements. Since our one loop implementation only consumed half of the available resources in our test device (xc4020epg223C-3), we can theoretically unroll the loop 2-4 times. Potentially even more in a bigger device.
108
that this technique will allow a developer to even further improve the performance of the RC device, above and beyond the results we published here. Other symmetric ciphers are based on algebraic operations (AO) (such as IDEA). Since the work done in this thesis is based on SP ciphers, we cant make accurate predictions as to how AO ciphers will behave. Future work should include these ciphers as well.
12.4 Summary
This thesis hopefully gave the reader some insight into ATM networks, the issues with providing security over those networks, and an introduction to issues using recon gurable logic for the main encryption hardware. We also provided data regarding the implementation of cryptographic algorithms in recon gurable hardware in general, such as the cost vs. speed, and how to asses an algorithm for its size and delay characteristics before any design work begins. Although there has been substantial work done in the area of recon gurable architectures, very little has been done in terms of cryptographic algorithms. We hope that this work will alert both potential developers of security devices and recon gurable hardware vendors about the viability of cryptographic applications and the need for further study. Often times crypto algorithms exhibit patterns in there architecture that may be exploited by new hardware designs. With the recent interest in cryptographic technologies by mass market companies, the use of new hardware technologies will be of value. In closing, there has been a lot of work done in the last few years regarding ATM security. We believe that the concept of using recon gurables for this technology is a promising and interesting addition to the growing interest in the eld of high speed secure networks. It is hoped that through the experiments performed in this study, any designer can make an intelligent decision as to which hardware will meet the needs of their application.
Part V References
109
Appendix A DES
110
APPENDIX A. DES
KEY_IN XXXXXXXXXXXXXXXX\H KU0_PC1CD XXXXXXXXXXXXXXXX\H KU0_PC1KS XXXXXXXXXXXXXX\H KU0_REGC XXXXXXX\H KU0_REGD XXXXXXX\H KU0_SC KU0_SD XXXXXXX\H XXXXXXX\H XXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXX XXXXXXXXXXXXXX 0000000 0000000 0000000 0000000 00000000000000 000000000000 XXXXXXXXXXXXXXXX
111
KU0_PC2IN XXXXXXXXXXXXXX\H KU0_IN XXXXXXXXXXXX\H DATA_IN XXXXXXXXXXXXXXXX\H DATA_OUT A1A26510C55AC5E9\H STAGEA_OUT XXXXXXXXXXXXXXXX\H STAGEC_OUT 6F47F290F42854D5\H STAGEC2_OUT 6F47F290F42854D5\H STAGED_OUT F42854D5D387A022\H CLK RESET START BUSY DONE CNTR_KSEL CNTR_DSEL CNTR_KU0 1 0 0 0 0 0 1 0
XXXXXXXXXXXXXXXX
T(KEY_IN)
3.7009u
1u
2u
3u
4u
Time (Seconds)
APPENDIX A. DES
112
Bibliography
1] C.M. Adams and S.E. Tavares. Designing S-Boxes for ciphers resistant to di erential cryptanalysis. Proceedings of the 3rd Symposium on State and Progress of Research in Cryptography, pages 181{190, Feb 1993. 2] ATMForum. ATM User Network Interface (UNI) Speci cation v3.1. Prentice Hall, Upper Saddle River, NJ, 1995. 3] S. M. Bellovin and M. Merritt. Encrypted key exchange: Password-based protocols secure against dictionary attacks. In Proceedings of the 1992 IEEE Computer Society Conference on Research in Security and Privacy, pages 72{84, 1992. 4] E. Bilham and A. Shamir. Di erential cryptanalysis of DES{like cryptosystems. In Advances in Cryptology - CRYPTO '90 Proceedings, pages 2{21. SpringerVerlag, 1991. 5] E. Bilham and A. Shamir. Di erential cryptanalysis of DES{like cryptosystems. In Journal of Cryptology,, volume 4, pages 2{21, 1991. 6] Uyless Black. ATM: Foundation for Broadband Networks. Prentice Hall, Englewood Cli s, NJ, 1995. 7] L. Brown, J. Pieprzyk, and J. Seberry. LOKI: a cryptographic primitive for authentication and secrecy applications. In Advances in Cryptology { AUSCRYPT '90 Proceedings, pages 229{236. Springer-Verlag, 1990. 113
BIBLIOGRAPHY
114
8] Stephen Brown. FPGA architectural research: A survey. In Design and Test of Computers, volume 13, pages 9{15. IEEE, 1996. 9] Stephen Brown and Jonathan Rose. Architecture of FPGAs and CPLDs: A tutorial. Technical report, Department of Electrical and Computer Engineering, University of Toronto, 1996. 10] J. Daemen, R. Govaerts, and J. Vandewalle. A new approach to block cipher design. In Fast Software Encryption, pages 18{32. Cambridge Security Workshop Proceedings, Springer-Verlag, 1994. 11] W. Di e and M. E. Hellman. New directions in cryptography. IEEE Transactions on Information Theory, IT-22:644{654, Nov 1976. 12] W.F. Ehrsam, C.H.W. Meyer, R.L. Powers, J.L. Smith, and W.L. Tuchman. Product block ciphers for data security. U.S. Patent Number 3,962,539, June 1976. 13] H. Gutowitz. Cryptography with dynamical systems. Cellular Automata and Cooperative Phenomenon, 1993. 14] ITU-T. Vocabulary of terms for broadband aspects of ISDN. Recommendation I.113 Section 2.2, ITU-T, November 1993. 15] X. Lai and J. Massey. A proposal for a new block encryption standard. In Advances in Cryptology - EUROCRYPT '90 Proceedings, pages 389{404. SpringerVerlag, 1991. 16] S. Lane. Security issues in moving from private to public ATM service. In ITU Americas Telecom 96, Technology Summit, Rio de Janeiro, Brazil, June 1996. ITU.
BIBLIOGRAPHY
115
17] S. Lane and G. Cohen. Security in ATM networks. Proceedings of the Technical Conference on Telecommunications RD in Massachusetts, pages 23{32, 1995. 18] J.L. Massey. SAFER K-64: a byte oriented block-ciphering algorithm. In Fast Software Encryption, pages 1{17. Cambridge Security Workshop Proceedings, Springer-Verlag, 1994. 19] M. Matsui. Linear cryptanalysis method for DES cipher. In Advances in Cryptology { EUROCRYPT '93 Proceedings, pages 386{397. Springer-Verlag, 1993. 20] M. Matsui. Linear cryptanalysis of DES cipher (I). In Proceedings of the 1993 Symposium on Cryptography and Information Security (SCIS 93), pages 3C.1{ 14, Shuzenji, Japan, Jan 1993. (In Japanese). 21] M. Matsui. Linear cryptanalysis method for DES cipher(III). In Proceedings of the 1994 Symposium on Cryptography and Information Security (SCIS 94), pages 4A.1{11, Lake Biwa, Japan, 27-29 Jan 1994. (In Japanese). 22] R.C. Merkle and M. Hellman. On the security of multiple encryption. Communications of the ACM, 24:465{467, 1981. 23] R.L. Rivest. The RC5 encryption algorithm. Dr. Dobb's Journal, 20:146{148, Jan 1995. 24] M.J.B. Robshaw. Block ciphers. Technical report, RSA Laboratories, Jul 1994. 25] M.J.B. Robshaw. Personal communication, 1995. 26] Bruce Schneier. Applied Cryptography: Protocols, Algorithms, and Source Code in C. Wiley, 2nd edition, 1996. 27] Daniel Stevenson, Nathan Hillery, and Greg Byrd. Secure communications in ATM networks. Technical report, MCNC, 1995.
BIBLIOGRAPHY
28] Doug Stinson. Cryptography: Theory and Practice. CRC Press, 1995.
116
29] Douglas R. Stinson. Cryptography: Theory and Practice. CRC Press, 1st edition, 1995. 30] Larry Waller. Focus report: Programmable logic. Technical report, ISD Archives, 1996. 31] ANSI X3.92. American national standard for data encryption algorithm (DEA). American National Standards Institute, 1981. 32] Xilinx Corporation. Data Book, 1996.