0 évaluation0% ont trouvé ce document utile (0 vote)
174 vues160 pages
Stream Control Transmission Protocol (SCTP) is a new reliable transport protocol for IP networks. SCTP avoids a very simple attack that affects TCP, the so-called SYN attack. It can take advantage of a multihomed host using all the IP addresses the host owns.
Stream Control Transmission Protocol (SCTP) is a new reliable transport protocol for IP networks. SCTP avoids a very simple attack that affects TCP, the so-called SYN attack. It can take advantage of a multihomed host using all the IP addresses the host owns.
Stream Control Transmission Protocol (SCTP) is a new reliable transport protocol for IP networks. SCTP avoids a very simple attack that affects TCP, the so-called SYN attack. It can take advantage of a multihomed host using all the IP addresses the host owns.
Supervisor: Professor Raimo Kantola Instructor: John Loughney, M. Sc. Author: Ivn Arias Rodrguez
Helsinki University of Technology Electrical and Communications Engineering Department Networking Laboratory
Espoo, 12 th of February, 2002
Abstract of Master's Thesis i
HELSINKI UNIVERSITY OF TECHNOLOGY ABSTRACT OF MASTER'S THESIS Author: Ivn Arias Rodrguez Title: Stream Control Transmission Protocol. The design of a new reliable transport protocol for IP networks Date: February the 12 th , 2002 Number of pages: 159 Department: Electrical and Communications Engineering Laboratory: Networking Supervisor: Raimo Kantola Instructor: John Loughney, M. Sc. There is an increasing need for internetworking between telephone and computer networks. Applications such as Voice over IP (VoIP) and the deployment of the 3 rd
Generation mobile telephony networks, make this integration a necessity. The Signaling Transport (SIGTRAN) working group of the Internet Engineering Task Force (IETF) is the one in charge of the design of the standards needed to make this internetworking possible. The primary purpose of this working group is addressing the transport of packet-based Public Switched Telephone Networks (PSTN) signaling over IP networks, taking into account functional and performance requirements of the PSTN signaling. Among the multiple standards that have been defined by SIGTRAN there is one new reliable transport protocol, the Stream Control Transmission Protocol (SCTP). SCTP is the evolution of a previous transport protocol, called the Multi-Network Datagram Transmission Protocol (MDTP), highly based on TCP. SCTP has several new features that make it more suitable for PSTN signaling transport than TCP. SCTP can take advantage of a multihomed host using all the IP addresses the host owns. SCTP avoids a very simple attack that affects TCP, the so- called SYN attack. This new protocol also provides a mechanism to prevent an application using SCTP from the so-called Head-Of-Line (HOL) blocking by using streams. Moreover, many features that are optional in TCP have been including in the basic specifications of SCTP, such as the Selective Acknowledgements, the ability to tell about the receipt of Duplicate Datagrams or the support for Explicit Congestion Notification (ECN). This Master's Thesis discusses the evolution of the design of SCTP. We will try to explain why the different aspects of SCTP were designed in the way they were designed. When possible, we will explain how the characteristics of SCTP evolved from the initial MDTP's ones, and we will show how SCTP and TCP behave in similar situations. Keywords: Internet, Internet Protocol (IP), reliable transport protocol, Stream Control Transmission Protocol (SCTP), Signaling System #7 (SS7), Signaling Transport (SIGTRAN), Transport Control Protocol (TCP). Resumen del Proyecto de Fin de Carrera ii
HELSINKI UNIVERSITY OF TECHNOLOGY RESUMEN DEL PROYECTO DE FIN DE CARRERA Autor: Ivn Arias Rodrguez Ttulo: Stream Control Transmission Protocol. El diseo de un nuevo protocolo de transporte fiable para redes IP Fecha: 12 de Febrero de 2002 Nmero de pginas: 159 Departamento: Ingeniera Elctrica y de Comunicaciones Laboratorio: Redes de Ordenadores Supervisor: Raimo Kantola Tutor: John Loughney, M. Sc. Coordinadores: ngel lvarez Rodrguez (ETSIT) Anita Bisi (HUT) Cada vez hay una mayor necesidad de integracin entre las redes de telefona y las redes de ordenadores. Nuevas aplicaciones tales como Voz sobre IP (Voice over IP, VoIP) o la implantacin de la 3 Generacin de telefona mvil hacen cada vez ms necesaria esta integracin entre ambas redes. El grupo de trabajo Transporte de Sealizacin (Signaling Transport, SIGTRAN) de la Fuerza de Trabajo de Ingeniera de Internet (Internet Engineering Task Force, IETF) es el que se encarga de producir los estndares necesarios para hacer posible la integracin de dichas redes. El propsito principal de este grupo de trabajo es encargarse del transporte de sealizacin de Redes Pblicas Telefnicas Conmutadas (Public Switched Telephone Networks, PSTN) basadas en conmutacin de paquetes sobre redes IP, teniendo en cuenta las funciones y prestaciones requeridas para el transporte de dicha sealizacin. Uno de estos nuevos estndares surgidos del trabajo conjunto de mltiples ingenieros en SIGTRAN es el Protocolo de Transmisin con Control de Flujo (Stream Control Transmission Protocol, SCTP). SCTP es un nuevo protocolo de transporte fiable. El objetivo inicial de este nuevo protocolo era el transporte de los paquetes de sealizacin de redes SS7 sobre redes IP. SCTP comenz a disearse en verano de 1998. Por aquel entonces, Randall R. Stewart y Qiaobing Xie comenzaron a disear un protocolo al que bautizaron como Protocol de Transmisin de Datagramas Multi-Red (Multi-Network Datagram Transmission Protocol, MDTP). En su diseo inicial se basaba ampliamente en el Protocolo de Control de Transporte (Transport Control Protocol, TCP), el protocolo de transporte fiable por excelencia presente en las redes IP. De hecho, este protocolo se comenz a disear antes siquiera de la existencia de SIGTRAN, y su objetivo original era subsanar algunos de los problemas encontrados al usar TCP. Tiempo despus, al crearse SIGTRAN y comenzar a buscar el protocolo de transporte idneo para sus propsitos, llegaron a la conclusin de que MDTP era lo ms parecido a aquello que andaban buscando. Desde este momento el inters por MDTP Resumen del Proyecto de Fin de Carrera iii
subi, y su diseo comenz a debatirse en la lista de distribucin que con tal propsito SIGTRAN haba abierto. Durante su fase de diseo, la estructura inicial de MDTP cambi mucho. Haba que adaptarlo a las necesidades especficas de SIGTRAN, el transporte de sealizacin de las redes telefnicas, sobre todo de la red SS7. El diseo final de SCTP fue publicado en la Peticin de Comentarios (Request For Comments, RFC) nmero 2960 a finales de octubre de 2000. SCTP incluye muchas mejoras sobre TCP que lo hacen ms apropiado que ste para el transporte de sealizacin, e incluso puede competir con l como protocolo de transporte fiable general en Internet. SCTP tiene un mecanismo para establecer asociaciones (el equivalente a las conexiones de TCP) que le hace inmune al ataque por inundacin de datagramas con la bandera de SYN fijada. SCTP utiliza un mecanismo de cuatro pasos en vez de los tres que usa TCP. Esto le permite a los servidores el autenticar la direccin IP fuente del datagrama que tiene la bandera SYN fijada antes de reservar ningn recurso y as imposibilitar este ataque. En TCP slo se pueden establecer conexiones de una direccin IP a otra direccin IP. Una conexin TCP se identifica por la direccin IP y puerto tanto del cliente como del servidor. As si una mquina posee diferentes tarjetas de red con sus respectivas direcciones IP asociadas, no puede usar ms que una de ellas para establecer una conexin TCP con otra mquina. En SCTP, una asociacin se identifica por una serie de direcciones IP y un puerto del cliente, y el conjunto de direcciones IP del servidor y su puerto. De esta manera, en caso de que una de las direcciones IP deje de funcionar, siempre se puede seguir utilizando cualquiera de las otras. Otra innovacin frente a TCP es que SCTP puede evitar el bloqueo del principio de la lnea (head of line blocking) mediante el uso de flujos (streams). Este bloqueo se da cuando en TCP enviamos varios mensajes independientes troceados en datagramas usando una nica conexin. En esta situacin, aunque un mensaje haya llegado completamente al receptor, ste no se podr pasar al usuario antes de que todos los mensajes anteriores hayan llegado tambin completos. SCTP permite el uso de flujos, que son subconexiones dentro de una asociacin SCTP de manera que datagramas dirigidos a flujos distintos se tratan independientemente. Adems, con SCTP podemos diferenciar distintos mensajes dentro del flujo de bytes con lo que el usuario no debe incluir sus propias marcas. Incluso se pueden enviar mensajes de forma que el receptor los pase al usuario nada ms recibirlos, sin guardar el orden en que fueron enviados. SCTP utiliza varias direcciones IP (multihoming) tanto en el cliente como en el servidor. Sin embargo, se utiliza tan slo una de ellas para enviar los datos, la direccin primaria (primary address). El resto se reserva y slo se utilizan en caso de que la direccin primaria falle. Por ello, para saber el estado en que se encuentran dichas direcciones IP de reserva, SCTP tiene el llamado mecanismo de latidos de corazn (heartbeat mechanism). Consiste en enviar mensajes a las direcciones IP que no se usan para enviar datos. Dichos mensajes, o latidos, se deben responder, de manera que al recibir la respuesta se sabe que esas direcciones siguen activas. Uno de los principales problemas de TCP es que es muy difcil de extender. Cuando se quiere aadir una nueva caracterstica a TCP, el limitado espacio que se dej reservado para uso futuro cuando TCP se dise hace muchas veces que esto no sea posible. SCTP es un protocolo muy abierto que ha sido diseado para que sea extensible Resumen del Proyecto de Fin de Carrera iv
por naturaleza. SCTP contiene una serie de funciones bsicas, y ha sido pensado para que toda aquella caracterstica adicional que quiera ser aadida en el futuro, pueda incluirse con gran facilidad. Adems, una mquina que tiene una asociacin SCTP con otra, puede enviarle mensajes de error, de manera que ciertos errores a nivel del protocolo de transporte pueden resolverse sin afectar al usuario. Estos mensajes de error sirven tambin para negociar el uso de funciones opcionales, de manera que versiones antiguas de SCTP que no soporten dicha funcin nueva tengan una manera de expresar dicha carencia enviando el mensaje de error apropiado. TCP ha sido el protocolo de transporte fiable por excelencia de las ltimas dos dcadas. Es por ello que muchas de las caractersticas que tiene SCTP han sido tomadas directamente de TCP. La mayora de las extensiones que se han escrito para TCP han sido incluidas en SCTP en su versin bsica. Entre ellas podemos mencionar el uso de asentimientos selectivos (selective acknowledgements), la posibilidad de alertar de la recepcin de datagramas duplicados, o el soporte para la Notificacin Explcita de Congestin (Explicit Congestion Notification, ECN). Adems, SCTP usa los mismos algoritmos que TCP para evitar la congestin. De esta manera, cuando haya convivencia entre aplicaciones que usen bien SCTP o TCP como su protocolo de transporte, el ancho de banda adjudicado a una asociacin SCTP o una conexin TCP sea el mismo. Para evitar que el paso de TCP a SCTP sea dramtico, se ha definido una interfaz de sockets que es lo ms parecido posible a la de TCP. De esta manera, los cambios necesarios para hacer que una aplicacin utilice SCTP en vez de TCP son mnimos. Aunque SCTP es un protocolo nuevo, e incluso se espera que se edite una nueva especificacin de SCTP en el futuro debido a los fallos encontrados en la RFC 2960, ya hay numerosas implementaciones pblicas en Internet. Esto har que los programadores de aplicaciones tengan un fcil acceso a SCTP y lo puedan empezar a utilizar cuanto antes. Aunque SCTP no es un protocolo simple, existen implementaciones que ocupan menos de 100 Kbytes, lo que hace que se puedan usar incluso en pequeos dispositivos. El autor de este Proyecto de Fin de Carrera ha estado trabajando con los ingenieros que diseaban SCTP y sus posteriores extensiones desde septiembre de 1999, tomando parte activa en su diseo. Tambin ha programado una implementacin de SCTP y ha asistido a varias sesiones de interoperabilidad entre distintas implementaciones. Durante este tiempo el autor ha adquirido una visin amplia sobre el proceso de diseo de un protocolo de transporte fiable. Debido a ello, en este Proyecto de Fin de Carrera se hablar sobre la evolucin del diseo de SCTP. Se tratar de explicar el porqu de los diferentes aspectos de SCTP. Adems, puesto que SCTP es una evolucin de MDTP, se intenta seguir la trayectoria de algunos de los aspectos de SCTP desde su diseo inicial en MDTP. Y puesto que TCP es el protocolo de transporte con el cual SCTP tendr que competir, se comparar el comportamiento de SCTP y TCP en situaciones similares. El autor de este Proyecto de Fin de Carrera espera que esta recopilacin de datos durante dos aos y medio pueda servir a futuros diseadores de protocolos similares. Palabras Clave: Internet, Protocolo de Internet (Internet Protocol, IP), protocolo de transporte fiable, Protocolo de Transmisin con Control de Flujo (Stream Control Transmission Protocol, SCTP),.Sistema de Sealizacin Nmero 7, (Signaling System #7 (SS7), Transporte de Sealizacin (Signaling Transport, SIGTRAN), Protocolo de Control de Transporte (Transport Control Protocol, TCP). Preface v
PREFACE
The work of this Master's Thesis was carried out at the Communication Systems Laboratory of the Nokia Research Centre located in Helsinki. It was supervised, however, by the Networking Laboratory of the Department of Electrical and Communications Engineering, in the Helsinki University of Technology. If somebody would have told me that I was still going to be living in Helsinki almost two years and a half after I arrived here, I would have simply laughed. I came to Helsinki in September 1999 with an Erasmus grant, initially for six months. I had already enlarged my stay here, in this supposed cold city, by three more months before Christmas. And after that time, another year, and then another one. I do not know if I have done a good work writing this Master's Thesis, but I am sure that I did my best, as I have never been in a hurry to finish it quickly. Who would like to do so being in this wonderful city, Helsinki, and working in this wonderful company, Nokia? I would like to make good use of this page to thank all those people that helped me to become an engineer. There are so many that I do not know how to start. Let us start first with Ramn Francisco Alfonso Pujante, that great guy that not only influenced me to choose the career I studied, but was the one that helped me so much to continue studying in Madrid while living in Barcelona. Thank you very much for sending me all those notes you took during those lessons I had to miss. Thank you for all those formalities you had to do for me at the university. And thanks for being my friend. Without your help, I probably would have never finished my studies. I would also like to thank all those people from the Barcelona's branch of the National Bank of Spain. They made so many things to make easier the fact of being studying while working hard there, and constantly encouraged me to continue. Thanks, I spent really good years living in Barcelona, the enemy of Madrid. Thanks to all the people I met in Helsinki. I have spent here such a lovely time that when I think about it and remember all those nice moments, it is even difficult to believe. So many people... Marghe, Gusi, Martina, Pepelu, Albert, Willy, scar, Paulito, Santi, Gaizka... I could continue for several pages if I would try to mention them all. However, I would especially like to thank Javichu, who has been my flatmate, workmate, wonderful friend (and cook!) for almost two years. Thanks to all of you. You are the main reason why I have never been in a hurry to finish this work. I would also like to thank the people in charge of the Socrates/Erasmus related issues. Especially to ngel lvarez, who really takes care of all of us. He is, however, planning to resign soon after being the Socrates/Erasmus Coordinator for many years now. It is really a pity, the new generations will loose a great and competent guy. Thanks also to the people here at the Nokia Research Centre that make it a lovely place to work. There is a guy here at the NRC called John Loughney. They say he is my boss, but I do not think so. In reality, he is my problem solver, and it looks like he is always happy with what I do, which really makes me feel good. Thanks to him, coming to work is one of the enjoyable things I do every day. If I am ever the boss of somebody, I will try to do it as you do. Thanks a lot!
Preface vi
I would also like to thank Raimo Kantola, my supervisor at the Helsinki University of Technology. This busy guy really knows how to do his work. He really takes care of what he does, so if he tells you that something will be better in another way, believe him, I would. Thanks for your advice Raimo. And at last but not least, I would like to thank my parents and siblings for everything they have done during these last 25 years. Hei people, I am becoming an engineer!
Helsinki, February the 12 th , 2002
Ivn Arias Rodrguez
Contents vii
CONTENTS
ABSTRACT OF MASTER'S THESIS............................................................................................................I RESUMEN DEL PROYECTO DE FIN DE CARRERA............................................................................ II PREFACE........................................................................................................................................................ V CONTENTS ................................................................................................................................................. VII LIST OF FIGURES.......................................................................................................................................IX LIST OF TABLES.......................................................................................................................................... X LIST OF ACRONYMS AND ABBREVIATIONS......................................................................................XI 1. INTRODUCTION...................................................................................................................................... 1 2. BACKGROUND......................................................................................................................................... 3 2.1 TELEPHONY SIGNALING: A LITTLE BIT OF HISTORY................................................................................ 3 2.2 THE SS7 NETWORK: WHAT IS THAT? ..................................................................................................... 5 2.2.1 Functional Architecture of SS7 ................................................................................................... 7 2.2.1.1 The Service Switching Point (SSP) ................................................................................................. 9 2.2.1.2 The Signal Transfer Point (STP) ..................................................................................................... 9 2.2.1.3 The Service Control Point (SCP)................................................................................................... 11 2.2.1.4 The Signaling Links....................................................................................................................... 11 2.2.2 Protocol Architecture of SS7..................................................................................................... 13 2.2.2.1 The Message Transfer Part (MTP) ................................................................................................ 14 2.2.2.2 The Signaling Connection Control Part (SCCP)............................................................................ 16 2.2.2.3 The Transaction Capabilities Application Part (TCAP) ................................................................ 16 2.2.2.4 The ISDN User Part (ISUP) .......................................................................................................... 17 2.3 THE LARGEST COMPUTER NETWORK: THE INTERNET........................................................................... 17 2.3.1 A quick history of the Internet: From military use to worldwide business tool ........................ 18 2.3.2 The basis of the Internet: The internals of the Internet Protocol (IP) ........................................ 21 2.4 A MARRIAGE OF CONVENIENCE: REASONS FOR SS7 AND IP NETWORKS INTEGRATION........................ 25 2.4.1 Voice over IP............................................................................................................................. 26 2.4.2 The 3 rd Generation Mobile Telephony ...................................................................................... 29 2.5 THIS IS WHAT WE WERE LOOKING FOR.................................................................................................. 31 2.5.1 The need of a new transport protocol ........................................................................................ 32 2.5.2 A proposal that IETF could not refuse ...................................................................................... 34 3. THE DESIGN OF SCTP: DATAGRAM STRUCTURE...................................................................... 37 3.1 SHAPE OF SCTP DATAGRAMS: AN EVOLUTION FROM MDTP .............................................................. 37 3.1.1 Common header and internal structure of MDTP ..................................................................... 37 3.1.2 Common header and internal structure of SCTP....................................................................... 40 3.2 SCTP ASSOCIATION MANAGEMENT: THE STATE DIAGRAM.................................................................. 48 4. AN ASSOCIATION'S BIRTH: FROM A TWO-WAY TO A FOUR-WAY HANDSHAKE............ 51 4.1 THE EVOLUTION OF THE ESTABLISHMENT PHASE.................................................................................. 51 4.2 COOKIES AGAINST THE ATTACKERS ..................................................................................................... 52 4.3 THE FIRST TWO LEGS: THE INIT AND THE INIT ACK CHUNKS............................................................ 54 4.3.1 The parameters .......................................................................................................................... 59 4.3.1.1 What is your address?.................................................................................................................... 59 4.3.1.2 The king of the parameters: The State Cookie............................................................................... 63 4.3.1.3 Other parameters ........................................................................................................................... 64 4.4 THE LAST TWO LEGS: THE COOKIE ECHO AND COOKIE ACK CHUNKS ......................................... 65 Contents viii
5. DOING THE HARD WORK: TRANSMISSION OF DATA .............................................................. 67 5.1 BASIC DATA TRANSMISSION................................................................................................................. 67 5.2 SOME SOLUTIONS TO AVOID CONGESTION............................................................................................ 70 5.3 SEVERAL CONNECTIONS INSIDE A SINGLE ASSOCIATION: THE USE OF STREAMS ................................... 76 5.4 SIZE MATTERS: MTU DISCOVERY........................................................................................................ 80 5.5 I WILL WAIT FOR YOU: RTO CALCULATION ......................................................................................... 85 5.6 THE IDEAS LEFT ON THE WAY............................................................................................................... 87 6. IT IS NOT ALL PLAIN DATA .............................................................................................................. 89 6.1 ARE YOU ALIVE? THE PATH HEARTBEAT MECHANISM.......................................................................... 89 6.2 YOU ARE WRONG: THE OPERATIONAL ERROR CHUNK.......................................................................... 92 7. THIS IS THE END: THE SHUTDOWN AND ABORT ALGORITHMS.......................................... 95 7.1 TERMINATING ASSOCIATIONS IN MDTP............................................................................................... 95 7.2 A HARD END FOR AN ASSOCIATION'S LIFE: ABORTING AN ASSOCIATION IN SCTP................................ 96 7.3 I AM DONE, COULD YOU FINISH AS WELL? THE SHUTDOWN PROCEDURE.............................................. 97 8. AND NOW? SCTP EXTENSIONS AND SCTP USERS.................................................................... 102 8.1 THE SCTP EXTENSIONS ..................................................................................................................... 103 8.1.1 This is my new address: Adding and deleting addresses, and per stream flow control ........... 103 8.1.2 Can I trust you? Reliable and unreliable streams .................................................................... 105 8.1.3 Be ready to adapt to your environment: The adaptive Fast Retransmit algorithm.................. 107 8.2 IS ANYBODY USING SCTP? SOME APPLICATIONS THAT USE SCTP..................................................... 108 9. CHANGES TO BE MADE IN RFC 2960 ............................................................................................ 111 9.1 THE CHECKSUM DILEMMA.................................................................................................................. 111 9.1.1 The good old days: Letting others protect the data integrity ................................................... 111 9.1.2 The quest for a stronger scheme: The Cyclic Redundancy Check .......................................... 112 9.1.3 From a 16-bit to a 32-bit checksum......................................................................................... 115 9.1.4 The Adler-32 Checksum: We have a problem......................................................................... 116 9.1.5 Going back to the roots: Using the CRC-32 as the checksum................................................. 117 9.2 ERRATA: THE IMPLEMENTORS GUIDE................................................................................................ 119 10. CONCLUSIONS..................................................................................................................................... 122 APPENDIX A: CONTENTS OF THE CD-ROM..................................................................................... 125 APPENDIX B: OTHER SOURCES OF INFORMATION ABOUT SCTP ........................................... 127 BIBLIOGRAPHY........................................................................................................................................ 128 INDEX .......................................................................................................................................................... 140 List of Figures ix
LIST OF FIGURES
FIGURE 2-1: EVOLUTION OF TELEPHONE NETWORK............................................................................................ 3 FIGURE 2-2: FUNCTIONAL ARCHITECTURE OF SS7 ............................................................................................. 8 FIGURE 2-3: SS7 PROTOCOL ARCHITECTURE................................................................................................... 13 FIGURE 2-4: INTERNET'S GROWTH (1981-2001)................................................................................................ 20 FIGURE 2-5: WORLDWIDE INTERNET POPULATION (AUGUST 2001) ................................................................. 20 FIGURE 2-6: THE IP HEADER ............................................................................................................................ 22 FIGURE 2-7: SOME MEMBERS OF THE INTERNET PROTOCOL SUITE.................................................................... 25 FIGURE 2-8: SIGTRAN FUNCTIONAL MODEL................................................................................................... 32 FIGURE 3-1: MDTP DATAGRAM STRUCTURE IN ITS FIRST VERSION.................................................................. 38 FIGURE 3-2: STRUCTURE OF SCTP DATAGRAMS.............................................................................................. 41 FIGURE 3-3: SCTP CONNECTION MANAGEMENT FINITE STATE MACHINE ......................................................... 49 FIGURE 4-1: ESTABLISHMENT PROCEDURE IN MDTP....................................................................................... 51 FIGURE 4-2: SYN ATTACK IN TCP ................................................................................................................... 53 FIGURE 4-3: ESTABLISHMENT PHASE IN SCTP (FIRST TWO LEGS) .................................................................... 55 FIGURE 4-4: TRANSMISSION OF 64 KILOBYTES FROM MADRID TO HELSINKI.................................................... 57 FIGURE 4-5: BASIC NAT OPERATION ............................................................................................................... 61 FIGURE 4-6: ESTABLISHMENT PHASE IN SCTP (LAST TWO LEGS) ..................................................................... 65 FIGURE 5-1: BASIC DATA TRANSMISSION ......................................................................................................... 69 FIGURE 5-2: TWO CAUSES OF CONGESTION....................................................................................................... 71 FIGURE 5-3: EVOLUTION OF CWND WITH AND WITHOUT PACKET LOSSES.......................................................... 74 FIGURE 5-4: USE OF FAST RETRANSMIT IN [STE2000] AND [STE2002B] ............................................................. 75 FIGURE 5-5: HEAD OF LINE BLOCKING............................................................................................................. 78 FIGURE 5-6: IP FRAGMENTATION ..................................................................................................................... 83 FIGURE 5-7: PROBABILITY DENSITY OF ACKNOWLEDGEMENT ARRIVAL TIMES................................................. 86 FIGURE 6-1: THE PATH HEARTBEAT MECHANISM IN SCTP ............................................................................... 91 FIGURE 6-2: THE ERROR CHUNK IN SCTP...................................................................................................... 92 FIGURE 7-1: THE ABORT PROCEDURE IN SCTP................................................................................................. 97 FIGURE 7-2: THE SHUTDOWN PROCEDURE IN SCTP.......................................................................................... 99 FIGURE 7-3: THE TWO-ARMY PROBLEM.......................................................................................................... 100 FIGURE 8-1: EVOLUTION OF THE ADDIP DRAFT ............................................................................................. 104 FIGURE 8-2: SS7-IP ADAPTATION LAYERS ..................................................................................................... 109 FIGURE 9-1: HARDWARE IMPLEMENTATION OF CRC-CCITT......................................................................... 114
List of Tables x
LIST OF TABLES
TABLE 2-1: DIFFERENCES BETWEEN THE TELEPHONE AND IP NETWORKS ........................................................ 27 TABLE 5-1: SOME MTUS FOUND IN THE INTERNET .......................................................................................... 82 TABLE 9-1: ERROR-DETECTION CAPABILITIES OF SEVERAL CHECKSUMS ....................................................... 117 TABLE 9-3: CALCULATION TIME CONSUMED BY SEVERAL CHECKSUMS.......................................................... 118
List of Acronyms and Abbreviations xi
LIST OF ACRONYMS AND ABBREVIATIONS
Numerics
1G 1 st Generation Mobile Telephony 2G 2 nd Generation Mobile Telephony 3G 3 rd Generation Mobile Telephony 3GPP 3G Partnership Project
A
A Access links ABORT Abort ACM Association for Computing Machinery AH Authentication Header AIMD Additive Increase Multiplicative Decrease ALG Application Level Gateway AMPS Advanced Mobile Phone System AMR Adaptive Multi Rate ANS Advanced Networks and Service ANSI American National Standards Institute API Application Programming Interface ARPA Advanced Research Projects Agency ARIB Association of Radio Industries and Business ASE Application Service Element AT&T American Telephone and Telegraph ATM Asynchronous Transfer Mode
B
B Bridge links (SS7) B Beginning fragment flag (SCTP) BISDN Broadband ISDN BISUP Broadband ISDN User Part BSD Business Services Database
C
C Cross links CCITT Comit Consultatif International Tlgraphique et Tlphonique CCITT International Telegraphy and Telephony Consultative Committee CCS Common Channel Signaling CDMA Code Division Multiple Access CERN Conseil Europenne pour la Recherche Nuclaire CERN European Council for Nuclear Research List of Acronyms and Abbreviations xii
CERT Computer Emergency Respond team CMSDB Call Management Services Database COOKIE ACK Cookie Acknowledgement COOKIE ECHO State Cookie COPS Common Open Policy Service CRC Cyclic Redundancy Check CRC-16 Cyclic Redundancy Check of 16 bits CRC-32 Cyclic Redundancy Check of 32 bits CRC-32c Cyclic Redundancy Check of 32 bits studied by Castagnoli CRC-CCITT Cyclic Redundancy Check standardized by the CCITT CSIP Connectionless SCCP over IP Adaptation Layer CSL Component Sublayer CTP Common Transport Protocol cwnd Congestion Window CWR Congestion Window Reduced CWTS Chinese Wireless Telecommunication Standard
D
D Diagonal links (SS7) D Delay flag (TCP) DATA Payload Data DF Don't Fragment flag DiffServ Differentiated Services DNS Domain Name System DoD Department of Defense DPC Destination Point Code DSCP Differentiated Services Codepoint DTMF Dual Tone Multi-Frequency DUP Data User Part
E
E Extended links (SS7) E Ending fragment flag (SCTP) E2E End-to-End ECN Explicit Congestion Notification ECNE Explicit Congestion Notification Echo EDGE Enhanced Data for GSM Evolution ERROR Operation Error ESP Encapsulating Security Payload ETSI European Telecommunications Standard Institute
F
F Fully associated links FDDI Fiber Distributed Data Interface FTP File Transfer Protocol
List of Acronyms and Abbreviations xiii
G
GNU GNU is Not Unix GPRS General Packet Radio Service GSM Global System for Mobile Communications
H
HEARTBEAT Heartbeat Request HEARTBEAT ACK Heartbeat Acknowledgement HLR Home Location Register HMAC Keyed-Hashing algorithm for Message Authentication HOL Head-of-line HSCSD High Speed Circuit Switched Data HTML Hypertext Markup Language HTTP Hypertext Transfer Protocol
I
IANA Internet Assigned Numbers Authority ICMP Internet Control Message Protocol ICMPv6 Internet Control Message Protocol for IPv6 ICV Integrity Check Value IEEE Institute of Electrical and Electronic Engineers IETF Internet Engineering Task Force IHL Internet Header Length IKE Internet Key Exchange IMT-2000 International Mobile Telephony 2000 IN Intelligent Network INIT Initiation INIT ACK Initiation Acknowledgement IP Internet Protocol IPsec IP Security Protocol IPv4 Internet Protocol version 4 IPv6 Internet Protocol version 6 ISC Internet Software Consortium ISDN Integrated Services Digital Network ISO International Standards Organization ITSP Internet Telephony Service Provider ISUP ISDN User Part ITU International Telecommunication Union ITU-D ITU Development Sector IUA ISDN Q.921-User Adaptation Layer
L
LAN Local Area Network LAPD Link Access Procedures on the D-channel LIDB Line Information Database List of Acronyms and Abbreviations xiv
LNP Local Number Portability
M
M2PA MTP2-User Peer-to-Peer Adaptation Layer M2UA MTP2-User Adaptation Layer M3UA MTP3-User Adaptation Layer MAC Message Authentication Code MAC Medium Access Control MD5 Message Digest 5 MDTP Multi-Network Datagram Transmission Protocol MF Multi-Frequency MF More Fragments flag (TCP) MG Media Gateway MGC Media Gateway Controller MIB Management Information Base MMUSIC Multiparty Multimedia Session Control MPLS Multiprotocol Label Switching Architecture MSS Maximum Segment Size MTP Message Transfer Part MTP1 MTP Level 1 MTP2 MTP Level 2 MTP3 MTP Level 3 MTU Maximum Transfer Unit
N
NCSA National Center for Supercomputer Applications NFS Network File System NIF Nodal Interworking Function NMT Nordic Mobile Telephone NREN National Research and Educational Network NSF National Science Foundation NSP Network Service Part NUP National User Part
O
OOTB Out Of The Blue OPC Origination Point Code OSI Open Systems Interconnection OSPF Open Shortest Path First
P
PCR Preventive Cyclic Retransmission PDC Personal Digital Cellular POTS Plain Old Telephone Service PSTN Public Switched Telephone Network List of Acronyms and Abbreviations xv
Q
QoS Quality of Service
R
R Reliability flag RAP Resource Allocation Protocol RFC Request For Comments RSVP Resource Reservation Protocol RTCP RTP Control Protocol RTO Retransmission Time-Out RTP Real Time Protocol RTSP Real Time Streaming Protocol RTT Round Trip Time RTTVAR Round Trip Time Variation RUDP Reliable UDP
S
SACK Selective Acknowledgement SAP Session Announcement Protocol SCCP Signaling Connection Control Part SCN Switched Circuit Network SCP Service Control Point SCTP Stream Control Transport Protocol SDP Session Description Protocol SF Single Frequency SG Signaling Gateway SHA-1 Secure Hash Standard 1 SHUTDOWN Shutdown SHUTDOWN ACK Shutdown Acknowledgement SHUTDOWN COMPLETE Shutdown Complete SIGTRAN Signaling Transport SIO Service Indicator Octet SIP Session Initiation Protocol SMTP Simple Mail Transport Protocol SNMP Simple Network Management Protocol SP Signaling Points SRTT Smoothed Round Trip Time SS7 Signaling System #7 SSCOP Service Specific Connection-Oriented Protocol SSN Subsystem Number SSN Stream Sequence Number SSP Service Switching Point ssthresh Slow Start Threshold SSTP Simple SCCP Tunneling Protocol STP Signal Transfer Point SUA SCCP-User Adaptation Layer List of Acronyms and Abbreviations xvi
T
T TCB Missing flag (SCTP) T Throughput flag (TCP) T1 Standardization Committee T1-Telecommunications TACS Total Access Communication System T/UDP UDP for TCAP TCAP Transaction Capabilities Application Part TCB Transmission Control Block TCP Transmission Control Protocol TDM Time Division Multiplexing TDMA Time Division Multiple Access TFTP Trivial File Transfer Protocol TLS Transport Layer Security TLV Type-Length-Value TOS Type of Service TSL Transaction Sublayer TSN Transmission Sequence Number TSVWG Transport Area Working Group TTA Telecommunications Technology Association TTC Telecommunication Technology Committee TUP Telephone User Part
U
U Unordered flag UDP User Data Protocol UMTS Universal Mobile Telecommunication System URI Uniform Resource Identifier URL Uniform Resource Locators
V
VLR Visitor Location Register VoIP Voice over IP
W
WATS Wide Area Telephone Service WWW World Wide Web
Introduction 1
1. INTRODUCTION
Our society is quite used to telephones. One simply grabs one of the available telephone set models, dial a number using a rotary dial, keyboard, or simply telling the name of the wished receiver, and in few seconds we are speaking with the desired person. That person could have a mobile phone and be anywhere in the world, but it does not matter. Alternatively, he may be speaking with some other person at that moment, but he can notice about our calling and answer us, or we can even join the other conversation if he wants. That person could not be available at the moment, but then we can leave a voice mail message, or our call can be redirected to another location in which he is at the moment. All this seems to be such a simple thing to do, but beneath it, there are the joint work of many people and the constant evolution of the technologies used to carry the voice from one point to another. What makes all this possible is telephony signaling. This term refers to the information transferred inside the telephone network that is used to establish, monitor and terminate a telephone call. In a broader sense, it is applied to any data flow related with the management of any of the telephone network internal elements or databases. This is what makes possible services such as billing, roaming of mobile phones, toll free numbers, televoting or calling card validation. Telephony signaling has existed since the very beginning of the history of telephone and it is as important as voice transport itself, if not more, as the whole operation of the telephone network relies on it. It has evolved during all time, especially during the last 25 years when the marriage between telephone and computer became effective and the differences between computer and telephone networks started to disappear. The computer is another tool that is becoming more common. It is not as widespread as the telephone because it is a newer invention and it is more expensive than a telephone (or at least they were until recent times). Many people work using a computer and every time it is harder and harder to think about a computer that is not connected to any computer network. People are getting used to send emails or to surf the web in the same way they write letters or read the newspaper. But again, those applications that simply work are the fruit of many years of study and constant evolution. This Master's Thesis deals with the effort done (and still to do) to join the telephone and computer networks, and the steps lately taken (and the ones that should be taken in the future) to achieve such objective. In the next chapter we will give a background on telephony signaling. We will speak not only about telephony signaling networks but also about computer networks and the advantages of joining both types of networks. The subsequent chapters of this Master's Thesis will be devoted to one of the key protocols that will make that merger possible, the Stream Control Transport Protocol (SCTP), which is the main topic of this paper. We will try to give a historic perspective of the design of SCTP, discussing how it evolved from a TCP-like protocol to its final specification. In chapter 3 we will show the structure of SCTP datagrams and the finite state machine that models its behavior. Introduction 2
In chapter 4 we will look in detail at the establishment procedure of SCTP. We will show how it differs from the way TCP sets up a connection, and the main advantages of SCTP's scheme over TCP's one. Chapter 5 is one of the most important ones. It discusses how SCTP performs its main task, the transmission of data. Apart from the basic data transmission, we will speak about the mechanism SCTP uses to avoid congestion in the network, how SCTP calculates the Maximum Transfer Unit (MTU) and the Retransmission Time-Out (RTO). We will also explain what are the streams and how they are used. Finally, we will comment some ideas that were discarded during the design phase. Chapter 6 is dedicated to the information transferred between two hosts that is not user data but internal SCTP messages that help in the management of an SCTP association 1 . We will show in chapter 7 the different ways to tear down an association. In chapter 8 we will show the SCTP extensions defined so far and we will quickly speak about applications that use SCTP as its transport protocol. Chapter 9 summarizes the changes that are going to be made in the SCTP specifications. A new version of the SCTP specifications including these changes will be released within the next months. Finally, in chapter 10 we will show our conclusions about SCTP, and what we think about its future.
1 The term association identifies, in SCTP, one transport session between two peers. It is equivalent to the term connection used in TCP. Background 3
2. BACKGROUND
In this chapter we will first quickly review telephony signaling history. Then we will continue explaining the main characteristics of the biggest telephony signaling network nowadays, the Signaling System #7 (SS7). We will also review its equivalent in the computer network world, the Internet: a mixture of heterogeneous computer networks that use a common protocol that acts as the glue that keeps them together, the Internet Protocol (IP). At the end of the chapter, we will discuss what are the reasons and benefits of merging both networks in a single one and what is needed to do so.
2.1 Telephony signaling: A little bit of history
Graham Bell patented the telephone in 1876, and immediately there was a huge demand for the new invention. Initially, phone usage was so simple, and there was not anything such as a telephone company but instead the telephone sets were sold in pairs (much as the present walkie-talkies) and the happy owner was the one in charge to physically establish the line by stringing a single wire between them (the earth surface acted as ground so just one wire was needed). The telephones did not even have a ringer, and the way of setting up a call was by simply shouting at the microphone and hoping that the partner would be close enough to his phone to hear the other one calling. This just gave to the telephone owner the possibility of speaking with another customer. One should have as many telephone sets as different people he wanted to speak with. Figure 2-1 (a) shows this situation, when 9 people wanted to be connected among them.
Figure 2-1: Evolution of telephone network
Within one year the cities were covered with wires passing over houses and trees in a wild jumble, and it became obvious that this model of connection was not going to work. Taking advantage of this, Bell created the Bell Telephone Company and opened the first switching office in 1878. The company ran a wire to each customer's house or office. When they wanted to use the telephone, they had to lift the receiver, allowing DC current (a) Fully interconnected network (b) Centralized switch (c) Two level hierarchy Background 4
to flow through the telephone and back through the return of the circuit, turning on a lamp in the operator's switchboard. Usually the subscriber had to crank the phone to make a ringing sound in the telephone company office to attract the attention of the operator, who connected him to the callee using a jumper cable. This way, a customer could speak with all the other customers connected to the same switching office, just having a single telephone set (now equipped with a ringer) and a single wire (now balanced, insulated, twisted pairs). This model is illustrated in Figure 2-1 (b). As the telephone started to be increasingly popular, people wanted to make long distance calls between cities, so the switching offices were interconnected. But then, the same problem of interconnecting all the offices arose again, and a second level of offices was created, as shown in Figure 2-1 (c). Eventually the hierarchy grew to five levels. The DC current flow and the ringer were the first type of telephony signaling that was ever used to establish and terminate phone calls, although that was done mostly manually by the operator. However, signaling evolved, including today much more information than this early method could, and reducing the human intervention to its minimum. Telephony signaling was initially limited by the fact that the same circuit was used both to carry the voice and the signaling, a method called in-band signaling. Moreover, telephony signaling was analog and had a small quantity of possible states and so little information could be handled, making necessary operator intervention most of the times. To make things worse, the in-band signaling approach caused that the circuit used for the telephone call was busy since the very moment the caller started dialing until the caller went on-hook. Thus, telephone companies were quickly running out of circuits to attend all the demand they had, as the customers started to be counted by millions and created an enormous amount of traffic. On one hand, telephone companies needed a new way of calling management that would save the substantial investments that had to be done to add new facilities 2 . On the other hand, they needed methods to be able to support the new services that the subscribers were demanding. In the early sixties, the European telephone companies started to digitize their networks. One of the first steps taken was to stop using the voice network for signaling and using instead another network used solely for that purpose, practice known as Common Channel Signaling (CCS). This new approach immediately brought some benefits. For example, the setup and teardown procedures could be done more quickly and they were less error prone. Digitalization of phone lines not only improved the quality of the calls (especially long distance ones) but also made equipment cheaper. CCS is in wide use today, the SS7 model being the protocol and architecture presently used in this relatively new network. Nevertheless, in the history of telephony signaling, many other methods have been used:
DC signaling: This was the first type of signaling used. When a subscriber went off-hook, DC current flowed from the central office through the telephone and back to the office. A DC current detector provided a dial tone, and the subscriber dialed the number using a rotary dial, which use a relay to interrupt the current creating pulse bursts (10 pulses per second). The central office determined the number dialed and established a circuit to the callee. The callee was alerted by the
2 Part of those facilities was the human operator. A story is told that the American Telephone and Telegraph (AT&T), in the early 1930s, predicted that by the mid-1950s, every woman of working age in the USA would be employed by them as an operator, due to the expected increase in call volume and the available technology. Background 5
ringer of his telephone, and the caller meanwhile received a calling tone. When the distant party answered, the tone was interrupted and then the circuit carried the voice. The circuit was released when either party hung up. The limitations of this system are obvious: the signaling is limited to seizing circuits, call supervision and disconnect. In-band signaling: This way of signaling relies on the use of tones at certain frequencies instead of using DC current. The tones are transmitted over the same circuit than voice, and thus, they must be within the voice band (0 to 4 kHz). The tones are designed to minimize the possibility of the voice frequencies duplicating the signaling tones, but it is not 100% fault tolerant. The tones sent can be Single Frequency (SF) tones, still used in some parts of the telephone network for interoffice trunks; or Multi-Frequency (MF) or Dual Tone Multi-Frequency (DTMF), mostly used to send dialed digits through the telephone network to the destination end office. Apart from the existing possibility of misinterpretation of speech as signaling tones, this method uses expensive tone detectors and it is still limited in the different values it can handle. Out-of-band signaling: It is quite the same as in-band signaling, with the difference that the analog voice carried in the circuit is limited to 3.5 kHz and the band between that frequency and 4 kHz is left for signaling tones. It has the same problems as in-band signaling except that there is no worry of false signaling. Digital signaling: One of techniques used for signaling when the telephone network went digitized was using certain bits in the voice trunk for signaling (a bit was robbed from certain frames). This practice did not hurt the quality of the digitized speech, or at least not enough to be detected by human ear. It is more cost effective than the other methods commented so far, but still limited regarding the type of signaling it can provide as it is not message-based. Common channel signaling: It is digital as well but its main property is that it places the signaling information in a time slot or channel separate from the voice and data it is related to, so the voice or data trunks are just used to carry speech or user data. This method is presently used in SS7 and Integrated Services Digital Network (ISDN). It is able of sending and receiving messages that can have unlimited values and thus can be extended to support new functionality. Moreover, it can not only control the state of telephone calls but also make queries and fetch data from remote databases to support special services.
Among all these signaling methods, the most important one is the last one, CCS, the only that will be further discussed. In the next section we make an overview of SS7 and we will slightly discuss its functional and protocol architectures.
2.2 The SS7 network: What is that?
CCS is more flexible and powerful than in channel signaling and it is well suited to support the requirements of integrated digital networks. The culmination of the transition of network control signaling from an in channel to a common-channel approach is SS7, Background 6
first issued by the Comit Consultatif International Tlgraphique et Tlphonique (CCITT) 3 in 1980, with revisions every four years. SS7 is designed to be an open-ended common-channel signaling standard that can be used over a variety of digital circuit-switched networks. The overall purpose of SS7 is to provide an internationally standardized general-purpose common-channel signaling system with the following primary characteristics:
Optimized for use in digital telecommunication networks in conjunction with digital stored program-control exchanges, using 64 kbps digital channels. However, it is also suitable for operation over analog channels and at speeds below 64 kbps Designed to meet present and future information transfer requirements for call control, remote control, management, and maintenance. Provides a reliable means for the transfer of in-sequence information without loss or duplication. Suitable for use on point-to-point terrestrial and satellite links.
The scope of SS7 is large, since it must cover all aspects of control signaling for complex digital networks. The fact that SS7 specifications consist of 53 ITU-T Recommendations in the Q.7XX series gives an idea of how complex the standard is. However, the first usage of SS7 was not for call setup and teardown, but for accessing remote databases. In the 1980s, some telephone companies started to offer a new service called Wide Area Telephone Service (WATS) that used a common 800 area code no matter what was the destination of the call. But all the telephone-switching equipment by then relied in the area code to make the routing decisions through the Public Switched Telephone Network (PSTN). That problem was solved by assigning a second normal number to every 800 number, which would have a real area code and thus could be used for routing. However, the quantity of 800 numbers grew rapidly and it was necessary to store all of them in a central database that could be accessed by all the central offices. Therefore, the SS7 network started to be used to fetch routing and billing information from that central database by making queries inside message packets. Later, the services of the SS7 network were expanded to include some other services, including call setup and teardown. Local Number Portability (LNP) is another feature of the telephone network achieved thanks to SS7, which allows customers to change their telephone company but still keeping the same number they previously used. LNP also avoids the number change when upgrading the service from Plain Old Telephone Service (POTS) to ISDN. This service requires the use of a database that is much the same as the one used for 800 numbers. SS7 can provide much more than routing and billing information. It provides the means for switching equipment to communicate with other switching equipment at remote sites. As an example, if the called number is busy the caller can use a feature known as automatic callback. Then, when the callee's number becomes available, the network will ring the caller's telephone. As soon as the caller answers, the called party telephone will be rung. This feature relies on the capabilities of SS7 to send messages from one switch to
3 The CCITT (the International Telegraphy and Telephony Consultative Committee in English) was renamed in 1993. Nowadays it is part of the International Telecommunication Union (ITU), being its Telecommunications Standardization Sector (ITU-T) and dealing with telephone and data communication systems. Background 7
another switch, allowing the two systems to invoke features within each switch without setting up a circuit between the two systems. Seamless roaming is a service of the cellular network that relies on the SS7 protocol. Cellular providers store their customer's information in databases called Home Location Register (HLR), and they share that information with other cellular providers with whom they have signed agreements. This way, the customer no longer has to register with other service providers when traveling abroad, the visited network is selected automatically. Today, SS7 is deployed by almost all independent telephone companies and interexchange carriers. All those subnetworks, owned by telephone companies, cellular service providers and long distance carriers, are linked together thanks to the SS7 protocol. This makes SS7 the world's largest data communications network. In the next subsections we are going to make a quick review of the internal structure of this network. The interested reader should take a look at those books from which most of the information included in this paper regarding SS7 has been taken: [Rus1998], chapter 10 of [Kes1998] and chapter 10 of [Sta1995]. Among the many documents in the Internet containing information about SS7, [PT2000] and [Mod1992] are worth a special mention.
2.2.1 Functional Architecture of SS7
With common-channel signaling, control messages are routed through the network to perform call management (setup, maintenance, and termination) and network management functions. Those control messages are short packets that must be routed through the network to their final destination. Thus, even if the network being controlled is a circuit- switched network (the voice trunks), the control signaling is implemented using packet- switching 4 technology. In effect, a packet-switched network is overlaid on a circuit- switched network in order to operate and control the circuit-switched network. SS7 defines the functions that are performed in the packet-switched network but does not dictate any particular hardware implementation. For example, all of the SS7 functions could be implemented in the circuit switching nodes as additional functions; this approach is the so-called associated signaling mode. Alternatively, there can be separate switching points that carry only the control packets and are not used for carrying circuits, the nonassociated signaling mode. Even in the second case, the circuit-switching nodes would need to implement portions of SS7 so that they could receive the control signals. Today, the telephone switches used in many exchange offices perform signaling functions. This is usually done by using an adjunct computer that is connected to the network through a digital link. Those computers are called Signaling Points (SP). They are in charge of switching messages through the network using transfer points to route those messages from one end office to another one, and they also provide access to databases. All nodes in the SS7 network are called signaling points. A signaling point has the ability to perform message discrimination (read the address and determine if the message is for that node), as well as to route SS7 messages to another SP. When using SS7 to support the Intelligent Network (IN) service we can find three different types of SPs:
Service Switching Point (SSP). Signal Transfer Point (STP).
4 A circuit-switched network is one in which a circuit is reserved and uniquely dedicated for transferring data between two endpoints. Once reserved, that circuit cannot be used by any other endpoint, even though it remains idle. In a packet-switched network, the resources are shared and used by all the endpoints with no dedicated circuits. Background 8
Service Control Point (SCP).
SPs provide access to the SS7 network, provide access to databases used by switches inside and outside of the SS7 network, and transfer SS7 messages to other SPs within the network. They are connected all together thanks to signaling links that provide the speed necessary for SS7 message delivery 5 . This functional architecture is shown in Figure 2-2.
Figure 2-2: Functional architecture of SS7
Both SPs and signaling links are always deployed in pairs for redundancy and diversity. SS7 makes sure that the network is always operational providing alternate paths in the event of failures. This ensures that messages can always reach their destinations. The network is deployed at two distinct levels, or planes. There is an international plane, using the ITU-T standard of the SS7 protocol, and there is the national plane. The national plane uses whatever standard exists within the country in which it is deployed. For example, in the United States, American National Standards Institute (ANSI) is the standard for the national plane, this version of SS7 being the one that will be discussed in this paper. In other nations, there may be one or several different versions of national protocols for SS7 and, while similar, they have fundamental differences. Yet all countries
5 ITU-T specifies a bit rate of 64 kbps, used almost everywhere in the world. The U.S. and Japan are exceptions to this model, using 56 kbps and 4.8 kbps respectively. The 64 and 56 kbps links are usually single DS0 channels of the digital signaling hierarchy. Future broadband networks might use T1 facilities at 1.536 Mbps links.
SSP
SCP
SSP
SSP
SSP
STP
STP
STP
STP
STP
STP
SSP
SSP
SSP
SCP
SSP A Ac cc ce es ss s l li in nk ks s C Cr ro os ss s l li in nk ks s B Br ri id dg ge e l li in nk ks s D Di ia ag go on na al l l li in nk ks s E Ex xt te en nd de ed d l li in nk ks s F Fu ul ll ly y a as ss so oc ci ia at te ed d l li in nk ks s S Se er rv vi ic ce e S Sw wi it tc ch hi in ng g P Po oi in nt t S Se er rv vi ic ce e C Co on nt tr ro ol l P Po oi in nt t S Si ig gn na al l T Tr ra an ns sf fe er r P Po oi in nt t SSP STP SCP Background 9
are capable of communicating with one another through gateways that convert the national version of the SS7 protocol to the international version of the SS7 protocol. This ensures that all nations can interwork with the rest, while still addressing the requirements of their own distinct networks. In the next subsections we will take a closer look to the different SPs and signaling links.
2.2.1.1 The Service Switching Point (SSP)
The SSP is the local exchange in the telephone network. An SSP can be a combination of a voice switch and an SS7 switch, or an adjunct computer connected to the local exchange's voice switch. The SSP must convert signaling from the voice switch into SS7 signaling messages, which can then be sent to other exchanges through the SS7 network. The exchange will typically send messages related to its voice circuits to the exchanges with a direct connection to it. In the case of database access, the SSP will be sending database queries through the SS7 network to computer systems located centrally to the network. The SSP function is to use the information provided by the calling party (such as dialed digits) and determine how to connect the call using its routing tables. It will send an SS7 message to the right adjacent exchange requesting a circuit connection. The adjacent exchange acknowledges the request, granting permission to connect this trunk. This same procedure is repeated, connecting trunks between several adjacent exchanges until the final destination is reached. Many SSP functions are accomplished by adding a computer adjunct to existing switches. This computer receives signals from the voice switch that are used to trigger the transmission of specific SS7 messages. Using adjuncts allows telephone companies to upgrade their SS7 SPs without replacing expensive switches, providing a modular approach to networking. Upgrades are typically limited to software loads. An SSP must have the ability of sending messages using the ISDN User Part (ISUP) protocol and the Transaction Capabilities Application Part (TCAP) protocol (see section 2.2.2).
2.2.1.2 The Signal Transfer Point (STP)
All SS7 packets travel from one SSP to another through at least one STP. The STP acts as a router in the SS7 network and does not usually originate or terminate messages. An STP is also typically an adjunct to a voice switch, and rarely it is a stand-alone system built for the sole purpose of STP functionality. There are three levels of STPs:
A national STP exists within a national network and is capable of transferring messages using the same national standard protocol. Messages may be passed to another level of STP but the national STP has no capability of converting messages into another version or format. One international STP works the same as the national STP, but it operates in the international network. The international network provides interconnectivity between worldwide networks using the ITU-T standards. All nodes connecting to the international STP must use the ITU-T protocol standard. Background 10
The gateway STP provides protocol conversion from a national standard to the ITU-T standard or some other standard, and vice-versa. A gateway STP is often used as an access to the international network.
The gateway STP serves as the interface into another network. Long distance service providers may have access into the local telephone company's database for subscriber information, or the local service provider may need access into the long distance service provider's database. In any case, this access is accomplished through a gateway STP. Gateway STPs use screening features to maintain network security. Screening is the capability to examine all incoming and outgoing packets and allow only those that are authorized. When considering the network constraints in terms of performance, one STP level seems preferable. However, considerations of reliability and availability dictate a solution with more than one level. The following guidelines are suggested by ITU-T:
In a hierarchical signaling network with a single STP level: Each SP that is not also an STP is connected to at least two STPs. The meshing of STPs is as complete as possible. In a hierarchical signaling network with two STP levels: Each SP that is not an STP is connected to at least two upper level STPs. Each STP in the lower level is connected to at least two upper level STPs. The STPs in the upper level are fully meshed 6 .
In Figure 2-2 we see an example of a hierarchical network with two STP levels. The four STPs at the lower part of the figure could be national STPs, while the other two STPs could be gateway STPs (or international STPs is the national SS7 is the ITU-T standard). Apart from the basic routing tasks, STP performs measurements. There are two basic types of measurements: traffic measurements and usage measurements. Traffic measurements provide peg counts and statistical information regarding the type of messages entering and leaving the network. For maintenance purposes, network events are also recorded (such as link out-of-service duration, local processor outage, etc.). Because of the speed of the network and the quickness at which SS7 entities respond to problems, traffic measurements are the best way for maintenance personnel to keep track of what is happening in the network and preventing network failures. Usage measurements are always peg counts and record the number of messages by message type that enter and leave the network. These peg counts are aggregated by a collection process and stored on magnetic tape. The tape is then used to create an invoice for its customers. In the local SS7 network, the STP receives messages from the SSP. These packets are either related to call connections of database queries. Database access is provided through another SS7 entity, the SCP (see next section). If the SSP does not know the address of the destination SCP, the STP must provide the address. In this case, the SSP sends a database query directed to the local STP. The STP will look at the dialed digits (the so-called global title digits) and determine, through its translation tables, the address of the database. This is referred to as global title translation. The STP is the most versatile of all the SS7 entities, providing a wide array of services to the users of the network.
6 We say that we have a full mesh when every STP has a direct link to every other STP. Background 11
2.2.1.3 The Service Control Point (SCP)
The SCP serves as an interface to telephone company databases. SCP does not necessarily store the information, but acts as an interface to the mainframe or minicomputer system that houses the information. These databases are used to store information regarding subscribers' services, routing of special service numbers or calling card validation and fraud protection. The SCP is usually a computer used as a front end to the database system. This database system is usually linked to the SCP through X.25 links, but in integrated STP/SCP, the database is resident in the SCP. The SCP can perform protocol conversion from SS7 to X.25, or it can provide an interface to access the database directly. The protocol used to access and interface to the databases is TCAP (see section 2.2.2.3). The type of database depends on the network. Each service provider has different requirements, and their databases will differ. The databases most commonly used within either of these networks are:
Call Management Services Database (CMSDB): Provides routing instructions for special service numbers (such as 800, 976 or 900 numbers) and billing information. It also provides routing instructions to avoid congested nodes. Local Number Portability (LNP): This database contains the necessary information that allows subscribers to be able to change telephone companies without having to change their telephone numbers. As the office code portion of a telephone number can no longer be used to identify the destination, this database provides the needed information to route the call. Line Information Database (LIDB): It provides information regarding subscribers, such as calling card service, third-party billing instructions, and custom calling features such as call forwarding and speed dialing. Business Services Database (BSD): The purpose of this database is to allow subscribers to store call processing instructions, network management procedures, and other data relevant only to their own private network. Home Location Register (HLR): This kind of database appears in cellular networks. The HLR stores information regarding billing, services allowed, as well as the current location of the cellular telephone. Visitor Location Register (VLR): It is used to store the current locations for the visiting subscribers when they roam outside of their home areas.
As seen, each database contains information for a specific application. Each database is also given an address, called a subsystem number, used in routing queries from SSPs through the SS7 network to the actual database entity.
2.2.1.4 The Signaling Links
Links are bi-directional and full-duplex, working at speeds varying from 4.8 kbps to 1.536 Mbps, depending on the national SS7 network standard. Links are placed into groups, called linksets. All the links in a linkset must have the same adjacent node. The switching equipment will alternate transmission across all the links in a linkset to ensure equal usage of all facilities. Up to 16 links can be assigned to one linkset. In the common case that a node has links to a mated STP pair, the links are assigned to two linksets, one linkset per node. Both linksets can then be configured as a Background 12
combined linkset. Combined linksets are used for load sharing, where the sending SP can send messages to both pairs, spreading the traffic load evenly across the links. Inside the SS7 network, alternate linksets are used to provide alternate paths for messages. An alternate linkset is used when congestion conditions occur over the primary links, thus taking profit of the provided diversity of paths to overcome congestion. Links must remain available for SS7 traffic at all times, with minimal downtime (a maximum of 10 minutes downtime per year is allowed for any one linkset). When a link fails, the other links within its linkset must take the traffic. Likewise, if an SS7 entity (such as an STP) fails, its mate must assume the load. This means links can suddenly be burdened with more traffic than they can handle. For this reason, SS7 entities are restricted to send less than 40% traffic on any link. In case of a failure, any link can suddenly be responsible for the failed link's traffic. Even at 80%, the links still have enough capacity to carry SS7 network management messages as well as the extra traffic. If the average message length is 40 bytes, i.e. 320 bits, and we consider the ANSI specifications of SS7 with 56 kbps links, working at 40% gives 22.4 kbps of available capacity, that could carry up to 70 messages per second. This simple formula is used to dimension the network. As seen in Figure 2-2, signaling links are labeled according to their function. There are six different types of links used in SS7:
Access links (A): They are used between the SSP and the STP, or SCP and STP. These links provide access into the network and to databases through the STP. There are always at least two A links, one to each of the home STP pairs (except in the highly unusual case that STPs are not deployed in pairs). The maximum number of A links connecting an SSP to one STP is 16. A links can be configured in a combined linkset, with 16 links to each STP, providing 32 links to the mated pair. Bridge links (B): B links are used to connect mated STPs to other mated STPs at the same hierarchical level. B links are deployed in a quad fashion, as seen in Figure 2-2. A maximum of eight B links can be deployed between mated STPs. Cross links (C): The C links connect an STP to its mate STP. Normal SS7 traffic is not routed over these links, except in congestion conditions or when a node becomes isolated and the only available path is over the C links. The only messages that travel between mated STPs during normal conditions are network management messages. At most eight C links can be used between STP pairs. Diagonal links (D): D links are used to connect mated STP pairs at a primary hierarchical level to another STP mated pair at a secondary hierarchical level. Otherwise, they have completely identical characteristics that C links. Extended links (E): They are used to connect to remote STP pairs from an SSP. They are used as an alternate route for SS7 messages in the event that congestion occurs within the home STP pairs. A maximum of 16 E links may be used between any remote STP pairs. Fully associated links (F): F links are used when a large amount of traffic may exist between two SSPs, or when it is not economical to provide a direct connection between an SSP and an STP. When traffic is particularly heavy between two end offices, the STP may be bypassed altogether. Only call setup and teardown procedures would be sent over this linkset.
Background 13
There is no difference between the various links. Only the way in which the links are used during message transfer and its interaction with network management is different.
2.2.2 Protocol Architecture of SS7
So far, we have been discussing SS7 architecture in terms of the way in which functions are organized to create a packet-switching control network. The term architecture can also be used to refer to the structure of protocols that specify SS7. As the Open Systems Interconnection (OSI) 7 model, the SS7 standard is a layered architecture. The term level in SS7 is used in the same context as layer in the OSI model. Figure 2-3 shows the current structure of SS7 (in its ANSI version) and relates it to OSI.
Figure 2-3: SS7 Protocol Architecture
Some of the functions called for in the OSI model have no purpose in the SS7 network and are, therefore, undefined. It should also be noted that the functions in the SS7 protocol have been refined over the years and tailored for the specific requirements of the SS7 network. For this reason, there are many discrepancies between the two protocols and their corresponding functions. Regardless of the differences, the SS7 protocol has proven to be a highly reliable packet-switching protocol, providing all of the services and functions required by the telephone service providers. SS7 continues evolving to adapt to bigger networks and new services provided by telephone companies. The lowest three levels of the SS7 architecture, referred to as the Message Transfer Part (MTP), provide a reliable but connectionless (datagram style) service for routing messages through the SS7 network.
7 The OSI model was developed and published in 1982 by the International Standards Organization (ISO). Its name comes from the fact that it deals with connecting open systems, that is, systems that are open for communication with other systems. It has seven layers, each of them performing a well-defined task, and its is often used to describe the kind of functions that a protocol provides. For a good yet short introduction to OSI read section 1.4.1 of [Tan1996]. MTP Level 1 MTP Level 2 MTP Level 3 SCCP
TCAP
I S U P
Physical Data Link
Network
Transport Session Presentation Application
7
6
5
4
3
2
1 Background 14
MTP does not provide the complete set of functions and services specified in the OSI layers 1-3, most notably in the areas of addressing and connection-oriented service. In the 1984 version of SS7, an additional module was added, which resides in level four of SS7, known as the Signaling Connection Control Part (SCCP). The SCCP and MTP together are referred to as the Network Service Part (NSP). SCCP defines a variety of different network-layer services to meet the needs of various users of NSP. The remainder of the modules of SS7 is considered to be at level four and comprise the various users of NSP. NSP is simply a message delivery system; the remaining parts deal with the actual contents of the messages. The ISDN User Part (ISUP) provides for the control signaling needed in an ISDN to deal with ISDN subscriber calls and related functions, mostly to set up and tear down telephone connections between end offices. ISUP was derived from the Telephone User Part (TUP) (which is the ITU-T equivalent to ISUP and not used in the ANSI SS7 specifications). Apart from TUP functionality, ISUP offers the added benefit of supporting IN functions and ISDN services. The Transaction Capabilities Application Part (TCAP), first introduced in 1988, provides the mechanisms for transaction-oriented (as opposed to connection-oriented) applications and functions. There are some other protocols that as TUP are part of the SS7 family and do not appear in Figure 2-3, as for example the Broadband ISDN User Part (BISUP) that is used for setting up and tearing down Broadband ISDN (BISDN) circuits. However, it is still being refined. Another protocol that is not present in Figure 2-3 is the Data User Part (DUP), designed to provide data transmission capabilities for circuit-mode data networks. DUP is not intended for ISDN as ISUP is, and thus it is already obsolete and it is not in use presently in North American SS7 networks. In the next subsections we will discuss the user and application parts of primary importance to SS7.
2.2.2.1 The Message Transfer Part (MTP)
The MTP protocol is at the lowest level in the SS7 protocol stack, and it is a transport protocol used by all the other members of the SS7 suite. It is actually divided into three different levels with the same functionality as layers one, two and three of the OSI model. MTP Level 1 (MTP1) allows the use of any digital-type interface. Common interfaces in most SS7 networks today include E1 (2,048 kbps; 32 64 kbps channels), DS1 (1.544 Mbps; 24 64 kbps channels), V.35 (64 kbps), DS0 (64 kbps), and DS0A (56 kbps). To be compatible to some older versions of SS7 (as the one deployed in Japan), SS7 can operate at speeds as low as 4.8 kbps although that can cause unacceptably long delays. The most common interface in the U.S. is DS-0A, but there exist already some preliminary standards on the usage of a full DS1 facility at 1.544 Mbps as a signaling link, which will reduce the number of multiplexers used in the network (saving one level of multiplexion, from one DS1 channel to 24 DS0 channels). MTP Level 2 (MTP2) provides the functions necessary for basic error detection and correction. This protocol is concerned only with the reliable delivery of signal units between two exchanges or SPs, there is no consideration outside of the signaling link and it has no knowledge of the final destination. This level provides flow control functionality and sequence numbering of the signal units sent through the link across this point-to-point signaling link. Another function of level two is error correction. There are two types of error correction procedures. Basic error correction is used for links with a delay under 15ms. It uses Go-Back-N retransmission, where a bad frame (lost or corrupted) and all Background 15
subsequently transmitted frames are retransmitted by the sender. The Preventive Cyclic Retransmission (PCR) scheme is used in links such as satellite signaling links with longer delay. In PCR the transmitted signal units are retransmitted automatically during idle periods until they are acknowledged. MTP2 also performs two important signaling link error rate monitoring functions. The signal unit error rate monitor counts signal unit errors using a leaky bucket scheme: a counter is incremented by one whenever a signal unit with errors is detected, and is decremented by one after every 256 signal units received (as long as the counter is positive). If the counter reaches 64, an indication is sent to MTP3. The alignment error rate monitor is used to ensure that signal unit alignment is maintained: alignment is considered to be lost if more than 6 consecutive one bits are received 8 or if a signal unit is received that is greater than the allowed maximum size. A counter is incremented after the receipt of every 16 octets until alignment is reestablished. If the counter crosses a threshold, an appropriate indication is passed to MTP3. The MTP Level 3 (MTP3) protocol has the responsibility of transporting messages between SPs. There are two broad functional categories performed by this layer network management and message handling. Network management is in charge of providing reconfiguration of the signaling network in the case of link or SP failures. It also controls traffic in case of congestion. The signaling network management functions are:
Signaling link management: Activates new links and reinitializes or removes from operation failed signaling links (following MTP2 indications). Signaling link management only informs about the problem to the adjacent SP that is at the other end of the problematic link. Therefore, signaling link management is a local function. Traffic rerouting due to link failures is not a task of the signaling link management. Another feature provided by some SSPs that is responsibility of signaling link management is the automatic allocation. It consists in removing voice circuits to use them as SS7 signaling links, and vice-versa. Signaling traffic management: Performs in a way a similar task than signaling link management since it also deals with signaling link replacement. However, it deals with signaling links that have suffered a complete malfunction (for example a backhoe digged a link facility). The messages used to remove the signaling link that caused the trouble are sent through a different path. So, the basic difference between signaling traffic management and signaling link management lies in the mechanism used to inform the adjacent SP about the failure. Signaling link management will be then the one in charge to take the link out of service. Signaling route management: It is used to advise other SPs about the inability of one SP to reach another SP. Therefore, when a SP realized that it could not communicate with an adjacent SP, it tells the other SPs to avoid sending signal units to the unreachable SP.
Inside the message handling category we can find these three major functions:
Message discrimination: Determines whether a MTP2 message belongs to this SP or another based upon the message's routing label. If the routing label contains the
8 MTP2 uses the fixed pattern of bits of 01111110 as an opening and closing flag of the signal units. As a result of this, the sender must apply bit stuffing, inserting a 0 after every five consecutive 1s. At the receiver, any 0 following five consecutive 1s is deleted. Background 16
address of the local signaling point the message is handed off to message distribution. Otherwise, it is passed to message routing. Message distribution: If the message belongs to this SP, the message is passed to the appropriate MTP user (ISUP or TCAP) or MTP3 function. Message routing: If the MTP2 message received is to be relayed to another SP, or if the message originated at this SP, it must be forwarded through another signaling link chosen thanks to the information provided by the routing table.
As seen, a large part of the signaling network functional specification is concerned with procedures for overcoming link failures and congestion. Procedures are specified for quickly determining when a link has failed, removing it from service, rerouting traffic, and bringing the link back into service after repair. There is an overriding concern for network reliability, the goal being of 99.998% availability. This goal is achieved in SS7 by both equipment redundancy and the network's dynamic reconfiguration and rerouting functions.
2.2.2.2 The Signaling Connection Control Part (SCCP)
The MTP was originally designed to meet the real-time requirements of telephone network signaling and, for that reason, provides a connectionless network service. Some applications, however, require a connection-oriented transfer capability and a larger, more complete address space than the MTP makes available. The MTP provides both the Origination Point Code (OPC) and the Destination Point Code (DPC), of 14-bit length. In both cases, the point code is from a node-to-node perspective. Moreover, MTP has a limited distribution capability at the node using a 4-bit indicator in the Service Indicator Octet (SIO) field of a signal unit. This addressing capability is adequate for a very limited set of services. One major enhancement provided by the SCCP is its expanded addressing functionality. The SCCP supplements MTP addressing by defining an additional field called the Subsystem Number (SSN), which consists of local addressing information used to identify SCCP users at each node. The combination of OPC plus SSN forms the calling party address, and the DPC plus SSN number is the called party address. Another SCCP enhancement is its ability to use global titles as addresses. A global title is a special address, such as an 800 number, that does not provide information usable for routing. SCCP is the protocol that performs the global title translation. SCCP is used only with TCAP, although the standards indicate its use with ISUP (as appears in Figure 2-3). This could in theory allow ISUP messages associated with an already established connection to be routed using end-to-end routing, as with TCAP messages. However, that functionality has not been implemented in SS7 networks.
2.2.2.3 The Transaction Capabilities Application Part (TCAP)
TCAP provides a general purpose, remote operation function for SS7. It provides the capability for an application at one node to invoke the execution of an operation at another node and to receive the results from that remote process. TCAP was originally designed to support queries into databases, although its role can include additional functions. TCAP comprises two protocol sublayers called the Transaction Sublayer (TSL) and the Component Sublayer (CSL). The TSL is the lower TCAP sublayer and it defines how the transaction or dialogue will take place, that is, what will be the context in which the remote operation will take place. There are two types of dialogues, the unstructured Background 17
dialogue, that is a one way communication in which the remote peer processes our message but does not send any response back, and the structured dialogue, which is analogous to a virtual connection where queries produce responses. The CSL is the upper TCAP sublayer, and defines the actual messages, called components, that are contained in the TSL messages. There are four types of CSL components: invoke (to request a remote operation), return result (containing the response of the requested operation), return error (indicating some kind of error), and reject (indicating some kind of syntax error). Both invoke and return result have a single and a multiple message versions (in case a unique message is not enough). The TCAP services are provided to an upper user application which is called the Application Service Element (ASE), responsible for providing the information that a specific application needs, such as translating an 800 number into a routable number or obtaining a billing number from a telephone calling card.
2.2.2.4 The ISDN User Part (ISUP)
ISUP is a circuit-related protocol, used to set up, manage and release trunks carrying voice and data calls over the PSTN. It is used for both ISDN and non-ISDN calls and it was adopted by the ANSI SS7 to replace TUP, which did not support data transmission or digital circuits. However, ISUP does not support broadband technologies. These new technologies will be addressed by a new version of ISUP called BISUP and still under development by the ITU-T. ISUP may use the transport services provided by either the MTP or SCCP as can be seen in Figure 2-3. However, the interface between ISUP and SCCP has not been implemented yet. MTP services are used for the transport of call-related signaling messages between ISDN central offices, while the SCCP may be employed for additional connectivity services as well as end-to-end signaling. ISUP is compatible with the ISDN protocol, which was developed as an extension of SS7 to the subscriber. The purpose of the ISDN compatibility is to allow subscribers switches to send signaling information to remote subscribers. This can be used to support called-invoked features such as conference calling or automatic callback. Not all the SS7 networks use ISUP as its basis for ISDN services. Most of the European countries, North America and Japan use ISUP, but for example in United Kingdom they use National User Part (NUP), developed in the early 1980s and largely based on TUP because ISUP was not yet available.
2.3 The largest computer network: The Internet
How many people have not heard about the Internet? In the developed countries, not that many. The Internet and the services it offers have become in the last few years a mass phenomenon, which is quickly gaining more and more popularity. In the close future having an Internet connection at home will be as normal as having a TV set or a telephone line is today. Behind this growing use there is a quite old protocol, the Internet Protocol (IP) [Pos1981a]. IP has proven to be a very robust network protocol that can face the introduction of new technologies with minimal changes, being still valid about 30 years after its initial design. This is because, unlike most older network layer protocols, it was designed from the beginning with internetworking in mind. Background 18
The number of users of the Internet has always been growing, however, it has not been until recent dates when the Internet users community became a significant group. This happened thanks to the new kind of applications developed that make use of the Internet and that make possible a new era of communication. In the next subsection we quickly tell about the origin of the Internet since its beginning in the late 1960s to the new millennium. Then we will briefly discuss about the structure of IP and the protocols that use its services to communicate through the Internet.
2.3.1 A quick history of the Internet: From military use to worldwide business tool
In the mid-1960s, at the height of the Cold War, the Department of Defense (DoD) of the U.S. wanted to have a network that could survive a nuclear war (knowing who would make use of that network after the nuclear attack was another issue). As traditional circuit- switched networks were considered too weak because the loss of a single node or line would terminate all the communications using it and could even split the network, the DoD turned to its research arm, the Advanced Research Projects Agency (ARPA) 9 , to investigate about a new network using the then-radical idea of packet switching. Having a datagram subnet, if some lines or nodes were destroyed the messages could be automatically rerouted along alternative paths. ARPA gave some grants to universities to investigate about this topic, and finally in December 1969, a packet switching network with four nodes was born, the ARPANet. ARPANet rapidly grew, and few years later, experiments showed that the existing ARPANet protocols were not suitable for running over multiple networks. This observation led to more research on protocols, culminating in the invention of the Transmission Control Protocol (TCP) [Pos1981c], and the TCP/IP model in 1974. TCP/IP was specifically designed to handle communication over internetworks. By 1983 ARPANet was stable and successful, with more than 200 networks and hundreds of hosts, TCP/IP being the only standard protocol used. The Domain Name System (DNS) [Moc1987] was created during the 1980s to organize machines into domains and map hostnames into IP addresses. In the 1980s the ARPANet was connected to several nodes outside the U.S., mostly in Europe and Japan, but the real growth and evolution of the Internet was happening in North America. By 1990, the ARPANet had been overtaken by newer networks that it itself had spawned, so it was shut down and dismantled. But already in the late 1970s the National Science Foundation (NSF) of the U.S. realized about the deep impact that the ARPANet had in the universities and research centers, as it was a very good means to share ideas and projects. The main problem was that, the universities wanting to join ARPANet should have a research contract with its owner, the DoD. This was not always the case, so the NSF began designing a high-speed successor to the ARPANet, open to all the universities, and the result of this research was the NSFNet, founded in the mid-1980s using the same hardware than ARPANet. NSFNet was an instantaneous success. Few years later it connected thousands of hosts placed in universities, research laboratories, libraries and museums, including the computers connected to the ARPANet.
9 ARPA was founded in 1957 after Russia launched Sputnik 1 into earth's orbit, with the mission of applying state-of-the-art technology to U.S. defense and to avoid being surprised by technological advances of the enemy, i.e. Russia. Background 19
NSFNet's success was killing itself, and in the subsequent years the links used for the backbone had to be upgraded from 56 kbps links at its foundation, to 1.5 Mbps links in 1990. However, these upgrades were not free of charge, and it became obvious that the government could not finance networking forever. So that same year some companies formed a nonprofit corporation called Advanced Networks and Service (ANS), this being the first step forward to the commercialization of the NSFNet. ANS took over the NSFNet and upgraded its links to 45 Mbps. By this time the Internet bound around 200,000 computers contained in about 3,000 networks. In 1991, the U.S. Congress approved the creation of the National Research and Educational Network (NREN), the successor of the NSFNet, already running at gigabit speeds. During the early 1990s commercial companies started to deploy their own IP-based networks so the NSFNet backbone was no longer needed. It was sold to America On Line in 1995 and since then, the Internet as a whole has not been maintained by the U.S. and local governments anymore. Until the early 1990s the traditional services provided by the Internet were e-mail (the most popular application since the ARPANet times), news (to create international forums regarding the most different topics), remote login (normally using the Telnet [Pos1983] protocol to access remote computers) and file transfer (to make copies of files using the File Transfer Protocol (FTP) [Pos1985] or the Trivial File Transfer Protocol (TFTP) [Sol1992]). Those services were mostly used by academic, government and industrial researchers. But in 1990, Tim Berners-Lee, a scientist working in the Conseil Europenne pour la Recherche Nuclaire (CERN) 10 , created the Hypertext Transfer Protocol (HTTP) [Ber1996], the language computers would use to communicate hypertext documents 11 over the Internet, and he also designed a scheme to give documents addresses on the Internet, the Uniform Resource Identifier (URI) [Ber1994]. At the end of 1990 he created a server of hypertext documents, and a client program (browser) to retrieve and view those hypertext documents. He called this application the World Wide Web (WWW). Next year, in 1991, he made his web server and client software publicly available on the Internet and what we today know as The Web started to take off. Berners-Lee's browser was specifically designed for the personal computer he was using, so others, mostly students, started to program their own web browsers. Among those early web browsers was Erwise, written by the students of the Helsinki University of Technology 12 , which worked in UNIX machines. The first browser with multimedia support was Mosaic, written at the National Center for Supercomputer Applications (NCSA), in 1993, and after this moment, things were so fast that are impossible to follow. The birth of the WWW was the killer application that attracted million of new, nonacademic users to the net, and it was what has made it so popular. Its first use was to make easier the sharing of documents among scientist and researchers, but nowadays its use is mostly commercial. There is virtually no known company that does not have its web
10 Translated to English is the European Council for Nuclear Research. 11 A hypertext document is a document containing text and embedded sound, images an even video (which is usually defined as hypermedia), including links to other hypertext documents. Hypertext documents are formatted using the Hypertext Markup Language (HTML) (also invented by Berners-Lee), being its last version published in [W3C1999]. 12 After a visit from Robert Cailliau, a close workmate of Tim Berners-Lee, a group of students at Helsinki University of Technology joined together to write a web browser as a master's project. Since the acronym for their department was called "OTH", they called the browser "erwise", as a joke on the word "otherwise". The final version was released in April, 1992, and included several advanced features, but was not developed further after the students graduated and went on to other jobs. Background 20
page selling its products, and governments maintain web pages where many bureaucratic processes can be done. Figure 2-4 shows the exponential growth of the Internet in the last twenty years as published by the Internet Software Consortium (ISC). The last measure of the number of hosts connected to the Internet was taken in January 2001, and by then there were about 110 million hosts. If the growth rate continues, it is expected that by the end of year 2001 there will be about 175 million hosts, and the first billion host would be reached at some point during year 2005.
Figure 2-4: Internet's growth (1981-2001)
The methods used to measure the number of hosts connected to the Internet are different and their results vary considerably when consulting different sources. Moreover, the number of hosts connected to the Internet is not the only figure that can give us an idea of its tremendous success. As an example, the company Nua shows in its web pages [Nua2001] an estimation of Internet users in August 2001 of 513 million worldwide. That estimation is shown in Figure 2-5.
Figure 2-5: Worldwide Internet Population (August 2001)
0 20 000 40 000 60 000 80 000 100 000 120 000 1981 19821983 1984 19851986 1987 19881989 1990 1991 19921993 1994 19951996 1997 1998 19992000 2001 Year N u m b e r
o f
h o s t s
( t h o u s a n d s ) 0 20 000 40 000 60 000 80 000 100 000 120 000 Asia/Pacific Rim (including Australia) South America Africa Middle East U.S. and Canada Europe E Eu ur ro op pe e 1 15 54 4. .6 6 M Mi il ll li io on n U U. .S S. . a an nd d C Ca an na ad da a 1 18 80 0. .7 7 M Mi il ll li io on n S So ou ut th h A Am me er ri ic ca a 2 25 5. .3 3 M Mi il ll li io on n A Af fr ri ic ca a 4 4. .2 2 M Mi il ll li io on n M Mi id dd dl le e E Ea as st t 4 4. .7 7 M Mi il ll li io on n A As si ia a/ /P Pa ac ci if fi ic c R Ri im m ( (i in nc cl lu ud di in ng g A Au us st tr ra al li ia a) ) 1 14 44 4 M Mi il ll li io on n Background 21
The times when North America was the almost solitaire owner of the Internet have gone. However, still only the developed countries have a significant quantity of Internet users. North America, the European Community, Japan, South Korea and Australia have more than the 77% of all the Internet users worldwide, while its population is about 14% of the total of the globe. Among all those countries we can highlight one, Sweden, that having a population of about 8.8 million inhabitants, has 5.6 million of Internet users, the 63.5% of its population, making it the country with the highest world's Internet penetration. However, IP was not ready for such an incredible success. Due to the new applications that make Internet interesting for the general public, the number of online user is growing exponentially since the mid 1990s, and that number is expected to keep growing in the next years. Even more, millions of people with wireless portables may use them to keep in contact with their home base, and with the convergence of the computer, communication and entertainment industries, it may not be long before every television or mobile phone in the world is an Internet node. This brought two problems. On one hand, IP addresses are 32-bit numbers, which gives a theoretical maximum of about 4 billion addressable hosts. But the practice of organizing the address space in classes to help routing wastes millions of them. So, with the enormous growth of the Internet, IP addresses have become a scarce commodity. On the other hand, having such a huge quantity of hosts makes the routing algorithms inefficient, both making routing slower and more resource consuming. Under these circumstances, it became apparent that IP had to evolve and become more flexible. So, more than ten years ago, in 1990, the Internet Engineering Task Force (IETF), the international organization that produces the standards regarding the Internet, started to work on a new version of IP. The main characteristic of this new version should be the use of a bigger address space so it would never run out of addresses, but at the same time it should solve a variety of other problems [Hui1998]:
Reduce the size of the routing tables. Simplify the protocol, thus allowing routers to make their job faster. Provide better security (authentication and privacy) than the former version of IP. Pay more attention to type of service, particularly for real-time data. Aid multicasting by allowing scopes to be specified. Make it possible for a host to roam without changing its address. Make a protocol open enough that could evolve in the future. Permit both versions of IP to coexist for years.
IETF issued a call for proposals, receiving 21 responses. In December 1992, seven serious options were on the table, varying from simple patches to IP, to complete different protocols. Next year, the three better proposals were chosen out of those seven: the one created by Deering, the one created by Francis, and the Katz and Ford proposal. The final protocol chosen was a modified combined version of the Deering and Francis proposals, and was given the designation Internet Protocol version 6 (IPv6) [Dee1998]. IPv6 is not fully deployed yet, and virtually the only protocol used in the Internet is still the previous version of IP, now called IPv4. But, there are already several complete IPv6 implementations that should start working, together with IPv4, within the next years.
2.3.2 The basis of the Internet: The internals of the Internet Protocol (IP)
We have already seen how the Internet became what it is today, and we have also mentioned that the whole Internet is possible thanks to a protocol that rules it: IP. Background 22
But why is IP so important? IP is what keeps all the networks together, it is the language that all the computers connected to the Internet must speak to be able to communicate among them. IP is a network protocol that provides a best-effort way to carry pieces of information called datagrams from our computer to any remote one, and vice- versa, no matter whether or not these machines are on the same network or not. Communication on Internet works as follow. A protocol operating at a higher level fragments the data it wants to transfer to another host. The address of that remote host must be provided to the IP layer along with the data itself, and then IP transfers that data to the receiver, probably going through different networks in its way, and possibly further fragmenting the data into smaller units. When all the pieces reach their destination, the IP layer at the receiver side reassembles them into the original datagram and identifies which upper level protocol originated it, passing the datagram to the right receiving process. Let us take a closer look at the structure of both IPv4 and IPv6. Figure 2-6 shows us their message headers. The first field in both IPv4 and IPv6 is the Version field. It keeps track of which version of the protocol the datagrams belongs to, making possible the transition between versions to take years. Obviously, it is set to 4 in IPv4 and to 6 in IPv6.
Figure 2-6: The IP header (b) Internet Protocol version 6 (IPv6)
Destination Address
Source Address
Payload Length Next Header Hop Limit Version DSCP ECN Flow Label 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 (a) Internet Protocol version 4 (IPv4)
Options (0 40 bytes) Destination Address Source Address Time to Live Protocol Header Checksum Identification
The next field in IPv4 is the Internet Header Length (IHL), necessary only in IPv4 since, as can be seen in Figure 2-6 (a), the IPv4 header length is variable, while the IPv6 header has a fixed length. IPv4 can include Options in its header and this is the reason of the existence of the IHL field. The options allow the datagram sender to indicate the path the datagram should follow, to create a log of the routers visited in its way, or even to tell how secret is the information carried in the datagram (something only valuable to spies). However, the space left for the options is insufficient, just 40 bytes. Moreover, most of the routers ignore the options, so the conclusion is that having to check which is the header length just makes datagram processing slower. That is the reason why IPv6 headers have a fixed length of 40 bytes. However, to provide extension possibilities IPv6 has the Next Header field. The IPv6 header was simplified as its maximum, but to allow adding capabilities to the protocol, there can be additional extension headers. They are located right after the main header and identified by the Next Header field. They in turn have another Next Header field, to be able to include several extension headers in the same datagram, placed in a daisy chain. The last one is set to the identifier of the payload protocol carried by the IPv6 datagram, an identifier that in IPv4 is located in the Protocol field. The byte used by the Differentiated Services Codepoint (DSCP) and the Explicit Congestion Notification (ECN) fields has been quite an unstable one. As defined in the initial specification of IPv4 [Pos1981a] it was called the Type of Service (TOS) byte. The first three bits were the so-called Precedence field, which indicated the importance of the information carried by the IPv4 datagram. The next three bits were three flags, Delay (D), Throughput (T) and Reliability (R), allowing the host to specify what it cares most about. The last two bits were set to zero. In practice, routers ignored the TOS field altogether. Later, the seventh bit of the Type of Service field was converted into another flag as the D, T and R bits, which was used to indicate preference for low monetary cost [Alm1992]. Still, the TOS field was not popular enough to bother routers, which usually did not take care about its value. Similarly, in the first specification of IPv6 bits 4-7 (second half of the first byte) of the main header formed a field named Priority, which had a similar use than the Precedence field in IPv4. In the final specification of IPv6, [Dee1998], that field was enlarged to be one byte long, and became the Traffic Class field, but its use was not defined. Taking profit of this anarchical situation, another modification was done to this byte in both versions of IP. As per [Nic1998] the first six bits became the DSCP field, and the last two bits were unused. Those six bits define 64 codepoints that are mapped to specific behaviors in the routers along the path that the datagram follows. However, this was not the end of the story. Later, in [Ram2001] the last two unused bits became the ECN field. These two bits are used to avoid using IP packet drops in routers as the only indication of congestion. In turn, when a router becomes congested, it uses this field to indicate that it is suffering from congestion. The data receiver uses then the acknowledgements of the transport protocol to tell about the congestion situation to the data sender, who should decrease its sending rate. The Total Length field of IPv4 counts the total length of the datagram. It is equivalent to the Payload Length field of the IPv6 header, which does not include the length of the main header. Theoretically the biggest IP datagrams are about 64 Kbytes long, but in practice they are around 1,500 bytes due to the limitations of the data link layer. The Identification field is used when a datagram is further fragmented in the network. All the fragments of a datagram carry the same Identification value. This field, together with the Fragment Offset field that indicates the position of the fragment within the Background 24
original datagram (in multiple of 8 bytes), and the More Fragments (MF) flag that marks the end of a fragmented datagram, makes possible the reassembly at the receiver. The Don't Fragment (DF) flag represents an order to the routers to not fragment the datagram because the destination is incapable of putting the pieces back together again. In IPv6 there is no field to provide fragmenting capabilities in its main header simply because the routers are not allowed to fragment datagrams. In turn, there exists a fragmentation header that can be used just at the sender and receiver sides. Routers simply discard too big datagrams sending back an error message, which simplifies the job. We will discuss more about fragmentation in section 5.4. The Time to Live field helps avoiding having datagrams wandering around forever due to corrupted routing tables. Initially it was designed to count seconds, but in practice it is decremented by one at each router, so it just counts hops. In IPv6 the Hop Limit fields makes that same work, but this time the name reflects the reality of its function. The Header Checksum protects just the header. It is supposed that if the upper user wants to protect the data carried by IP it will provide its own error detection scheme. However, as fields such as Time to Live vary at every hop, the checksum must be recalculated in every router. IPv6 does not include any checksum in its main header, but there are two extension headers available to overcome this problem. The Authentication Header (AH) [Ken1998b] provides connectionless integrity and data origin authentication. the Encapsulating Security Payload (ESP) [Ken1998c] provides encryption. The Source Address and Destination Address present in both versions of IP indicate the origin and the final destination of the IP datagram. While IPv4 addresses are 4-byte numbers, in IPv6 they are 16-byte values. Deering's originally proposed 8-byte addresses, but during the review process people felt that with 8-byte addresses IPv6 will have the same problem than IPv4 in the next few decades. People then suggested 20-byte addresses, or even variable length ones, an after much discussion it was decided that fixed-length 16- byte addresses were the best compromise. To have an idea of the magnitude of the quantity of addresses we can calculate that, if they were equally distributed over the earth, there would be about 6.710 23 addresses per square meter (a little bit more than the Avogadro number). Nonetheless, part of those 16 bytes carries routing information, allowing routers to work efficiently in such a big network [Hin1998]. Finally, we have the Flow Label field in the IPv6 main header, whose function has not been strictly defined in the IPv6 specifications. It is still an experimental field that may be used by a source to label sequences of packets for which it requests special handling by the IPv6 routers, such as non-default quality of service or real-time service. Nevertheless, even though IP is the most important protocol in the Internet because it makes possible communication among remote networks, it is just a tiny part of the whole suite of protocols that the Internet uses. Figure 2-7 shows some of those protocols and their relation with the OSI reference model. At the lowest protocol levels, the Internet can use any of the available technologies, from Token Ring or Token Bus, to even high performance Local Area Networks (LANs) such as FDDI (the ones represented in Figure 2-7 are just a few of them). However, one of the most widely deployed LAN for its use in IP networks is Ethernet 13 (working at any of the speeds available). In the Internet architecture, both the Physical and Data Link layers are joined together in a single level, which is usually called the Host to Network layer.
13 That makes that the de facto standard for the length of the IP datagrams is 1,500 bytes, which is the Maximum Transfer Unit (MTU) of the Ethernet networks. Background 25
Then, at a higher level we have the IP protocol trying to make a single network out of a huge quantity of dispersed LANs. In the Internet architecture IP works at the Internet level, which is the equivalent to the Network layer in the OSI model. Then, working on top of IP we have the Transport level, in which we usually found only two protocols: the User Data Protocol (UDP) [Pos1980] and TCP. UDP is very simple and is used for connectionless applications or unreliable connection-oriented applications. TCP is a connection-oriented protocol that provides reliability and congestion control. They have been the most used transport protocols in the last 20 years, but nowadays we also have SCTP, close to TCP in the functionality it provides but improved. SCTP is the main topic of this Master's Thesis so we will speak a lot more about it later. On top of IP there are not only transport protocols. Some control protocols operate directly over IP without using any other protocol in between. Examples of such protocols are Internet Control Message Protocol (ICMP) ([Pos1981b] for IPv4 and [Con1998] for IPv6) that is used to report errors at the IP level, or some routing protocols such as Open Shortest Path First (OSPF) [Moy1998] used to calculate the routing tables.
Figure 2-7: Some members of the Internet protocol suite
In the Internet model we do not have the Session or Presentation layers. Instead, directly on top of the Transport layer we have the Application layer. The protocols shown in Figure 2-7 are just a very small portion of the total. Apart from Telnet (used for remote login), FTP (used for file transfer) and the Simple Mail Transport Protocol (SMTP) [Kle2001] (used for e-mail transport) shown in the figure, there are many more protocols in use in the Internet, such as Network File System (NFS) [She2000] (used to provide transparent file access for client applications) or Simple Network Management Protocol (SNMP) [Cas1990] (used to manage remote systems through the network). Most of those protocols can use either TCP or UDP, but usually one of them is preferred All in all, IP networks provide a very flexible well-known packet-switched network that is able to carry a multitude of protocols designed for the widest range of uses.
2.4 A marriage of convenience: reasons for SS7 and IP networks integration
The SS7 network and the Internet have been two independent networks that perform different tasks and provide different services. SS7 is used for telephony signaling and the
ETHERNET
TOKEN RING
IP
SCTP
TCP
UDP
TELNET
FTP
SMTP
Physical
Data Link
Network
Transport
Application 7
4
3
2
1 Background 26
Internet for data transfer in packet-switched networks. However, in the last years the services they offer are merging. On one hand, the telephone users are demanding new services that involve access to the Internet. On the other hand, there has been a development in the services provided by IP networks so that they are able to assure certain levels of quality. This makes IP networks more suitable for the transport of more delay sensitive data such as speech or telephony signaling. In the next subsections we review some of the most important reasons why it is convenient to merge SS7 and IP networks.
2.4.1 Voice over IP
During the last five years, IP Telephony or Voice over IP (VoIP) has become a hot topic. Its share of the global number of telephone calls is increasing, and it is becoming more and more popular. This popularity comes from the fact that it makes a better use of the resources that transfer the voice stream, and so, the price the companies offer for IP Telephony can be cheaper, especially for long distance calls. However, its bigger problem is that it can not offer the same quality levels as the PSTN does. As we have seen, the traditional telephone system is circuit-switched. When someone makes a telephone call, a dedicated circuit is reserved for that specific call during the whole call. It then remains open for the whole calling time. IP telephony in turn does not make any exclusive reservation of any resource, but it handles the call over the network as just another stream of data. Normally, an IP telephony user dials a toll-free number and connects to the IP telephony gateway, dialing also the necessary information such as his account number and the destination telephone number. The gateway bridges the public telephone network and the IP network providing the service, and it is in charge of receiving the voice stream, compressing the speech and transporting it through a public or private IP network to the receiver gateway. That gateway will connect to the call receiver using the local telephone network, decompressing the IP packets and passing the voice stream to the desired subscriber. The main difference between these two schemes is that the long distance carrier is replaced by an IP network. So we convert a long distance call into two local calls plus long distance IP transport. Thus the IP Telephony Service Provider (ITSP) can offer a cheaper price to its customers 14 . The costs of transporting the speech using an IP network are lower than those of a long distance carrier, as the whole facilities are shared among all the users and there is no dedicated channels. If we have a dedicated full-duplex circuit to transmit a telephone conversation we make poor use of it, as most of the time at least one of the parties will be silent (at least that is the idea) and its channels unused. Initially, Internet telephony meant the existence of some software that was able to establish a phone call with another computer also connected to the Internet. Those calls were free but offered a poor quality at the beginning, as those programs were mostly the products of someone's hobby. The first of such programs appeared in 1995, and since then internetworking trials between IP network and PSTN were made. In 1997 the first phone- to-phone service was launched. Presently, many ITSP offer long distance calls. However, the PSTN and the Internet have very different characteristics that presently make difficult the use of IP networks for transporting voice. Some of those differences are summarized in Table 2-1.
14 This is the phone-to-phone IP telephony. If instead of connecting two telephone subscribers, any or both of the users involved in the call is using a computer connected to the Internet running the right software, then the price of the call can be as cheap as not paying anything. Background 27
Description PSTN Internet Designed for Voice only Packetized data Bandwidth Assignment 64 kbps (dedicated) Full-line bandwidth over a period of time Delivery Guaranteed Not guaranteed Delay 5-40 ms (distance-dependent) Not predictable (usually more than PSTN) Cost for the Service Per-minute charges: long distance Monthly flat rate: local access 15
Monthly flat rate for access Voice Quality Toll quality Depends on customer equipment Quality of Service Real-time delivery Not real-time delivery
Table 2-1: Differences between the telephone and IP networks
One of the biggest differences they have is the Quality of Service (QoS) 16 they provide. While the PSTN has been designed to be a highly reliable network in which a packet is rarely lost or delayed, Internet is on the contrary just a best-effort network that every now and then losses a packet 17 and that can not provide even a delay limit, which can severely damage speech quality. The QoS offered by VoIP is highly dependent on network congestion, degrading as the available bandwidth decreases. This problem can be faced by simply adding bandwidth, but this would be simply a temporary solution. More appropriate network-based mechanism must be used in order to guarantee the necessary bandwidth to services such as VoIP and helping carriers to minimize their costs while still achieving a satisfactory QoS. Nevertheless, the future in this aspect does not seem to be hopeless. The IETF has developed several technologies to add QoS features to IP networks that could address the problems originated by IP telephony and transport of real-time media in general. Among those efforts we can highlight these:
The Resource Reservation Protocol (RSVP) [Bra1997] is used by hosts to request specific QoS from the network for application data streams. The routers use RSVP to communicate QoS requests to all nodes along the path of the flow, and to establish and maintain state. RSVP requests normally result in resources being reserved in each node along the data path. The desired level of QoS is assured by reserving the resources beforehand. The Resource Allocation Protocol (RAP) [Yav2000] is a protocol used by routers that are RSVP capable to communicate with policy servers within the network. Where there are not enough resources to satisfy all the RSVP requests, the policy servers are the ones that determine who will be granted network resources and which requests will have priority. The Common Open Policy Service (COPS) [Dur2000] is the base protocol for communicating policy information between policy servers and routers within the RAP framework.
15 In U.S. local calls are free (included in the monthly rate) opposed to most of the other countries. 16 The QoS is specified in quantitative or statistical terms of throughput, delay, jitter, and/or loss, or may otherwise be specified in terms of some relative priority of access to network resources. 17 In fact, TCP needs packet loss to use it as feedback to provide congestion control. Background 28
The Real Time Protocol (RTP) [Sch1996] is a protocol specially designed to carry real-time data. It operates on top of UDP and can be used in media-on- demand or VoIP. It consists of two parts. The data part, called RTP Data Transfer Protocol, is a thin protocol that provides timing reconstruction, loss detection, security and content identification. The control part, or RTP Control Protocol (RTCP), checks the quality of the transmission and controls the state of the participants. RTP itself does not provide any mechanism to ensure timely delivery or provide other QoS guarantees, but relies on lower-layer services such as RSVP to do so. The Real Time Streaming Protocol (RTSP) [Sch1998] is a control extension to RTP. It adds functions such as rewind, fast forward and pause. The Session Initiation Protocol (SIP) [Han1999] is an application-layer control protocol that can be used to set up, manage and terminate multimedia sessions (including VoIP). SIP can be used to establish multi-party sessions, and Internet telephony gateways that connect PSTN parties can also use SIP to set up calls between them. It also defines new types of Uniform Resource Locators (URL) that help translating phone numbers into IP addresses and back again (those URLs were revised in [Vh2000] where we can find tel, fax and modem URL schemes). The Session Description Protocol (SDP) [Han1998] was defined with the purpose of describing multimedia sessions to be able to make session announcement, session invitation and other ways to initiate multimedia sessions. The session announcement itself is done using the Session Announcement Protocol (SAP) [Han2000] by multicasting the announcement containing the description of the session. The use of Differentiated Services (DiffServ) [Bla1998] [Nic1998] enables service providers to classify packets with various priorities using the DSCP field of the IP header. It is expected that routers throughout a network would recognize those priority labels and give packets certain throughput privileges according to them. The Multiprotocol Label Switching Architecture (MPLS) [Ros2001a] essentially imposes some kind of circuit-switching into an IP network. Packets can be grouped by tagging them with a common label, which permits expedited passage through the network. The labels not only inform the routers about the QoS to be applied but also supersede the routing decisions, that must be done just once and applied to the whole group of packets.
However, the IETF is not the only standardization organization that has published standards in order to help the development of VoIP. ITU-T has published its recommendation H.323 [ITU2000] that deals with multimedia communication services (especially audio) over packet-switched networks that may not provide the necessary QoS (such as Internet). Since the ratification of H.323 in 1998 and its posterior revisions, this recommendation has been widely adopted to provide interoperability between VoIP products over local and wide area networks. However, once a reasonable quality for the speech transported by IP networks has been achieved, there are some other tasks to solve. One of the biggest issues related with IP telephony has been its limitations to interoperate with the PSTN. VoIP gateways are able to provide a means for the transport of a raw voice stream, but much of the services provided by the PSTN come from the signaling network it uses: the SS7 network. Background 29
The functionality provided by the SS7 network to carriers includes a wide range of features, from simple caller identification to more complicated IN-based features. Only when proper interworking between IP networks and SS7 is provided, the VoIP services will be widely adopted by customers. As a simple example, without a complete SS7 interconnection, the ITSPs have to continue with their cumbersome multi-state dialing practices (the subscribers must firstly dial to the gateway, then their customer ID and finally the number of the callee). Moreover, a true integration of the voice (SS7) and data (IP) networks would introduce the long run benefits of VoIP supporting multimedia and multi-service applications, something that today's telephone system can not compete with. Beyond replacing the circuit-switched network, VoIP has the potential of making phone service as flexible and programmable as email and web service, speed the availability of multimedia communications, as well as integrating phone service with existing common Internet services. Today, most of the interest in VoIP comes from its cheaper costs in long distance call. However, in the future, the real benefits of VoIP will come from possibility of offering these new services. And it is not only a matter of the service offered. An integration of the voice and data networks would allow more standardization and would reduce the total equipment needed. Also, having a single network both for voice and data would make its management easier.
2.4.2 The 3 rd Generation Mobile Telephony
Once the telephone was invented and the electromagnetic wave propagation was studied enough to be able to become a means of communication (firstly using the Morse code, which was the first digital code ever used), the next step forward was freeing the telephone from its wire boundaries and making it wireless. The first mobile telephone service was provided in U.S. in the 1940s, and in the early 1950s in Europe. They were analog car phones restricted in its mobility and number of subscribers. They were bulky and expensive, very susceptible to interferences, with a very high power consumption and poor speech quality. In the early 1980s there were about one million subscribers worldwide. In the late 1970s and early 1980s, the introduction of cellular systems was a quantum leap in the mobile communications. Thanks to semiconductors and microprocessors, new lighter, smaller and more sophisticated phones became a reality. These early cellular systems that were only able to transmit analog voice, are known as the 1 st Generation Mobile Telephony (1G), the most prominent ones being the Advanced Mobile Phone System (AMPS) in America, part of Europe and Russia, Australia and part of Asia; the Nordic Mobile Telephone (NMT) in the Nordic countries, and Total Access Communication System (TACS), in Great Britain. There were about 20 million customers of 1G by 1990 and it is still in use. The 2 nd Generation Mobile Telephony (2G) is the one we use today. It is digital, thus providing a new range of services such as fax, short messages and data transmission, even with the possibility of encryption. Moreover, it provides advanced mobility services (roaming), that make possible for customers to move to areas were different telephone companies operate while still having service (as far as they use the same technology). The most successful of the 2G cellular standards is Global System for Mobile Communications (GSM), born in 1991 18 and supporting about 66% of the some 860 million users of mobile
18 The first public GSM call was made on 1 st of July 1991 in a city park of Helsinki, Finland. Background 30
telephony in July 2001 [GSM2001]. GSM is used mostly in Europe but it is spreading to the urban areas of U.S., and in about 170 countries worldwide, reaching penetrations of up to 80% in countries such as Finland. Some other important 2G systems are Code Division Multiple Access (CDMA), mostly used in the Asia-Pacific region; Time Division Multiple Access (TDMA), still in use in the U.S.; and Personal Digital Cellular (PDC), serving the customers in Japan. But 2G networks are far from perfect. There are several standards that are incompatible making the mobile terminals useless in other areas or countries with a different technology; the bit rate for data transmission (9.6 kbps in GSM) is far too slow; the speech quality is good but could be improved; and the users are demanding new services such as multimedia applications that do not fit in the 2G networks. Whilst a new generation of mobile phones is being developed, some new technologies have been added to the GSM networks, such as High Speed Circuit Switched Data (HSCSD) that provides up to 57.2 kbps by opening several circuit-switched channels; the General Packet Radio Service (GPRS), allowing data transfer at up to 171.2 kbps using IP packets (but that bit rate is not available yet); Enhanced Data for GSM Evolution (EDGE), with two versions based in HSCSD and GPRS respectively, that should provide 384 kbps (but is still under development); or Adaptive Multi Rate (AMR) to optimize speech quality. GSM, together with these new technologies, is often known as 2G+. In this environment, the 3 rd Generation Mobile Telephony (3G) will be soon a reality. The 3G networks are being specified by the world wide 3G Partnership Project (3GPP) 19
with the main aim of making a global mobile communication system that provides multimedia services. The 3G networks (collectively known as International Mobile Telephony 2000 (IMT-2000), and Universal Mobile Telecommunication System (UMTS) in Europe) started to be designed in mid 1990s and were supposed to be ready by 2000. However, the first 3G license was granted in 1999 to a Finnish operator, and it is expected that 3G networks will start to provide service within year 2002. UMTS is backward compatible with GSM and also uses SS7 networks for signaling. However, UMTS places more interest in packet-switching than GSM, using it not only for signaling but also for user data. UMTS offers up to 1,920 kbps (under certain circumstances) to be used for multimedia applications, such as videoconference. Having a dedicated channel of 1,920 kbps for each user is too much, and the resources must be shared among all of them, so it needs a packet-switched network. In the first release of the UMTS specifications, 3GPP R99, Asynchronous Transfer Mode (ATM) was the packet-switched network chosen. This was mostly because it provides a way of assuring the necessary QoS, and also because its addressing space was big enough. However, the last advances regarding QoS in IP and the development of IPv6 with its much wider range of addresses, made things change. All in all, most of the servers containing the data that will be transferred to the UMTS users are expected to be located in the Internet. Thus it made sense to use IP to transfer that data to the terminals (which will be IPv6 hosts) through the UMTS network. Moreover, IP networks are cheaper. But once we have an IP network transporting the user data, it is desirable that the signaling network is also IP-based so we can use the same network for both purposes. There have been two more releases of UMTS, 3GPP R4 and 3GPP R5, where the role of
19 3GPP is a joint venture of several standardization bodies: the European Telecommunications Standard Institute (ETSI), the Standardization Committee T1-Telecommunications (T1) from U.S., the Association of Radio Industries and Business/Telecommunication Technology Committee (ARIB/TTC) from Japan, the Telecommunications Technology Association (TTA) from South Korea and the Chinese Wireless Telecommunication Standard (CWTS). Background 31
IP has been enlarged. In 3GPP R5, also known as the All IP release, the transport network utilizes IP networking as much as possible. IP and overlying protocols will be used in network control too, and the user data flows are also expected to be mainly IP based. In other words, the mobile network implemented according to the 3GPP R5 specifications will be an end-to-end packet switched cellular network using IP as the transport protocol instead of SS7. But the IP-based network should still support circuit switched services, and UMTS must be compatible with GSM. This means that we will still need a way to use the SS7 protocols in our IP network. As a result of this situation, the new 3G networks needed a way of carrying SS7 messages over IP. The interested reader can find a really good introduction to 3G networks in [Kaa2001].
2.5 This is what we were looking for
While people in the Multiparty Multimedia Session Control (MMUSIC) working group of the IETF were in charge of providing the necessary means to improve the QoS capabilities of the IP networks, a new working group was founded on November 23 rd of 1998, Signaling Transport (SIGTRAN), with the mission of addressing the transport of packet-based PSTN signaling over IP networks. The way of facing such a task was keeping both the SS7 and IP stacks and defining an interface that would make possible transporting both the voice stream and the SS7 signaling data through IP networks. So, as stated in the SIGTRAN working group page:
"The primary purpose of this working group will be to address the transport of packet-based PSTN signaling over IP Networks, taking into account functional and performance requirements of the PSTN signaling."
The first step was to produce an informational document [Ong1999], published in October 1999, identifying functionality and performance requirements to support telephony signaling over IP. Signaling messages have very stringent loss and delay requirements, and also security and resilience must be addressed. That document, among other things, defines the architectural model shown in Figure 2-8. In that figure we can see the gateways that connect the SS7 and IP networks where the red lines represent voice channels and the black lines represent signaling links. We can identify the next three elements:
A Media Gateway (MG) terminates PSTN media streams, packetizes the voice and delivers the packets to the IP network. At the receiver side, it performs the reverse function. The Signaling Gateway (SG) is a signaling agent that receives the native signaling and translates it to send it then through the IP network, and vice-versa. The Media Gateway Controller (MGC) handles the registration and management of resources at the MG, with the possibility of authorizing resource usage based on local policy.
The SS7-IP gateways not only translate and transport the SS7 signaling through the IP network, but can also receive management messages directly addressed to them. They provide transparent transport of message-based signaling protocols over IP networks. In this way, both the media data and the signaling can traverse the IP network and reach the Background 32
destination, providing the same kind of services that the PSTN offers while making a better use of the network that carries the voice stream.
Figure 2-8: SIGTRAN functional model
2.5.1 The need of a new transport protocol
Even before the architecture was completely defined, people from SIGTRAN started to define the protocols to be used to provide such translation from SS7 messages (see section 8.2). Obviously a transport protocol was needed but there was not an agreement about which one should be used, and they referred to it simply as the Common Transport Protocol (CTP). There was an initial attempt not to complicate even more the whole issue and just use either TCP or UDP. Apart from the fact that both TCP and UDP were implemented in almost every operating system, they have gone through years of review, criticism and adjustment, and have been very successful. The expected functionality supported by the CTP was this [Ong1999]:
Transport of a variety of Switched Circuit Network (SCN) protocol types, such as MTP3, ISUP, SCCP, TCAP, etc., with the ability of providing a way to identify the specific SCN protocol being transported. Provide a common base protocol defining header formats, security extensions and procedures for signaling transport, and support extensions to add individual SCN protocols if needed. Together with IP, provide the relevant functionality as defined by the appropriate SCN lower layer. That relevant functionality may include: Flow control. In sequence delivery of signaling messages within a control stream. Error detection. Recovery from failure of components in the transit path. Retransmission and other error correcting methods. Detection of unavailability of peer entities.
SSP MG MGC SG MG MGC SG
IP Network
SS7 Network
SSP Background 33
Support the ability to multiplex several higher layer SCN sessions on one single signaling transport session. In general, in-sequence delivery is required for signaling messages within a single control stream, but is not necessarily required for messages that belong to different control streams. The protocol should if possible take advantage of this property to avoid blocking delivery of messages in one control stream due to sequence error within another control stream. Be able to transport complete messages of greater length than the underlying SCN segmentation/reassembly limitations. Allow for a range of suitably robust security schemes to protect signaling information being carried across networks. Signaling transport shall be able to operate over proxyable sessions, and be able to be transported through firewalls. Provide means for congestion avoidance and reaction to network congestion.
UDP was not even considered, and initially they suggested TCP as a candidate to become the CTP. However, after some detailed analysis it was shown that TCP had some deficiencies that did not make it suitable for PSTN signaling transport across IP networks. Among them, we can identify the next ones [Ste2000]:
TCP is a transport protocol that provides both reliable data transfer and strict order-of-transmission delivery of data. This is what is normally desired, but there are some applications that need reliable transfer but not sequence maintenance, or even just partial ordering of the data. An application with such needs is suffering from the head-of-line (HOL) 20 blocking that TCP produces, causing a delay that is unnecessary and undesirable. TCP is stream oriented, and this can be also an inconvenience for some applications, since usually they have to include their own marks inside the stream so the beginning and end of their messages can be identified. In addition, they should explicitly make use of the push facility to ensure that the complete message has been transferred in a reasonable time. TCP was never designed to be multihomed 21 . The limited scope of the TCP sockets makes difficult the task of designing any data transfer mechanism in which a multihomed host could use several network cards at the same time. This would provide high availability, often needed in some applications. TCP does not scale well since The maximum number of simultaneous TCP connections is dependent on kernel limitations. This is because TCP is generally implemented at the operating system level. In TCP there is no possibility of timer control. TCP generally does not allow application control over its initialization, shutdown, and retransmission timers. TCP is relatively vulnerable to denial of service attacks. This kind of attacks try to make unavailable one service, commonly trying to exhaust the resources it uses. One of such well-known attacks is the so-called SYN attack (more about this attack in section 4.2).
20 TCP manages messages as a single string of bytes without internal structure. Thus, if we use a single TCP connection to send several unrelated messages, the receiver will deliver them to their upper user in the same order they were sent. If a datagram is lost, this will affect all the subsequent messages, which will be retained until the lost message arrives. This is the HOL blocking. 21 A multihomed host is one that has several network cards, and can make use of a number of IP addresses at the same time. Background 34
The transport of PSTN signaling across an IP network is one kind of application for which all of these limitations of TCP are relevant. There was an initial attempt of modifying or enhancing TCP to meet those requirements. However, the idea was discarded, mostly because some other similar IETF investigations on transport issues had already pointed out how hard it would be. Therefore, they decided to design a new suitable transport protocol that would operate on top of UDP. Apart from the necessary functionality that the CTP was expected to provide, some other features were identified as desirable [Ste2000]:
Ability to discover the Maximum Transfer Unit (MTU) of the path used from the IP sender address to the IP receiver address, and possibility to fragment user data to conform to the discovered MTU. Possibility of sending user messages within multiple streams inside the same association. Sequenced delivery of the user messages sent through the same stream, and possibility of order-of-arrival delivery of individual user messages. Possibility of bundling multiple user messages into a single packet.
Having these fixed objectives, people at SIGTRAN started to work in the design of a new protocol that could overcome TCP's problems.
2.5.2 A proposal that IETF could not refuse
Late in 1998 at the Orlando IETF meeting, several authors submitted proposals of protocols that totally or partially met those requirements. One of them was called Reliable UDP (RUDP) [Bov1999], which supported acknowledged data and retransmissions, but it did not provide support for multihoming, neither it had any congestion avoidance algorithm, so it was finally abandoned. Another proposal was UDP for TCAP (T/UDP) [Ma1998] that included flow control and reliable data transfer, but was equally abandoned. Yet other protocol with similar characteristics was Simple SCCP Tunneling Protocol (SSTP) [Sn1999] (an evolution of the Connectionless SCCP over IP Adaptation Layer (CSIP) [Sn1998] protocol), this one being able to run on top of UDP or TCP, but again, after two versions the idea was discarded. The PURDET [Ton1999] protocol was another option, using UDP and supporting sequencing, flow control, protocol identification, error retransmission and link loss detection. However, after its first version it was forgotten, as the rest. People at SIGTRAN were not only looking for new protocols but they also took into account some ITU-T protocols that could be valid ones, such as Service Specific Connection-Oriented Protocol (SSCOP) [ITU1994] or H.323 Annex E [ITU1999] or even RTP, but none of them was considered suitable for the purposes of SIGTRAN. Nevertheless, there was a proposal submitted by Randall R. Stewart and Qiaobing Xie, the Multi-Network Datagram Transmission Protocol (MDTP) [Ste1998], which attracted the attention of the SIGTRAN working group. MDTP started to be designed in 1997, independently of the SIGTRAN work, as a solution for some of TCP's weaknesses. After getting most of the general concepts together and having a working implementation, the authors decided to submit it to the IETF for consideration in summer 1998. In its preliminary design, MDTP was an application level protocol working on top of UDP that incidentally met most of the requirements imposed by SIGTRAN to the CTP. This proposal was the only one supporting multihoming and that avoided the HOL blocking, and there was even an available implementation working with a performance similar to Background 35
TCP's. These were good reasons to choose MDTP to become the CTP and during the next 10 months it was improved and eight more versions were written. However, it never became a Request For Comments (RFC) and it was abandoned as well, the reason being that it was deeply modified and used as the basis of SCTP. The acronym by then stood for Simple Control Transport Protocol, but later on they realized that it was not that simple and that it was not limited to control messages. So the intention was firstly to change its name to Signaling Common Transport Protocol, but finally, that name was never used and the protocol was renamed, in its 9 th version, to the present Stream Control Transport Protocol. The change from MDTP to SCTP not only involved a change in the name but also a deep revision of the protocol itself. It was then when the protocol datagram header and its internal structure were almost completely modified (see section 3.1) so it became highly extensible; the cookie mechanism was adopted in the initialization to avoid denial of service attacks similar to the known SYN attack of TCP (see section 4.2); the TCP congestion control features [All1999] were included in SCTP (discussed in section 5.2); and some other features such as stream negotiation (see section 5.3), message bundling and data fragmentation were also changed (as explained in section 5.1). Later on, in January 2000 another big change was introduced: the working group revised the protocol stack to run SCTP directly on top of IP. This change was a very polemic one because it implicitly meant that SCTP should be implemented inside the operating system kernel. Thus, SCTP implementations would not be ready within the next years, having to wait until the operating system vendors had it available in their products. Moreover, having SCTP inside the kernel would make more difficult (if not impossible) to have control over the values of the timers and some parameters to adapt SCTP to different environments. However, the benefits of locating SCTP in its architecturally right place in the IP stack outweighed all these problems. Moving SCTP on top of IP and having its own port number space, opened the way to SCTP to become a major transport protocol, at the same level than TCP, making SCTP useful for a wider range of applications than telephony signaling transport. The first version of the internet draft specifying SCTP was submitted in September 1999 and since then lots of modifications were made until late October 2000. Then, the 14 th version of the SCTP Internet-Draft was raised to the RFC status, and was published in the IETF as RFC number 2960, a Proposed Standard. During these almost 14 months of work, the design of the SCTP protocol was discussed daily in a distribution list that contained more than 1,000 members at some stages of the design, proposing changes and highlighting specification errors in more than 4,000 messages. SCTP evolved so much during its design that today it would be hard to say that MDTP was its predecessor. SCTP was initially designed to be a transport protocol for telephony signaling. It was not the original idea to design a protocol that could compete with TCP. In fact, in the second version of MDTP the following paragraph could be read in its Introduction section:
Comparing to traditional TCP [3], MDTP design is more tuned towards a special set of applications, that is the time critical fault tolerant applications using redundant LANs. It is not designed to replace TCP as a general purpose transmission protocol.
However, this paragraph was deleted seven months later, in April 1999, in the 5 th
version of MDTP, when the authors realized that they should not limit the scope of application of what they were designing. Background 36
In its long design time, many features were added to the original sketch, most of them trying to solve problems that were already noticed when using TCP/IP, even if they were not that important for its main use, PSTN signaling transport across private IP networks. In February 2001 the discussion about SCTP was moved from SIGTRAN to the Transport Area Working Group (TSVWG), another working group of the Transport Area of the IETF. This effectively meant that the SIGTRAN working group was very successful in designing SCTP to be useful to a wide range of applications, and thus it started to be thought as a general-purpose transport protocol rather than a signaling-specific one. Since October 2000 some editorial and technical defects have been found, and it is planned that in the near future a new version of the SCTP specifications will be released, making obsolete the present one (more about this in chapter 9) . Moreover, quite many people already know about SCTP and they are writing extensions, so some valuable or needed features can be added to make SCTP more suitable to work in different environments. The nice extensibility capabilities of SCTP make this a relatively easy job, and the present extensions will be discussed in section 8.1. SCTP could be used as the main transport protocol in Internet, substituting TCP in the future. It is for this reason that SCTP can be thought of as a renewed version of TCP with extended capabilities, and, used together with IPv6, it is expected to change in the future the way in which information is sent and received in the Internet. The Design of SCTP: Datagram structure 37
3. THE DESIGN OF SCTP: DATAGRAM STRUCTURE
During its long design time, SCTP features were increased. People from the distribution list where its design was debated were sending valuable comments that were shaping the structure of the protocol itself. All the modifications to the protocol were made after rough consensus was reached. Thus, the final specification of SCTP (contained in RFC 2960) is the result of the joint work of lots of people specialized in areas varying from checksum protection to IP routing. They contributed with their ideas and advice, pointed out errors or tested their own implementation during the test sessions discovering possible points of failure and improvements. In this chapter we will take a look to the internal structure of SCTP. As in the following chapters, we will not only explain what is written in the SCTP specifications, but also what is not there. Will discuss about the motivations that made the designers to choose some specific solutions and not others, and what were the reasons to include certain features while others seem to be absent. We will explain the design of SCTP from a historic point of view, highlighting the main pitfalls discovered in its evolution and design errors that were late corrected. However, in this chapter we will just show the shape of the SCTP datagram, quickly reviewing the function of its fields. We will also briefly introduce the state diagram that represents the behavior of SCTP. In the next chapters we will go further, explaining the way SCTP performs its tasks.
3.1 Shape of SCTP datagrams: An evolution from MDTP
The internal shape of SCTP datagrams has completely changed since the first version of MDTP. Its features have been highly improved and many mistakes have been solved. But the basic ideas remained and the final design of SCTP is internally much closer to MDTP than it could be thought at a first sight. As SCTP is an evolution from MDTP, we will speak first about MDTP's header and internal structure as it was in its first version. Then we will discuss its final design in SCTP as published in [Ste2000].
3.1.1 Common header and internal structure of MDTP
When the first MDTP version was submitted to the IETF editor in August 1998 the datagram structure was TCP-like and looked like Figure 3-1 (numbers on top of the figure mean the bit position): As we can see from Figure 3-1 the datagram format reminds the TCP one, revealing which protocol was the basis to start designing MDTP. TCP is nowadays the most successful of the transport protocols used in the Internet, and so, it was the best reference to start with. Later, with the evolution of MDTP and SCTP, the similarities with TCP stayed more in the internal behavior than in its external shape. We are not going to explain in detail the meaning of the fields in the first version of the MDTP protocol, but we can note the next similarities and differences with TCP: The Design of SCTP: Datagram structure 38
Figure 3-1: MDTP datagram structure in its first version
Every MDTP datagram had an overhead of 8 bytes containing the identifier of MDTP itself, while TCP does not have anything like that. The reason of this is that MDTP was designed as an application protocol running on top of UDP, and then the IP header field that identifies the protocol carried would always be set to 17, identifying the UDP protocol. But it is important for proxies, firewalls and even for the routers, to know the protocol that is carried in the IP datagrams, because then they can decide better what to do with them. So, these 8 bytes overhead were a lesser evil. The MDTP Protocol Identifier 1 and MDTP Protocol Identifier 2 fields were originally set to the hexadecimal numbers F7873072 and 17074012 respectively. As it is not that uncommon when designing a new protocol, those values were chosen randomly and used because it was highly improbable that any data carried by UDP would start by these 8 bytes. Later on, it was decided that this way of identifying the MDTP protocol was not clean, as one should have to dig inside application data carried in the UDP datagram to know what was being transported. So, it was accepted that certain UDP port numbers would be used when sending MDTP datagrams to help identifying them. However, to allow protocol multiplexing (so other protocols can share the same UDP port) the 8-byte identifier was not immediately removed. Firstly, the MDTP Protocol Identifier 2 field disappeared in the 7 th version of MDTP, and then the MDTP Protocol Identifier 1 field was reduced from 32 to 28 bits in the next version. Later, in the 9 th version, it was made optional (sharing its 28 bits with an optional Cyclic Redundancy Code (CRC) instead). Finally, the MDTP identifier was completely removed when the first version of SCTP was released as its identification inside UDP datagrams relied exclusively in the use of a reserved port (a port that, by the way, was never specified). But that was not a good solution either. Having the port number as the only way of identifying SCTP datagrams encapsulated in UDP would eventually limit the number of associations between two single homed hosts to just one. This problem was finally eliminated when SCTP started to run on top of IP and was given the protocol number 132 as its identifier.
Data N O G N O V W I N I S B F I R R E S D A T A C K B R O S H U W N R R E 1 R E 2 B U N G A R U N R Version In Queue Data Size Part Of Sequence Number (Send) Acknowledgement Number (Seen) MDTP Protocol Identifier 2 MDTP Protocol Identifier 1 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 The Design of SCTP: Datagram structure 39
The Acknowledgement Number and Sequence Number were completely equivalent to the fields of TCP with the same name. The change in their relative position is anecdotal. The Data Size field was included at the beginning in the MDTP header. TCP has the Data Offset field, which includes only the length of the header. This is enough since the IP layer tells how large is the entire TCP datagram. This did not work with MDTP because it was designed to add padding (bytes set to 0) at the end of the datagram to make its length a multiple of 4 bytes, and those bytes are not part of the user data. This padding (originally optional) was added because most of the computers used nowadays read data in pieces of 32 bits or larger. So, it is effectively quicker to read 4 bytes in a row even if only one of them is useful. This field was moved from the header to the chunks 22 in the first version of SCTP, as it is explained later in section 3.1.2. MDTP was message oriented, which is quite a big difference from TCP. TCP just manages a stream of bytes, and it is the application that delimits the different messages carried in the same flow of bytes and cut that stream in pieces. In MDTP you sent messages, identified by the Part and Of fields. If the datagram contained a whole message, then Part was set to 0 and Of was set to 1 (note that a single datagram could contain more than one message, all bundled together to make the transmission more efficient, saving the bytes of the header). Otherwise, Of told to the receiver the number of fragments of the message and Part indicated the order of the fragment (from 0 to Of 1) so the message could be correctly reassembled. That made that the length of the biggest message transmitted through MDTP was 255 times the MTU of the network used to transmit the datagrams minus the IP and MDTP headers. If we are using an Ethernet, whose MTU is 1,500 bytes, and IPv4 as the network protocol, with a typical header without options of 20 bytes, that makes 255(1,500 20 24) = 371,280 bytes. Even though this value should be more than enough for a single message, the mechanism was later on modified in SCTP making the maximum length of a message technically infinite. There were 16 flag bits grouped in two bytes called Flags and Mode. This was similar to TCP, with the difference that the added functionality in MDTP needed more bits to perform correctly (only two of those bits were free, the RE1 and RE2 ones). Those bits were used as a negotiation during the establishment phase to ask for optional services, and also to tell the receiver which was the internal structure of the Data field (that could contain several kinds of information as well as user data). The two free bits were already used in the 4 th version of MDTP, limiting the possibilities of extension of MDTP. Moreover, having such a big quantity of flags was somehow ugly and difficult to manage, making the datagram process quite clumsy. In the 8 th version, the MDTP datagram was highly transformed, including a Control Parameter Part and Data Part areas. The flags were replaced by 2 bits indicating if either the control and/or data areas were present, plus a 6-bit identifier of the control parameter. In addition, 8 bits were reserved for future use. The Version field represented the version number of the protocol. This field was reduced to 4 bits in the 8 th release of the MDTP specifications and finally discarded in the 6 th version of SCTP. The way SCTP was designed made it so
22 A chunk is a unit of information within an SCTP packet, consisting of a chunk header and chunk- specific content. The Design of SCTP: Datagram structure 40
easily extendable that a Version field did not make much sense. If you can not extend SCTP to include the feature you want to add, probably you actually need a new protocol, not a new version of SCTP. The In Queue field contained the number of messages the sender of the datagram had in its incoming queue, waiting to be read by the application. It was used for flow control purposes. This field was equivalent to the Window field in TCP, with the difference that it indicates unread messages, not bytes. There was a big discussion about the use or this field, and it was deleted from the header in the 8 th version of MDTP. It was agreed that the information about data sent but not yet acknowledged (referred to as outstanding data) was enough for the congestion avoidance algorithms, and so this field was a waste of space in the header. But as soon as the reference implementation of MDTP was updated, it was noted that the knowledge about the outstanding data did not provide the necessary information about the state of the receiver's incoming buffer. It is clear that if the receiver acknowledges the receipt of certain datagrams but not the previous ones, and if they have to be delivered in order to the upper user, the received data must be occupying space in the MDTP buffer, but the opposite is not true at all. Even when there was no outstanding data, the receiver's buffer could be full if the upper user did not retrieve the data received. If the buffer is full, all the incoming data will be discarded, and we waste network resources. The point is that there are two different problems to be addressed: congestion control (which is related with the network) and buffer control (which happens at the receiver side). The In Queue field was a useful hint to the data sender about the state of the receiver's buffer. However, it would be more useful if the information carried in that field would be expressed in bytes, not in messages. So the final decision was that the receiver's buffer size would be exchanged during the establishment phase and that a similar field would be used again (called Advertised Receiver Window), this time not in the header but in the acknowledgement chunks. This change was made in the first version of SCTP. As its name tells us, the Data field carried the user data, but this was not completely true. Apart from data, and depending of the value of the Flags and Mode fields it could carry MDTP's control information related with its internal behavior. That made the Data field kind of wildcard field that could be used for almost anything. This was not a neat design, and was changed in the 8 th version of MDTP, differentiating a Control Parameter Part and a Data Part fields. This structure evolved and was converted finally into control chunks and data chunks.
3.1.2 Common header and internal structure of SCTP
We have just revised the initial structure of MDTP. About 27 months later, SCTP looked like it is shown in Figure 3-2. As we can see there, SCTP's common header is completely different to MDTP's one, and it is far less complex. That makes SCTP datagrams easier to process. Nonetheless, the internal structure is somehow more elaborated as we can see that an SCTP datagram contains several structures at different levels.
The Design of SCTP: Datagram structure 41
Figure 3-2: Structure of SCTP datagrams
Every single SCTP datagram has a common header of 12 bytes, followed by one or more structures called chunks. The common header has the following elements:
In January 2000 SCTP became a transport protocol running on top of IP. Therefore, the information carried in the UDP header had to be moved to the SCTP header. The Source Port Number and the Destination Port Number had to appear in the SCTP header. However, having SCTP at the same level as protocols such as TCP or UDP also resulted in that SCTP had to face the interactions with other protocols also running on top of IP. One of those protocols, ICMP [Pos1981b], is the one in charge of telling the IP users about problematic situations such as lack of buffer space in a router, an unreachable address, or inefficient routing tables. ICMP was initially defined for IPv4, but it has an IPv6 version (ICMPv6) [Con1998]. When there is any situation in the IP network that triggers the transmission of an ICMP message, that message includes the beginning of the IP packet that originated the anomalous situation. The ICMPv6 messages include as many bytes Common Header
Cause Value Cause Code Cause Length 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
Chunks Checksum Verification Tag Source Port Number Destination Port Number 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 The Design of SCTP: Datagram structure 42
of the original IP datagram as will fit without the ICMPv6 packet exceeding the minimum IPv6 MTU (1,280 bytes), while the ICMP messages contain only the first 8 bytes of the IP datagram. As the Source Port Number and the Destination Port Number fields are vital to identify which SCTP association triggered the sending of the ICMP message, those two fields had to be located in the first 8 bytes of the SCTP datagram. When SCTP was modified to run directly on top of IP, it was also agreed that all the TCP ports used for well known applications would be automatically reserved in the SCTP port address space. That would not only make easier the migration of application from TCP to SCTP, but also would avoid a lot of bureaucratic work related with the Internet Assigned Numbers Authority (IANA) 23 . The Verification Tag field is the evolution of the Acknowledgement Number and the Sequence Number fields in MDTP. It plays the same role as those fields in the establishment phase (together with the Initiate Tag field of the INIT chunks as we will see in chapter 4), but it also gives protection against blind attacks. It is a 32- bit integer, randomly chosen and exchanged in the establishment phase. It is used without modifications thorough the whole life of the association to validate the incoming datagrams as being really sent by the peer endpoint and not by an attacker. Otherwise, it would be really easy for an attacker to forge one of our peer's addresses and, for example, abort the association. As happened with the Source Port Number and the Destination Port Number, the position of the Verification Tag became an important issue when it was decided that SCTP should run directly on top of IP. Some ICMP and ICMPv6 messages can be used to make attacks to TCP, and they could also be used to attack to SCTP in the same way. One of these types of messages is the ICMP Source Quench message (note that ICMPv6 does not have this kind of message). That message is typically sent by a router that suffers from lack of buffer space, or that is receiving IP datagrams at a higher rate it can process them, so it has to discard those datagrams. The arrival of this kind of message to a TCP sender causes the congestion window (cwnd) to be set to one segment, initiating slow start (see section 21.10 of [Ste1994]). That effectively causes that the sender is allowed to have less outstanding data and so the speed of data transmission may decrease. This is because the sender must wait for the acknowledgement of the data already sent before transmitting again. If the Round Trip Time (RTT) is bigger than the time the TCP sender takes to send cwnd bytes, it will stay idle part of the time waiting for the acknowledgements. The MTU discovery algorithm is described in [Mog1990] for IPv4, and in [McC1996] for IPv6. It relies on the use of the ICMP Destination Unreachable message (with a code meaning Fragmentation Needed and DF Set) and the ICMPv6 Packet Too Big message. These messages are sent by a router to the sender of the IPv4 packet when the Don't Fragment (DF) bit in the header of the IPv4 datagram is set (thus meaning that this IPv4 packet can not be fragmented, see section 2.3.2) but it should traverse a network whose MTU is smaller than the size of the packet, and so it must be discarded. In IPv6, an ICMPv6 Packet Too Big message must be sent by a router in response to a packet that it can not forward because the packet is larger than the MTU of the outgoing link (note that
23 IANA is the organization that assigns the port numbers, protocol identifiers, IP addresses, domain names, and basically every number that identifies something in the Internet. The Design of SCTP: Datagram structure 43
in IPv6 the routers never fragment IPv6 datagrams). When any of these two messages is received, TCP decreases the segment size and retransmits the segment that triggered the sending of the ICMP message (see [Ste1994], section 24.2). The smaller the segment size, the bigger the overhead of the IP and TCP headers, making the transmission less efficient, so the segment size should not be reduced unless it is unavoidable. An attacker could easily send us any of those ICMP messages to affect our sending capabilities reducing our throughput. Therefore, the SCTP header was designed so the Verification Tag is present in the ICMP messages. When processing a received ICMP packet, the association affected is determined by using the Source Address and Destination Address fields (carried in the IP header inside the ICMP message), and the Source Port Number and Destination Port Number (carried in the first 4 bytes of the SCTP header also included in the ICMP message). Then, to validate that this ICMP packet was not sent by an attacker, the ICMP message is validated by comparing the SCTP Verification Tag also carried in the ICMP message and checking that it is the right one. This field appeared in the first version of SCTP. When SCTP was decided to run directly on top of IP, in the 6 th version, it was agreed that this field should be placed within the first 8 bytes of the header, for the reasons stated above. The Checksum field has proved to be quite a controversial one. As we have seen, the first versions of MDTP did not have any kind of checksum, because, as it was transported by UDP it seemed that the checksum it provides would be enough. Later, in the last MDTP version, an optional Cyclic Redundancy Check of 16 bits (CRC-16) as the one standardized by the ITU-T (see [ITU1996], section 8.1.1.6.1) was added to the header. This approach was maintained in the first five versions of SCTP. Then, in the 6 th version of SCTP it was modified again to be a 32 bits checksum, the Adler-32 Checksum [Deu1996], and this is the one that have been finally included in the SCTP specifications. But later on it was agreed that Adler- 32 provides a weak protection against the detection of errors in small frames, and taking into account that telephony specific messages typically use packets of less than 128 bytes, another kind of check should be used. The discussion has been long, but it is agreed that the Cyclic Redundancy Check of 32 bits (CRC-32) standardized by ITU-T (see section 8.1.1.6.2 of [ITU1996]) will be used. This checksum issue will be further discussed in section 9.1.
After the common header, there must be at least one chunk. A chunk is an independent structure with its specific identifier and meaning but with a common Type-Length-Value (TLV) structure. They are used in a wide sense to send requests to the peer endpoint and receive the answers for those requests. The chunks initially defined in [Ste2000] are the next ones:
The Initiation (INIT) , Initiation Acknowledgement (INIT ACK) , State Cookie (COOKIE ECHO) and Cookie Acknowledgement (COOKIE ACK) chunks. They are used in the establishment phase. The Payload Data (DATA) and the Selective Acknowledgement (SACK) chunks. They are used for the data transfer. The Design of SCTP: Datagram structure 44
The Heartbeat Request (HEARTBEAT) and Heartbeat Acknowledgement (HEARTBEAT ACK) chunks. These chunks are used to track the state of the different network interfaces used in the association. The Operation Error (ERROR) chunk, which is used to report a non-fatal error. The Shutdown (SHUTDOWN) , Shutdown Acknowledgement (SHUTDOWN ACK) and Shutdown Complete (SHUTDOWN COMPLETE) chunks. They are the ones used during the graceful termination of the association. The Abort (ABORT) chunk, which reports a fatal error and terminates the association. The Explicit Congestion Notification Echo (ECNE) and the Congestion Window Reduced (CWR) . However, these two chunks are not really defined in [Ste2000], merely their identifiers have been reserved. The reason for this is that the Explicit Congestion Notification (ECN) mechanism was still being studied (and there has not been a great advance in its application for SCTP so far), being the work finally published in [Ram2001].
This structure is what makes SCTP most different from TCP. As we have seen before, the initial design of MDTP was quite close to TCP, with lots of flags and fixed length fields. SCTP structure tried to avoid two problems present in TCP:
TCP has a major problem with its extensibility possibilities. TCP has only 40 bytes of space to include options in its header, and six free bits (with the extension of TCP for ECN [Ram2001] only four bits remain unused). This makes TCP hardly extendable (and that is exactly why a new protocol was needed), so the designers of SCTP tried to avoid being in this same situation in the future. The initial design including many flags was abandoned. When TCP was designed, one of the key design principles was to make it as efficient as possible in terms of the overhead produced by its header. If something made the processing of a TCP datagram somehow harder, but saved one byte of its header, then it was a good choice. There are even standards such as the one depicted in [Jac1990], which defines a method to compress TCP's header from its typical 20 bytes to an average of 3 bytes. This is possible due to the similarity between TCP headers in segments belonging to the same connection. These efforts can be easily understood if we take into account that in the 1970's and 1980's, when TCP started to be used, the data links could go as fast as several tens of kilobytes per second in the best case. But this is not the case anymore. Nowadays it is quite common that your computer is connected to the Internet through a 100 Mbps Ethernet card, more than 1,000 times faster than the connections used about 30 years ago. So, the key features for a transport protocol presently is not if its overhead is a little bit bigger or smaller, but if it is simple, easy to extend, and its datagrams are fast to process 24 . Some tests made by Randall R. Stewart, one of the parents of SCTP, showed that it was quicker to process TLV structures than checking the value of specific bits inside a byte.
24 However, when not using a fast LAN to send the TCP/IP packets, but a radio link of small capacity, compressing the headers can greatly improve the performance. Compression not only lowers the header overhead allowing the use of small packets for delay sensitive low data-rate traffic and improving the interactive response time, but also reduces packet loss rate over lousy links because fewer bits are sent per packet. One of the latest efforts to compress TCP/IP headers (both IPv4 and IPv6, including the IPv6 extension headers) is published in [Pri2001]. The Design of SCTP: Datagram structure 45
Moreover, the extensibility possibilities reached with this model (which is the same one used in IPv6) are excellent. Some more bytes of overhead are worth these features, especially if there is even the opportunity of bundling several of those structures inside a single datagram, saving then the space of the common header.
Therefore, the first version of SCTP was kind of a revolution in this aspect as it was deeply modified from the last MDTP version. As can be seen in Figure 3-2, all the chunks follow a basic guideline in their definition. They have the following fields:
A chunk first has a one byte field, the Chunk Type, that is what identifies one chunk from another and what tells the receiver what to do with it. As we have stated before, there are 15 values defined for the Chunk Type field in the [Ste2000] (all the values from 0 to 14), and the rest are reserved by the IETF. Originally, the value 254 was used for vendor-specific chunk extensions. That was designed when SCTP was running over UDP and when every company interested in SCTP would implement its own version. But as SCTP was evolving, it was placed to run on top of IP, and moreover, there were plans to include its code inside the kernel of the operating system (this has been already done for UNIX / LINUX). So it would be more likely that companies interested in SCTP would directly buy it from their local operating system provider. This would make expensive for a single company to ask for a specific extension to SCTP. Also, there was an increasing feeling that the possibility of letting vendors define and use their own chunks would lead to future interoperability problems. But that was not necessarily true, and it has been shown in the past that not having such a vendor-specific extension field in other standards did not stop a company from defining non-interoperable versions of an IETF protocol and forcing the industry to adapt to it. A vendor extension is only useful between two hosts that understand the extension (normally from the same vendor), in which case the vendor could introduce its new functionality in any way it wanted. In a discussion in the distribution list about the issue, it was agreed that defining this kind of non-interoperable extension would tempt people to use it, and so, this possibility was removed from the 11 th version of SCTP. It did not add any value to the protocol, and vendors could still propose extensions to IETF meeting their needs (at the end, SCTP was defined by several vendors discussing in a distribution list). As the chunks can be created to fulfill the needs created by any new feature, the receiver can find new chunk types that it may or may not understand. The design group could have easily chosen to simply discard the chunks that are unknown, but this does not always work. Sometimes the sender of the chunk must know if the receiver understood it or not, and some other times, the processing of that chunk can be of vital importance to continue processing the rest of the datagrams. So, the first two bits of the Chunk Type field tell the receiver what to do in case it does not recognize the chunk. Depending on their value the next actions should be performed:
00: The receiver should discard the whole datagram, not processing any further chunks within it. The Design of SCTP: Datagram structure 46
01: Same as with 00 but also reporting to the sender that it did not recognize the chunk. 10: The receiver should discard this chunk, but continue processing the rest. 11: Same as with 10 but also reporting to the sender that it did not recognize the chunk.
With this convention, the negotiation phase to request for new services is simpler, because if the receiver does not have that feature it can always be pushed to send a negative answer to our request. This idea was taken from IPv6, where the first three bits of an unknown option type are used to tell the receiver whether it has to discard the whole datagram or just skip the option, if it has to send back an ICMPv6 message, and if the option can be modified en-route. The next field is the Chunk Flags field. The structure of flags was abandoned as we have explained before, but it did not mean that it was completely useless. Usually, some of the fields of the chunk can be expressed as a Boolean value, and for those the flag structure is still valid. The meaning of the flags is chunk dependent and only three chunks defined in [Ste2000] have flags: the DATA, SHUTDOWN COMPLETE and ABORT chunks. As the length of the chunks can be variable, it is necessary to include a Chunk Length field. Not all the chunks have Fixed Fields. Some of them do not need to provide any more information than their own Chunk Type. Nevertheless, most of them have some other fields to give extra information. Of course, the structure of the Fixed Fields is chunk dependent. If the chunk has to include any optional or variable length data, it can carry parameters. Finally, if the Chunk Length is not a multiple of four bytes, some padding bytes must be appended at the end. They are compulsory, and are included because nowadays most computers use at least 32-bit buses and read the data in pieces of 4 bytes or more, and padding makes their buffer management easier.
The parameters are similar to the chunks but at a lower level. They were created to be able to include optional or variable length information inside the chunks, and to provide more possibilities of extension (it should be noted, however, that chunk types 63, 127, 191 and 255 are reserved for IETF-defined chunk extensions). As can be seen in Figure 3-2 they have the following structure:
The parameters were designed to make virtually unlimited the possibilities of expansion of SCTP. Therefore, they have a Parameter Type field of two bytes. These 65,536 different parameters should be more than enough. As a hint, only 8 parameters have been defined in [Ste2000], and another one has been reserved for ECN (some others have been defined in some Internet-Drafts specifying SCTP extensions, see section 8.1). As in the case of the chunks, the first two bits of the Parameter Type tell the receiver what to do in case it does not understand it. The behavior is basically the same one as described above, but, as we are in a lower level, when the first bit of the Parameter Type is set to zero the receiver should not discard the whole datagram but the whole chunk instead. The Design of SCTP: Datagram structure 47
As the parameters have variable length, there is a Parameter Length field that tells the receiver about the size of the parameter. The Parameter Value contains the actual information (note that the Parameter Value is optional, the Parameter Type alone can be enough in some cases). As in the case of the chunks, there are also padding bytes at the end of the parameter if its length is not a multiple of four bytes.
One parameter normally can be sent inside only one type of chunk, but this is not necessary, so the Parameter Type field must be unique across all the chunks. There is yet another structure used inside SCTP datagrams as Figure 3-2 shows us. It is the so-called Error Cause. Basically they are the same as the parameters. They have exactly the same structure, having Cause Code, Cause Length and Cause Value fields, which are completely equivalent to the Parameter Type, Parameter Length and Parameter Value fields. The main differences between the error causes and the parameters are:
Error causes are only included inside the ERROR and ABORT chunks (more about them in sections 6.2 and 7.2 respectively). They inform about some problematic situation, such as the receipt of an unrecognized chunk or parameter, lack of resources to open a new association, or a type of address that can not be managed. These two types of chunks can not carry parameters. The first two bits of the Cause Code field do not have the meaning they have in the Parameter Type value. This way of specifying how to act upon the receipt of an unknown error cause is not necessary due to two main reasons. First, no error cause should be originated by another error cause (this is basically the same idea as that no ICMP message is sent about ICMP messages). Second, error causes different than the ones already defined in [Ste2000] will be sent only in response to a chunk or parameter defined in an SCTP extension. If the sender of the chunk or parameter that triggered the sending of the error cause knows about that extension, then it should know about the error cause as well. Thus, in theory a host could never receive an unknown error cause (unless any of the two endpoints involved has bugs in its implementation).
As we can see, this design of SCTP is what makes possible one of its key features: its extensibility. A simple example that shows that lots of different needs will appear is that SCTP was created to transport several different telephony signaling protocols with different requirements, and later on its range of use became wider converting it in a possible replacement to TCP. Thanks to the chunk structure of an SCTP datagram, it is really easy to design new chunks and parameters that provide a new feature, and defining new error causes that inform about problems originated by the use of that new extension. This new feature will always be backwards compatible with implementations of SCTP that do not support it, since as we have seen, the chunk sender may require to be informed if the receiver is not able to understand that chunk. The chunk sender will eventually decide if it can continue without the features provided by the SCTP extension or if it tears down the association. There is complete freedom for the design and use of new SCTP chunks as well as parameters and error causes. The only problem about doing this, is that this new feature should then be presented to the IETF to be accepted as a valid SCTP extension, and the time involved in this process can be significant. This is because an agreement about its use and design should be reached inside the proper IETF working group (SCTP was first part The Design of SCTP: Datagram structure 48
of the SIGTRAN working group, and then it became a matter of TSVWG). As we have seen with the design of SCTP, it can take a long time. Nowadays, several extensions of SCTP are being designed. We will deal with them in chapter 8.
3.2 SCTP association management: The state diagram
As in TCP, the steps required to establish and release associations can be represented as a finite state machine. In SCTP it has 8 states instead of the 11 that TCP has, as represented in Figure 3-3. In the figure we can clearly identify the 8 different states as rounded rectangles (note that the CLOSED state appears twice, and that the state in the upper part of the diagram labeled with Any State is not another state but means any of the 8 states). The two representations of computers identify the other host from or to which we receive or send datagrams. As it is shown in the legend in the right bottom area of Figure 3-3, there are three types of arrows meaning different things:
The notched arrows with associated text in bold letters mean that the upper user makes a primitive call, namely Associate (to start an association), Shutdown (to gracefully terminate an association) and Abort (to abort an association). The arrows with text in Italics over them represent the control chunks sent to the peer or received from it. They go from any of the rectangles representing a state to the other host, and vice-versa. Finally, the arrows without any adjacent text represent changes in our internal state. They go from one rectangle representing a state to another one.
The diagram is colorful and many arrows have different starting and ending colors. This not only has to do with the author preferences, but it also helps to understand better what the diagram tells us. An upper user primitive call or an incoming chunk usually triggers a state change and quite often a chunk is sent as well, so the key to identify which output is related with a specific input is the color. The response to a given primitive call or incoming chunk is the arrow that has the same starting color as the ending color of the arrow representing the primitive call or incoming chunk. This applies not only to our state, but also to the answers of the other host to our outgoing control chunks. As an example, we see that the incoming INIT chunk arriving in the CLOSED state appears as a rose terminating arrow. Thus, we have to follow the rose outgoing arrow from the CLOSE state and we will see that the response to that chunk is that we stay in the same state and we send an INIT ACK chunk back to the host. Moreover, the arrow representing the INIT ACK chunk ends in pink color when it reaches the host, and so its answer to the INIT ACK chunk is the outgoing arrow from the host whose initial color is also pink, that is it, it sends us back a COOKIE-ECHO chunk. There are two state changes, marked with the * symbol that are not produced by any represented incoming chunk or primitive call, but they occur when the peer acknowledges all the possible outstanding data we could have. The Design of SCTP: Datagram structure 49
Figure 3-3: SCTP connection management finite state machine COOKIE-WAIT CLOSED Any State ESTABLISHED SHUTDOWN- PENDING SHUTDOWN- SENT SHUTDOWN- RECEIVED SHUTDOWN- ACK-SENT CLOSED A AB BO OR RT T I IN NI IT T A AC CK K I IN NI IT T C CO OO OK KI IE E E EC CH HO O S SH HU UT TD DO OW WN N C CO OM MP PL LE ET TE E S SH HU UT TD DO OW WN N S SH HU UT TD DO OW WN N S SH HU UT TD DO OW WN N C CO OM MP PL LE ET TE E S SH HU UT TD DO OW WN N A AC CK K SHUTDOWN ABORT ASSOCIATE ESTABLISHED U Us se er r P Pr ri im mi it ti iv ve e C Ca al ll l ASSOCIATE S St ta at te e S SH HU UT TD DO OW WN N C CO OM MP PL LE ET TE E S St ta at te e c ch ha an ng ge e C Co on nt tr ro ol l c ch hu un nk k s se en nt t o or r r re ec ce ei iv ve ed d * * T Th he e s st ta at te e i is s c ch ha an ng ge ed d a an nd d t th he e s si ig gn na al l i is s s se en nt t w wh he en n t th he er re e a ar re e n no o m mo or re e o ou ut ts st ta an nd di in ng g D DA AT TA A c ch hu un nk ks s. . S SH HU UT TD DO OW WN N * * COOKIE-ECHOED S SH HU UT TD DO OW WN N A AC CK K * * C CO OO OK KI IE E A AC CK K C CO OO OK KI IE E A AC CK K I IN NI IT T A AB BO OR RT T I IN NI IT T A AC CK K C CO OO OK KI IE E E EC CH HO O CLIENT The Design of SCTP: Datagram structure 50
Another hint to understand better the figure is that the right part of it represents the actions taken when we play the active part in the establishment and termination of the association (for example when we act as a client connecting to the server and then we finish the connection), while the left part is just the opposite (when we act as a server waiting for a client to connect to us, and then the client is the one who releases the association). This rule is broken only in one case, which is the SHUTDOWN chunk sent by the right host (this is done to keep the clarity of the diagram). At the first sight, one could think that SCTP's finite state machine representation is quite similar to TCP's one. This would not be a surprise, not only because TCP is one of the ancestors of SCTP but also because most of the transport protocols have roughly the same finite state machine representation. But there are big differences between them, the most important ones being these:
SCTP uses for the establishment phase a four-way handshake while TCP uses a three-way one. This has to do with the so-called cookie mechanism, used to avoid an attack similar to the known SYN attack in TCP (described in section 4.2). The establishment phase will be shown in detail in the next chapter. The termination of an association is simpler. In addition, in SCTP there is not the concept of half-open connections. This issue about termination of an association will be further commented in chapter 7.
While the initiation process has been accepted as a great improvement over TCP, the termination phase has been largely criticized due to its lack of the half-open connection concept. We will see more about these phases in their respective chapters. An association's birth: From a two-way to a four-way handshake 51
4. AN ASSOCIATION'S BIRTH: FROM A TWO-WAY TO A FOUR-WAY HANDSHAKE
In this chapter we will take a deep look at the establishment phase in SCTP. The way how SCTP sets up a new association is one of its major improvements over TCP. The establishment procedure has been modified several times during the design of SCTP and the final design is very robust (providing protection against one of the most typical attacks done to TCP) but at the same time allows a fast way to start transmitting data. In the next sections we will explain in detail how SCTP associations are formed, and the main advantages of this way of doing. We will also describe how this procedure evolved and the reasons behind the changes done.
4.1 The evolution of the establishment phase
Initially, the establishment of an association was really simple. MDTP was first designed to be used for telephony signaling transport. In the telecommunications business, signaling means not only call establishment and shutdown, but also billing, and when money comes into play all kind of precautions taken are never enough. So it was taken for granted that MDTP would be used inside the private IP networks of telecommunication companies without any connection to the Internet, and so little effort was put to avoid attacks as no hacker would be able to get into the network (and if you have attackers inside your own company then you really have a problem). Moreover, those IP networks would be properly engineered, meaning that they would have more bandwidth than they were expected to manage. So they would never be congested and they would not loose any message making IP almost reliable (but still some failures could happen in the network making routers to misbehave). In addition, the main objective was to make this establishment phase as fast as possible, so the hosts involved in the association could start sending data as soon as possible, with the lesser delay the better. As a result of this, MDTP used the simple two- way handshake connection algorithm shown in Figure 4-1.
Figure 4-1: Establishment procedure in MDTP
In the environment we were dealing with, it was supposed that packets were not lost in the network. Then the fastest way to start an association is simply sending data packets to the peer endpoint with whom the association is to be started, allowing the receiver to immediately send data back as soon as it receives the first data packet. This would mean no delay at all, but this design could originate some problems because one can not completely A Ac ck kn no ow wl le ed dg ge em me en nt t N Nu um mb be er r = = 0 0 S Se eq qu ue en nc ce eN Nu um mb be er r = = T Ta ag g A A Data Version In Queue Data Size Part Of Sequence Number (Send) Acknowledgement Number (Seen) MDTP Protocol Identifier 2 MDTP Protocol Identifier 1 A Ac ck kn no ow wl le ed dg ge em me en nt t N Nu um mb be er r = = T Ta ag g A A S Se eq qu ue en nc ce eN Nu um mb be er r = = T Ta ag g B B
Data Version In Queue Data Size Part Of Sequence Number (Send) Acknowledgement Number (Seen) MDTP Protocol Identifier 2 An association's birth: From a two-way to a four-way handshake 52
trust an unreliable network (one should not even completely trust a reliable one). Among other reasons, if a router fails there is still the possibility of having for example old delayed packets inside the network. If the receiver of one of such packets immediately opens a connection it will be likely that it will start waiting for more data that will never come, and that connection will be open forever (or at least until some other mechanism gets rid of it), wasting resources. So, before properly opening a connection one should ask the sender of the packet to verify if it really wants or not to open it to avoid these problems (this is exactly one of the reasons of the existence of the three-way handshake in TCP). Finally, it was decided that a three-way handshake was not necessary, but there should be at least a minimum establishment phase. As can be seen in Figure 4-1 the procedure used to open a new association was quite simple. The initiator of the association sent a MDTP datagram with several flags set to indicate that a new association was to be established. It also used the Sequence Number field of the MDTP datagram (see Figure 3-1) to send to the initiatee a key number. That number should be sent back inside the Acknowledgement Number field of a datagram in which the initiatee should also send its own key number in the Sequence Number field. This is exactly the same TCP does, even the names of the fields in the header are the same. The difference is that, this done, the MDTP association was open, and there was no need to acknowledge the receipt of the second message. The initiatee could start sending data from the moment it sent the answer to the initiator, who could start sending data right after receiving the acknowledging message. That made the initiatee able to send data two round trips time before it would in a normal TCP connection (actually, it would be perfectly legal in TCP to send data inside the three initial segments, but this is never done because, using the standard sockets interface defined for TCP, the upper user must first open the association before it can send any data). As explained in section 3.1.1 the Data field of MDTP was also used to exchange other kind of information such as the receiver buffer size, the number of streams (see section 5.3) or the valid IP addresses that could be used in the association. So MDTP was not allowed to include data in the datagrams sent during the establishment phase. However, the more the protocol evolved the clearer it became that it should not be restricted to signaling transport. That meant that the designers were focusing their efforts in creating a protocol that some day could even compete with TCP in its present task, as the main Internet transport protocol. As a consequence, they started to look at external attackers as a real menace and so they could not use this initiation scheme anymore.
4.2 Cookies against the attackers
This initiation procedure was kept until the last version of MDTP. During the IETF meeting in Oslo in July 1999 a new and revolutionary establishment phase started to be sketched. Little time later, in a designer's meeting in Santa Clara, Randall R. Stewart explained the main idea of this new mechanism (that, almost mythical, moment is remembered as the birth of a cookie). It completely removed the problem of the so-called SYN attack in TCP. This attack is very simple and can affect any system connected to the Internet providing TCP-based network services (such as an HTTP, FTP or mail server). There is a very good description of this attack in [CER1996]. Let us see in short how this basic attack is performed. In TCP, the connection phase consists of a three-way handshake, the first two legs being exactly the same as they were in MDTP (Figure 4-1). An association's birth: From a two-way to a four-way handshake 53
The third one is simply the acknowledgement of the second message exchanged. These three packets are usually called SYN (from Synchronization, as it has the SYN flag set, used only during the establishment), SYN-ACK (it has both the SYN and ACK flags set) and ACK (this is a simple acknowledgement message with the ACK flag set). The problem is that the receiver of the SYN not only sends back the SYN-ACK but also keeps some information about the packet received while waiting for the ACK message (a server in this state is said to have a half-open connection). The memory space used to keep the information of all pending connections is of finite size and it can be exhausted by intentionally creating too many half-open connections. This makes the attacked system unable to accept any new incoming connections and thus provokes a denial of service to other users wanting to connect to the server. There is a timer that removes the half-open connections from memory when they have been in this state for so long, and that will eventually make the system to recover, but nothing will change if the attacker continues sending SYN messages. There is no generally accepted solution to this attack. Using packet filtering, discarding the IP datagrams coming from the attacker would solve the problem if the attacker was not able to forge its IP source address, something commonly refereed to as IP spoofing [CER1995]. This practice, which not only makes packet filtering useless but is also so effective in hiding the identity of the attacking machines, is a trivial thing to do under any of the various UNIX-like operating systems. Fortunately, in a fluke of laziness (or good judgement?) that has saved the Internet from untold levels of disaster, Microsoft's engineers never fully implemented the complete UNIX Sockets specification in any version of Windows previous to Windows 2000. As a consequence, Windows machines, which are the most spread ones among Internet users, have been blessedly limited in their ability to generate deliberately invalid Internet packets (compared to UNIX machines). It is impossible for an application running under any version of Windows 3.x/95/98/ME or NT to spoof its source IP or generate malicious TCP packets such as the ones used to produce SYN floods 25 . Therefore, the attack works as represented in Figure 4-2 below:
Figure 4-2: SYN attack in TCP
As we see, the attacker uses IP spoofing, making it unable to receive the SYN-ACK segments produced, which is not a problem since it will never answer them. All those SYN-
25 For the interested reader, there is a very good reading about a SYN attack causing denial of service directed to the Gibson Research Corporation in [Gib2001].
I IP P D De es st ti in na at ti io on n A Ad dd dr re es ss s A A S SY YN N- -A AC CK K
I IP P D De es st ti in na at ti io on n A Ad dd dr re es ss s B B S SY YN N- -A AC CK K
I IP P D De es st ti in na at ti io on n A Ad dd dr re es ss s Z Z S SY YN N- -A AC CK K S SY YN N F Fa ak ke e I IP P S So ou ur rc ce e A Ad dd dr re es ss s A A S SY YN N F Fa ak ke e I IP P S So ou ur rc ce e A Ad dd dr re es ss s B B F Fa ak ke e I IP P S So ou ur rc ce e A Ad dd dr re es ss s Z Z S SY YN N An association's birth: From a two-way to a four-way handshake 54
ACK segments will be lost unless there is any host with TCP service listening to the port and addresses used as the source of the SYN segment. In that case that host will answer with a segment carrying the RST (from Reset) flag set and the attacked system will delete the information for that specific half-open connection. It seems that with the release of the last versions of Windows (Windows 2000 and Windows XP), which give access to raw IP sockets allowing the programmer to completely modify the whole IP header, this kind of basic attacks could become much more common. So servers relaying on the use of TCP as their transport protocol could be in danger. Or maybe not, but in any case, SCTP gives no chance of success to this kind of attacks with its cookie mechanism. When the designers of SCTP started to think about how to deal with SYN flooding, they quickly saw that two things were necessary in order not to make a new transport protocol with this same weakness:
The server (the initiatee of a new association) should not use even a byte of memory until the association is completely established. There must be a way to recognize that the client (the initiator of the association) is using its real IP address.
Usually, to meet the second requirement, the server sends some kind of key number to the client who will only receive that information if the source address used in its IP datagram is the real one. Once the client has that information, it can then send a confirmation to the server using that key number thus proving that it was telling the truth. This means that the server needs to save somewhere that key number as well so there is a way it can verify that the key number was the right one. But then comes the problem of being forced to store that value somewhere and using some memory resources while waiting for the answer that might never come. Therefore, the idea was: why not instead of storing that information in our system we make it to stay all the time in the network or in the client's memory? Of course, one immediately thinks that if a datagram coming from the client is the one that is going to provide us the information to check against the client's answer, we have not done anything but making worse the situation. The client will tell us whatever it wants and then it could just completely open an association sending us a simple message. But this is not necessarily true if we manage to convert the two problems into another one: the server has to sign with a secret key the information sent to the client. So, when it receives that information back from the client, it can recognize due to the signature and using the secret key, that it did send exactly that information, which is unmodified, and so we can be as confident on it as if it had never left the server's buffers. And that is the cookie mechanism. Apparently (and truly) so simple, but at the same time so powerful to avoid the flooding attack described above. In any case, that mechanism was basically the same as the one used in Photuris (a session-key management protocol specified in [Kar1999]).
4.3 The first two legs: The INIT and the INIT ACK chunks
So, let us look at the establishment phase in SCTP, represented in Figure 4-3, where the datagrams exchanged in the first two legs of the four-way handshake are augmented to see their internal structure:
An association's birth: From a two-way to a four-way handshake 55
Figure 4-3: Establishment phase in SCTP (first two legs)
As we can see the client first sends a datagram to the server containing the INIT chunk, and the server answers sending back an INIT ACK chunk. These two chunks are very similar and apart from the Chunk Type (which is set to 1 in the INIT chunk and 2 in the INIT ACK chunk), Chunk Flags (which are not used and are reserved for future use) and Chunk Length fields, they carry the following information:
The Initiate Tag in the INIT chunk plays the same role as the Sequence Number field of the TCP header. It matches the INIT chunk sent to the server (equivalent in this case to the SYN segment) with the expected INIT ACK chunk (which would be the SYN-ACK segment counterpart). The big difference is that in TCP only this first exchange is protected with this key number, while in SCTP the number is kept and all the datagrams exchanged during the whole life of an association must be tagged with this value. The randomly chosen value contained in the Initiate Tag field will be included inside the Verification Tag field of the common header of the datagrams sent by the server as a validity check: the client will never accept a datagram coming from the server if it does not have the Verification Tag set to the right value (except for some special cases as explained in chapter 7). Received Cookie Chunk Type = 10 Chunk Flags (Reserved) Chunk Length Checksum Verification Tag =Tag Z Source Port Number Chunk Type = 11 Chunk Flags (Reserved) Chunk Length Checksum Verification Tag =A Source Port Number Destination Port Number
Parameters Initial TSN Number of Outbound Streams Number of Inbound Streams Advertised Receiver Window Credit Initiate Tag = Tag A Chunk Type = 1 (INIT) Chunk Flags (Reserved) Chunk Length
Verification Tag = 0 Source Port Number Destination Port Number
Parameters Initial TSN Number of Outbound Streams Number of Inbound Streams Advertised Receiver Window Credit Initiate Tag = Tag A Chunk Type = 1 (INIT) Chunk Flags (Reserved) Chunk Length Checksum Verification Tag = 0 Source Port Number Destination Port Number
Cookie + Other Parameters Initial TSN Number of Outbound Streams Number of Inbound Streams Advertised Receiver Window Credit Initiate Tag = Tag Z Chunk Type = 2 (INIT ACK) Chunk Flags (Reserved) Chunk Length Checksum Verification Tag =Tag A Source Port Number Destination Port Number
State Cookie + Other Parameters Initial TSN Number of Outbound Streams Number of Inbound Streams Advertised Receiver Window Credit Initiate Tag = Tag Z Chunk Type = 2 (INIT ACK) Chunk Flags (Reserved) Chunk Length Checksum Verification Tag = Tag A Source Port Number Destination Port Number An association's birth: From a two-way to a four-way handshake 56
The Verification Tag field was included to avoid blind attacks 26 . In our case, any blind attacker would not know the value of the Verification Tag and so its datagrams would be rejected by the receiver. With the use of the 32 bits Verification Tag the blind attacks are drastically reduced as the attacker would need to send in average 2 31 datagrams before one of them is accepted. This would take a very long time, and much before such a quantity of datagrams with a wrong Verification Tag had arrived there should be some alarms bells already ringing. However, for a stronger protection against attacks one should use the procedures defined in [Ken1998a], which may be a tradeoff between security of the association and time consumed processing the datagrams. Of course, only if we make the Verification Tag as random as possible, that average number of 2 31 attempts will be a reality. Otherwise, as the Verification Tag is a basic defense against blind attacks, there will be the possibility of suffering attacks similar to the so-called Sequence Number Attack described in [Bel1996], in which the power of the attack relies in the possibility of guessing the value of a new pseudorandom number if the attacker knows the ones generated during a small period of time. Random numbers are hard to produce in a computer, but the hints given in [Eas1994] can be helpful to achieve the desired level of randomness. As the datagram containing the INIT chunk is the first one of an association, it has its Verification Tag field set to zero. As seen in Figure 4-3 the datagram containing the INIT ACK chunk already uses the Initiate Tag included in the INIT chunk received. The INIT ACK chunk itself also contains an Initiate Tag that will be used by the client as the Verification Tag of its subsequent datagrams directed to the server. The Advertised Receiver Window Credit tells the server which is the buffer space in bytes that the sender of the chunk has reserved to store incoming data. This field has been changing a lot during the evolution of MDTP and SCTP (already discussed in section 3.1). In the first version of SCTP, the In Queue field of the MDTP header evolved. A new field, called the Receiver Window Credit, was included both in the INIT and in the INIT ACK chunks. This information told the receiver of the INIT or INIT ACK how many outstanding messages it could have. But this did not help that much: the information was still given in number of messages instead of number of bytes, and again that information was related with the number of outstanding data messages and not with the real state of the receiver's buffer. In the next version it was changed to express the value in number of bytes instead of messages, but the error in the concept was still there, as there was no direct boundary between the outstanding bytes and the buffer space at the receiver. Finally the mistake was fixed in the 6 th version of SCTP, including the Advertised Receiver Window Credit field, also included in the acknowledgement chunks, that allows the data sender to track the buffer space at the receiver side. This was somehow going back to the roots, as the Window field of TCP's header performs exactly this same function. The main difference is that the Window field is 16 bits long, while the Advertised Receiver Window Credit uses 32 bits. When the Receiver Window Credit field was firstly used in SCTP it was 16 bits long as well. But when it was changed to express the value in octets instead of messages,
26 In a blind attack the attacker is not able to read a datagram that is not directed to it, and it does not have access to the data exchanged between the peers involved in an association An association's birth: From a two-way to a four-way handshake 57
it was immediately upgraded to 32 bits to avoid a problem that TCP has related with its Window field. TCP has a 16-bit Window field that can at most report 64 Kbytes. That quantity, while enough when TCP was designed, quickly became too small. As described in [Jac1992] TCP performance depends not upon the transfer rate itself, but rather upon the product of the transfer rate and the round-trip delay. This BandwidthDelay product measures the amount of data that has been already sent but that has not yet reached its destination (the bits that are still on the way). It is the buffer space required at the receiver to obtain maximum throughput on the TCP connection over the path, i.e., the amount of unacknowledged data that TCP must handle in order to keep the pipeline full. As networks evolve to become Gigabit networks, the small Window field that TCP has brings performance problems especially in long distance connections. Let us consider for example that we are transmitting data from Madrid to Helsinki having the receiver a buffer of 64 Kbytes, and suppose also that the link used transmits at one Gbps through fiber. The example is represented in Figure 4-4:
Figure 4-4: Transmission of 64 kilobytes from Madrid to Helsinki
In the example, Figure 4-4 (a) shows the initial state, just before the host in Madrid starts sending data. Let us make some rough calculations that will show how big the problem can be. The 64 Kbytes (524,282 bits) of the Window are sent in about 500 s as shown in Figure 4-4 (b). If we consider that the speed of light inside the fiber is about 200,000 Km/s, the datagrams will take about 15 ms to cover the 3,000 Km distance between the two cities (Figure 4-4 (c) shows the moment when the first datagram sent reaches Helsinki). In that moment, the acknowledgements are started to be sent and they reach Madrid 15 ms later as (a) At t = 0 (d) After 30 ms (c) After 15 ms (b) After 500 s An association's birth: From a two-way to a four-way handshake 58
represented in Figure 4-4 (d). The arrival of those acknowledgements to the data sender allows it to send more data. Meanwhile, it must stay idle waiting for the answer, and so, it is sending data at 1 Gbps rate only during 500 s every 30 ms. This is less than 2% of the time, converting our excellent 1 Gbps link into a poor 20 Mbps link. Fortunately, this problem was solved in [Jac1992] including a new Window Scale option in TCP that allows to shift the Window field up to 16 bits to the left, thus allowing windows of up to 2 32 bytes. SCTP will not suffer this problem at least in a very long time, as the Advertised Receiver Window Credit can make use of a buffer of up to 4 Gbytes. The Number of Outbound Streams and Number of Inbound Streams are used to negotiate the quantity of streams 27 used in the association by each endpoint. Every SCTP association is composed of at least one outbound stream going from each host. So, every host has at least an outbound stream to send data to the other host, and an inbound stream to receive data from the other host. During the initialization phase, the client sends inside the INIT chunk the information about how many inbound streams it is willing to accept, and how many outbound streams it would like to open. The server also includes this information in its answer, so the minimum of the number of requested outgoing streams and the number of manageable incoming streams by the peer will be chosen, whichever number is smaller. The streams feature appeared firstly in the 6 th version of MDTP (firstly called flows and then modified to avoid confusion, as that term was used for other purposes in other protocols somehow related to MDTP). It was a compulsory feature but it added complexity to the protocol so it was decided that it should be optional in the next revision of MDTP. However, as they were a convenient remedy to avoid head-of-line (HOL) blocking (as explained in section 5.3), they became compulsory again in the next version, as they are in SCTP. In the 6 th version of MDTP the streams had to be opened one at a time, using a special stream opening procedure. When opening a large quantity of streams, this procedure was long and inconvenient. So, in the last version of MDTP included the possibility of opening several streams during the establishment phase. When SCTP came into play with its extensibility possibilities only this initial opening was kept, and the possibility of opening and closing streams during the life of the association was completely removed. This was done because if it is proven in the future that this feature is desired, one can always make an easy extension to SCTP to deal with the problem. Meanwhile it is better not to add features to the protocol that maybe will be never used. In the early stage of design, stream 0 was reserved for control purposes. This was elegant in a way, but few weeks were enough for the designers to realize that streams were only related with upper user data transmission and thus it was at least paradoxical using it for control purposes (it should be the upper user that should specify the use of different streams). Anyway, stream 0 still kept its special status as it was always implicitly open when an association was established (presently, stream 0 must be explicitly and compulsorily open).
27 The term stream is used in SCTP to refer to a sequence of user messages that are to be delivered to the upper-layer protocol in order with respect to other messages within the same stream. This is in contrast to its usage in TCP, where it refers to a sequence of bytes. The use of the streams will be further explained later in section 5.3. An association's birth: From a two-way to a four-way handshake 59
The value of the first Transmission Sequence Number (TSN) must be included in the Initial TSN field of both the INIT and INIT ACK chunks. The TSN is a number included in every DATA chunk to allow the receiving SCTP endpoint to acknowledge its receipt and detect duplicate deliveries (thus with an equivalent functionality than the Sequence Number in TCP). The Initial TSN is simply the value of the TSN that the INIT or INIT ACK sender will include in its first DATA chunk sent. It is usually set to the same value than the Verification Tag. The last part of the INIT and INIT ACK chunks contains parameters. We will deal with them in the next sections.
When the client sends the INIT chunk requesting the establishment of the association, it creates a data structure that keeps the information needed to manage that association, the Transmission Control Block (TCB). The TCB will be used during the whole life time of the association, keeping the information about timers, received and sent TSNs, and all the necessary data to keep the association up and running. It is important to note that the server will not create the TCB until it receives the answer to the INIT ACK it sent.
4.3.1 The parameters
All the parameters defined in the basic SCTP specifications are meant to be used during the first two legs of the establishment phase. Thus, only the INIT and INIT ACK chunks are able to carry parameters so far. The ERROR and ABORT chunks can carry error causes, which are syntactically the same, but with different semantics (more about this in sections 6.2 and 7.2). Also the HEARTBEAT and HEARTBEAT ACK chunks can carry a similar TLV structure, but its internal structure is implementation-specific (see section 6.1). Some SCTP extensions use new parameters, but they have not been standardized yet (as discussed in chapter 8). All the INIT and INIT ACK parameters that appear in the SCTP specifications are discussed in the next sections.
4.3.1.1 What is your address?
The IP Address parameters, in the INIT chunk, list the valid IP addresses that the client will use as a source of its datagrams and that the server can use as the destination of its datagrams (and vice-versa in the INIT ACK chunk). Opposed to TCP, an SCTP association can take advantage of a multihomed host using all the IP addresses the host owns. This feature is one of the most important ones in SCTP as it gives some network redundancy that is really valuable when dealing with telephony signaling. As seen in section 2.2.1, in the SS7 world everything is duplicated, and the idea of loosing a TCP connection due to the failure of one of the network cards was one of the major problems that made SCTP necessary. Initially, multihoming was also used for load sharing. The idea was using the destination addresses available in a round robin fashion, and so sending 1/n part of the traffic to each of the n available destinations, thus avoiding congestion. Quickly this idea was discarded, as SS7 links are engineered to be loaded just at a 40% of their capacity at most and so they should be never congested. Moreover, transmitting datagrams selecting the destination address in a round robin fashion actually means that if any of all the network cards suddenly stops working, the association will not be lost but there will be undesired retransmissions, delaying the transmission of the information (so we are An association's birth: From a two-way to a four-way handshake 60
multiplying by a factor of n the probabilities of suffering some kind of network failure). Even more, every change in the address used would likely produce out of order datagram arrival to the receiver. This is not generally a nice thing, as it produces extra buffer consumption, the sending of more acknowledgements and even retransmission of packets (more about this in section 5.2). So, in SCTP only one address is used, the Primary Address, while the rest are left as a backup in case the Primary Address becomes unavailable. Another discarded idea regarding the use of multihoming, was sending the datagrams duplicated to all the destination addresses. This idea was forgotten since it would multiply the load by the number of destination addresses and there were doubts about the gain that it could provide. At the beginning, only IPv4 addresses were considered. This was really a very shortsighted design, that fortunately was modified in the first version of SCTP after the Oslo IETF meeting in July 1999. However, this was not the last addition to the IP Address parameter suite. Listing inside the body of the SCTP datagrams the addresses that are going to be used, instead of only using the one that appears as the source address in the IP header, produces some operation problems when dealing with a Network Address Translator (NAT) [Sri2001]. NATs are a special kind of routers that were created as a short-term solution to IPv4 address depletion. The 32-bit field for IPv4 addresses yields a total of 4,294,967,296 addresses. This quantity would be enough to address most of the people in the whole world. However, only about 20 - 30% of those addresses can be used, so we make routing efficient enough (having a hierarchy in the address allocation). NATs are a lesser evil that is lasting more time than expected (IPv6 with its 128-bit addresses is the long-term solution that will obsolete the use of NATs). The arguments in favor and against NATs frequently take on religious tones, with each side passionate about its position. The author is on the side of people against the use of NATs. NATs are always placed at the borders of stub domains 28 and they take advantage of the fact that a small percentage of hosts in a stub domain are communicating outside of the domain at any given time (indeed, many hosts never communicate outside of their stub domain). Because of this, only a subset of the IP addresses inside a stub domain needs be translated into globally unique IP addresses when outside communication is required. Meanwhile, the addresses used inside the domain can be reused in several different stub domains. So, one globally unique Class C IPv4 network (more about IPv4 network classes in [Tan1996, section 5.5.2]) can be used by more than 254 hosts in the whole world (usually one of them acting as a router). The basic operation of a NAT is shown in Figure 4-5. The figure represents two stub domains, each of them having a NAT that connects a LAN to the Internet. Let us call Stub A the one that externally uses the globally unique Class C IPv4 address block of 195.217.176.0/24 and is connected to the Internet through the NAT with IPv4 address 195.217.176.1 (the one on the left of Figure 4-5). Let us call Stub Z the one that externally uses the globally unique Class C IPv4 address block of 195.17.34.0/24 and is connected to the Internet through a NAT whose IPv4 address is 195.17.34.1 (on the right in the figure). As we see, both NATs use Class C IPv4 addresses, globally unique, while both stub domains use Class A IPv4 network addresses inside their domains (both use network 10.0.0.0/8). This kind of network can contain more than 16
28 A stub domain is a domain, such as a corporate network, that only handles traffic originated or destined to hosts in the domain. An association's birth: From a two-way to a four-way handshake 61
million hosts. However, these Class A addresses can be used only inside the domain itself, as they are not globally unique.
Figure 4-5: Basic NAT operation
In Figure 4-5 the NAT operation is explained with an example of an IPv4 packet traversing two NATs (situation which is normally referred to as Twice NAT). A host in Stub A internally represented as 10.114.206.48 sends a packet to the IPv4 address 195.17.34.9 (the destination address is known using an Application Level Gateway (ALG) [Sri1999] that returns the right answer to a DNS query, but we will not discuss that earlier phase). The IPv4 packet will have 10.114.206.48 as its source address and 195.17.34.9 as its destination address. When that IPv4 packet reaches the NAT router, it translates the source address, changing it from 10.114.206.48 to 195.217.176.131. In fact, the router could have chosen any address of its Class C network (from 195.217.176.2 to 195.217.176.254) that is not being used in that moment by any host inside Stub A for external communications (an address that is not part of any connection between a host in Stub A and any other host outside the stub domain). Then, in case the IPv4 datagram is the first datagram of a connection (for example, if it is a TCP SYN segment) the NAT reserves that address (195.217.176.131) as being the address used by host 10.114.206.48 outside Stub A. So, IPv4 packets that arrive to the router of Stub A and that are destined to 195.217.176.131 will be internally sent to host 10.114.206.48. This change will mean for example that the IPv4 Header Checksum will have to be recalculated, and depending on the type of information carried by the IPv4 packet some more changes will have to be done (for example, if it carries a TCP segment, the TCP Checksum will also be recalculated, or if it carries an ICMP message, the source address of the IPv4 header inside the ICMP message will also have to be modified). The IPv4 packet is sent then to the Internet, where it will be routed to the NAT at the border of Stub Z (195.17.34.1). That NAT will use an internal table to know that packets directed to 195.17.34.9 must be actually sent to host 10.170.8.92, so it will make the necessary changes to the incoming packet and it will resend it, so finally the right host will
Parameters Initial TSN Number of Outbound Streams Number of Inbound Streams Advertised Receiver Window Credit Initiate Tag =Tag A Chunk Type =1 (INIT) Chunk Flags (Reserved) Chunk Length Checksum Verification Tag =0 Source Port Number Destination Port Number Source: 195.217.176.131 Destination: 195.17.34.9
The Internet 10.27.28.15 10.170.8.92 10.114.206.48 10.170.8.47 LAN 195.217.176.1 195.17.34.1 Parameters Initial TSN Number of Outbound Streams Number of Inbound Streams Advertised Receiver Window Credit Initiate Tag = Tag A Chunk Type = 1 (INIT) Chunk Flags (Reserved) Chunk Length Checksum Verification Tag = 0 Source Port Number Destination Port Number Source: 10.114.206.48 Destination: 195.17.34.9 LAN Parameters Initial TSN Number of Outbound Streams Number of Inbound Streams Advertised Receiver Window Credit Initiate Tag =Tag A Chunk Type =1 (INIT) Chunk Flags (Reserved) Chunk Length Checksum Verification Tag =0 Source Port Number Destination Port Number Source: 195.217.176.131 Destination: 10.170.8.92 Parameters Initial TSN Number of Outbound Streams Number of Inbound Streams Advertised Receiver Window Credit Initiate Tag =Tag A Chunk Type =1 (INIT) Chunk Flags (Reserved) Chunk Length Checksum Verification Tag =0 Source Port Number Destination Port Number Source: 195.217.176.131 Destination: 195.17.34.9 An association's birth: From a two-way to a four-way handshake 62
receive the packet. The answer to that packet will suffer the same changes in its way back to host 10.170.8.92 but in the reverse order. As we see, NATs provide the feature of having more than 254 hosts while using a Class C Network. This is helpful as the 16.382 Class B Networks (with space for up to 65,534 host) were almost exhausted, used by companies that normally use more than 254 hosts in their network but far less than 65,534, so wasting lots of IPv4 addresses. This feature can not be always transparently provided to the hosts because the NAT not always have all the necessary information to make the translation of addresses (especially in issues related with security, where IPv4 packets carry things such as digital signatures). In any case the solution has the disadvantage of breaking the End-to-End (E2E) principle 29 inside an IP network, and making up for it with increased state in the network. SCTP is also affected by the existence of NATs as, due to its multihoming capabilities, the addresses used in the association are included inside parameters in the INIT and INIT ACK chunks. If those addresses were not translated as well, the receiver of the INIT chunk would mistakenly use those non globally unique addresses. This problem forced one of the next three solutions:
Updating the software of existing NATs to look inside the SCTP datagrams and determine if some modifications should be done to its content, translating the addresses carried inside parameters. Not including any IP Address parameter (i.e., not using multihoming) if there is any NAT in between. Waiting until IPv6 is deployed and so there will not be any need for NATs any more and we will not have to worry about them.
As NATs are widely used and SCTP was expected to be used before IPv6 is a reality in the Internet, not any of these solutions could be seriously considered. So, after some debate in the distribution list, a solution to the NAT traversal problem was found thanks to the extensibility features of SCTP, and in its 9 th release the Host Name Address parameter was included. This parameter simply includes the host name of the sender of the INIT or INIT ACK chunks, so the receiver can make the DNS query and the NATs can forget about sniffing inside SCTP datagrams, thus making easier the whole operation with NATs. However, the idea was almost discarded as it brought some potential security problems (regarding the need of a DNS query) that were finally fixed. The parameters also carry information about the Supported Address Types. This parameter was included at the same time when the Host Name Address parameter was first included. The problem it solves is that, if the INIT receiver wants to send us a Host Name Address parameter and we are not able to resolve such kind of addresses, we will not even be able to answer to the INIT ACK chunk, and there will not be any way the association could be established. Telling to the peer that we do not support Host Name Address parameters can avoid this situation (if apart from host names the peer can also send us IPv4 Address and/or IPv6 Address parameters). Of course, this parameter is only useful inside the INIT chunk.
29 The so-called End-to-End principle notes that certain functions can only be performed in the endpoints, thus they are in control of the communication, and the network should be a simple datagram service that moves bits between these points. This improves network reliability. A discussion about this model is held in [Car2000]. An association's birth: From a two-way to a four-way handshake 63
The multihoming features of SCTP also impose some problems to the use of the IP Security Protocol (IPsec) ([Ken1998a] defines the whole architecture of IPsec, and all the encryption and authentication algorithms, key management and security protocols are specified in the RFCs number 2402 to 2412). This is because the whole IPsec model was designed thinking on connections that did not make use of multihoming. Every source- destination pair of addresses has to use a single key that must be first securely exchanged using a protocol such as the Internet Key Exchange (IKE) [Har1998]. So, even if there is the possibility of creating and exchanging a key for every source-destination pair of IP addresses, when the number of IP addresses used by the endpoints is large, the whole process of maintaining all those secure associations becomes clumsy. The work regarding the use of SCTP with IPsec is published in [Bel2001]. During the early stage of SCTP design, there existed the implicit feature of starting an association with an endpoint on behalf of another one. There were quite many security implications if this was allowed (and few reasons to do it). So finally it was forbidden and the source address of the SCTP datagram carrying the INIT or INIT ACK chunk is always part of the association (unless a Host Name Address parameter is used, but in that case the resulting INIT ACK will be discarded if it is not directed to the INIT sender). There is one Internet-Draft, [Coe2001], which compiles the issues raised by SCTP in regard to multihoming on the Internet.
4.3.1.2 The king of the parameters: The State Cookie
The INIT ACK chunk always carries a special parameter, the State Cookie (normally simply referred to as the Cookie). This is the parameter that makes possible getting rid of attacks similar to the SYN attack used in TCP and shown in section 4.2. It was included in the first version of SCTP, being the basis of the whole establishment phase. It does not really have any internal structure, as it must be transparently echoed by the receiver of the INIT ACK chunk, for whom the Cookie is meaningless. However, as the intention of the Cookie is to move to the network and the client the task of the storage of the information needed to open the association when the Cookie is echoed to its sender, there must be a method to validate that it remained unmodified during its return travel through the network. So it is highly recommended to include a Message Authentication Code (MAC) in the Cookie. The current recommended MAC is the Keyed-Hashing algorithm for Message Authentication (HMAC) described in [Kra1997]. HMAC makes use of any iterative cryptographic hash function such as Message Digest 5 (MD5) [Riv1992] or Secure Hash Standard 1 (SHA-1) [NBS1995] (which are the two most widely used cryptographic hash function nowadays), in combination with a secret key. Thus, the INIT ACK sender should calculate the HMAC of the Cookie, using also a secret key that is not known by anybody else (and that should be changed every now and then). When the Cookie is echoed back and received by the INIT ACK sender, it should recalculate the HMAC of the bytes of the Cookie, using again its secret key. If the result is the same contained in the Cookie, it means that nobody modified it (or that a wise attacker somehow guessed the secret key). During the first releases of SCTP, it was suggested that MD5 should be used for the HMAC. Later on, that suggestion was taken off due to the fact that MD5 is considered a weak cryptographic function nowadays, as explained next. The strength of any one-way hash function is defined by how well it can randomize an arbitrary message and produce a unique output. One might think that it would take on the order of 2 m operations to subvert An association's birth: From a two-way to a four-way handshake 64
an m-bit message digest, but in fact, 2 m/2 will often do using the Birthday Attack 30
[Yuv1979]. Making a mathematical study, it can be proven that if some function, when supplied with a random input, returns one of k equally-likely values, then by repeatedly evaluating the function for different inputs, we expect to obtain the same output after about 1.2k 1/2 iterations. MD5 generates a digest of 128 bits, so it would be expected that about 2 64 messages would have to be processed before we find two messages with the same digest (using only for this purpose the last designed supercomputer in the U.S. nowadays, able of making about 10 15 operations a second, it would still take about a year to calculate such quantity of MD5 digests). However, studying the internal structure of MD5, in [Dob1996] a way was described such that one could find, in about 10 hours and with a Pentium-PC, two messages with the same digest with a probability of about 0.05% (while this kind of attack does not yet threaten practical applications of MD5, it comes rather close). [Ste2000] recommends that the Cookie should be as small as possible to avoid fragmentation. A Cookie is usually smaller than 100 bytes. Apart from the MAC already discussed, most of the SCTP implementations include in the Cookie the next fields:
The information exchanged in the INIT and INIT ACK chunks: the Verification Tag of both the client and the server, the client's Advertised Receiver Window Credit and Initial TSN, the number of the incoming and outgoing streams and the valid IP addresses used by the client (or its hostname). The lifetime of the Cookie, so a hypothetical attacker would not have enough time to crack the MAC included. The Tie-Tags.
The Tie-Tags are two 32-bit values that are normally set to 0. However, in case the INIT is received when an association is already established (or it is in its establishment phase), they carry the copies of both the client and server's Verification Tags in the moment the INIT arrived to the server. This information, together with some rules regarding the election of the Verification Tag depending on the state in which the receiver of the INIT is (see section 5.2 of [Ste2000]), help to identify the situations such as: initialization collision, restart of the peer, receipt of old or retransmitted datagrams and false packets generated by attackers. The concept of the Tie Tag was first included during the last stage of SCTP design as the response to the impossibility of differentiate the situations stated above in some cases. As stated above, the lifetime of the Cookie is limited. Thus, in case the delay between the two hosts is large and the lifetime of the Cookie is too short, establishing an association might become impossible. So, the INIT sender may ask for an enlargement of the Cookie lifetime with a Cookie Preservative parameter. It simply includes the suggested Cookie life-span increment. The receiver of this parameter may choose to ignore it due to its own security reasons.
4.3.1.3 Other parameters
SCTP capabilities can be extended creating new chunks and/or parameters. As the
30 The name of this attack comes from the answer to the question "How many people do you need before the probability of having two or more of them with the same birthday exceeds 50%?". The answer is that only 23 people are needed. Taking into account that with 23 people one can make (23x22)/2 pairs, each of them with a probability of 1/365 of being a hit, it is not really so surprising. An association's birth: From a two-way to a four-way handshake 65
sender might need no answer to the new chunks or parameters, there exist the ambiguity of a receiver actually processing the chunk or parameter, acting as it is supposed to and not sending back any answer, and a receiver that simply discards the received information because it does not know how to manage it. In the later case, depending on the Chunk Type or Parameter Type (as explained in section 3.1.2) the receiver may send back an Unrecognized Parameters parameter inside the INIT ACK, or an ERROR chunk (more about this in section 6.2). The receiver of such parameter may decide to set up the association without the extended functionality, or abort the establishment procedure. The last defined parameter, is the ECN Capable parameter. Its internal shape has not been specified, and just its Parameter Type has been reserved for future use of ECN. This parameter should indicate that the INIT or INIT ACK sender understands ECN messages.
4.4 The last two legs: The COOKIE ECHO and COOKIE ACK chunks
The last two legs of the whole four-way handshake are much simpler than the first two ones. They are shown in Figure 4-6.
Figure 4-6: Establishment phase in SCTP (last two legs)
Basically, the receipt of the INIT ACK chunk triggers the sending of the COOKIE ECHO chunk, which carries the same Cookie received inside the INIT ACK chunk. Of
Cookie + Other Parameters Initial TSN Number of Outbound Streams Number of Inbound Streams Advertised Receiver Window Credit Initiate Tag = Tag Z Chunk Type = 2 (INIT ACK) Chunk Flags (Reserved) Chunk Length Checksum Verification Tag =Tag A Source Port Number Destination Port Number Parameters Initial TSN Number of Outbound Streams Number of Inbound Streams Advertised Receiver Window Credit Initiate Tag = Tag A Chunk Type = 1 (INIT) Chunk Flags (Reserved) Chunk Length
Verification Tag = 0 Source Port Number Destination Port Number
Received Cookie Chunk Type = 10 Chunk Flags (Reserved) Chunk Length Checksum Verification Tag =Tag Z Source Port Number
Received Cookie Chunk Type = 10 (COOKIE ECHO) Chunk Flags (Reserved) Chunk Length Checksum Verification Tag = Tag Z Source Port Number Destination Port Number Chunk Type = 11 Chunk Flags (Reserved) Chunk Length Checksum Verification Tag =A Source Port Number Destination Port Number Chunk Type = 11 (COOKIE ACK) Chunk Flags (Reserved) Chunk Length Checksum Verification Tag = Tag A Source Port Number Destination Port Number An association's birth: From a two-way to a four-way handshake 66
course, the datagram carrying that chunk must have its Verification Tag set to the Initiate Tag value received in the INIT ACK chunk. Upon the receipt of the COOKIE ECHO chunk, the server might open a new association with the client (if it has resources and the received Cookie is valid and not stale yet). It is in this moment when the server creates its TCB, and before the receipt of the COOKIE ECHO nothing is saved in the server about the association that is in its establishment phase. Then, the server sends back the COOKIE ACK chunk, which does not really carry any extra information but tells the client that the new association was successfully created. As stated before, the initial goal of the establishment phase was to be able to send data as soon as possible. The use of a four-way handshake initialization procedure instead of a two-way one would delay the sending of data by one Round Trip Time (RTT). But this is not necessarily true, as the last two datagrams exchanged in the SCTP establishment phase can carry any other chunk (including the DATA chunk) bundled with the COOKIE ECHO or COOKIE ACK chunks. Therefore, when comparing SCTP's and MDTP's establishment phase, we see that the client must wait for a single RTT before it can send any data, which is the same quantity of time in both protocols. The server must wait for an RTT between the receipt of the INIT and the receipt of the COOKIE ECHO chunk. This means one RTT extra wait when comparing with MDTP. However, as usually the server can not send any data to the client before the client itself has made a request, in the normal case both the client and the server suffer from the same delay with the two-way handshake and the four- way one, but the four-way is much more secure. The so-called Cookie Mechanism is a very neat solution to most of the problems with which SCTP has to deal with, and it is one of SCTP's greatest improvements over TCP.
Doing the hard work: Transmission of data 67
5. DOING THE HARD WORK: TRANSMISSION OF DATA
The aim of any transport protocol is the transmission of data. In this aspect, SCTP has evolved a lot since the first version of MDTP to the publication of the RFC. As the designers of SCTP had complete freedom, they included almost all the features that in TCP are included as successive extensions (some of them can not be used at the same time, mostly due to space problems in the TCP options field). In this chapter we will explain the evolution of data transmission in SCTP, and how new additions to TCP's functionality fit inside SCTP, such as the congestion control mechanism, the selective acknowledgements or the report of the receipt of duplicate data.
5.1 Basic data transmission
The two chunks used for data transmission are the DATA chunk, used by the data sender and the one that carries the user data, and the SACK chunk, used by the data receiver and the one that carries the acknowledgement of the receipt of the DATA chunks. In Figure 5-1 we see the normal way in which data transmission takes place. As we can see from the figure, every DATA chunk is identified by its TSN. This value plays the same role as the Sequence Number field of the TCP header, with a subtle difference. The TSN counts DATA chunks sent and not the bytes carried on them as the Sequence Number does. Therefore, two consecutive DATA chunks will have two consecutive TSNs. During the first 6 releases of the MDTP specification, the MDTP and TCP behaviors were exactly the same in this aspect (even the fields were called exactly in the same way, see Figure 3-1). But in one of the many design team's discussions held in April 1999 it was decided that packet marking instead of byte marking was more desirable for signaling transport. In this way, SCTP can somehow use better the 32 bits of the TSN (but this is not a big deal, since one should have 2 31 bytes outstanding, 2 Gbytes, before the difference could be of any help, and this is highly unlikely). This packet marking can be done in SCTP because the user data is sent to the network inside data blocks, the DATA chunks, which can be uniquely identified by its TSN, and so all the bytes included inside them. In TCP every byte is marked depending on its order in the byte stream being sent to the receiver, and they do not belong to any superior structure. Thus, a TCP sender has the ability of freely rearrange the quantity of bytes of user data it wants to include in a segment. Once the user data has been sent inside several TCP segments (and thus fragmented in specific pieces), those segments can be joint or split later on. So, for example in case of retransmission, the TCP data sender has the possibility of including in a single segment what was previously included in several different (and consecutive) segments. Joining DATA chunks is not a problem in SCTP either, due to its bundling ability (present since the first version of MDTP). This means that more than one chunk can be included in a single SCTP datagram. So in case of retransmissions, an SCTP data sender can put together in a single datagram several DATA chunks previously sent inside their own datagram. However, what is a real limitation in SCTP is that once a DATA chunk has Doing the hard work: Transmission of data 68
been sent, the data carried inside it can not be split later on and sent inside several smaller DATA chunks. This can be a problem if the MTU decreases (see section 5.4). As seen in the figure, when a DATA chunk arrives to the receiver of data, it must send back a SACK chunk reporting its receipt. The Cumulative TSN Ack is used in the same way as the Acknowledgement Number in TCP. But again, it acknowledges the receipt of all the previous TSNs up to and including the Cumulative TSN Ack, while in TCP bytes are acknowledged, not the segments that carry them. After the Cumulative TSN Ack, the SACK chunks carry the Gap Ack Blocks. They are used to acknowledge data received out of order. The Cumulative TSN Ack acknowledges all the datagrams received up to the TSN it states (acknowledging that TSN as well). However, as the DATA chunks can arrive disordered to its destination, or some of them may even be lost, we need a mechanism to tell the data sender that we have received those TSNs out of order. Thus, if a TSN falls inside a Gap Ack Block it means that it has reached its destination and the data sender does not have to retransmit it even if the Cumulative TSN Ack does not acknowledge it. The Gap Ack Block Start and Gap Ack Block End are 16-bit numbers because they express TSNs relative to the Cumulative TSN Ack. TCP can only report the last byte received in order (using the Sequence Number) unless it uses the option for selective acknowledgement defined in [Mat1996] 31 . As happened with some other TCP extensions, this ability was directly included in SCTP in its basic specification. We can also see in Figure 5-1 that the SACK chunk carries at the end a list of Duplicate TSNs. The use of such list is not explained in the whole specification of SCTP, but this is not a mistake. This feature was added in the 8 th version when the experts in congestion control of the TSVWG suggested incorporating it. At that time they were working on an extension to the Selective Acknowledgements for TCP that could report also duplicate data segments received, work that was finally published in [Flo2000]. Again, SCTP inherited this TCP functionality. As expected, to provide reliability, if the acknowledgement of a certain TSN is not received within an interval of time, it is retransmitted. However, in SCTP we can play with another variable than in TCP, which is the set of addresses used by the data receiver. A data lost might mean either that the path to that IP address is congested, that a router in the way in misbehaving and loosing datagrams, or simply that the network card of the receiver is broken. So, when several addresses can be used at the same time, it is advised that when making a retransmission of a DATA chunk we use a different address than the one to which the DATA chunk was sent the last time. In this way, the sender takes profit of the multihoming capabilities of SCTP to provide an extended reliability (if any of the receiver addresses is properly working, the data transfer will effectively take place). There are, however, some concerns about a malicious use of multihoming to artificially enlarge the sending limits (so using more network resources than allowed) that will be explained in section 5.2. The DATA chunks only send data, and the SACK chunks only acknowledge data. In TCP, if both ends are transmitting data, a data segment can also acknowledge data received. This is a nice feature since it saves the bandwidth consumed by
31 Both in SCTP and in TCP the acknowledgement of data arrived out of order is taken as advisory only. User data is not considered fully delivered until it is acknowledged by the Cumulative TSN Ack or the Sequence Number respectively. This is because the data receiver can drop received data that has not been delivered to the upper user yet (although this should be done only in extreme circumstances such as buffer shortage). Doing the hard work: Transmission of data 69
acknowledgements. It can also be achieved in SCTP by bundling DATA chunks with a SACK chunk.
Figure 5-1: Basic data transmission
As seen before, MDTP could send piggybacked acknowledgements as TCP does without further problems, so bundling was initially designed for another reason. The reason is that TCP transports a simple stream of bytes, and it is the task of the upper user to insert the proper marks inside the user data so the receiver can identify several data units inside a single byte string received. As SCTP was initially designed to carry telephony signaling packets, whose length is usually in the range of 100 bytes, sending every message in a single SCTP datagram would cause a lot of overhead. So it was one of the design goals that an SCTP endpoint could send several small messages inside a single datagram to soften the header overhead, and that is the reason why bundling was necessary. Messages inside the Duplicate TSN#D
. . . Duplicate TSN#1 Gap Ack Block #GStart Gap Ack Block #GEnd . . . Gap Ack Block #1 Start Gap Ack Block #1 End Number of Gap Ack Blocks =G Number of Duplicate TSNs = D Advertised Receiver Window Credit Cumulative TSN Acknowledgement Chunk Type =3 (SACK) Chunk Flags ( Reserved) Chunk Length Checksum Verification Tag = Tag Z Source Port Number Destination Port Number Duplicate TSN #D
. . . Duplicate TSN #1 Gap Ack Block #G Start Gap Ack Block #G End
. . . Gap Ack Block #1 Start Gap Ack Block #1 End Number of Gap Ack Blocks = G Number of Duplicate TSNs = D Advertised Receiver Window Credit Cumulative TSN Acknowledgement Chunk Type = 3 (SACK) Chunk Flags ( Reserved) Chunk Length Checksum Verification Tag = Tag Z Source Port Number Destination Port Number User Data Payload Protocol Identifier StreamIdentifier Stream Sequence Number Transmission Sequence Number Chunk Type = 0 (DATA) Reserved U B E Chunk Length Checksum Verification Tag =Tag Z Source Port Number Destination Port Number
User Data Payload Protocol Identifier Stream Identifier Stream Sequence Number Transmission Sequence Number Chunk Type = 0 (DATA) Reserved U B E Chunk Length Checksum Verification Tag = Tag Z Source Port Number Destination Port Number Doing the hard work: Transmission of data 70
byte string received were initially identified in MDTP thanks to the Part and Of fields (see Figure 3-1). In SCTP they are identified by their Stream Sequence Number (SSN) . The use of the SSN field and the streams is discussed below in section 5.3. SCTP was designed to be able to carry a number of signaling protocols (the adaptation layers defined so far are mentioned in section 8.2). Since the beginning of the existence of the SIGTRAN working group, it was accepted that one of the features that SCTP should support was the identification of the upper protocol it was transporting as its payload (see section 2.5). However, no protocol identifier field was included anywhere in SCTP until its 6 th version. There were several possible options to identify the protocol. One of the easiest ones was to simply use different SCTP well-known ports for different protocols carried by SCTP (in the same way that TCP uses port 80 when carrying HTTP or 21 when FTP is the payload protocol). This way, it would be very easy for middle boxes such as proxies or firewalls to know which is the protocol being transported by SCTP and act in consequence. However, this had the problem that only one SCTP association transporting one type of protocol could be established between two endpoints. Moreover, if a firewall relies on the SCTP port to discard or not a datagram, this barrier can be surpassed by simply using some other port. There was also the possibility of adding a protocol identifier field in the common header as IPv4 and IPv6 do (with their Protocol and Next Header fields respectively). This would have the same advantages as the well-known ports approach, and none of its drawbacks. But there was a feeling that the messages managed in the signaling protocol being quite short, it would be nice to have the possibility of bundling several messages of different protocols in the same SCTP datagram. So finally it was decided to add a Payload Protocol Identifier field inside the DATA chunks, as seen in Figure 5-1. Initially that field was going to be only one byte long, but finally it was decided that it should be a 32-bit value (not only because just one byte was not maybe enough, but also because a 32-bit value fitted perfectly in the existing DATA chunk). Of those 32 bits, 16 bits would be used for the protocol identifier, 8 bits for the variant, and the last 8 bits for the version. One nice feature of TCP that avoids sending too many acknowledgement segments without data is the so-called Delayed ACK Algorithm as described in [All1999]. Basically it consists in sending an acknowledgement every second received datagram containing data, and never delaying the acknowledgement of a segment more than a fixed quantity of time (usually 500 milliseconds). As happened almost with every nice feature of TCP, SCTP also inherited it.
5.2 Some solutions to avoid congestion
The behavior described in the previous section is the way in which data transfer should be done in case nothing goes wrong. Unfortunately, it is hard to find such a perfect network especially when dealing with the transmission of data through the Internet. In a real situation packets are reordered in their way, and some of them are discarded or even duplicated. The probability of this actually happening is related with the network usage: if the network is used to send more packets than it is prepared to, all these problems arise. So, when dealing with data transfer, one of the typical problems is designing algorithms that help the data sender to know the state of the network, and also the processing capabilities of the receiver. The goal of such algorithms is something which is Doing the hard work: Transmission of data 71
not really easy: not sending more traffic than the network and the receiver can handle (so the retransmissions of lost packets are kept to a minimum), and avoiding unnecessary retransmissions (so we only retransmit those packets that were really lost). Normally, congestion in the network produces packet loss, which in turn triggers retransmissions (usually making the duplicate receipt of several packets), which leads to more congestion. Therefore, the best cure against congestion is prevention, as once it is produced it is hard to deal with it. Packet loss due to congestion has two origins. These two different problems are usually illustrated hydraulically as it is done in Figure 5-2.
Figure 5-2: Two causes of congestion
In Figure 5-2 (a) we see a tap pouring water on a funnel. The tap would represent the data sender, and the water drops would be the equivalent to the SCTP packets. The funnel and the pipe could be considered as the Internet, the glass receiving the water drops would be the memory buffer of the data receiver. Eventually there will be someone drinking that water, who would play the role of the upper user of SCTP processing the received information and freeing the buffer space. So, in the (a) case, the pipe is thick (thus the bandwidth is big and the network is not congested), but the receiver has a small capacity (a small buffer). So if we open the tap too much (we send lots of datagrams), the receiver would be flooded and it will loose part of the water (data) sent before anybody could drink it (passed to the upper user). This waste in water (datagrams dropped) could be avoided if we simply would know about the capacity of the glass (the buffer space) and we would open the tap consequently.
The Internet
The Internet (a) Congestion at the receiver (b) Congestion in the network Doing the hard work: Transmission of data 72
This kind of congestion is relatively easy to manage. We have already discussed in section 3.1.1 how MDTP solved this problem with its In Queue field. SCTP addresses this problem by the use of the Advertised Receiver Window Credit in the INIT and INIT ACK chunks (already seen in section 4.3), as well as in the SACK chunks. This is basically the same that TCP does. Every time the data receiver sends a SACK, it tells the data sender about the state of its buffers in that moment. So, the data sender should not send more data than the receiver can buffer. When the SACK reaches the data sender the buffer space at the receiver might be different, but as the receiver reports also the TSNs seen so far, the data sender can easily calculate how much outstanding data it is allowed to have. In Figure 5-2 (b) the capacity of the receiver is not a problem, as instead of having a glass we have a whole bucket, but we still have problems. As the thin pipe (the congested network) can evacuate less water (data) than the tap (data sender) is pouring, there will be a moment in which the water level at the funnel (data travelling in the Internet) will grow so much that again, the water will be lost. And this problem is much more difficult to address, as the width of the pipe is not really known, at most it can be guessed from some other information. Here is when the congestion avoidance algorithms come into play. MDTP dealt with congestion in the network, having a variable limit on the number of outstanding datagrams. In its initial specification, a simple table said how to decrease or increase that limit when several quantities of datagrams where lost or acknowledged, but this was a very primitive basis for what finally was used. From the first version of SCTP the same congestion algorithms used in TCP were adopted with several variations but with the same Additive Increase Multiplicative Decrease (AIMD) behavior. These algorithms are published in [All1999], and were firstly devised by Van Jacobson in [Jac1988]. They have been used for a long time now by most of the TCP implementations, and they were chosen not only because they work but because it is convenient to have the same sending capabilities than TCP. Otherwise, if SCTP used algorithms that made it more congestion-sensitive than TCP, TCP flows would outcompete SCTP flows for capacity, and vice-versa. There are basically four intertwined algorithms that will be quickly described below. They use two variables, the Congestion Window and the Slow Start Threshold (normally called cwnd and ssthresh). The first one limits the number of outstanding bytes that the data sender can have, and the second one helps to use the right algorithm in the right moment. It is worth noting that in TCP there is one of such variables for the whole TCP connection, while in SCTP there is one per receiver address. The cwnd variable of a specific destination address indicates the quantity of bytes that can be outstanding on that particular address at a given time. So, the bigger a cwnd is, the more data is allowed to be injected into the network destined to that address. There is an open debate about the possibility of using a single cwnd for the whole association instead of one per destination address. Having several of them could allow the data sender to have more outstanding data than it is meant to, without breaking any rule of the protocol, just making load sharing among the interfaces. This is not the idea of multihoming, which is supposed to be used only as a backup in case the Primary Address crashes. However, as different interfaces usually mean different paths, and different states of congestion, there should be a way of applying different congestion variables to different destination address. This is one of the problems of multihoming, it has never been seriously tested before the creation of SCTP and the consequences of its use are not completely known yet. Doing the hard work: Transmission of data 73
When the data transmission starts or when no data has been sent for a long time, SCTP uses the slow start algorithm. The initial name for this algorithm was soft start, which does not really give a better idea of what it is about, since it is not really slow neither soft. Slow start is used to probe the network to determine the available capacity, so the idea is that the cwnd is initially fixed to at most twice the value of the MTU of the address. However, usually the network is able to carry much more than that quantity without major efforts. So, during the slow start phase, when a SACK chunk is received, the value of cwnd is increased by the total size of the acknowledged DATA chunks (limiting this increase to one MTU worth of bytes if more data has been acknowledged). The result is that cwnd increases exponentially, doubling every RTT. The complete rules are a little bit more complicated, but the interested reader can check section 7.2.1 of [Ste2000]. When cwnd reaches the value of ssthresh, SCTP changes its behavior to the congestion avoidance algorithm. In this phase, the cwnd is increased by at most one MTU per RTT, so it grows linearly. Again, the complete rules are written in section 7.2.2 of [Ste2000]. If cwnd continues growing, we should reach a point in which the network starts loosing packets. A packet loss is considered always as a symptom of congestion because with the modern technology it is quite unusual that a packet is dropped due to its corruption when traversing a noisy channel. Therefore, unless there is a reasonable doubt (if we are using satellite links for example), network congestion is always declared responsible of the packet losses. So, if a DATA chunk is not acknowledged within a certain period of time (this time is called Retransmission Time-Out (RTO) and we will deal with it later, in section 5.5), it is retransmitted. But this causes almost catastrophic consequences to the flow of data, as the cwnd is reduced to one MTU to avoid congestion, starting again with the slow start algorithm. To help recovering from this situation, ssthresh is set to one half of the old value of cwnd (so it takes few RTTs to recover our sending capabilities to one half of the ones we had before), but in any case the overall loss is quite big. To see this graphically, let us take a look at Figure 5-3. In Figure 5-3 (a) we see what should be the normal progression of a data transmission if there is no packet losses (the normal case sometimes even happens for small data transfers). For the shake of simplicity we measure both cwnd and ssthresh in MTUs (it is supposed that all the DATA chunks carry the maximum allowed quantity of bytes) as shown in the left axis, and the time is measured in RTTs. The value of cwnd and ssthresh appear as a solid line (blue and pink respectively). As we see, initially cwnd is set to 2 and ssthresh to 16 (as an example). The green circles represent the DATA chunks sent (whose TSN is the one that appears in the right axis), and the red squares are the SACK chunks (its height indicates the value of the Cumulative TSN Ack, measured in the right axis). We also assume that the data receiver is using the Delayed ACK Algorithm and that the RTT is about 30 times the time of putting a whole MTU size packet in the line. That means that if we are using a 10 Mbps Ethernet with a 1500 bytes MTU, the RTT would be 36 milliseconds. Finally, we also make the unrealistic assumption of having a RTO that is set to 3 RTTs, which is convenient not to make the graph very large. As we see, during the first RTTs the cwnd is increased exponentially and in about 5 RTTs it reaches the chosen value of ssthresh (we go from 2 MTUs to 16 MTUs in 5 RTTs, which is quite a fast increment). Then, cwnd starts growing linearly, being increased by one MTU every RTT. We see that at the end, after few more time than 17 RTTs (about 600 milliseconds in our example) cwnd is set to 27 MTUs and we have sent 287 TSNs (which would mean more than 400 Kbytes in the environment described), 260 of them already acknowledged. As no packets were lost, ssthresh was not modified at all. Doing the hard work: Transmission of data 74
Figure 5-3: Evolution of cwnd with and without packet losses
We can see the devastating influence of a single packet loss in the whole transmission in Figure 5-3 (b). The beginning of the transmission is exactly the same as in Figure 5-3 (a), but right after reaching the congestion avoidance phase, TSN 34 is lost. The sender continues sending normally, but as the incoming SACK have all the same Cumulative TSN Ack, the cwnd is not increased during 3 RTTs. Then, the timer expires and TSN 34 is resent, ssthresh set to 8 MTUs (one half of cwnd), and cwnd set to a single MTU. This drastically lowers the sending speed. As we see, it takes about 5 RTTs to leave the slow start phase, and then cwnd continues growing slowly. After 17 RTTs, cwnd is set to 11 MTUs, ssthresh to 8 MTUs and 129 TSNs have been sent, 118 of them already acknowledged. Summarizing, a single packet lost roughly halves the throughput of an association 32 . However, we are not the first ones to notice this, and luckily people already made some fixes to this behavior so this is not exactly the way in which things really work. To palliate the effects of a single packet drop another algorithm called fast retransmit is used. The heart of the algorithm is to already retransmit a DATA chunk when the SACKs show that several other DATA chunks sent later than that DATA chunk have already arrived to the destination, while the DATA chunk is still unacknowledged. In this way we can avoid the time-out of the retransmission timer.
32 Although the figures completely depend on how much data we have to send and when the retransmission happens. (b) One Packet Lost (a) No Packet Loss 0 5 10 15 20 25 30 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Time (RTTs) C o n g e s t i o n
W i n d o w
( M T U s ) 0 50 100 150 200 250 300 T S N s cwnd sstresh TSN sent TSN acknowledged 0 5 10 15 20 25 30 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Time (RTTs) C o n g e s t i o n
W i n d o w
( M T U s ) 0 50 100 150 200 250 300 T S N s cwnd sstresh TSN sent TSN acknowledged Doing the hard work: Transmission of data 75
In TCP, a data segment is fast retransmitted upon the arrival of 3 duplicate ACKs (4 consecutive ACKs with the same Acknowledgement Number). Due to the use of Delayed ACKs (only used when there are no gaps in the incoming data), a data segment is fast retransmitted when the data receiver has gotten 3 or 4 later segments. This algorithm was defined for TCP before the use of its option for selective acknowledgement was widely deployed. So, in SCTP, due to its compulsory use of Gap Ack Blocks, the algorithm is slightly different: if a TSN is not acknowledged in 4 consecutive received SACKs while any other newer TSN is acknowledged in any Gap Ack Block of those 4 SACKs, the TSN must be retransmitted. Moreover, both cwnd and ssthresh variables are set to one half of the value of cwnd in the moment of the fast retransmission. In practice, this should work pretty well, but SCTP specification has a bug related with this fast retransmit procedure that makes it only work when there are few TSNs outstanding. Otherwise the same procedure is applied several times and the final result is sometimes even worse than when the fast retransmit procedure is not used. As stated before, SCTP specification is being studied and there are some needed changes so far, one of them being this fast retransmit issue. Those changes are published in [Ste2002b], and Figure 5-4 shows the differences between the use of fast retransmit in [Ste2000] and [Ste2002b].
Figure 5-4: Use of fast retransmit in [Ste2000] and [Ste2002b]
The main problem with fast retransmit in the present specification of SCTP is that it allows the same TSN to be fast retransmitted several times (every fourth received SACK not acknowledging it and acknowledging subsequent TSNs). So when there are several (b) Using Fast Retransmit as defined in [Ste2002b] (a) Using Fast Retransmit as defined in [Ste2000] 0 5 10 15 20 25 30 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Time (RTTs) C o n g e s t i o n
W i n d o w
( M T U s ) 0 50 100 150 200 250 300 T S N s cwnd sstresh TSN sent TSN acknowledged 0 5 10 15 20 25 30 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Time (RTTs) C o n g e s t i o n
W i n d o w
( M T U s ) 0 50 100 150 200 250 300 T S N s cwnd sstresh TSN sent TSN acknowledged Doing the hard work: Transmission of data 76
TSNs and their acknowledgements in flight, the same algorithm is repeatedly applied, causing cwnd and ssthresh to decrease much more than would be desirable. This behavior is shown in Figure 5-4 (a), while the revised one appears in Figure 5-4 (b). We can see that there is a subtle difference between them. There is another problem with the fast retransmit procedure for SCTP. The data receiver should stop using the Delayed ACK algorithm when it finds any gap in the incoming sequence of TSNs. So, if the datagrams are reordered in the network, and one TSN arrives to the destination before a number n of other TSNs (at least 4 TSNs), there will be n 3 TSNs that will be retransmitted due to the fast retransmit algorithm. This issue is also solved in [Ste2002b] by only triggering a fast retransmission of a TSN upon the receipt of 4 SACKs that not only acknowledge other TSNs sent later, but that do not increase the Cumulative TSN Ack. In other words, as far as the acknowledgements arrive in sequence no fast retransmission will be issued (of course the first SACK containing the gap breaks the sequence, but we would need other 3 ones in which the Cumulative TSN Ack is not advanced). Finally, the fourth algorithm used for congestion control is called fast recovery, also defined in [All1999] and used right after a fast retransmission. TCP without the Selective Acknowledgement option can not inform the data sender about anything else but the last data segment received in order. As a fast retransmission is originated due to the acknowledgements generated by data received out of order, even if the data is doing it to the receiver, there is no possibility of sending anything else but a duplicate acknowledgement (it is also interesting the use of NAKs by the data receiver, described in [Fox1989], that make the sender to send a specific data segment, but this option has never been widely implemented). In the typical case, the data sender receives several duplicate acknowledgements and suddenly, when the retransmitted data reaches the destination, an acknowledgement segment with a big advance in the Acknowledgement Number is received (this can be seen for the SCTP case in Figure 5-4). As this is the expected behavior, we can anticipate it already increasing the cwnd when the duplicate acknowledgements are still arriving, and this is basically the fast recovery algorithm. SCTP, however, does not need that algorithm due to its use of Gap Ack Blocks, so the problem is elegantly solved. An SCTP data sender should follow these guidelines not to flood the network with excessive traffic, but as it is usually recommended, any other option that is less aggressive (in the sense of injecting lest packets to the network) is always accepted. Finally, the Duplicate TSNs at the end of the SACK chunks can be used for congestion avoidance purposes. Although there is no standard algorithm that could be used to take advantage of this information, there are some guidelines about its use. The idea is that this information can help recognizing when an unnecessary retransmission was done and we can take then the opportune actions. It is generally accepted that the Duplicate TSNs can be useful to create an adaptive fast retransmit algorithm as discussed in section 8.1.3. Some Internet-Drafts regarding this issue have already been published.
5.3 Several connections inside a single association: the use of streams
A stream, as defined in [Ste2000], is:
Doing the hard work: Transmission of data 77
o Stream: A uni-directional logical channel established from one to another associated SCTP endpoint, within which all user messages are delivered in sequence except for those submitted to the unordered delivery service.
Therefore, in a way, a stream is a kind of subconnection inside an SCTP association. The number of streams used is negotiated during the association establishment as already shown in section 4.3. UDP packets do not carry any sequencing information that allows the receiver to order them upon arrival. Some applications do not need such ordering, but it is often desirable (if not necessary) that the messages are delivered to the upper user in the same order as they were sent to the network. Telephony signaling applications are this kind of applications. Those applications that need keeping the order of the data sent through the network may choose TCP as their transport protocol. TCP is ordered, but strict order-of- transmission data delivery is also a restriction for many applications. TCP is byte-stream- oriented and that means that it does not have any way to recognize the beginning and ending of individual messages. So, the whole flux of bytes is all managed the same way, and the data must be delivered to the upper user in the same order it was sent to the network. This is because there is no way a TCP receiver can know that several parts of the continuous byte stream are unrelated. This way of transferring data causes the so-called Head-Of-Line (HOL) blocking, which is illustrated with an example in Figure 5-5. If we make a typical request to an HTTP server to download a web page when surfing the Internet, normally we will receive several different files, containing text, graphics or sound. In Figure 5-5 the right side represents the client, and the left side is the server side, that sends three small files (differentiated by the color of the packets containing them) upon the request of the client. For the shake of clarity, we suppose that each file is contained in two datagrams. To illustrate the problem we suppose that the first part of the first file transferred is lost. So, in Figure 5-5 (a) we see what happens when the three files are sent using a single TCP connection: as the first datagram is lost, even though the second and third files entirely arrived to the client they can not be delivered to the upper user as all the data sent by the server must be passed to the user strictly in order. This is the HOL blocking problem. Normally, when dealing with HTTP transfers, things do not work like this. Usually the client opens several different TCP connections, one independent connection per file, and closes it once the file is completely transferred. This is shown in Figure 5-5 (b), and as we see, this way of doing is not affected by the HOL blocking as the files can be delivered to the user as soon as they arrive (of course all of the files but the first one, whose beginning is still missing). However, we still suffer from the delay involved in opening and closing the TCP connections, and what is worse, we are wasting resources by having several TCP connections open at the same time between the same two endpoints. As the servers have a limitation in the number of open TCP connections they can have at the same time, using several of them for the same client lowers the overall number of clients that can be served simultaneously. In Figure 5-5 (c) we see the way SCTP could handle this problem: using a single association with different streams for different files. This way, the server can save resources, not having more than one association per client. But there is another advantage of using a single SCTP association with several streams instead of using several TCP connections. Apart from saving resources and avoiding delay establishing those TCP connections, as all the streams belong to the same association, they all share the same Doing the hard work: Transmission of data 78
congestion avoidance mechanisms discussed in section 5.2. In Figure 5-5 (b), the different TCP connections have different congestion avoidance parameters, and that can give the client an excessive share of bandwidth. When using different TCP connections, each connection does not know about the existence of the others, and behaves as if it was the only one. So, in our example, the client would use three times the bandwidth of a single TCP connection because at the server side these three connections are managed independently. This would hurt other clients having a single TCP connection, and makes congestion avoidance a harder issue.
Figure 5-5: Head of Line Blocking
As seen in Figure 5-1 the DATA chunks carry two sequence numbers: the TSN and the SSN. The TSN is global for the whole association, and it is used to recognize packet losses as a DATA chunk is uniquely recognized by its TSN. No matter to which stream it is directed, the TSN of a new DATA chunk will always be set to the last used TSN (a) A single TCP connection TCP connection 2 1 6 5 4 3 Buffered TCP user 6 5 4 3 2 (d) A single SCTP association sending unordered user messages SCTP association Stream 0 1/0 2/0 6/0 5/0 4/0 3/0 SCTP user Stream 0 6/0 5/0 4/0 3/0 2/0 Delivered Delivered Buffered
(c) A single SCTP association with several Streams 2/1 4/1 3/1 6/1 5/1 Delivered Delivered Buffered SCTP user Stream 1
Stream 2
Stream 0 Stream 2 Stream 1 Stream 0 SCTP association 6/1 5/1 4/1 3/1 2/1 1/1 (b) One TCP connection per file TCP connection C 2 1 TCP connection A 2 1 TCP connection B 2 1 2 2 1 2 1 Delivered Delivered Buffered TCP user B TCP user C TCP user A Doing the hard work: Transmission of data 79
incremented by one 33 . The acknowledgements also use this number. In Figure 5-5 (c) the TSN is the number that appears first in the datagrams, and globally identifies them. In SCTP the TSNs mainly do the same work as the Sequence Numbers in TCP. So they are not anything new, even if TSNs identify DATA chunks and Sequence Numbers identify single bytes in the overall flow of data. The new stuff is the SSN and the Stream Identifier that appear in the DATA chunks. The Stream Identifier, as can be guessed by its name, identifies the stream to which this DATA chunk is directed. In Figure 5-5 (c) the server makes use of three streams, but as the Stream Identifier is 2 bytes long, it could use up to 65,536 different streams. This number was thought to be a reasonable compromise between stream capabilities and overhead during the design of SCTP. The SSN identifies a message sent to a given stream, so that all the pieces of a user message have the same SSN and they carry consecutive TSNs. As the user messages can be bigger than one MTU worth of data, there is a need for Fragmentation, so bigger messages can be chopped and included in several SCTP DATA chunks. Fragmentation is a new feature in SCTP, as TCP does not have any need for it because it manages every single byte as an independent entity identified by its Sequence Number. We have already seen in section 3.1.1 how MDTP was able to fragment user messages of up to 255 times the MTU minus the space of the headers (up to 371,280 bytes in the typical case of using IPv4 and having a MTU of 1,500 bytes). SCTP uses another way to fragment virtually any message no matter its length. The mechanism used is similar to the one used by IPv4 to fragment datagrams when the MTU of the next network the datagram must traverse is smaller than the size of the datagram itself. In IPv4, the Fragment Offset field indicates where in the whole original IPv4 datagram this fragment is located, and the More Fragments flag is set to 0 if this fragment is the last one. Thanks to them (and the length of the fragments) the receiver can organize the pieces received and determine when it has received all of them. In SCTP the DATA chunks use two flags, the B (Beginning Fragment) flag and the E (Ending Fragment) flag. In the DATA chunk containing the first part of an user message the B bit is set, while the E bit is set only in the last one. So, an unfragmented message carried inside a single DATA chunk will have both flags set, and an intermediate fragment of a user message will have these two flags unset. Moreover, all the DATA chunks containing fragments originated by a single user message will have the same SSN and their TSNs will be consecutive. In this way, the only limitation we have is the TSN, and so there is a possibility of sending fragmented messages up to 2 31 times the MTU (in Ethernet this is more than 2 Terabytes). This message size is far bigger than any expected buffer space within the next decades. This way, the datagrams sent to a specific stream will be delivered to the user in the same order they were put into the network (and so the client will have a clean copy of the files the HTTP server sent). But as the files were sent using different streams, the order of delivery of the files will not have to be necessarily the same in which the server transferred them. So, we avoid the HOL blocking and the overhead of having one open SCTP association per file transferred (the streams are cheap to manage). As the final user of a web page is a human who will read it, the problem described inside this HTTP context does not seem to be that horrible (but it is still quite nice to be
33 The TSN is a 32-bit number and so, once it reaches the value ffffffff (hexadecimal), it wraps and the next value is 0. However, in the early versions of MDTP the 32-bit Sequence Number field could use numbers from 0 to 7fffff000 (hexadecimal). Before reaching that value, a special procedure to reset the Sequence Number should be done. This scheme did not last much. Doing the hard work: Transmission of data 80
able to start reading text before all the images have been downloaded for example). However, we should think about some more critical applications such as telephony signaling. We could for example use one SCTP stream to carry the signaling information of a specific phone call, and even though several calls will be managed inside a single SCTP association, they will be internally treated as different flows of data, and so delays or losses in one of them will not affect the others. Other applications that could make use of streams are real time multimedia applications. Thinking about teleconferences we could send voice and image through different streams, and so, in the typical case that the link does not have enough bandwidth to carry the images (at least not all of them on time), we could still hear what is happening independently of the image. But this is not the only feature of SCTP regarding the ordering of data. As seen in Figure 5-1 the DATA chunk uses three flags, the B and E flags already commented, and the U (Unordered) flag. An unordered user message is delivered as soon as it arrives (all its fragments) to the destination. All the DATA chunks of an unordered message have the U flag set. This kind of DATA chunks do not use the SSN or the Stream Identifier fields, but they can still be reassembled thanks to the TSN and the B and E flags. These kind of messages could also help to avoid the HOL blocking problem, as one could simply send unordered messages that would be delivered to the upper user as soon as they arrive. This in fact would provide a functionality similar to UDP with the possibility of sending fragmented messages that would be reassembled in the right way at the destination. This possibility is shown in Figure 5-5 (d) and as we see, it also solves the HOL blocking problem. So, as we have seen, SCTP DATA chunks are ordered at three levels:
At the user message level: All the DATA chunks containing fragments of a bigger user message are always ordered at the destination, so the user messages are always reconstructed in the right way. This level of ordering is always present and it is provided by the B and E flags, as well as the TSN. At the stream level: The user messages contained in the DATA chunks are also ordered inside streams. However, DATA chunks directed to different streams are unrelated and there is no required ordering among them, and the unordered messages are always delivered upon arrival, without any attempt of ordering. The SSN and the Stream Identifier provide this level or sequencing. At the association level: All the DATA chunks sent inside an association are sequenced so they can be unambiguously acknowledged. The TSN carries the information to make this ordering possible.
Thanks to these three levels of ordering, the user messages can be sent to the receiver either in the same order as they were sent by the data sender, partially ordered (using streams), or completely unordered (sending unordered DATA chunks), even if the messages must be fragmented. These three possibilities seem to be sufficient for almost any ordering scheme.
5.4 Size matters: MTU discovery
We have already spoken in previous sections about the MTU and fragmentation, both for TCP and SCTP, but we have not explained yet why this has to be done, and how the MTU is calculated. In this section we will try to answer both questions. Doing the hard work: Transmission of data 81
Neither TCP nor SCTP have any Total Length field in their headers, so, apparently, they do not have any need for fragmentation. If the user gives us 1 Mbyte of data, why not simply make a datagram containing that 1 Mbyte in the user data field and send it to the receiver? In TCP it would be as easy as that, and in SCTP we should make several DATA chunks of up to 65,535 bytes long each including the DATA chunk header, and bundle them in a single SCTP datagram. In fact, this can not be done due to the limitations of the network layer. In IP there is an upper limit for the length of the datagram of 65,535 bytes (in IPv4 this also includes the header, in IPv6 it does not). Thus TCP or SCTP should provide datagrams to the IP layer that are small enough to fit in an IP datagram. But again, the IP datagrams must be sent using the physical network that connects the sender to the Internet, and the 65,535 bytes limit of IP is too much for any existing network. So, that is not the real limit for the IP datagrams and one can not see IP datagrams of 64 Kbytes going around in the Internet. The physical network used to transmit the IP datagrams has a maximum size for its frames, the MTU, varying from some hundredths of bytes up to several thousandths. There are several reasons that explain this limitation in the maximum size of packets:
The Medium Access Control (MAC) layer protocol itself. If it has for example a Length field occupying 2 bytes, then the biggest packet will be at most 65,535 bytes long. Hardware limitations. If for example we are applying Time Division Multiplexing (TDM), then the speed of the network limits the size of the biggest packet. The power of the checksum decreases as the length of the packet grows. Therefore, if we want to achieve certain value of protection against corruption, packets must have a maximum size limit.
Table 5-1 shows some of the most typical MTU values found in the Internet, taken from [Mog1990]. As can be seen, there is a big difference among the MTUs of different networks. In real life nobody takes care of MTUs smaller than 576 bytes and some of the values in the table are only rarely used (the 1,500 bytes MTU of Ethernet is by far the most used one). Fortunately, there are some groups of similar MTUs called plateaus (with a difference between the biggest and smallest MTU in the group stated in the Variation field of Table 5-1). This makes things easier to the algorithm that discovers the MTU, as stated below. What will happen if the sender gives to the IP layer a datagram that fits in a Token Bus frame (IEEE 802.4) and the receiver is located inside an Ethernet network? The answer (in IPv4) is that the router in between will eventually fragment the datagram into pieces that fit the MTU of the second network and send them. The IP layer of the receiver side will reassemble the received fragments and deliver the original datagram to the TCP or SCTP engine as if it would have never been fragmented. That same process would be done if both the sender and receiver where located in their own Token Bus network, connected by an Ethernet network in between. So, if the IP layer is able to fragment the datagrams, why not simply make TCP or SCTP datagrams that fit in the biggest IP datagram and let that layer make all the fragmentation needed? There are some answers that prove that this is not the best way. Let us look at an example in Figure 5-6. In the example, a host sends a datagram that fits the local network. That big datagram reaches the router, which must resend it through another network with smaller MTU. The big datagram is too large for that network, so the router fragments it into, let us say, three Doing the hard work: Transmission of data 82
smaller pieces and sends them. All those fragments have to traverse the Internet before they arrive to the final router, and they can follow different paths. In the example shown in the figure, the third fragment is lost somewhere in the Internet, so even if the two other pieces arrive correctly to the receiver, they will simply be stored there at the IP layer waiting for the third piece, which will never arrive. After some time, the whole original datagram will be retransmitted.
MTU Type of Network Variation (%) 65,535 Official maximum MTU 65,535 Hyperchannel 0.00 17,914 16 Mb IBM Token Ring 0.00 8,166 IEEE 802.4 0.00 4,464 IEEE 802.5 (4 Mb Max) 4,352 FDDI (Revised) 2.57 2,048 Wideband Network 2,002 IEEE 802.5 (4 Mb Recommended) 2.30 1,536 Exp. Ethernet Nets 1,500 Ethernet Networks 1,500 Point-to-Point 1,492 IEEE 802.3 2.95 1,280 Official Minimum IPv6 MTU 0.00 1,006 SLIP 1,006 ARPANET 0.00 576 X.25 Networks 544 DEC IP Portal 512 NETBIOS 508 IEEE 802/Source-Rt Bridge 508 ARCNET 13.36 296 Point-to-Point (low delay) 0.00 68 Official Minimum IPv4 MTU 0.00
Table 5-1: Some MTUs found in the Internet
So, the result of sending a big message is that in case any of the pieces is lost (and having several pieces makes loosing one of them easier) the whole datagram and not only the needed pieces will be retransmitted. Moreover, the fragmentation issue complicates the receiver's operation. Things are somehow harder if several pieces must be assembled before the datagram can be delivered to the IP user. The reassembly algorithms are not very efficient, and some efforts have been done to design the best possible algorithm, such as the one specified in [Cla1982]. Also, having to keep the fragments in the memory until the last one arrives unnecessarily wastes resources at the IP layer in cases when one of the fragments has been really lost. When dealing with lousy channels, fragmentation can cause a severe loss of throughput as it is more difficult assuring the arrival of several smaller fragments than one single big datagram. In an extreme case where every datagram is fragmented into for example 10 pieces, and the channel looses in average 1 every 10 packets that traverse it, we can effectively have zero throughput. Moreover, routers need more time to manage an IPv4 datagram if they find out that it must be fragmented. So it is not surprising that in Doing the hard work: Transmission of data 83
case of congestion they start dropping datagrams that exceed the MTU of the next hop. A discussion about the problems of IP fragmentation and how to overcome them can be found in [Cha1998].
Figure 5-6: IP fragmentation
Finally, there is even a stronger reason to explain why limiting packets to the MTU boundary is convenient: IPv6 routers do not fragment large datagrams that do not fit in the next network's MTU. Instead of that, they send back to the datagram sender an ICMPv6 Packet Too Big message including the MTU of the network that was unable to carry such a big IPv6 datagram. So, not surpassing the MTU threshold is convenient, but being as close as possible to that limit is also important. If we simply send small IP datagrams not to have any problems with MTUs, including few bytes of user data in each packet, we waste network resources because the datagrams have little information and a lot of overhead due to the IP header. In a TCP connection, during the establishment phase, both endpoints exchange the Maximum Segment Size (MSS) option. It carries the value of the maximum segment size that the network of the sender of this option can manage. Basically it is set to the MTU of the network minus the length of the IP and TCP headers (thus in a 1,500 bytes MTU Ethernet, and using IPv4, the MSS would be set to 1,460). This establishes an upper limit that must not be surpassed, but not the lower limit (there can be networks in the path from one peer to the other with a lesser MTU value). SCTP does not have anything like that. As we see, there are some reasons why transport protocols such as TCP or SCTP should implement the so-called Path MTU Discovery algorithm. This algorithm is specified for IPv4 in [Mog1990] and for IPv6 in [McC1996], but they both share the same basic idea with some differences. The IPv4 header has a flag called Don't Fragment (DF). This flag, when set, means that routers should not fragment this IPv4 datagram (thus behaving as in IPv6). This flag was meant to advice routers that the receiver might not be able to reassemble fragments. So, in case the MTU of the next network is smaller than the size of the datagram, the router sends back the ICMP Fragmentation Needed and DF Set message (including also the MTU of the network if the router is [Mog1990] compliant). Therefore, the main idea of the Path MTU Discovery algorithm is starting sending IP datagrams at most as big as the local Small MTU LAN I iti l TSN N b f N b f Ad ti d R i I iti t T T A Chunk Chunk Ch k Ch k V ifi ti T 0 S P t D ti ti I iti l TSN N b f N b f Ad ti d R i I iti t T T A Chunk Chunk Ch k Ch k V ifi ti T 0 S P t D ti ti 2 1 I iti l TSN N b f N b f Ad ti d R i I iti t T T A Chunk Chunk Ch k Ch k V ifi ti T 0 I iti l TSN N b f N b f Ad ti d R i I iti t T T A Chunk Chunk Ch k Ch k V ifi ti T 0 S P t D ti ti 2 1 I iti l TSN N b f N b f Ad ti d R i I iti t T T A Chunk Chunk Ch k Ch k V ifi ti T 0 I iti l TSN N b f N b f Ad ti d R i I iti t T T A Chunk Chunk Ch k Ch k V ifi ti T 0 S P t D ti ti I iti l TSN N b f N b f Ad ti d R i I iti t T T A Chunk Chunk Ch k Ch k V ifi ti T 0 S P t D ti ti 3 2 1
I iti l TSN N b f N b f Ad ti d R i Chunk Chunk Ch k Ch k V ifi ti T 0 S P t D ti ti I iti l TSN Ad ti d R i I iti t T T A Chunk Chunk Ch k Ch k V ifi ti T 0 S P t D ti ti I iti l TSN N b f N b f Ad ti d R i I iti t T T A Chunk Chunk Ch k Ch k V ifi ti T 0 3 2 1 The Internet Large MTU LAN Parameters Initial TSN Number of Outbound Streams Number of Inbound Streams Advertised Receiver Window Credit Initiate Tag = Tag A Chunk Type = 1 (INIT) Chunk Flags (Reserved) Chunk Length Checksum Verification Tag = 0 Source Port Number Destination Port Number I iti l TSN N b f N b f Ad ti d R i I iti t T T A Chunk Chunk Ch k V ifi ti T 0 S P t D ti ti I iti l TSN N b f N b f Ad ti d R i I iti t T T A Chunk Chunk Ch k Ch k V ifi ti T 0 S P t D ti ti
2 1 Doing the hard work: Transmission of data 84
hop allows (and also smaller than the received MSS in TCP), with the DF bit set. Then, as soon as we receive one ICMP message telling us that the packet was so large, we start using the immediately lower value for the MTU in Table 5-1 (or the MTU value inside the ICMP message, if it includes that information). The values in the table are grouped in the so-called plateaus, which are helpful to converge to the MTU value quicker. In this way we notice when the MTU decreases. To be aware when the MTU grows, the data sender increases every certain time the value of the MTU, also following Table 5-1. If the datagram was so big and it receives back the ICMP message it uses again the previous MTU value (or the MTU value inside the ICMP message, if it includes that information). The lost packet will have to be retransmitted (either retransmitting the same datagram with DF bit unset, fragmenting the IP datagram at the source, or, for TCP, including the user data in several smaller datagrams). Retransmitting one datagram every so often is considered a lesser evil than sending smaller packets. Using this method we will discover the smallest MTU of the networks involved in the path from the sender to the receiver, which is exactly what we are looking for. In IPv6 basically everything is the same, except for two subtle differences. First, there is no DF bit at all, so packets will never be fragmented in a router. And second, when the packet is too big for the next network the router sends back an ICMPv6 Packet Too Big message that always includes the size limit of that network. So, the Table 5-1 is not used, because when we receive the ICMPv6 message it always carries the exact information of the next hop's MTU. So, when we want to test for a bigger MTU value, it is enough to send a datagram as big as the MTU of the local network and then use the information in the ICMPv6 message received, if any. Again, the packet that triggered the ICMPv6 message will be lost, and so the information it carries should be retransmitted. This time no DF bit can be unset, and we should either fragment the IPv6 datagram at the source, or use smaller TCP segments. There are some problems with the IPv4 implementation of the MTU Discovery, as discussed in [Lah2000]. The most important one is the so-called Black Hole Detection. A Black Hole is a router that discards an IPv4 datagram due to its size, but for some reason the datagram sender never receives the corresponding ICMP message. This can be simply caused by bugs in the router software, or due to firewalls that filter those ICMP messages so they never reach their destination. This kind of problem is hard to find, and usually leads to time-outs and finally the connection is aborted. There are some other practical problems. If the peer tests a new MTU in a moment of heavy transference of data, several IP datagrams will be lost before we receive the ICMP message and we restore the old value of the MTU. Moreover, once a DATA chunk has been created in SCTP and a TSN has been assigned to it, the TSN series must be followed, and so there is no possibility of dividing the already created DATA chunk into two of them with different TSN (something that in TCP is possible). Among the possible solutions to these problems the author of this Master's Thesis has decided to use another approach for MTU discovery in his SCTP implementation. Instead of enlarging the MTU and sending the normal SCTP packets with the new bigger value, a HEARTBEAT chunk of the desired length is sent instead (more about HEARTBEAT chunks in section 6.1). This datagram will have the DF bit set as every datagram sent. Since the internal structure of the HEARTBEAT chunk is not defined in [Ste2000] it is easy to make a HEARTBEAT chunk of the desired size. If we receive the subsequent HEARTBEAT ACK, it means that the MTU tested is valid, and it can be enlarged up to that value. If instead of the HEARTBEAT ACK we receive an ICMP message telling us Doing the hard work: Transmission of data 85
that the HEARTBEAT sent was too big, we do not have to modify anything as the MTU was never increased. Only in case the ICMP contains the MTU value, we should modify our MTU if it is different than the one included in the ICMP message. If we do not receive anything, it might mean that we are dealing with a Black Hole, that the network is congested and the HEARTBEAT was lost, or that the destination is down.
5.5 I will wait for you: RTO calculation
The time the datagrams take to reach the receiver side and its acknowledgement to arrive back to the sender, is a very important thing the data sender should know. This measure of time is called Round Trip Time (RTT). The importance of knowing the RTT is due to the need of having some measure that serves us to set the value of the Retransmission Time-Out (RTO), which is used for the retransmission timers. One could think that once we have the RTT, the calculation of the RTO should be easy, but that is not completely true. This same problem has been studied for a long time in TCP, and quite many algorithms have been tested and used. So, among the properties SCTP has inherited from TCP, the RTO calculation is one of them, with the main difference that SCTP keeps an RTO for every used destination address. The RTT calculation is quite straightforward. It is as simple as saving the information about the time when one TSN was sent, and when the acknowledgement is received, just calculating how long it took to arrive. There are, however, some things that must be taken into account. Of course when measuring the RTT, one has to use chunks that are acknowledged upon receipt, such as INIT, COOKIE ECHO, DATA or HEARTBEAT chunks (normally DATA chunks are used for the RTT measure of the Primary Address and the HEARTBEAT chunks are used for the rest or addresses). When using the HEARTBEAT chunks there is no problem, as one can always include inside the chunk itself the time when it was sent. That information will come back inside the HEARTBEAT ACK chunk (see section 6.1) and we will use it to make our measure. However, if DATA chunks are used instead, we must take care because if a DATA chunk was retransmitted one never knows which transmission of the DATA chunk triggered the acknowledgement. This apparently simple rule has its own name: the Karn's algorithm (Karn's rule also says that the RTO should be doubled every time a retransmission is issued). It is not surprising that SCTP uses it. Some TCP implementations avoid that problem with the retransmissions by using the TCP's Timestamps option, as defined in [Jac1992]. Another thing to take into account is that due to the use of delayed SACKs the TSN received is not immediately acknowledged, and that affects to the accuracy of the RTT measure. And it is precisely the variation in the value of the RTT (due to multiple reasons) what makes the calculation of RTO harder. If RTO is set to a small value, then there is the possibility of making retransmissions when they were not needed. In the opposite case, if RTO is so large, we can delay for so long the retransmission of a lost packet. Therefore, the problem is finding the right RTO, taking into account that its value depends on the state of the network and it changes. As has been already said, at the beginning MDTP was only thought to be used for telephony signaling transport. That meant that it would be used in well-behaved networks, which will not be congested and will always be under our supervision. In such an environment, the RTT was not expected to change, and it was likely to be quite small. So the designers chose a fixed value for RTO of 160 ms. Then they changed it to be 160 ms plus the last calculated RTT. Finally, in the last versions of MDTP, RTO was set to 160 Doing the hard work: Transmission of data 86
ms, plus the maximum RTT measured ever, plus the maximum time an acknowledgement could be delayed (in MDTP that last value was negotiated beforehand in the establishment phase). The algorithm used was still so simple and with the first version of SCTP things changed. As the designers were expecting to use it in the Internet instead of private networks, the problem faced with the variation of RTT was completely different. The expected probability density of acknowledgement arrival times changed from something similar to Figure 5-7 (a), to something like Figure 5-7 (b).
Figure 5-7: Probability density of acknowledgement arrival times
As can be seen in the figure, in a private network there is no big problem to calculate the RTO. But in the Internet the RTT measured can vary rapidly, making the election of the RTO a more difficult task. So more elaborated algorithms must be used to calculate the RTO. TCP has always used a more complicated algorithm. As defined in [Pos1981c] a TCP implementation should calculate the Smoothed Round Trip Time (SRTT) by using the low-pass filter:
SRTT = SRTT + (1 - )RTT
Then, the RTO is calculated as SRTT. The value of was typically set to 7/8, and was always set to 2. But in 1988 Van Jacobson showed that the fixed value of made it fail to respond when the variance went up, being able to adapt to loads of at most 30% [Jac1988]. To improve this he proposed to use the mean deviation of the values of RTT as an easy to calculate approximation to the standard deviation. That algorithm was finally published as an RFC in [Pax2000] and basically adds another calculation previous to the (b) Probability density of acknowledgement arrival times in the Internet (a) Probability density of acknowledgement arrival times in a private network 0 0. 1 0. 2 0. 3 0 10 20 30 40 50 60 70 Round Tri p Ti me (ms) P r o b a b i l i t y RTO 1 RTO 2 0 0.1 0.2 0.3 0 10 20 30 40 50 60 70 Round Trip Time (ms) P r o b a b i l i t y RTO Doing the hard work: Transmission of data 87
one of SRTT, which is the calculation of the Round Trip Time Variation (RTTVAR) with the formula:
RTTVAR = (1 - )RTTVAR + |SRTT R|
Finally, the estimation of the RTO is modified to be:
RTO = SRTT + 4RTTVAR
Moreover, after every retransmission time-out the RTO value must be doubled. There is a lower and an upper limit for RTO, usually set to 1 and 60 seconds respectively. As happened with many other TCP features, the SCTP designers did not surprise us, and SCTP inherited this scheme from TCP, being the one that appears in [Ste2000].
5.6 The ideas left on the way
During the design of SCTP, people proposed several modifications to the data transmission scheme described that never had the support of the community and were discarded. One of them was the use of a special chunk called the CANCEL chunk. The aim of this chunk was not to retransmit stale data when SCTP was used as the transport protocol of real-time applications. There was a proposal to send this special chunk (basically a DATA chunk containing only the TSN number and no data) instead of the original one if the sender already knew that the receiver would discard the retransmitted packet because the information would arrive too late. In this way the otherwise wasted bandwidth could be used for other purposes, and also this chunk would avoid that the retransmission of an old chunk could delay the delivery of another salient packets. Finally this proposal was discarded, mainly for two reasons. First, it would add some more complexity to the protocol (especially when dealing with fragmented messages) to add what finally seemed to be little gain. And second, because using CANCEL chunks the data sender would not be sure of what the receiver really got (and thus we would convert a reliable transport protocol into an unreliable one) and which is the state of its buffer. To avoid the second problem, it was suggested that the SACK chunk could send a list of cancelled chunks, but this made worst the first problem. It was then proposed to send the same DATA chunk simply removing the data field. However, following a standard Bekerley socket Application Programming Interface (API) a zero length read would mean the end of the connection. The final decision was to avoid sending zero length DATA chunks and leaving the issue open so that the feature could be added in the future. That is the nice thing of having a protocol with the extensibility possibilities of SCTP: if you are not sure that something will work, you can always leave the problem to future generations. During some time, the idea of creating and destroying streams on demand was also considered, and an Internet draft was about to be written. However, people in the distribution list agreed that the extended functionality was not worth the added complexity to the protocol, because there was a rough consensus that this ability was not really useful and that it would be sheldom used. One could always tear down an established association and open a new one with the necessary number of streams. It was also pointed out that there was the possibility of opening the maximum number of outbound streams. All in all, you can always program your SCTP implementation in a way that a stream only consumes Doing the hard work: Transmission of data 88
resources if it has been used at least once (the only problem of this is that the SCTP data receiver is the one that must be programmed in this way). Finally the proposal was forgotten.
It is not all plain data 89
6. IT IS NOT ALL PLAIN DATA
Once we have established a new connection most of the SCTP datagrams will contain either DATA or SACK chunks (or both). But they are not the only ones that are exchanged by the peers involved in the association. SCTP has a mechanism to verify that the peer endpoint is up and running even there is not a data transfer under way. This procedure helps keeping track of the state of associations that are sheldom used. Moreover, as SCTP peers can be multihomed, normally only one of them, the Primary Address is used, while the others remain as a back up in case the Primary Address fails. But if we are not sending data to those other addresses, we need some other way to know their state. This is the so-called path heartbeat mechanism, discussed in the next section. However, using the path heartbeat mechanism, we can only know about a complete malfunction of one of the peer's addresses, or the whole peer itself. So, SCTP has also a way to tell the other host that something is going wrong at our side, even though it does not necessarily prevent us from continuing working. This information may help the peer to adapt better to our needs, or simply to know why things are not working as expected. We will speak about this in section 6.2.
6.1 Are you alive? The path heartbeat mechanism
As has been already said, one of the main features of SCTP is its use of multihoming. But when the peer endpoint has several different IP addresses in use, one of them is considered to be the Primary Address and is the one to which the datagrams are normally directed. The rest are kept as a backup and are only used if the Primary Address fails. This is somehow problematic, because there must be a way to know in which state is an address that is only rarely used. Knowing the state of unused address is vital to make the right choice when the Primary Address goes down. TCP has the controversial keepalive mechanism specified in [Bra1989], that basically consists in sending data that is outside the window, which should trigger the sending of an acknowledgement. Upon receipt of that packet, we conclude that the peer is still alive, but if we do not receive anything it could both mean that the peer might be down or that the packet was lost in the network, and so we should try again. It is a controversial mechanism because it can tear down an otherwise perfectly good connection if we are facing congestion in the network (and thus the packets are being lost). Moreover, it normally consumes unnecessary bandwidth (if the connection is not being used, who cares if it is still in good conditions?) that would even cost money for and Internet path that charges for packets. Due to these reasons, the keepalive mechanism should only be invoked in server applications that might otherwise hang indefinitely and consume resources unnecessarily if a client crashes or aborts a connection during a network failure. The equivalent algorithm in SCTP is the path heartbeat mechanism. It was added in the early stage of design of MDTP, in the 4 th version, because the designers were concerned about not having a way of keeping track of the state of unused addresses (both It is not all plain data 90
their reachability and their RTT). It has never been strongly criticized as the TCP's keepalive mechanism, because it solves a similar yet different problem. In SCTP we use only one destination address to send data, the Primary Address, and if that address fails we must use any of the rest. But if there has been a failure in one of the addresses, the probabilities that some other address is not working either are higher. This is because normally the addresses are physically placed in the same host, and it is highly probable that datagrams directed to any of those addresses will share part of its path to the peer. As the idea consists in using one of the backup addresses to quickly solve the problem, we must be quite sure that the new address used is in good conditions. In any case, there are some people that think that this feature is in a way useless, so the path heartbeat mechanism can be disabled if the upper user decides so. Initially, the heartbeat type of datagram used in MDTP had a fixed format, with 8 bytes to include the time in which the datagram was sent. Upon receipt of the heartbeat datagram, that information should be included in the answer directed to the source address of the received datagram, so the heartbeat sender could make an RTT measure. The sending frequency was initially set to one heartbeat sent every 4 seconds to any address that stayed idle 34 during that time. Later, it was made adaptive adding the last measured RTT to those 4 seconds. But this late change did not really make that much difference as the value of the RTT is usually in the order of some tens of milliseconds. If a certain amount of heartbeat datagrams were unanswered, the destination address was considered as unreachable. In the first version of SCTP the same structure was kept (using the HEARTBEAT chunk and its acknowledgement, the HEARTBEAT ACK chunk). Nevertheless, to avoid flooding the network with HEARTBEAT chunks, only a single HEARTBEAT chunk could be sent (to any of the idle addresses) every 4 seconds plus the last RTT measured. In the 5 th version of the specifications of SCTP the path heartbeat mechanism was deeply modified. These were the main changes:
Both the HEARTBEAT and HEARTBEAT ACK chunks were modified. Instead of having 8 bytes to save the time in which the HEARTBEAT chunk was sent, it included an opaque TLV structure of undetermined size that should be copied in the HEARTBEAT ACK chunk. This was done because some SCTP implementations were unable to choose the source address of their datagrams. So upon the receipt of the HEARTBEAT ACK chunk it could be difficult to find out to which address the HEARTBEAT chunk was sent. Having an opaque structure gave more freedom to the implementations to include whatever they wanted. The designers considered that being able to have only one unanswered HEARTBEAT chunk per association at a time was not enough. So they undid the previous change, managing every address independently of the rest. The period of heartbeating was also modified, being set to the RTO of the address to which the HEARTBEAT chunk was sent. That value was actually the smallest period for heartbeating, because the upper user could define any heartbeating period as long as it was bigger than the RTO. But usually all the RTOs are set to the minimum value of 1 second, and so to avoid sending the HEARTBEAT chunks in bursts, they should be sent once per RTO with jittering of +/- 50%, and exponential back-off or the RTO if the previous HEARTBEAT chunk was unanswered.
34 An address is considered to be idle during a period of time if no chunk eligible to measure the RTT (INIT, COOKIE, DATA or HEARTBEAT) has been sent during that period or time. It is not all plain data 91
The discussion about if only one HEARTBEAT chunk should be in flight per destination address or per association continued. The final decision was to choose the latter choice because when having lots of destination addresses the overhead produced by the heartbeat algorithm was considered too much. So in the 10 th version of SCTP specification this feature was modified again, allowing only one unanswered HEARTBEAT chunk per association. Until the 9 th version of the SCTP specifications, a HEARTBEAT chunk was considered lost if it was unanswered one RTO (with jittering of +/- 50%) after it was sent. But the designers wanted to give more freedom to the implementors to adjust this time so they created the Heartbeat Interval concept. The Heartbeat Interval is simply a quantity of time configurable by the upper user. When a HEARTBEAT chunk is sent to a specific address, it is considered to be lost after the RTO of the address to which the HEARTBEAT chunk is sent (with jittering of +/- 50%) plus the value of the Heartbeat Interval.
Figure 6-1: The path heartbeat mechanism in SCTP
This was the last change in the heartbeat algorithm. Figure 6-1 shows the internal structure of the HEARTBEAT and HEARTBEAT ACK chunks. The Heartbeat Information field typically carries the IP address to which the HEARTBEAT chunk was directed, as well as the time when it was sent. So upon the receipt of the HEARTBEAT ACK we can make the necessary measure of the RTT to be able to calculate the RTO (see section 5.5). However, as the internal structure of the Heartbeat Information field is completely undefined, one can use the heartbeat algorithm even to make a measure of the MTU (more about this in section 5.4). Sender-specific Heartbeat Info Heartbeat Info Type = 1 Heartbeat Info Length Chunk Type =4 (HEARTBEAT) Chunk Flags (Reserved) Chunk Length Checksum Verification Tag =Tag Z Source Port Number Destination Port Number
Sender-specific Heartbeat Info Heartbeat Info Type = 1 Heartbeat Info Length Chunk Type = 4 (HEARTBEAT) Chunk Flags (Reserved) Chunk Length Checksum Verification Tag = Tag Z Source Port Number Destination Port Number Sender-specific Heartbeat Info Heartbeat Info Type =1 Heartbeat Info Length Chunk Type =5 (HEARTBEAT A.) Chunk Flags (Reserved) Chunk Length Checksum Verification Tag = Tag A Source Port Number Destination Port Number
Sender-specific Heartbeat Info Heartbeat Info Type = 1 Heartbeat Info Length Chunk Type = 5 (HEARTBEAT A.) Chunk Flags (Reserved) Chunk Length Checksum Verification Tag = Tag A Source Port Number Destination Port Number It is not all plain data 92
6.2 You are wrong: the Operational Error chunk
When designing a protocol, one always specifies how things should be done. However, there are quite many circumstances that might make things go in a different way, from simple implementation bugs or hardware failures, to corruption of packets in the networks or even external attacks. SCTP is quite a complicated protocol and many problems can appear. Some of them can even be solved by the SCTP implementation itself if it knows what is happening. If the problem is so important that it needs some fixes outside the SCTP protocol itself, one can always take a look to the packet traces taken from a protocol analyzer. However, quite many times one would need to know about the state in the other peer to really understand what is going on. ICMP (and ICMPv6) is a protocol exclusively designed to report errors in the processing of IP datagrams and to give some diagnostic tools to the network manager. It is used among other things to verify the existence of a path going to a specific IP address, to report congestion in a router, to indicate the impossibility of delivering a specific datagram, or even to implement the Neighbor Discovery algorithm. ICMP not only serves to debug problems at the IP layer, but it is also used for example by TCP to implement the MTU Discovery algorithm (making use of the Packet Too Big message) as has been told in section 5.4, or to modify its sending rate when the Source Quench message is received (more about this in section 3.1.2). But there are some problems that are too specific of the transport layer that can not be solved with ICMP. Thus, it is interesting to have a mechanism that reports errors at the TCP or SCTP level. TCP does not have any method to report errors. It faces transmission errors such as received datagrams apparently not directed to the host that received them by responding with a datagram that has its RST flag on. The receipt of such a datagram will abort the connection, not having the TCP implementation any possibility of fixing any problem. MDTP was not initially a very complicated protocol (in fact, the 6 th version of MDTP was considered as already too complicated for the designers, who cut part of its functionality in the next version). Thus it did not have any way to report any error to the peer endpoint. During its evolution it got complicated and when SCTP was born the designers decided to include a mechanism to notify certain error conditions to the peer endpoint. This design idea was translated into the inclusion of a certain chunk, the ERROR chunk, whose shape has not changed at all during the whole design phase of SCTP. In Figure 6-2 we can see how it looks like:
Figure 6-2: The ERROR chunk in SCTP
Error Causes Chunk Type = 9 (ERROR) Chunk Flags (Reserved) Chunk Length Checksum Verification Tag = Tag Z Source Port Number Destination Port Number Parameters Chunk Typ = 9 (ERROR) Chunk Flags (Reserved) Chunk Length Checksum Verification Tag = Tag Z Source Port Number Destination Port Number It is not all plain data 93
The ERROR chunk contains one or more error causes. As shown in Figure 3-2, the error causes are TLV structures since the 2 nd version of the SCTP specifications. In the final version, 10 different types of error cause have been defined, and some other ones have been defined in the extensions to SCTP. Let us take a closer look at the error causes that are present in [Ste2000]:
The Invalid Stream Identifier error cause is sent when the peer sends us a DATA chunk directed to a nonexistent stream. Normally this means that the peer is broken and there is not that much to do, but the receipt of this error would help to fix the implementation bug that originated it. In case a mandatory parameter is missing in a received INIT or INIT ACK chunk the Missing Mandatory Parameter error cause should be sent in response. This error cause was defined in the first version of SCTP when some variable length mandatory parameters were expected to be defined in the future. The reality is that the only such parameter is the State Cookie of the INIT ACK chunk and so the use of this error cause is very limited. It probably means that the INIT ACK sender (the server) is not working properly. As explained in chapter 4 the State Cookie included in the INIT ACK chunk has a limited lifetime. If the server is too restrictive and sets that life span to a very small value, or if there are long delays in the path joining the two hosts involved in the association, it can happen that when the COOKIE ECHO chunk reaches the server, the State Cookie is already stale. In that case, the server should send an error cause of the Stale Cookie Error type, giving a hint to the client about the problem that aborted the establishment of the association. This error cause includes the value in milliseconds of how late the State Cookie arrived. Normally this will trigger another attempt to establish the association including the Cookie Preservative parameter in the INIT chunk to try to enlarge the lifetime of the State Cookie. Another cause to abort the establishment phase in SCTP is not having enough resources to be able to open a new association. In that case, the peer lacking memory should send the Out of Resource error cause, so the initiator of the association can try to establish the association later. The INIT chunk can include three types of parameters specifying destination addresses to be used by the server: an IPv4 Address, IPv6 Address or Host Name Address parameter. The server might not support some of these address types and so it should send in response the Unresolvable Address error cause including those addresses that can not be used. The client might simply give up or try again not including those address types. The SCTP protocol has been designed to be easily extensible. However, this means that in the future new chunks and parameters would be defined, and those implementations that do not know about those extensions will not understand them. Obviously the sender of those chunks or parameters might want to know if they caused the desired effect. So both the Chunk Type and Parameter Type have one bit that pushes the receiver to send back the Unrecognized Chunk Type or Unrecognized Parameters error cause in case it is not compatible with such extension (see section 3.1.2). A broken implementation can set any of the parameters of the INIT or INIT ACK to an invalid value. The receiver of such invalid chunk should send back the Invalid Mandatory Parameter error cause to help fixing the bug. It is not all plain data 94
The receipt of a DATA chunk that does not include any data is a symptom that the data sender has some problem. The No User Data error cause is thus sent in response to such a DATA chunk to help fixing the bug of the data sender. If a COOKIE ECHO chunk is received showing that the peer has restarted, we should set up a new association. However, if the receiver of such chunk is in the SHUTDOWN-ACK-SENT state, it meant that the peer crashed when trying to shutdown the association. In that case it makes no sense to establish a new association. So the receiver of the COOKIE ECHO chunk must send a datagram with an ERROR chunk containing the Cookie Received While Shutting Down error cause, bundled with a SHUTDOWN ACK chunk (more about this chunk in section 7.3). The receiver of that datagram should answer with a SHUTDOWN COMPLETE chunk, and probably it will not try to re-establish the association.
Some of those error causes help the SCTP implementations to solve a problem that might be transitory. But some others are normally included inside an ABORT chunk (see section 7.2) instead of an ERROR chunk. This is because they are sent in response to a datagram that proves that the peer has some important bug and then the association must be finished. However, they are always useful and help finding problems that otherwise would be more difficult to fix.
This is the end: The shutdown and abort algorithms 95
7. THIS IS THE END: THE SHUTDOWN AND ABORT ALGORITHMS
Releasing a connection is always easier than establishing it. But in any case, one can find more difficulties than expected, and so the final design is the result of an evolution in which the pitfalls that appeared were solved. The final procedure as appears in [Ste2000] will be slightly modified, but in any case let us see which were those problems and how they were managed. As can be seen from the state diagram of Figure 3-3 there are two ways to end an association, the graceful shutdown procedure, and the abortion of the association, but this has not always been like that. In the next sections we will explain how the terminating process evolved from a simple one-way procedure in MDTP to the abort and shutdown procedures in SCTP. These two mechanism to terminate an association will be discussed in separate sections.
7.1 Terminating associations in MDTP
In the initial versions of MDTP there were two ways of finishing an association. One of them was the so-called Endpoint Drain, which basically consisted of sending a special message to the peer endpoint of an association. That message did not need to be acknowledged, and the association was simply terminated, erasing any information about it in the sender side as soon as the message was sent, and the same in the receiver side as soon as it was received. The other way of finishing associations was the so-called Termination of an Endpoint. When this procedure was called, all the associations were terminated by sending another special message to all the peer endpoints. At the end, it was much the same than the Endpoint Drain, sending a message that was not acknowledged to terminate the association. The only difference was that all the associations were terminated and not just one. There was no acknowledged way of shutting down an association. The explanation of this is the same than for some other early MDTP properties: MDTP was meant to be run in an environment in which packet losses should be a really rare event. As the delivery of the packets was assured by the reliability of the network itself, the acknowledgement did not seem to be necessary. Moreover, one of the initial design principles was that associations should be established and terminated as quick as possible. Thus, having to wait for an acknowledgement was considered as a loss of time. This was the schema used in the first 7 versions of MDTP. Again, as the protocol was gaining popularity and starting to be looked at as a much general protocol than a simple telephony signaling transport. This change in purpose had to be translated into changes in its design. This rather innocent terminating process has to be changed, as anybody forging the peer's address could tear down an association. Several changes were done. The Termination of an Endpoint procedure (which was meant to be used only rarely, when the endpoint had serious problems) was left as it was, just changing the morphology of the datagram sent, which was then called an Abort This is the end: The shutdown and abort algorithms 96
datagram. But the Endpoint Drain procedure was modified and renamed to Graceful Shutdown of an Association. The new mechanism was, in any case, still quite simple. The main improvement was the birth of the Verification Tag concept. However, during that time, it was not located in the Common Header in every datagram. The Verification Tag was inserted only when some susceptible information was carried inside the datagram, such as establishment datagrams, stream management datagrams, and terminating datagrams (basically all the datagrams but the ones that simply carried data or acknowledgements). So, the shutdown initiator sent a special datagram (the Shutdown datagram) carrying the peer's Verification Tag and the last in-order TSN received. But as the peer at that point might still have some data to send, it could continue sending data until all of it was acknowledged. After this it should erase all the information about the terminated association and reply with the Shutdown Acknowledgement datagram, carrying the shutdown initiator's Verification Tag. At that moment the shutdown initiator also erased the information about the association and the whole process was finished.
7.2 A hard end for an association's life: Aborting an association in SCTP
When SCTP came into play, this same scheme was used. Again, there was an abort procedure, in which the party wanting to abort the association simply sent an ABORT chunk and deleted the information about the association. And there was also a shutdown procedure, in which one peer sent the SHUTDOWN chunk, which was answered with a SHUTDOWN ACK chunk just as stated above. The abort procedure was kept mostly the same as it was in the last versions of MDTP. But then the fellows of the distribution list started to think about what would happen if the ABORT or SHUTDOWN ACK chunk was lost when running SCTP in lousy environments. In that case, one peer would be terminated while the other would still think that it was up and running. Following the normal procedure of aborting an association when a maximum number of consecutive data retransmissions had been issued, it could take even minutes to consider the peer as unreachable. If the other peer is not sending data or does not have the heartbeat mechanism enabled (see section 6.1) it could be that the resources allocated for that association would never be freed. Therefore, the concept of the Out Of The Blue (OOTB) datagrams arose. An OOTB datagram is one that seems to be valid but that is not directed to any of the open associations (due to a bug in the sending party, or because we crashed and have just recovered). In case a host received an OOTB datagram it should reply with an ABORT. But as the host did not know the peer's Verification Tag it should use the one carried in the incoming OOTB datagram instead (sent with the Reverse Verification Tag, as it is said in the SCTP jargon). There are some exceptions to the management of OOTB datagrams: the INIT and COOKIE ECHO chunks (they fit in the OOTB datagram definition, but obviously, when somebody is trying to establish a new association we do not know anything about it in advance), the ABORT chunk (that should not be answered at all to avoid a datagram storm), and the SHUTDOWN chunk (which should be answered with a SHUTDOWN ACK instead 35 ). Having the OOTB datagram concept, as soon as one datagram was sent to
35 In the initial designs the SHUTDOWN ACK chunk should carry an all zeros Verification Tag, but this was modified so it carries a copy of the Verification Tag of the received datagram as in the case of the ABORT chunk. This is the end: The shutdown and abort algorithms 97
an already terminated host, we would receive an ABORT chunk back, thus quickly closing our side of the association. Not only the way of using the ABORT chunk was modified, but the ABORT chunk itself. Initially the ABORT chunk did not have any body at all. It only had the compulsory chunk header and nothing else. To be able to tell the peer something about the cause of the error that originated the abortion of the association, one had to bundle an ERROR chunk with the ABORT chunk (being the ABORT chunk the last one in the datagram, otherwise the ERROR chunk would not be read). This was considered to be a clumsy thing to do, and so finally the ABORT chunk was modified to be able to carry the same error causes used in the ERROR chunks as explained in section 6.2. Some time later, due to the obligation of sending an ABORT chunk in response to an OOTB datagram, another modification was done. One of the reserved flags in the ABORT chunk was renamed to be the T (TCB Missing) flag. This flag is set in case the ABORT chunk is sent in response to an OOTB chunk, meaning that no Transmission Control Block (TCB) was found belonging to this association. As not having the TCB means that we do not know the peer's Verification Tag, a datagram carrying an ABORT chunk with the T flag set has its Verification Tag field set to the same value as the Verification Tag of the received OOTB datagram (i.e., it carries the Reverse Verification Tag). The receipt of an ABORT chunk with its T flag set, normally means that the peer has restarted. The final ABORT procedure was set to be as shown in Figure 7-1 below:
Figure 7-1: The abort procedure in SCTP
As can be seen there, the abort procedure is really simple, but still gives to the receiver of the ABORT information to at least figure out the reason of its receipt. In any case, the abort procedure should be rarely used, and any peer wanting to tear down an association must always use the graceful shutdown mechanism explained in the next section. Only when that procedure fails, of if the host has some internal problems, should the ABORT chunk be sent.
7.3 I am done, could you finish as well? The shutdown procedure
We have already commented that the last versions of MDTP already had a way to gracefully shutdown an association. In the first version of SCTP, the shape of the datagram changes considerably, and thus the shutdown procedure was also modified. In any case, the
Parameters Chunk Typ = 6 (ABORT) Reserveddf T Chunk Length Checksum Verification Tag = Tag Z Source Port Number Destination Port Number
Error Causes Chunk Type = 6 (ABORT) Reserved T Chunk Length Checksum Verification Tag = Tag Z Source Port Number Destination Port Number This is the end: The shutdown and abort algorithms 98
basis of the process remained the same: the closing side sends a SHUTDOWN chunk, that has to be answered by the peer with a SHUTDOWN ACK chunk once it has received the acknowledgement of all the data it sent. However, when SCTP went to the final revision (at least one of the first 6 final revisions) a problem related with the graceful closing of the associations was highlighted. When a host that has sent at least twice the SHUTDOWN chunk received a SHUTDOWN ACK chunk with the Reverse Verification Tag, there was no way to differentiate one of the next two situations. It could be that a previous SHUTDOWN chunk made it to the peer endpoint but the corresponding SHUTDOWN ACK with the right tag was lost. Or it could be as well that the peer endpoint simply restarted (possibly sending us an ABORT chunk that was lost) and so it directly replayed to our SHUTDOWN chunk with the SHUTDOWN ACK chunk carrying the Reverse Verification Tag. The problem is that the SHUTDOWN chunk sender does not really know if the SHUTDOWN ACK chunk was lost, or if the peer crashed (probably loosing some data). In TCP this situation is somehow palliated with the existence of the TIME WAIT state. This state basically consists in keeping the information about a connection for some time (common implementation values are 30 seconds, 1 minute or 2 minutes [Ste1994]) after sending the final acknowledgement, just in case it is lost and we have to send another one later. There was another issue, the difference between TCP's and SCTP's shutdown procedure. In TCP there is the concept of the half-closed connection 36 . TCP treats every single duplex connection as two simplex ones that must be closed independently. So you can tell to the peer that you are done with your data and you are not sending anything else, thus closing your part of the connection, while the peer is still sending you data (so the overall connection is just half-closed). This means that the TCP's closing procedure is a 4- way handshake one, in which one of the peers has to send a datagram carrying the FIN (from Finalization) flag set, telling to the other that it will not send any more data. Then it receives the acknowledgement of that datagram (as a normal data acknowledgement, since the FIN segment occupies one byte in the sequence space), and finally the procedure is repeated on the other side. Normally this procedure is shortened, setting also the FIN flag in the datagram that acknowledges the first FIN segment, and so half-closed connections are not so common. However, half-closed connections are really useful for a commonly used application [Ste1994]: the Remote Shell (RSH). This application is used in the UNIX environment and executes a command on another remote system. For example, if we are in a host called helsinki and we type the command:
helsinki % rsh madrid sort < datafile
the sort command will be executed on the host madrid (which has a rshd server) with standard input for the rsh command being read from the file named datafile. In that moment rsh creates a TCP connection between itself (in the helsinki host) and the program being executed on the madrid host (sort in this case). The rsh copies standard input (datafile) to the connection established, and then copies from the connection to standard output (our terminal). On the madrid server, the rshd server
36 Although it is mostly a matter of taste (is the bottle half-full or half-empty?), a half-closed connection is one in which only one direction of data flow has been closed, while a half-open association is one in which only one side of the connection thinks that it is open (see section 4.2). Sometimes the term half-open is used in both cases. This is the end: The shutdown and abort algorithms 99
executes the sort command so that it takes the standard input from the TCP connection and copies the standard output to the TCP connection created. But the sort program, as many other programs, cannot generate any output until all the input has been read (in other words, when the end-of-file in the input is reached). Therefore, the sort program will only start sending back the results of its action to helsinki as soon as we close the outgoing flow of data from helsinki to madrid (thus providing the end-of-file mark required). That is the reason why half-open connections are sometimes valuable in TCP (the same result could have been obtained using two TCP connections, but using a single one with half-close is better). So, after some deliberation, the closing procedure was modified in the 11 th version of the SCTP specification. In Figure 7-2 we can see the chunks that are involved in the whole procedure. As can be seen in the figure, the first two chunks were not modified: the SHUTDOWN chunk carrying the Cumulative TSN Ack, and its reply, the SHUTDOWN ACK chunk. However, another new chunk was added to the whole procedure, the SHUTDOWN COMPLETE chunk. The TCB is erased at the initiator side as soon as it receives the SHUTDOWN ACK, and the other side deletes its TCB when it receives the SHUTDOWN COMPLETE chunk.
Figure 7-2: The shutdown procedure in SCTP
In addition, there was another modification. In case a SHUTDOWN ACK chunk is received and there is no TCB belonging to that association (i.e., the SHUTDOWN ACK is an OOTB datagram), the receiver will in any case answer sending back a SHUTDOWN Cumulative TSN ACK Chunk Type =8 (SHUTDOWN) Chunk Flags (Reserved) Chunk Length Checksum Verification Tag = Tag Z Source Port Number Destination Port Number Cumulative TSN ACK Chunk Type = 8 (SHUTDOWN) Chunk Flags (Reserved) Chunk Length Checksum Verification Tag = Tag Z Source Port Number Destination Port Number Chunk Type =14 (SHUTDOWNC.) Reserved T Chunk Length Checksum Verification Tag =Tag Z Source Port Number Destination Port Number Chunk Type = 14 (SHUTDOWN C.) Reserved T Chunk Length Checksum Verification Tag = Tag Z Source Port Number Destination Port Number Chunk Type = 9 (SHUTDOWN A.) Chunk Flags (Reserved) Chunk Length Checksum Verification Tag = Tag A Source Port Number Destination Port Number Chunk Type = 9 (SHUTDOWN A.) Chunk Flags (Reserved) Chunk Length Checksum Verification Tag = Tag A Source Port Number Destination Port Number This is the end: The shutdown and abort algorithms 100
COMPLETE chunk. But as the sender of the SHUTDOWN COMPLETE does not have any knowledge about the association, it would use the Reverse Verification Tag. As the ABORT chunk, the SHUTDOWN COMPLETE also has a T flag, which must be set in these cases. In any case the problem about loosing the SHUTDOWN COMPLETE chunk and having one side with the association open is still there. But now there is a difference. Even if we had to retransmit the SHUTDOWN ACK chunk and then we received a SHUTDOWN COMPLETE with the T flag set, we know that the peer was done with its data as it started the shutdown procedure sending us first the SHUTDOWN chunk. And we also know that the peer received all our data since it had to acknowledge it before we were able to send the SHUTDOWN ACK chunk. So no matter if the peer restarted or not, the final result would have been the same. However, the peer who sent the SHUTDOWN COMPLETE chunk can not be sure that the other one received it and closed the connection. So, why not adding a fourth leg to the procedure so we can wait for the acknowledgement of the SHUTDOWN COMPLETE and we can close the association being sure that the peer did the same? Unfortunately that does not work either. There is a famous problem regarding this issue that is called the two-army problem (see section 6.2.3 of [Tan1996]). Imagine that there is a Russian army in the middle of a valley, surrounded by two Finnish armies, one in each of the two hills beside the valley. Each of the two Finnish armies is smaller than the Russian army, so in case any of them tries to attack, it will be defeated by the Russians. This situation is graphically shown in Figure 7-3.
Figure 7-3: The two-army problem
However, the two Finnish armies together are bigger than the Russian one. Therefore, the Finns will only be victorious if they attack the Russians simultaneously with their two armies. The point is that they have to agree on a date to do that attack. In the very improbable case that none of the Finns in the two armies has a mobile phone or any other way of communication with the other army, they should send one of their soldiers across the valley to pass to the other army the information about the day of the attack. This way, once both armies know the date, they can attack at the same moment and defeat the Russian army. Let us imagine that the left Finnish army sends one of its men to the right side to tell them to attack on December the 6 th . But what would happen if the soldier were captured in his way? Then, the right army would not know about the agreed date of the attack, and they would not move, and so the left army would be defeated. Thinking about this possibility, the left army probably will not attack either. To avoid this situation, they tell the soldier to ask the right army to send another soldier back, so they can be sure that their soldier made it to the other side of the valley. But now, the right army is in a similar situation: they know that the left army is willing to This is the end: The shutdown and abort algorithms 101
attack on December the 6 th , but how can they be sure that their man will arrive safely to the left hill? As there is the possibility that the soldier is captured, they can not take the risk of charging into battle as possibly the left army will not do it either. Let us improve the process by sending a third soldier from the left valley to tell the right army that their brave soldier told them that they know about what will happen in December the 6 th . But then, how will the left army know that the right army knows that the left army knows that the right army knows about the date? Adding a fourth trip will not help. In fact it can be easily proven that there is no perfect way of doing the expected work. Let us imagine that there is a perfect procedure, then the arrival of the last soldier is necessary or not. If it is not necessary, do not send him and check if the previous soldier was necessary or not, and so on. This way we end up having a procedure in which the last soldier has to reach the other side of the valley or the whole procedure will fail, so, what would happen if that last soldier was captured? The other army will not attack. In consequence, the army that sent the soldier, knowing about this possibility, will not attack either. Even if the soldier got through, the army that received the last soldier would know that other army can not be sure that they know about the date, so they will not attack. If the soldiers are replaced by SCTP datagrams, the Finnish armies are replaced by two hosts having an SCTP association between them, and the valley with the Russian army is replaced by a lousy channel as the Internet, we have exactly the same problem that when closing the association and trying to be sure that the other host also closed it. As having a half-close association is not as important as being defeated in a war, a three-way handshake is usually good enough for our purposes. With this modification, becoming the shutdown mechanism a three-way handshake, it was closer to the one used in TCP (when the second and third leg are joint, as explained above). But still, the scheme is asymmetrical, because one end forces the other one to stop sending data: as soon as the SHUTDOWN chunk arrives, the upper user is told not to pass any new data to SCTP. There was a proposal of using one of the flags in the SHUTDOWN ACK to either simply mean that the SHUTDOWN chunk was received, or to also indicate that the host has sent all its data and it is waiting for the SHUTDOWN COMPLETE. This would mean that TCP's semantics would be kept in SCTP unmodified. But finally this idea was discarded. After some discussion there was a consensus about not keeping the TCP's half-close semantics. The reason for that is that there are bad-behaved clients that never close their flow of data and so the TCP connection is never released at the server, that ends up flooded with open connections. To avoid this problem, some TCP implementations start a timer while in the FINWAIT-2 state (after receiving the acknowledgement of the first FIN segment sent). When that timer expires, they close the connection and so they do not keep waiting for a datagram with the FIN flag set. Yet another issue related with this has recently appeared. The peer receiving the SHUTDOWN chunk will not send the SHUTDOWN ACK chunk until it has finished sending its data. Meanwhile it will send its DATA chunks, and this would cause that the peer wanting to close will not terminate the association as it is still receiving data. Therefore, an implementation could decide not to close an association by simply accepting new data from the upper user, or by sending duplicate DATA chunks. To avoid this, one of the future modification in the SCTP specification that appears in [Ste2002b] is the existence of a guard shutdown timer that is started right after sending the first SHUTDOWN chunk. When that timer expires, we close the association no matter if the peer is still sending us data.
And now? SCTP extensions and SCTP users 102
8. AND NOW? SCTP EXTENSIONS AND SCTP USERS
The Internet is a changing world. New technologies appear almost daily and a good transport protocol should be able to adapt to new environments. TCP's possibilities of extension are limited to 6 flag bits (reduced to 4 bits by the ECN the extension for TCP described in [Ram2001]) and 40 bytes for options. This is far from enough to make new versions of TCP which are backwards compatible and that include the features needed in some fields. Precisely this was one of the reasons why a new transport protocol for telephony signaling started to be designed instead of enhancing TCP. Avoiding this same limitation in the future was one of the design principles of SCTP. The extensibility possibilities of SCTP are practically unlimited due to its internal structure, graphically shown in Figure 3-2. The only problem found with this architecture is related with its fixed common header. In the first versions of SCTP there were some reserved bits in the common header that could be used in the future to indicate a special processing of the whole datagram including the header itself. However, that reserved field disappeared when the checksum was enlarged from 16 to 32 bits. The designers missed this lacking of spare bits in the common header right after the final publication of SCTP specifications, during a long discussion about if it was better the use of a strong and expensive checksum, or a weaker but cheaper one. It would have been easier having several checksum schemes and having a flag in the common header telling which one had been used. It is true that an equivalent result could have been achieved negotiating the use of one or another checksum during the establishment phase, or simply using a weak checksum and including a new chunk carrying the strong checksum when necessary, but they would have been less efficient (more about the checksum problem in section 9.1). Apart from this problem with the common header, we can say that the extensibility possibilities of SCTP are excellent, and there is even one section in RFC 2960 that deals precisely with this: how the protocol can also be extended through IANA. IANA is not only in charge of the Chunk Types, Parameter Types and Error Cause Types, but also port numbers and Payload Protocol Identifiers. In October 31 st of 2000, the RFC containing the SCTP specifications was released. SCTP is not widely deployed yet and the existing implementations are still experimental ones (there were 19 different SCTP implementations tested in the third interoperability session organized in April 2001 in Nice, France). Nevertheless, there have been already quite many attempts to extend its features. Some of them were done even before SCTP was completely finished. Also, there are some applications that find interesting the use of SCTP as their transport protocol. In the following sections we will take a look to the main extensions to SCTP, and also we will quickly speak about some of the Internet-Drafts and RFCs that document applications that use SCTP.
And now? SCTP extensions and SCTP users 103
8.1 The SCTP extensions
RFC 2960 took about 27 months of work to be written (since the first version of MDTP), which is quite a long time. The main problem was that lots of people wanted to modify lots of things, and there was always the desire of changing the way of saying things and adding more and more features to the basic SCTP specifications. The authors (mainly Randall R. Stewart and Qiaobing Xie) made an excellent work trying to reach consensus whilst avoiding making unnecessary changes that would delay even more the publication of the RFC. One of their best weapons against changes was precisely the SCTP capabilities to be easily extended. So the tactic used was not adding in the basic specification of SCTP any fancy feature that could be helpful in restricted environments, but writing an extension document instead. This would cause that the whole SCTP specifications would be spread all around many RFCs, but that is better than not having any of them at all. The most important extensions to SCTP are described in the next subsections.
8.1.1 This is my new address: Adding and deleting addresses, and per stream flow control
One of the main features of SCTP is its ability to use several origin and destination IP addresses in a single association. However, one of the biggest problems that this feature has is that it is not flexible at all. The addresses to be used are negotiated during the association establishment and they are not changed at all, unless the association is restarted (which is not a clean way to do it). There were some reasons why the ability of changing the IP addresses in use was important. The first idea was to simply be able to plug or unplug the network cards of one host and add or delete the corresponding IP address to or from the association. This would not only help to remove on the fly a broken card and replace it with another one (having a different IP address assigned than the old one), but also would provide the same type of services that exist in the SS7 world that allow a link set to add an additional link without interference with the operation of the link set. Another problem that this new extension could solve was related with the renumbering feature of IPv6. In IPv6 it is possible that a site renumbers all of its nodes, for example when it switches to a new network service provider. This already causes some problems to TCP connections, that must be terminated before the renumbering takes place (see section 4.1 of [Tho1998]). TCP implementation can at most tell the upper user that one address is about to be changed. Also, as the new address should be available in advance, most of the TCP connections should already be using the new address in the moment the old one is released. But for long lasting connections this will not help either. The Internet draft containing the extension to add or delete IP addresses is called (for obvious reasons) the AddIP draft [Ste2002c]. It has evolved a lot since its first release, published even before the SCTP specifications were ready as an RFC. Figure 8-1 shows graphically this evolution: Initially the AddIP draft included two different extensions. One was obviously the possibility of adding and deleting our source IP addresses, but as the addition or deletion request should be acknowledged, another new feature was added in the same draft. The authors of the extension thought that there would be some other extensions that would make requests that should be reliably acknowledged. Therefore they specified a general way to send parameters inside a new control chunk that should be acknowledged (basically And now? SCTP extensions and SCTP users 104
making use of serial numbers for both the chunk and the parameters), the Reliable Request Procedure. Moreover, this new feature was designed not to interfere in the congestion control mechanism defined in SCTP, so the reliable requests were treated as if they were DATA chunks from the congestion control point of view. At the same time, almost the same authors of the AddIP draft wrote another one that added the possibility of applying flow control on a per stream basis, called the Srwnd draft (from Stream Receiver Window), which used the Reliable Request Procedure. As happens with the use of multihoming, the avoidance of the head-of-line blocking by using several streams is one of the basic features of SCTP (see section 5.3). But so far the flow control is performed both on a per association and per address basis (as explained in section 5.2). So, there is still the possibility that one single stream uses all the resources exhausting the buffer capacity of the receiver. Basically this extension proposes dividing the Receiver Window space among the used streams. As it is expected that a single SCTP association will carry the signaling data of several telephone calls, one per stream, this new extension was warmly welcome as a very valuable one.
Figure 8-1: Evolution of the AddIP draft
With the time being, the AddIP draft was divided into two: the AddIP draft itself and the RelReq draft. This was quite a straightforward movement, as the Reliable Request Procedure was a very general one that had nothing to do with its specific use to add or delete addresses. Apart from this change, another nice feature was added to the AddIP draft, the possibility of recommending the peer which address should be its Primary Address, a valuable suggestion when the Primary Address is about to be deleted. In parallel, the Srwnd draft continued evolving. In the next months, the RelReq draft was modified to better provide the functionality needed by both the AddIP and Srwnd drafts. Thus, the initial aim of having a very general way for reliably transferring control chunks, was being lost. So, after some comments in the list, and some IETF meetings, it was decided that the RelReq draft would be discarded, and its functionality would be added to the AddIP draft, adapting it better to its needs. As stewart-srwnd 2 Sep 11 2000 Nov 3 2000 stewart-addip 2 Sep 7 2000 Nov 15 2000 sigtran-srwnd 1 Jan 31 2001 Jan 31 2001 sigtran-relreq 2 Feb 2 2001 Feb 23 2001 sigtran-addip 2 Feb 2 2001 Feb 23 2001 tsvwg-addip 5 May 7 2001 Jan 29 2002 D Dr ra af ft t I In nf fo or rm ma at ti io on n Draft Name N. of Versions Date First Vers. Date Last Vers. D Dr ra af ft t E Ev vo ol lu ut ti io on n U Up pp pe er r D Dr ra af ft t u us se es s L Lo ow we er r D Dr ra af ft t And now? SCTP extensions and SCTP users 105
the Srwnd draft also used the RelReq draft, it was merged with the new AddIP draft, and some modifications were done, such as the possibility of limiting the flow of a stream to a number of bytes as well as to a number of user messages. In the whole life of the AddIP draft, 14 versions have been issued. The last AddIP draft written so far was published in November 2001. It is expected to become an RFC soon, and some SCTP implementations already include its functionality. It is expected to be first tested in the next interoperability session, which will be carried out in San Jos, California, during March 2002).
8.1.2 Can I trust you? Reliable and unreliable streams
By definition, as it appears in its specifications, "SCTP is a reliable transport protocol". That means that the data sent to the peer using SCTP is guaranteed to reach its destination (unless the network or the hosts are not working at all) by retransmitting the data in case it is not acknowledged. When transporting telephony signaling this seems to be the right thing to do, but SCTP has a wider range of operation and so there are some applications that do not really want this. For example, if we have joint a multicast group and we are receiving the emission of a radio station through the Internet, or if we are just using any application that transmits digitized speech over IP. In these cases, it is usually desirable not to retransmit the lost packets and not delay the transmission of the new ones. This will cause that the guy listening will realize that there are some cuts and interruptions when the packets are lost. But the data can be consumed at the receiver at most as quickly as it is produced (if we are for example hearing uncompressed audio at a fixed rate of 64 Kbps, there is no way we can hear a minute of radio emission in less than one minute). Thus, retransmitting old packets while holding the transmission of the new ones that are being created will also cause interruptions, and the sending queue (and the receiver's buffer) will be every time fuller and fuller. In some other applications, data simply expires. So retransmitting it when it is already stale not only makes subsequent data more likely to arrive late to the receiver (since its transmission must be delayed while the previous data is retransmitted), but also floods the network with useless packets that will be discarded when they reach their destination. However, not only the data that arrives late or never do it cause problems. In our example of transmission of real time voice over IP, it is also preferable to listen to a slightly corrupted emission than not listening anything at all. If someone is hearing the speech of somebody speaking, surely few corrupted bits will not make such a big difference to the listener's ears, while having interruptions is much worse. So discarding a datagram that arrived corrupted to its destination is not always the best option. UDP already solves these two problems to some extent. In UDP there are no acknowledgements, and thus no retransmissions at all, but that also means that there is no congestion control. Moreover, the checksum can be turned off by simply setting it to all 0, so this alleviates the second problem, but leaves unprotected not only the data carried inside the UDP datagram but also the whole UDP header. Thus UDP is not precisely the best possible solution. So, some of the authors of the SCTP specification started to write a draft called Usctp (from Unreliable SCTP) [Xie2001a]. There they defined a new parameter used to set some outbound streams as unreliable, so that a DATA chunk sent to any of those streams will never be retransmitted (this concept evolved to a limited number of retransmissions). In turn, when the retransmission timer expires, a special chunk used to advance the peer's And now? SCTP extensions and SCTP users 106
Cumulative TSN Ack was sent (note that a similar feature, the CANCEL chunk, was about to be included in the SCTP specifications and finally was discarded, as explained in section 5.6). This way, one could have a single association in which some streams could be used to transmit unreliable data which is not likely to be retransmitted (for example real time multimedia traffic), and some other streams to transfer reliable information (such as data files). This draft started to be written even before the final specifications of SCTP were published as an RFC. This was the only new feature contained in the Usctp draft in its first version. Later on in the next release, the problem about corrupted data stated above was also addressed including a special kind of DATA chunk that was only partly covered by the Adler-32 Checksum. But this brought another problem, since before checksumming an incoming datagram, one should parse it and take a look to see if any of those special DATA chunks was present, and then calculate the Adler-32 Checksum over the right bytes. So in a way you were being too confident, considering that the data inside the datagram was not corrupted before you verify that, and so, why calculating the checksum at all? After some discussion in the list about a way of avoiding this problem, the proposed solutions were just making things too complicated. So, this feature about including data not covered by the checksum was finally dropped. The point was that the checksum verification procedure would consume too much processing time, and in any case we could simply accept SCTP datagrams having a wrong Adler-32 Checksum if we want (however, the advantage of the Usctp extension is that it would protect the headers). Further discussion in the IETF meeting made that the whole draft was finally withdrawn after its 6 th version and more than one year of work involved. Things were getting too hard while the advantages of having this draft were shrinking. Some of the discussed problems were these [Xie2001b]:
Having new functionality makes always things more complicated. Feedback received from SCTP application designers was that things were already fairly hard and that there could be interoperability problems if a transport service is too complex. There were already some limitations in the Usctp draft. For example, unordered DATA chunks could not be used, and unreliable data could not be fragmented. That made the whole draft less useful than expected. TCP is the basis of SCTP. So the designers were a little bit scared of going so far away with SCTP. All the data should get through, and the receiving application could use unordered DATA chunks to deal with datagrams that arrive too late. Canceling a piece of data sent but not yet acknowledged was quite a new feature, and the sender would not really know if the receiver got the data or not. Even if people wanted this feature, more experience was needed. There should be many easier ways to send data unreliably than using SCTP. Most of its complexity comes from its reliability and so making things harder just to avoid using that feature does not make much sense.
There was a long discussion about if the draft should be forgotten or not, but after a couple of weeks of mail exchange in the distribution list, it was accepted that the draft was not interesting any more, and so it expired without any new release. But this does not necessarily means that SCTP will never have an unreliable data transfer mode. This was already the second attempt of including such functionality and surely in the future this possibility will be further studied and developed. And now? SCTP extensions and SCTP users 107
8.1.3 Be ready to adapt to your environment: The adaptive Fast Retransmit algorithm
The fast retransmit is an algorithm already commented in section 5.2 that helps avoiding a retransmission time-out by making a quick retransmission of a certain TSN when subsequent TSNs are arriving to the peer. It was mostly copied from [All1999] modified to adapt it to SCTP's characteristics (mostly due to the existence of Gap Ack Blocks in the SACK chunks). It has proven to be a valuable algorithm that improves throughput. But SCTP has a nice feature that has not been used to make the fast retransmit algorithm even better: the Duplicate TSNs at the end of the SACK chunk. It was included after a proposal made by an Internet congestion expert, but it is not presently used. It was left there to be used in the future after some studies show the way it could be used. As for [Ste2000] we must receive four consecutive SACK chunks reporting one TSN as missing before we fast retransmit that TSN. Why four and not another quantity? Simply because it seems to be a reasonable number not to give time to the retransmission timer to expire, but at the same time to avoid unnecessary fast retransmissions. But different networks have different behaviors and what could seem to be a reasonable trade off in one of them it is not so in some other. The reordering of packets in the network is one of the worst enemies of the fast retransmit algorithm. It can trigger unnecessary fast retransmissions, that not only waste network resources but also diminish the throughput as already seen in Figure 5-4. And, opposite as what was generally thought, reordering is not such a strange event. The study done in [Ben1999] shows that, under certain network load, more than 90% of TCP connections suffered from reordering. The receiver of a duplicate TSN must compulsorily notice it by including a Duplicate TSN inside a SACK chunk. That will tell the data sender that it made an unnecessary retransmission. So it could undo the last changes in the congestion avoidance variables (namely the values of the cwnd and ssthresh variables) that would get the data sender back to the state previous to the retransmission. This was the basic idea that the author of this Master's Thesis, the main author of SCTP (Randall R. Stewart) and one expert in Internet congestion, co-author of TCP's congestion avoidance algorithms (Mark Allman), used in [Ari2001] to modify the current SCTP's fast retransmit algorithm. The main procedure was creating, every time a fast retransmission was issued, a record containing the TSNs retransmitted, the cwnd and the ssthresh values. If some time later, the data sender receives a SACK chunk containing as Duplicate TSNs all those TSN that were retransmitted, it would mean that the whole fast retransmission was unnecessary and then ssthresh should be set to the old value of cwnd, so SCTP could exponentially reach again that value in few RTTs. That would undo the damage done in the sender's transmission capabilities, but that is not all. In SCTP, one needs exactly four chunks reporting a TSN as missing before a fast retransmission is issued. If due to the stated algorithm we notice that some of the retransmissions are spurious because of reordering in the network, that number of SACK chunks could be increased. On the other hand, if the retransmission timer expires and some of the TSNs to be retransmitted were already reported as missing several times, it might well mean that our fast retransmission threshold is quite high, and so it should be diminished. In this way the fast retransmission algorithm becomes adaptive. Finally, the authors decided that some real testing was needed before the draft could be published, to prove that the whole algorithm really worked. What is more, the draft actually covered two different problems. One of them was what to do when realizing that a And now? SCTP extensions and SCTP users 108
spurious fast retransmission was issued, and the other was how to know that the retransmission was bogus. The receipt of the Duplicate TSNs was a very neat way to know it, but not the only one. For example, a surprisingly quick acknowledgement of a retransmitted TSN might also mean that the SACK was sent due to the receipt of a previous copy, and so the retransmission was unnecessary anyway. Even a specialized mechanism could be created, such as an extension to include time stamps that would tell the data sender if a SACK was triggered by the last transmission of a TSN or by a previous one. Finally the document was divided into two. In [Bla2001a] appears the algorithm to detect spurious retransmissions by either the inspection of the Duplicate TSNs in SCTP, or by the use of the TCP extension to report receipt of duplicate data segments documented in [Flo2000]. The other document, [Bla2001b], discusses the algorithm that both reverts the congestion control state previous to the fast retransmission (modifying the values of cwnd and ssthresh) and modifies the fast retransmit threshold. It also allows to introduce some delay before we make the fast retransmission, instead of making it right after the fast retransmit threshold is reached (if we realize that we made an unnecessary retransmission).
8.2 Is anybody using SCTP? Some applications that use SCTP
SCTP was born about one year ago and it is not widely known yet. But in any case there exist already some applications that use SCTP as their transport protocol. Most of them are, however, new protocols related with telephony signaling transport, which was the initial field for which SCTP was designed. Let us comment first about those adaptation protocols. To make SS7 signaling transport over IP networks possible, an SS7-IP gateway must provide the means for translating SS7 messages into IP datagrams, and vice-versa. However, that translation can be done at several layers. Even though there is no need to provide translation at all levels in the SS7 stack, authors are writing adaptation modules that can translate SS7 signaling at the SCCP level, as well as at the MTP3 and MTP2 (there are even two proposals for MTP2). The SCCP-User Adaptation Layer (SUA) [Lou2002] is a protocol designed to transport any SCCP-User signaling (such as TCAP) over IP using SCTP, in a seamless way. SUA can be used between a Signaling Gateway (SG) and an IP signaling endpoint (a Service Switching Point (SSP) or Service Control Point (SCP)), but can also provide transport of SCCP user information directly between IP endpoints rather than through a SG. The SG is needed only to assure interoperability with SS7 signaling in the switched- circuit network. SUA is able to support both SCCP unordered and in-sequence connectionless services, as well as bi-directional connection-oriented services, either with or without flow control and detection of message loss and out-of-sequence errors (i.e., SCCP protocol classes 0 through 3). As seen in Figure 2-3, there is an interface defined between ISUP and SCCP. However, it has not been implemented yet and thus SUA will not be able to carry ISUP messages until that interface becomes available. The first release of SUA was submitted in March 2000, and after 11 versions it is expected to become an RFC soon. The MTP3-User Adaptation Layer (M3UA) [Sid2002] works at a lower layer than SUA. It directly replaces MTP3, and it provides support for the transfer of all SS7 MTP3- User Part messages, such as ISUP or SCCP over IP using SCTP. And now? SCTP extensions and SCTP users 109
M3UA can be used between an SG and a Media Gateway Controller (MGC) or IP telephony database. M3UA extends access to MTP-3 services at the SG to remote IP endpoints. In case the IP endpoint is connected to several SGs, the M3UA layer at the IP endpoint keeps track of the status of configured SS7 destinations and routes messages depending on the availability and congestion status of the routes to these destinations via each SG. M3UA provides accommodation of larger blocks than the 272-bytes limit of MTP2, without the need of segmentation and re-assembly at the upper layer. At the SG, the M3UA layer provides interworking with MTP3 management functions to support seamless operation of signaling between the SS7 and IP networks. The M3UA layer at an IP endpoint keeps the state of the routes to remote SS7 destinations and may request the state of remote SS7 destinations from the M3UA layer at the SG. The M3UA layer at an IP endpoint may also indicate to the signaling gateway that M3UA at an IP endpoint is congested. M3UA was started to be defined more than two years and a half ago, and as SUA, it is expected to become an RFC soon.
Figure 8-2: SS7-IP adaptation layers
At the MTP2 level we have two different protocols that translates SS7 into IP. One of them is the MTP2-User Adaptation Layer (M2UA) [Mor2002], and the other is MTP2- MTP1 MTP2 MTP3 SCCP TCAP
STP
SG
SCP (a) Adaptation with SUA MTP1 MTP2 MTP3 SCCP IP
SCTP SUA NIF IP
SCTP SUA TCAP MTP1
MTP2 MTP3 SCCP TCAP
STP
SG
SCP MTP1
MTP2 IP SCTP M2PA MTP3 (d) Adaptation with M2PA IP SCTP M2PA MTP3 SCCP TCAP MTP1
MTP2 MTP3 SCCP TCAP MTP1
MTP2 IP SCTP M2UA NIF IP SCTP M2UA MTP3 SCCP TCAP (c) Adaptation with M2UA
STP
SG
MGC MTP1 MTP2 MTP3 SCCP TCAP (b) Adaptation with M3UA MTP1 MTP2 MTP3 IP SCTP M3UA SCCP IP SCTP M3UA SCCP TCAP
STP
SG
MGC And now? SCTP extensions and SCTP users 110
User Peer-to-Peer Adaptation Layer (M2PA) [Geo2001]. They both replace the MTP2 protocol, adapting the MTP3 protocol to the SCTP/IP stack. M2UA provides an equivalent functionality to its users as MTP2 provides to MTP3. It is used between a SG and a MGC. The SG keeps the availability state of all MGCs to manage signaling traffic flows across active SCTP associations. M2PA also provides the same functionality than MTP2. However, unlike M2UA, M2PA supports complete MTP3 message handling and network management between any two SS7 nodes communicating over an IP network. IP SPs work as normal SS7 nodes using the IP network instead of the SS7 network. Every IP signaling point has an SS7 point code and thus they are SS7 nodes. M2PA makes easier the integration of SS7 and IP networks by allowing nodes in the SS7 networks to access IP telephony databases and other nodes in IP networks making use of SS7 signaling. In turn, M2PA makes possible for IP telephony applications to access SS7 databases. Both M2UA and M2PA are still Internet-Drafts, and as SUA and M3UA they are expected to become RFCs within the next few months. The differences among these four adaptation layers can be seen in Figure 8-2. The figure represents the case in which an SG connects one STP in the SS7 network with an SCP that is located in an IP network, and shows the protocol stack used when the STP sends TCAP queries to the database. In case of (c), we observe that there is a new protocol layer that we have not mentioned yet, the Nodal Interworking Function (NIF). Basically, the NIF serves as an interface between MTP2 and M2UA within the SG. In [Mor2001] another adaptation layer is specified, the ISDN Q.921-User Adaptation Layer (IUA). The ITU-T recommendation Q.921 defines the data link level protocol used in ISDN signaling, also known as the Link Access Procedures on the D-channel (LAPD). IUA replaces Q.921 and uses SCTP as the transport layer, and provides transparent adaptation to Q.921 users, such as Q.931. However, SCTP has also been pointed to be the transport protocol to be used with protocols not related with telephony signaling. There are proposals to run SIP and SDP over SCTP ([Ros2001b] and [Fai2001] respectively). In [Jun2001], we can read a description of the usage of the Transport Layer Security (TLS) [Die1999] protocol using SCTP. The TLS protocol provides communications privacy over the Internet and allows client/server applications to communicate in a way that is designed to prevent eavesdropping, tampering, or message forgery. Changes to be made in RFC 2960 111
9. CHANGES TO BE MADE IN RFC 2960
The design of SCTP was done taking a lot of care in every change, listening to every discrepant voice and trying to consult to the specialists in fields such as congestion avoidance algorithms, Internet security, or even the creators of IPv6, so there would not be unexpected problems in the future. In addition, two interoperability sessions were organized before the publication of RFC 2960, to empirically ensure that there were no major hidden problems in the specifications of SCTP. Those test sessions showed some weak points of SCTP that were modified. However, all the care taken seemed not to be enough, and after the publication of the SCTP specification as an RFC, another interoperability session was organized and some more errors where found. Simple debate in the distribution list also brought some other issues. All those defects of editorial or technical nature that appear in RFC 2960 are documented in several Internet-Drafts. All those drafts documenting the changes to be made in the present specifications of SCTP will be merged with the RFC 2960 itself to produce a new and modified RFC in the future, as happened for example with the specifications of IPv6. In the next sections we will comment about those changes.
9.1 The checksum dilemma
The history of the checksum that appears in the common header of every SCTP datagram has been quite active. We can differentiate several stages inside this evolution. At the very beginning of SCTP design, no checksum was used at all. Then, weak checksums such as the ones used in IP and TCP were used. Later one, when the designers started to realize that SCTP would not be cloistered inside the SS7 networks, they looked for stronger data integrity protection. However, at the end the Adler-32 Checksum was chosen, which finally proved to be weaker than expected. Several months after the RFC 2960 was published, the designers decided to modify the checksum scheme used. This is the biggest modification that will be made to the SCTP specifications and unlike all the others, this change is not backwards compatible. In the next sections we will follow the steps taken in the election of the different checksum mechanisms used during the design of SCTP, and the reasons beneath those changes.
9.1.1 The good old days: Letting others protect the data integrity
Initially, MDTP datagrams did not carry any kind of checksum as shown in Figure 3-1. As with many other initial features of MDTP, the reason behind this was that the designers where designing something for an ideal world as SS7 networks are meant to be. The detection of corrupted data was delegated to the communication links and platforms used that carried the MDTP packets. Some time later it was noticed that the focusing of the problem in this way was not the right thing to do. As the telephony networks are going digital (the local loop, however, Changes to be made in RFC 2960 112
continues being mostly an analog twisted copper pair), corruption of packets due to noisy channels is infrequent. Much of the corruption happens, not during data transmission, but during buffering in switches when data is copied. This kind of data corruption at the network layer cannot be detected at lower layers, so some kind of protection against it should be supplied at a higher level. At the beginning, MDTP frames were supposed to be carried inside UDP datagrams, which protects the data with the so-called TCP Checksum [Pos1980], also known as Internet Checksum. The TCP Checksum is simply the 16-bit one's complement of the one's complement sum of a pseudo header containing information from the IP header (the source address, the destination address, the protocol, and the UDP length), the UDP header, and the data, padded with zero octets at the end (if necessary) to make a multiple of two octets. This sum is used by IP (not including any kind of pseudo header in its calculation), UDP and TCP, and catches any 1-bit error in the data, and over equally distributed values of data it is expected to detect other kind of errors at a rate proportional to 1 in 2 16 . However, it has two major limitations [Pat1995]: the sum of a series of 16-bits values is the same, regardless of the order in which the values appear, and the value of the checksum is unaffected by the addition or deletion of zeros. The TCP Checksum is used in Internet because it offers a sharp choice between performance and error detection capabilities, but in the SS7 world there is a need for stronger protection against corrupted messages. MTP calls for less than one undetected error every 10 9 received packets 37 , and the TCP Checksum was not sufficient to meet this requirement. Some studies published in section 3.3 of [Pax1997] show that in average, one every 5,000 TCP packets arrives corrupted to the destination. This high level of corruption in TCP packets is mostly due to router bugs and not because of problems in the transmission lines. As with a 16-bit checksum one expects not to detect 1 corrupted packet every 65,536 received packets with errors, the final result is that in average about 1 packet out of 310 8 packets arrives corrupted and is accepted as a valid one. So, some other different scheme should be used to meet the SS7 requirements.
9.1.2 The quest for a stronger scheme: The Cyclic Redundancy Check
Being unable to accomplish the SS7 requirements, it was proposed to make compulsory the use of the IPsec Authentication Header (AH) [Ken1998b] in the IP packets carrying MDTP datagrams. As the AH includes a strong error check (an Integrity Check Value (ICV) using by default HMAC with either MD5 or SHA-1 as defined in [Mad1998a] and [Mad1998b] respectively) its use would diminish the number of undetected errors. But AH was not a cheap solution in terms of the time it takes to be calculated. Moreover, the multihoming capabilities of MDTP would make things even worse, since the keys used are valid only for a given pair of source and destination IP address. So, the use of AH was never recommended. Some other solutions where inspected and finally it was decided to include a Cyclic Redundancy Check (CRC) of 16 bits to protect data from corruption. After some debate, it was decided that the checksum would protect both the data and the MDTP header, and that it would be calculated using the MDTP datagram itself, not including any pseudo header containing IP parameters, as it is done for both TCP and UDP. Although the CRC was also a 16-bits long checksum, the main difference between it and TCP Checksum is that due to
37 The ANSI specification of SS7 allows at most one undetected error every 10 9 received packets. The ITU-T limit is more restrictive, calling for less than one undetected error every 10 10 packets. Changes to be made in RFC 2960 113
the way the CRC is calculated, it provides specific protection against some usual errors. Section 3.2.2 of [Tan1996] discusses the internal structure of CRC and we will also show here its basic properties. A CRC is a Polynomial Code. A Polynomial Code treats a bit string as a representation of a polynomial with the only coefficients of 0 and 1. Therefore, if we have a bit string of length k it represents the degree k-1 polynomial b k-1 x k-1 + b k-2 x k-2 + ... + b 1 x + b 0 , where b n represents the value (1 or 0) of the bit in position n of the bit string. As an example, the 8-bits string 11010011 would represent the degree 7 polynomial x 7 + x 6 + x 4
+ x 1 + 1. As the values of the coefficients can only be 1 or 0, the polynomial arithmetic is done modulo 2, according to the rules of algebraic field theory. So, in modulo 2, subtractions and additions are both equivalent to the EXCLUSIVE OR logic operator, without carries for addition or borrows for subtraction. A long division is carried out the same way as it is in binary except that the subtraction is done modulo 2. If a Polynomial Code is used, then the sender and the receiver agree in advance on the use of a Generator Polynomial, G(x), of degree g (thus represented by a string containing g + 1 bits). For that polynomial, both the high and low order bits must be 1. So, the whole idea is that, if we have a datagram of m + 1 bits (m must be bigger than g) that represents the polynomial M(x) of degree m, we will append at the end of M(x) a g-bit long checksum so the whole datagram including the checksum will be divisible by G(x). So, the algorithm for computing the checksum is the next:
We append g zero bits to the low-order end of the datagrams, so it contains now m + g + 1 bits and corresponds to the degree m + g polynomial x g M(x). We divide the obtained polynomial x g M(x) by G(x) using modulo 2 division. Subtract the remainder of that division (which is always g or fewer bits) from x g M(x) using modulo 2 subtraction. The result is the checksummed frame that will be transmitted. We will call C(x) the polynomial it represents, which will be divisible by G(x).
When the receiver gets C(x) it must only make the division using the same G(x), and if it finds out that the remainder is not 0 it means that the received frame is corrupted. One can show mathematically the kind of errors this checksum can identify. If the receiver instead of C(x) receives a frame with errors, it can be represented by C(x) + E(x). Due to the use of modulo 2 algebra, the coefficients of E(x) with a non-zero value will represent the bits that arrived corrupted. So when the receiver applies the method and makes the division, we will have [C(x) + E(x)] / G(x) = E(x) / G(x), since C(x) / G(x) = 0. Only the errors that correspond to a multiple of G(x), so E(x) / G(x) = 0 will be undetected, all the rest will be caught. In the case that E(x) is composed of a single term, so E(x) = x i , where i is the position of the errored bit, if G(x) has more than one term, it will never divide E(x) so all the single bit errors will be detected using this method. If there have been two isolated errors, then E(x) = x i + x j where i > j, and so E(x) can be represented as E(x) = x j (x i-j + 1). As the coefficient of G(x) for x 0 is 1 (its low order bit must be 1 by definition), x j cannot be a factor of G(x) and so in this case G(x) will not divide E(x) (and so it will discover the double error) unless it divides (x i-j + 1). There are simple low degree polynomials such as x 15 + x 14 + 1 that will not divide any term of the form (x k + 1) for any k below 32,768 (thus giving protection against any double error in frames smaller than 32 Kbytes). This is a major improvement over the TCP Checksum, as it does not detect any double bit error if the errored bits are separated a multiple of 16 bits. Changes to be made in RFC 2960 114
Another interesting mathematical property of polynomials is that there is no polynomial with an odd number of terms that has (x + 1) as a factor in the modulo 2 system. This is easy to show, as if E(x) contains (x + 1) as a factor it could be expressed as (x + 1)F(x), and then evaluating E(1) = (1 + 1)F(1), as 1 + 1 = 0 in modulo 2, it would mean that E(1) = (1 + 1)F(1) = (0)F(1) = 0. This is not compatible with having an odd number of terms, as in that case substituting x for 1 we would have an addition modulo 2 of an odd number of 1 terms, thus the result would be 1 and not 0. This means that if G(x) contains (x + 1) as a factor, it will detect all the errors with an odd number of bits swapped. Finally, a Polynomial Code with a generator polynomial G(x) of degree g will detect all burst errors 38 of length g. A burst error of length k is represented as E(x) = x i (x k-i + ... + 1), where i determines how far from the right-hand end of the received frame the burst is located. If G(x) contains a x 0 term, it will not have x i as a factor, and so if the degree of the parenthesized expression is less than the degree of G(x), the remainder can never be zero. In case the burst length is g + 1, the remainder of the division by G(x) will be zero if and only if the burst is identical to G(x). By the definition of a burst error and G(x), the first and last bit will be in both cases 1, so the burst will match G(x) if the g 1 intermediate bits match. If all combinations are regarded as equally likely, the probability of such an incorrect frame being accepted as valid is g-1 . For any other burst error longer than the degree of G(x), the probability of a bad frame getting through unnoticed is g , assuming that all bit patterns are equally likely. Therefore, due to its much better capabilities of catching transmissions errors, the designers decided to include a 16-bits CRC checksum in the header of MDTP. They chose as the generator polynomial, the one standardized by the ITU-T, x 16 + x 12 + x 5 + 1, as can be found in section 8.1.1.6.1 of [ITU1996], called CRC-CCITT. It contains (x + 1) as a prime factor and so, it catches all single and double errors, all errors with an odd number of bits, all burst errors of length 16 or less, 99.9969% of 17-bit error bursts, and 99.9985% of 18-bit and longer bursts. Some people were against this decision, as they claimed that it takes too long time to calculate the CRC-CCITT compared to TCP Checksum. It is true that in software, the calculation of the TCP Checksum is faster. This is because processors are efficient when doing additions, and because the implementation of the TCP Checksum has been enhanced in its long life in many ways, for example by the use of incremental update when it is used in IP, as shown in [Rij1994]. However, when implemented in hardware, a CRC checksum can be implemented as a simple shift register circuit with some XOR gates. This makes it much faster than any implementation of TCP Checksum (hardware implementation of the TCP Checksum has also been studied, for example in [Tou1996]). In practice this circuit is almost always present (especially for the calculation of the CRC-32 in Ethernet and Token Ring LANs), and looks like the one in Figure 9-1 for the case of CRC-CCITT:
Figure 9-1: Hardware implementation of CRC-CCITT
38 A burst error of length k is an error in which all the errored bits are contained in a fragment of the original frame of at most k bits long, being the first and the last bits of that substring errors, irrespective of the value of the rest of the bits of the substring between the first and last one. X 15 X 14 X 13 X 12
X 11 X 10 X 9 X 8 X 7 X 6 X 5 X 4 X 3 X 2 X 1 X 0
Message to be Checksummed Changes to be made in RFC 2960 115
Of course, the circuit shown in Figure 9-1 is a simplified one, but it is still valid. The bits (what we noted as x g M(x) with g equal to 16) enter from the right side, and every time a new bit enters the circuit, the shift registers move the bits they contain to the left and introduce a new one from its right (the one generated by the XOR gate). When the last bit of the message has entered the circuit, the value of the bits in the registers is the checksum. When sending a message, that value should be subtracted (modulo 2) from x g M(x) so when the receiver gets the checksummed message and applies this same algorithm to the received message, it will have all the bits in the registers set to 0 at the end of the process. With this method any leading 0 bit would leave the whole circuit in the same state (all zeros), and then a truncated message that starts with zeros will have a valid checksum. So, the CRC-CCITT sets all the bits of the registers to 1 before starting the process. The software implementation is not that simple. However, there are quite many enhancements that make things not so painful. Having a table of 256 words of 2 bytes allowing calculating the CRC one byte at a time (instead of one bit at a time) makes the whole calculation much faster. Normally bigger tables are not used (one of 65,536 elements for example) because then the cache memory does not help that much and the whole process is actually slower. In [Wil1993] there is a very complete discussion about different implementations of different CRCs with high performance. In any case, especially for small devices, the CRC calculation could be very costly, and so it was made optional, using a flag in the MDTP header. When that flag was set, it meant that the datagram was protected by the use of CRC-CCITT. This mechanism was used during the first five versions of SCTP too.
9.1.3 From a 16-bit to a 32-bit checksum
The CRC-CCITT, with its 16 bits and its error detection capabilities, is usually limited to use with messages of less than 4 Kbytes long (there are enough ways to corrupt messages larger than 4 Kbytes that catching only 99.9985% of them is considered inadequate). Initially, SCTP was designed for telephony signaling transport, where the messages are usually a few hundred bytes long, so 4 Kbytes was not a big limitation. Nevertheless, there was an increasing feeling that SCTP would be able of broader applications. Therefore, the designers started to look for a 32-bits checksum. Some checksums were proposed. Firstly, there was an idea of simply extending the TCP Checksum to 32 bits (TCP-32 Checksum). The problem was that it would keep the same kind of problems than the TCP Checksum. Maybe the overall rate of undetected errors would go from 1 in 2 16 to 1 in 2 32 , but such kind of checksum was never tested, and this idea was quickly discarded. Some people suggested the use of the Fletcher Checksum. This checksum was firstly published in 1982, and it has been studied for a long time. It was even proposed to be used in TCP in [Zwe1990] using a TCP option, but in practice TCP has never used any other checksum different from the TCP Checksum. There are two flavors of the Fletcher Checksum, the 8-bit Fletcher Checksum (which results in a 16-bit checksum) and the 16-bit Fletcher Checksum (which in turn gives a 32- bit checksum), the second one considered as a possible candidate. Basically, the 16-bit Fletcher Checksum considers the message that is going to be checksummed as a list of 16- bits fragments, from F[1] to F[n], and uses two 16-bits accumulators, A and B, initially set to zero. The main loop calculates (with i ranging from 1 to n):
Changes to be made in RFC 2960 116
A = A + F[i] B = B + A
The additions are made calculating the 16-bit one's complement of the right side of the addition (although some versions use the two's complement instead). So, at the end of the cycle, A contains the 16-bit one's complement sum of all the 16-bit fragments of the message, and B will contain (n)F[1] + (n-1)F[2] + ... + F[n]. Those two 16-bit accumulators are joint at the end of the process to form the checksum (A65,536 + B). When performed in two's complement, the 16-bit Fletcher Checksum detects all single bit errors, a single error of less than 16 bits in length, and all double bit errors separated by 16 bits or less [Pat1995]. Unlike in the TCP Checksum, in the 16-bit Fletcher Checksum the information about the order in which the bytes appear in the original message is reflected in the value of the checksum, so if a message is corrupted in a way that some bytes are reordered, the checksum should catch the error. The major known failing of the checksum is that it is unaffected by zeros being added or deleted from the front of a packet [Pat1995]. But still, Fletcher algorithm was stronger than TCP Checksum, and it has been studied for almost two decades (and used in the ITU-T X.224 / ISO 8073 standard), so at the end a variation of it was chosen, the so-called Adler-32 Checksum [Deu1996]. It is an extension and improvement of the 16-bit Fletcher Checksum. The differences are that the A and B accumulators are still 16-bit long, but the F[n] fragments are only 8-bit long. Moreover, the additions are done without making any kind of one's or two's complement, and they are done modulo 65,521 (the biggest prime number smaller than 65,536). The A accumulator is initially set to 1 (thus avoiding the leading zeros addition or deletion problem stated above) and B is set to 0. Finally, the result is stored in 32 bits as B65,536 + A.
9.1.4 The Adler-32 Checksum: We have a problem
The Adler-32 Checksum was the one finally chosen, and it is the one that appears in [Ste2000]. However, this is not the end of the story: several months later, some researchers took a look at the SCTP checksum, and they complained. The problem with the Adler-32 Checksum is that, for short packets, it is noticeably weaker than the alternatives. Of course, since the primary application of SCTP is signaling transport and call signaling typically uses packets of less than 200 bytes long, this is a major problem. As the accumulators in Adler-32 Checksum perform additions of bytes, it is unlikely (or even impossible) that small packets could make those accumulators wrap, and so, it is guaranteed to give poor coverage of the available bits. The resulting checksum is not random enough, having a high correlation with the number of bytes of the packet (when that packet is small enough). The A accumulator is a simple addition of the values of the bytes in the message to be checksummed. As the maximum value for an 8-bit unsigned number is 255, and A is initially set to 1, it will never wrap if the message contains less than (65,521 1) / 255 = 257 bytes, which is the normal case. If we make a deeper study, and we consider that obviously not all the bytes will be set to 255, and that the value 0 is quite popular, the results are even worse. The B accumulator is the addition of the values of A. In the best case, all the bytes of the packet are set to 255, and so, as A is initialized to 1, if we have n bytes in the message then, B = (1 + 255) + (1 + 255 + 255) + ... = n + 255 + 2255 + ... + n255 = n + 255(1 + 2 + ... + n) = n + 255[(1 + n)/2]n. The B accumulator wraps when it reaches 65,521, so solving the equation for B = 65,521, we have that for packets smaller than 22 bytes, it is Changes to be made in RFC 2960 117
impossible that B wraps. If we consider 128 as the value of the bytes instead of 255, that value grows up to 32 bytes. So, B will almost always wrap (the smallest SCTP datagram is 32 bytes long), but not A. So we are almost wasting the 16 bits of the A value. This problem started a long discussion of about three months long (which has not really finished). People in the list were debating if it was worthy to modify the checksum (because that would be backwards incompatible), or if another kind of method should be used (such as an initial negotiation). Finally, it was accepted that as not too many implementations of SCTP were already done, and in any case they could be easily changed (as they were implemented in software), it was better to use a single checksum algorithm than having the possibility of using several ones, as usually that design possibility causes interoperability problems.
9.1.5 Going back to the roots: Using the CRC-32 as the checksum
Once it was decided that the checksum should be changed, the problem was which one to use. To overcome this problem, there were some proposals. The simplest one was modifying the Adler-32 Checksum algorithm to make the additions two bytes at a time. That would make the accumulators to wrap always. Some other proposals that were already discarded in the past, such as TCP-32 Checksum or 16-bit Fletcher Checksum, were studied again. But the one that finally seemed to win the competition was the CRC- 32, which uses the same principle as any CRC-16 but with twice the bytes. The main burden the use of CRC-32 had, was that people thought that it would be too time consuming to calculate it. Small devices with low speed (such as mobile phones) would especially suffer from this problem. But the error detection capabilities of a CRC-32 were much better than the ones of any other checksum. To illustrate this, let us see Table 9-1 with figures taken from [Oti2001a]:
Table 9-1: Error-Detection capabilities of several checksums
Four error detection algorithms were used, three of them already discussed before, being the fourth the so-called Fletcher-Adler Checksum. This checksum basically is the Adler-32 Checksum but it initializes the A accumulator to 21,845 instead of 1, so making the A and B accumulators more likely to wrap (which really did not help to the randomness of its value). The test consisted in transferring the data stored in two hard drives using SCTP as the transport protocol. So, the files were chopped in pieces of up to 1,452 bytes, (a MTU of 1,500 bytes was considered, and the overhead of IPv4 header, SCTP header and DATA chunk header was subtracted from it). Then the SCTP common header and the DATA Changes to be made in RFC 2960 118
chunk header were added, and the checksum calculated. As most of the errors are not originated in the network lines (due to noise) but when copying the datagrams inside the buffer of routers, a stuck bit error was simulated. In the simulation the value of a specific bit was changed every 4 bytes, so simulating an error in a memory cell of 32 bits with one of those 32 bits damaged. The results are quite amazing, as 16-bit Fletcher Checksum performs much worse than any other checksum, and even if Adler-32 Checksum was much better than 16-bit Fletcher Checksum, it still failed to find the error many times. The last row of the table, Bytes, means that if we suppose that with an n-bit checksum we should miss just one error every 2 n ones, then having the quantity of errors over the total, the number shown in the table would be the number n of bytes so we miss one error every 2 n . So, once the people were convinced that CRC-32 was much better than any other, they were still unsure if the extra time involved in calculating it would be a major problem. As explained above, when the CRC is implemented in hardware, it is much quicker to calculate than any other checksum, and when several implementation enhancements such as the use of tables (to make the calculations of the CRC more than one bit at a time), the difference in the calculation time in software is drastically reduced. So, Randall R. Stewart made some measures of the time involved to calculate several checksums [Ste2001a], and those results appear in Table 9-3 below:
Checksum Used Minimum Calculation Time (s) Maximum Calculation Time (s) CRC-32 3 128 Adler-32 2 91 Modified Adler-32 40 60 16-bits Fletcher 15 50 TCP-32 3 15
Table 9-3: Calculation time consumed by several checksums
The way this measure was done can be consulted in [Ste2001b], but basically it was calculated over pieces of 1,000 bytes of random data, and applying the different algorithms to the same data. The Modified Adler-32 Checksum is the Adler-32 Checksum making the additions two bytes at a time, instead of one. There were some later calculations with improved algorithms that showed that the real overhead of the calculation of CRC-32 over Adler-32 Checksum would be of about 5-10%, and so it was decided that it was worthy to use CRC-32 instead. All in all, after 3 months of discussion, Moore's Law 39 says that the processors were already about 12% more powerful, so it did not really make any sense continuing with the discussion. However, once it was decided that the checksum used by SCTP should be changed to CRC-32 there were still more problems. It has been explained above in this section that the calculation of the CRC is basically finding which bits should be put at the end of a datagram so the polynomial represented by all the bits in the datagram is divisible (modulo 2) by the Generator Polynomial. So, it makes sense to insert the CRC field at the end of the datagram. In this way, when verifying the value of the CRC one should only make the division and if there is no remainder, one supposes that the datagram is not corrupted. This, apart from being a clean way to calculate the checksum, would save some time in saving
39 Moore's Law exists since 1965, when Gordon E. Moore predicted that the number of transistors per integrated circuit would double every 18 months [Moo1965]. Moore's Law still holds true. Changes to be made in RFC 2960 119
the value of the checksum, setting those 32 bits to zero, and make a comparison at the end of the process. By the time this is written, it seems that the 32 bits of the checksum will not be moved from its present position. Moreover, there is not a single CRC-32 but several, thus another decision to be done regarding the checksum. The most popular CRC-32 is the one standardized by the ITU [ITU1996] and used in Ethernet, Token Ring or Fiber Distributed Data Interface (FDDI) networks among others, which is, x 32 + x 26 + x 23 + x 22 + x 16 + x 12 + x 11 + x 10 + x 8 + x 7 + x 5 + x 4 + x 2 + x + 1. Another one is the one studied by Castagnoli in [Cas1993] which is, x 32 + x 28 + x 27 + x 26 + x 25 + x 23 + x 22 + x 20 + x 19 + x 18 + x 14 + x 13 + x 11 + x 10 + x 9 + x 8 + x 6 + 1. This second polynomial produces checksummed frames that have bigger Hamming Distance 40 for messages of up to 1 Kbyte (and so it is better for low-noise binary channels). Apparently, the one studied by Castagnoli (CRC-32c) is the one that will be chosen. The document that talks about this checksum change in SCTP is [Ste2002a], but it is not the only proposal submitted. There are another two Internet-Drafts regarding the checksum change. In one of them, [Ahm2001], there is a proposal to use CRC-32 instead of Adler-32 Checksum but still having the possibility to interact with old SCTP implementations. The algorithm is the easiest possible: apply CRC-32 in the first received packet containing the INIT chunk and if it does not work, then apply Adler-32 Checksum and keep applying the one that worked for the whole life of the association. This allows establishing associations with old implementations, but only in the case we are not the initiator (otherwise, our datagram containing the INIT chunk will use CRC-32 and the peer will discard the whole packet). The proposal specified in [Oti2001b] goes further, removing the checksum field from the common header and defining a new CHECKSUM chunk that should also be the first one in every SCTP datagram, which can then use several different checksums. Basically, this is the same as extending the common header to contain an identifier of the checksum we are using. It is likely that the simple change from Adler-32 Checksum to CRC-32c documented in [Ste2002a] will be the chosen option. Two excellent discussions about the different checksum algorithms commented in this section with their advantages and disadvantages appear in the Internet-Drafts [Cav2001] and [She2001].
9.2 Errata: The Implementors Guide
The change in the checksum is the most important one to be done, but not at all the only one. After about a year of inspecting the SCTP specifications it seems that they have plenty of mistakes. Fortunately, most of them are just minor typos of editorial nature caused by the well-known cut and paste habit. However, there are a few that definitely have to be changed, and so the designers of SCTP proposed a second version of RFC 2960, called the RFC 2960 Bis. This second version basically only included the changes related with the checksum (surprisingly proposing the use of the 16-bits Fletcher Checksum), but was expected to evolve as RFC 2960 did itself, to include all the necessary changes.
40 If we have a group of codewords (in our case the ones formed by checksummed messages), the Hamming Distance between two codewords is the number of bit positions in which they differ. The Hamming Distance of the code is the minimum of such Hamming Distances among every possible pair of codewords. If the Hamming Distance of a code is H, then at least H single bits errors are needed for a corrupted frame to be accepted. Changes to be made in RFC 2960 120
However, this is not the way of working in the IETF nowadays. Instead of writing a second version of the RFC, all the proposed changes are compiled in a separate Internet draft called Implementors Guide [Ste2002b]. This document is co-authored by the author of this Master's Thesis. One of the main changes is related with the restart process. In the normal scenario, the crashed host sends again the INIT chunk, the receiver of such chunk identifies it as belonging to an already established association and sends back an INIT ACK chunk containing a State Cookie with the Tie-Tags set to the old Verification Tag values (instead of setting them to 0 as usual). When, later on, the COOKIE ECHO chunk is received containing those Tie-Tag values, the receiver of that chunk recognizes that the peer has restarted, and then the association is reset. However, there is a big security problem with this mechanism. If an attacker sends an INIT chunk using a fake source address, setting it to a valid source address of one of the already established associations (including also its own IP address so it can receive the INIT ACK chunk), once the whole procedure is finished, the old association would have been unnecessarily restarted. If we add to this the possibility of using the AddIP extension (see section 8.1.1) to delete the fake used address, the association will have been completely hijacked. Moreover, as the Tie-Tags are sent as plain text, it would be easy for an attacker to guess their value (for example, sending INIT chunks to both peers and comparing the State Cookies received). As the Tie-Tags are set to the Verification Tag values, the attacker would be able to send us valid datagrams. To avoid this, the Implementors Guide states that a restart attempt will not be accepted if the INIT chunk contains any new IP address that was not part of the old association. Also, a new error cause is defined to indicate this situation, so the crashed host can restart the association with less addresses (and eventually tear it down to be able to reinitiate and use the whole set of addresses, if that is necessary). Another interesting change is related with the fast retransmit algorithm. The present SCTP specifications state that once a TSN is reported as missing (i.e., the TSN is unacknowledged while any subsequent TSN is acknowledged inside a Gap TSN ACK) in 4 consecutive SACK chunks, the TSN should be fast retransmitted. This causes two major problems:
There is no limitation on how many times a TSN could be fast retransmitted. In the normal case, and especially in a high bandwidthdelay network, at any given time there will be several DATA and SACK chunks on flight. So, immediately after issuing the fast retransmission the old SACK chunks (and the ones produced by the DATA chunks on flight) will still be arriving and reporting that TSN as missing, thus triggering another unnecessary fast retransmission of the same TSN. If a TSN is reordered in the network and arrives to the receiver before a number n of TSNs, it will trigger the sending of n SACK chunks containing missing TSNs. So, if n is bigger than or equal to 4, there will be n 3 TSNs that will be unnecessarily fast retransmitted because they were not lost but simply reordered. Moreover, if the data receiver is waiting for a specific TSN to fill a gap in its TSN sequence, that TSN will be delaying the delivery of all the subsequent TSNs (assuming the data must be delivered in order). Once the gap is filled, the data receiver will suddenly have a big amount of data to deliver to its upper user and it will free its buffer. If the data sender is waiting for the receiver's buffer to empty, the arrival of the SACK chunk acknowledging the receipt of the retransmitted TSN with an updated Advertised Receiver Window Credit will suddenly allow the Changes to be made in RFC 2960 121
data sender to send a big amount of data. This would produce an excessive burst of traffic that could flood the network.
Avoiding the first problem is quite easy, and so the Implementors Guide simply allows a TSN to be sent only once via the fast retransmit algorithm. The second problem is a little bit harder to solve. The main problem is that once a TSN has arrived out of order, no matter in which order the other TSNs sent arrive, all the unacknowledged ones will be considered as missing. So, let us say that we sent DATA chunks from TSN 1 to TSN 6, and the order of arrival is 1, 6, 2, 3, 4 and 5. When TSN 6 arrives, TSNs from 2 to 5 are reported as missing, which is right. But as per RFC 2960, when TSN 2 arrives, TSNs 3, 4 and 5 are also reported as missing because there is a later TSN that is acknowledged in a Gap Ack Block. So, no matter that TSNs from 2 to 5 arrive in order, at the end TSN 5 will be fast retransmitted. This can be avoided in a simple and neat fashion, which is considering as missing only those unacknowledged TSNs previous to any of the oldest newly acknowledged 41 TSN in the received SACK chunk. This is what the Implementors Guide proposes. The third problem is not hard to avoid either. A new protocol parameter is introduced called Max.Burst that limits the maximum size of a burst of traffic (and its recommended value is 4). Some other minor problems covered in the Implementors Guide are related with the path heartbeat mechanism (when to start and finish, and how an unacknowledged HEARTBEAT chunk should be treated), with the shutdown procedure (how long to wait for the SHUTDOWN ACK chunk), and some editorial defects as well as clarifications of implicit features of SCTP that have traditionally caused problems to people of the distribution list.
41 A newly acknowledged TSN in an incoming SACK chunk is a TSN that has been acknowledged for the first time in the received SACK chunk. Conclusions 122
10. CONCLUSIONS
We have just seen the history of SCTP so far. At this point, three years and a half after the first version of what by then was called MDTP, SCTP is still a practically unknown protocol and we could say that it is still under development. There is no practical application that uses it yet, because there is no commercial implementation of SCTP available in the market. Despite all these problems, there is a deep feeling that it will succeed. People at SIGTRAN just needed a relatively simple protocol, with specific requirements for signaling transport, but things got complicated. The authors of SCTP could have simply designed that needed protocol, not including all those features that people in the distribution list were constantly proposing. But at the end, even if at some stages it seemed that SCTP would never be ready and that the delay in its publication as an RFC would make that companies would eventually develop their own solutions, all the care taken in its design and all that time spent were worthy. Now we have a new transport protocol that not only offers the necessary support for signaling transport, but also could compete with one of the giants in the Internet, TCP. SCTP has several features that make it more suitable than TCP for common Internet applications. One of the weak points of TCP is the famous SYN attack explained in section 4.2. That attack was hard to do with computers running Microsoft Windows operating system (at least without being discovered), but the new 2000 and XP versions make things easier for attackers. Now, a transport protocol such as SCTP that is immune to that attack due to its cookie mechanism is needed more than ever, and this simple fact can speed up the deployment of SCTP. The use of streams makes SCTP particularly suitable to be used in HTTP servers. Presently, when we download a web page, a TCP connection must be set up for every graphic element it contains, as well as for sound or video. For small graphics that occupy few Kbytes, the five datagrams that must be sent for establishing and tearing down a TCP connection are a considerable overhead. Using SCTP the server could simply open as many streams as needed for the transport of those pictures and send the information regarding each one using an independent stream. Moreover, the congestion algorithms are there to provide a means for equally sharing the resources of the Internet, and if a host has many established connections with a server all of them are considered independent and are given a portion of those resources. A practical effect of this is that if there are for example 11 users accessing an HTTP server, 10 of them downloading pages containing just text (thus using a single TCP connection), and one of them asking for a page containing 9 pictures (which would require 10 TCP connections), this last user will consume as much of the HTTP server bandwidth and processing time as the other 10 users altogether. Using SCTP, the 11 users would receive the same portion of resources. Before SCTP, there was no transport protocol in Internet able to take profit of multihomed hosts. The use of several network cards is quite common nowadays, especially for servers that have high traffic demand. With SCTP, a multihomed host not only provides a way of ensuring that a data connection will not be closed in case any of its cards stop working, but also gives the possibility of deviating the traffic from congested paths. If the multihomed host has its cards connected to different networks, it can distribute Conclusions 123
the data flow among them or change the one it is using as soon as it experiences congestion. In TCP, once a connection has been established there is no option about which card to use 42 as only one can be used. SCTP is message-oriented as UDP is. TCP does not have any kind of message concept, and what it transports is seen as a simple flow of bytes. This results in the fact that applications must provide their own marks to separate different messages sent through a TCP connection, or use UDP instead. But UDP is unreliable and does not offer many of the features that TCP does, like congestion control. In SCTP the user messages are identified by their SSN and that makes possible to identify specific portions of the whole data transfer. Applied to our previous example of an HTTP server, the different parts of a web page could be transferred as different messages that would make easier their identification at the receiver side even if a single stream is used. TCP relies on ICMP to inform about problems such as a server that is not listening in a specific port or an unreachable peer. The problems reported by ICMP are always at the IP level and TCP itself does not have any way to tell about problems at the transport level. SCTP, however, has the possibility of using ERROR or ABORT chunks to notify the peer of certain error conditions. Thus, an SCTP endpoint can tell the other for example that it is out of resources or that the received cookie was stale. So, the other peer can act more consistently than if it would simply notice that for some reason the association was not established. But SCTP not only has new features, it is highly inspired in TCP and that is a good thing since TCP has proven to be a very robust protocol used during many years now. The congestion control algorithms were directly taken from those of TCP, and some other features that are optional in TCP became compulsory in any SCTP implementation. Among them we could mention the use of selective acknowledgements, the ability to tell about the receipt of duplicate TSNs, the support for ECN or the path heartbeat mechanism. SCTP has much better extensibility capabilities than TCP. In TCP, the restricted space that can be used to include options makes them virtually useless. The few bits that are reserved in the TCP header are a very scarce good, and any new feature added to TCP that had to make use of any of those reserved bits must be designed in a way that it uses as few of them as possible. This usually complicates the design or may even make the whole feature impossible to add. In SCTP adding a new feature is easy and the designers do not have to be worried about the available space for extensions, they just define new chunks or new parameters and include in them as much information as needed. The available quantity of undefined chunks and parameters is big enough to ensure that we will not run out of them in the future. The quantity of applications that use TCP is huge and it would take a long time to modify them to use SCTP instead. However, this is alleviated by the use of a very similar socket interface to the TCP one that is being defined presently [Ste2002d]. For simple applications that would make use of a single stream, the necessary changes in the code to use SCTP instead of TCP are so minimal that basically one just has to manage the socket in the old way, specifying at the moment of its aperture that SCTP should be used instead of TCP or UDP. This would make things much easier and would facilitate the quick deployment of SCTP. Moreover, there are several available open source SCTP implementations that can be downloaded from the Internet freely, so nobody who would like to use SCTP really has to
42 Note that there exists also the possibility of using another network card and still using the same source and destination IP addresses. However, usually an IP address is assigned to each network card in a way that IP datagrams sent through a specific network card will always have the same IP source address. Conclusions 124
write his own implementation. The so-called reference implementation (written by the creators of SCTP to help themselves finding errors in the specifications of MDTP and SCTP) has been available since the times of MDTP and it is constantly updated. Even if SCTP is not a simple protocol, there are some implementations that occupy less than 100 Kbytes, making SCTP suitable to run in small devices with memory limitations. Some tests [Jun2000] showed that not only SCTP performance was not worse than TCP, but the throughput achieved by SCTP was even better than that of TCP under some circumstances. Moreover, SCTP and TCP implementations share resources equally (as they have the same congestion avoidance algorithms). This behavior is highly desired to facilitate a gradual conversion of applications to use SCTP instead of TCP, making easier the co-existence of both protocols. On the whole, SCTP has many advantages over TCP and very few drawbacks, and we can expect that, apart from being used for signaling transport, SCTP will replace TCP in the Internet in the future. However, that will not happen overnight. As an example we can cite IPv6, whose design procedure was relatively similar to SCTP's one. IPv6 was chosen among some other proposals about 10 years ago, it took some years to finish its design and the specification was finally revised in 1998. Today, more than three years later, IPv6 is not widely deployed yet, but it will in the future (otherwise Internet will collapse). SCTP is in the phase of being revised and possibly within this year we will see another RFC containing the new specification of SCTP. We do not know how many years it will take, but very probably, in the future, the TCP/IP architecture will be replaced by another similar architecture, SCTP/IPv6.
Appendix A: Contents of the CD-ROM 125
APPENDIX A: CONTENTS OF THE CD-ROM
At the end of this Master's Thesis you will find a CD-ROM. That CD-ROM includes most of the documents cited in the Bibliography section that are available in electronic format and some other files related with SCTP. All the publicly available documents in the Internet appear in the /bibliography folder. This includes all the documents available in the IETF pages (RFCs and Internet- Drafts) as well as some other papers that can be freely distributed. Basically all the tittles of the bibliography are in the CD-ROM except those that are books or magazines or those published by the ITU-T, the Institute of Electrical and Electronic Engineers (IEEE) or the Association for Computing Machinery (ACM), which are not free of charge. For these documents, the link in the Bibliography section can be only accessed by those having the needed subscription for those publications. All the documents included in the CD-ROM appear in the Bibliography section with their reference name written in bold letters. The name of the files in the /bibliography folder correspond with the name that appears inside the square brackets in the Bibliography section. All those documents are written in English and are saved either in .txt, .htm or .pdf format. The CD-ROM also includes all the previous MDTP and SCTP versions. The folder /mdtp contains the old MDTP Internet-Drafts (from mdtp-00.txt to mdtp- 08.txt), and /sctp includes all the previous releases of RFC 2960 and the RFC itself (from sctp-00.txt to sctp-14.txt). In the /extras folder there is a compilation of Internet-Drafts published by the IETF that are related with SCTP and not included in the bibliography. These are the last releases of the documents describing ways how applications not mentioned in the Master's Thesis can use SCTP as their transport protocol, the Management Information Base (MIB) and some other documents. They all are saved in the CD-ROM maintaining their original name with which they were published in the IETF. The /rfcs folder contains all the IETF RFCs from RFC 1 to RFC 3238 available in the IETF pages. They all are in English and saved in .txt format. Inside the /mail folder there is an extensive mail archive of both SIGTRAN and TSVWG. They are included there as a .pst file, so they should be opened using Microsoft Outlook. The archive includes all the messages sent to SIGTRAN from November 1999 to January 2002, both months included, and the messages sent to the TSVWG distribution list from February 2001 to January 2002. There are about 13,000 messages altogether. The CD-ROM also includes four publicly available implementations of SCTP inside the /implementations folder. One of them is the so-called Reference Implementation written by Randall R. Stewart and Qiaobing Xie, the two primary designers or SCTP. It is an user-space implementation that runs on Linux, FreeBSD, NetBSD, Lynx O/S, Solaris and in general most UNIX-like systems that provide a classic sockets API and a method for sending raw IP datagrams. The archives of release 4.0.5 were taken from the CD-ROM included in its newly published book regarding SCTP, [Ste2001c], and are located in the /reference subfolder. There is a kernel implementation for Linux based on the Reference Implementation, the lksctp. It is a Source Forge project and has been developed by a team of programmers Appendix A: Contents of the CD-ROM 126
from Motorola, Cisco, IBM and Intel. The release 2.4.1-0.3.2 appears in the CD-ROM inside the /lksctp subfolder. The CD-ROM includes another public implementation published under the GNU public license called sctplib, a cooperative work of the University of Essen and Siemens AG, Munich. It is an user-space implementation programmed by Andreas Jungmaier, Herbert Hlzlwimmer, Achim Weber and Michael Txen that runs under Linux, FreeBSD, Solaris and Mac OS X. The release 1.0.0-pre14 is included in the /sctplib subfolder. The last public implementation of SCTP appears in the /strsctp subfolder. It is the release 0.7.6 of the kernel implementation for Linux called STREAMS SCTP, which has been developed by OpenSS7. The specific functionality provided by each of those implementations appears in the README files included in each subfolder. They are not completely compliant with the SCTP specifications and some are beta versions that might include bugs. New releases appear every now and then (see Appendix B). There is a nice network protocol analyzer that supports SCTP. Its name is Ethereal and it is included in the /ethereal folder of the CD-ROM. Ethereal is a free network protocol analyzer for Windows, Unix and Unix-like operating systems. It allows you to examine data from a live network or from a capture file on disk. You can interactively browse the capture data, viewing summary and detail information for each packet. Ethereal has several powerful features, including a rich display filter language. The CD-ROM includes the Windows version of Ethereal 0.8.20 (including the necessary WinPcap packet capture driver that must be installed before Ethereal can be used) ready to be installed, inside the /windows subfolder. But the /ethereal folder contains another subfolder, the /others one. This folder includes the source code of Ethereal for Linux, Solaris, FreeBSD, Sequent PTX v4.4.5, Tru64 UNIX (formerly Digital UNIX), Irix, AIX and Windows as well. In order to work properly, Ethereal needs GTK+ and Glib (a graphical user interface library) and libpcap (a packet capture and filtering library), all included in the same subfolder. Perl is also required to build the documentation, and the zlib library allows Ethereal to read gzip-compressed files on the fly. They both are also included in the CD-ROM. SCTP support was added to tcpdump by Jerry Heinz (Temple University), John Fiore (University of Pennsylvania), and Armando Caro (University of Delaware). It is available starting in versions 3.7. Tcpdump is the de facto standard for packet sniffing tools, which is published under the BSD software license. It comes pre-packaged with most major UNIX/Linux distributions. Its source code, together with the libpcap and tcpslice libraries is included in the /tcpdump folder of the CD-ROM. This CD-ROM also contains an SCTP module as a patch for NS-2 (release 2.1b8), published under the BSD software license. This module was developed by Armando Caro and Janardhan Iyengar of the University of Delaware. NS-2 is a discrete event simulator and it is the most commonly used network simulator today in the research community. The NS-2 simulator itself and the patch has been included in the /ns-2 folder, as well as a manual for the NS-2 simulator. This patch currently supports most of the features in section 6 and 7 of the SCTP specifications. Finally, in the /thesis folder we can find this document in electronic format, both in .ps and .pdf formats.
Appendix B: Other sources of information about SCTP 127
APPENDIX B: OTHER SOURCES OF INFORMATION ABOUT SCTP
SCTP is still a very young protocol, however, there are already quite many sources of information about it, mostly in the Internet. The only book published so far about SCTP is [Ste2001c], on the shelves since November 2001. It is written by the two primary designers of SCTP, Randall R. Stewart and Qiaobing Xie, and it is definitely worth reading. It should be seen as a companion to the SCTP specification, including lots of examples that help understanding the difficult parts of the SCTP specification. Possibly the most complete web page about SCTP can be found at http://www.sctp.de/. There, we can not only find several links to many other SCTP resources in the Internet related to SCTP, including RFCs and Internet-Drafts, but also the last releases of the sctplib implementation of SCTP. The last version of the Reference Implementation is located in http://www.sctp.org/. There you can fin information about SCTP extensions as well. In http://sourceforge.net/projects/lksctp/ you will find the last version of the SCTP kernel implementation lksctp. Another web page, http://www.openss7.org/, contains the last versions of the strsctp implementation of SCTP. In http://playground.sun.com/sctp/ there is one kernel implementation of SCTP publicly available for Solaris TM Operating Environment. However, due to due to U.S. export laws it cannot be downloaded from several countries. In http://www.ethereal.com/ you can find all the information about Ethereal and all the downloads for the different platform versions. The NS-2 page is located in http://www.isi.edu/nsnam/ns/. The last releases can be obtained there. The web page http://www.watersprings.org/ contains an impressive extensive collection of IETF Internet-Drafts (both expired and up-to-date ones) and RFCs. The official IETF page of the TSVWG is http://www.ietf.org/html.charters/tsvwg- charter.html, and for the SIGTRAN working group is http://www.ietf.org/html.charters/sigtran-charter.html. They contain most of the RFCs and present Internet-Drafts related with SCTP. All the RFCs can be accessed from the IETF web page, http://www.ietf.org. If you want to dive into the mail archives of both the SIGTRAN and TSVWG to discover by yourself the reasons beneath some design decisions, you can go either to ftp://ftp.ietf.org/ietf-mail-archive/sigtran/ (here you can find only the archive since March 2001) or to http://www17.nortelnetworks.com/archives/sigtran.html for the SIGTRAN archives, or to ftp://ftp.ietf.org/ietf-mail-archive/tsvwg/ for the TSVWG archives. To participate in those mail lists you can send your messages to sigtran@standards.nortelnetworks.com or tsvwg@ietf.org respectively for SIGTRAN or TSVWG. The instructions about how to subscribe appear in the official pages of those IETF working groups, http://www.ietf.org/html.charters/sigtran-charter.html for the SIGTRAN working group and http://www.ietf.org/html.charters/tsvwg-charter.html for TSVWG.
Bibliography 128
BIBLIOGRAPHY
[Ahm2001] AHMED, H., and BOFFA, S.: SCTP Dynamic Checksum Selection, Internet-Draft, August 2001. Work in progress. http://www.watersprings.org/pub/id/draft-ahmed-tsvwg-sctpdsum-00.txt
[All1999] ALLMAN, M., PAXSON, V., and STEVENS, W. R.: TCP Congestion Control, RFC 2581, April 1999. http://www.ietf.org/rfc/rfc2581.txt
[Alm1992] ALMQUIST, P.: Type of Service in the Internet Protocol Suite, RFC 1349, July 1992. http://www.ietf.org/rfc/rfc1349.txt
[Ari2001] ARIAS-RODRGUEZ, I., STEWART, R. R., and ALLMAN, M.: SCTP Adaptive Fast Retransmit, Internet-Draft, June 2001. Work in progress.
[Bel1996] BELLOWIN, S. M.: Defending Against Sequence Number Attacks, RFC 1948, May 1996. http://www.ietf.org/rfc/rfc1948.txt
[Bel2001] BELLOWIN, S. M., IOANNIDIS, J., KEROMYTIS, A. D., and STEWART, R. R.: On the use of SCTP with IPsec, Internet-Draft, October 2001. Work in progress. http://www.watersprings.org/pub/id/draft-ietf-ipsec-sctp-02.txt
[Ben1999] BENNETT, J. C. R., PARTRIDGE, C., and SHECTMAN, N.: Packet Reordering is not Pathological Network Behavior, IEEE Transactions on Networking, Vol. 7, Issue 6, December 1999. http://ieeexplore.ieee.org/iel5/90/17613/00811445.pdf
[Ber1994] BERNERS-LEE, T.: Universal Resource Identifiers in WWW: A Unifying Syntax for the Expression of Names and Addresses of Objects on the Network as used in the World-Wide Web, RFC 1630, June 1994. http://www.ietf.org/rfc/rfc1630.txt
[Ber1996] BERNERS-LEE, T., FIELDING R. T., and FRYSTYK, H.: Hypertext Transfer Protocol -- HTTP/1.0, RFC 1945, May 1996. http://www.ietf.org/rfc/rfc1945.txt
[Bla1998] BLAKE, S., BLACK, D. L, CARLSON, M. A., DAVIES, E., WANG, Z., and WEISS, W.: An Architecture for Differentiated Services, RFC 2475, December 1998. http://www.ietf.org/rfc/rfc2475.txt Bibliography 129
[Bla2001a] BLANTON, E., and ALLMAN, M.: Using TCP DSACKs and SCTP Duplicate TSNs to Detect Spurious Retransmissions, Internet-Draft, August 2001. Work in progress. http://www.watersprings.org/pub/id/draft-blanton-dsack-use-01.txt
[Bla2001b] BLANTON, E., and ALLMAN, M.: Adjusting the Duplicate ACK Threshold to Avoid Spurious Retransmits, Internet-Draft, July 2001. Work in progress. http://www.watersprings.org/pub/id/draft-blanton-dupack-thresh-adjust- 00.txt
[Bov1999] BOVA, T., and KRIVORUCHKA, T.: Reliable UDP Protocol, Internet- Draft, expired August 1999. http://www.watersprings.org/pub/id/draft-ietf-sigtran-reliable-udp-00.txt
[Bra1989] BRADEN, R. (editor): Requirements for Internet Hosts -- Communication Layers, RFC 1122, October 1989. http://www.ietf.org/rfc/rfc1122.txt
[Bra1997] BRADEN, R. (editor), ZHANG, L., BERSON, S., HERZOG, B., and JAMIN, S.: Resource Reservation Protocol (RSVP) -- Version 1 Functional Specification, RFC 2205, September 1997. http://www.ietf.org/rfc/rfc2205.txt
[Car2000] CARPENTER, B. E.: Internet Transparency, RFC 2775, February 2000. http://www.ietf.org/rfc/rfc2775.txt
[Cas1990] CASE, J., FEDOR, M., SCHOFFSTALL, M., and DAVIN, J.: A Simple Network Management Protocol, RFC 1157, May 1990. http://www.ietf.org/rfc/rfc1157.txt
[Cas1993] CASTAGNOLI, G., BRUER, S., and HERRMANN, M.: Optimization of Cyclic Redundancy-Check Codes with 24 and 32 Parity Bits, IEEE Transactions on Communications, Vol. 41, Issue 6, June 1993. http://ieeexplore.ieee.org/iel1/26/5993/00231911.pdf
[Cav2001] CAVANNA, V., and WAKELEY, M.: iSCSI Digest, CRC or Checksum?, Internet-Draft, expired September 2001. http://www.watersprings.org/pub/id/draft-cavanna-iscsi-crc-vs-cksum-01.txt
[Cla1982] CLARK, D. D.: IP Datagram Reassembly Algorithms, RFC 815, July 1982. http://www.ietf.org/rfc/rfc815.txt
[CER1995] CERT: IP Spoofing Attacks and Hijacked Terminal Connections, CERT Advisory CA-1995-01, January 1995. http://www.cert.org/advisories/CA-1995-01.html
Bibliography 130
[CER1996] CERT: TCP SYN Flooding and IP Spoofing Attacks, CERT Advisory CA- 1996-21, September 1996. http://www.cert.org/advisories/CA-1996-21.html
[Cha1998] CHANDRANMENON, G. P., and VARGHESE, G.: Reconsidering Fragmentation and Reassembly, Proceedings of the 17 th annual ACM Symposium of Principles of Distributed Computing, pages 21-29, July 1998. http://www.acm.org/pubs/articles/proceedings/podc/277697/p21- chandranmenon/p21-chandranmenon.pdf
[Con1998] CONTA, A., and DEERING, S. E.: Internet Control Message Protocol (ICMPv6) for the Internet Protocol Version 6 (IPv6) Specification, RFC 2463, December 1998. http://www.ietf.org/rfc/rfc2463.txt
[Coe2001] COENE, L., TEXEN, M., VERWIMP, G., LOUGHNEY, J., STEWART, R. R., XIE, Q., HOLDREGE, M., BELINCHN, M. C., JUNGMAIER, A., and ONG, L.: Multihoming issues in the Stream Control Transmission Protocol, Internet-Draft, November 2001. http://www.watersprings.org/pub/id/draft-coene-sctp-multihome-01.txt
[Dee1998] DEERING, S. E., and HINDEN, R. M.: Internet Protocol, Version 6 (IPv6) Specification, RFC 2460, December 1998. http://www.ietf.org/rfc/rfc2460.txt
[Deu1996] DEUTSCH, L. P., and GAILLY, J. L.: ZLIB Compressed Data Format Specification version 3.3, RFC 1950, May 1996. http://www.ietf.org/rfc/rfc1950.txt
[Die1999] DIERKS, T., and ALLEN, C.: The TLS Protocol Version 1.0, RFC 2246, January 1999. http://www.ietf.org/rfc/rfc2246.txt
[Dob1996] DOBBERTIN, H.: The Status of MD5 After a Recent Attack, RSA Laboratories' CryptoBytes, Volume 2, Number 2, Summer 1996. ftp://ftp.rsasecurity.com/pub/cryptobytes/crypto2n2.pdf
[Dur2000] DURHAM, D. (editor), BOYLE, J., COHEN, R., HERZOG, S., RAJAN R., and SASTRY, A.: The COPS (Common Open Policy Service) Protocol, RFC 2748, January 2000. http://www.ietf.org/rfc/rfc2748.txt
[Eas1994] EASTLAKE, D. E., CROCKER, S. D., and SCHILLER, J. I.: Randomness Recommendations for Security, RFC 1750, December 1994. http://www.ietf.org/rfc/rfc1750.txt
[Fai2001] FAIRLIE-CUNINGHAME, R.: Guidelines for specifying SCTP-based media transport using SDP, Internet-Draft, May 2001. Work in progress. Bibliography 131
[Flo2000] FLOYD, S., MAHDAVI, J., MATHIS, M., and PODOLSKY, M.: An Extension to the Selective Acknowledgement (SACK) Option for TCP, RFC 2883, July 2000. http://www.ietf.org/rfc/rfc2883.txt
[Fox1989] FOX, R.: TCP Big Window and Nak Options, RFC 1106, June 1989. http://www.ietf.org/rfc/rfc1106.txt
[Geo2001] GEORGE, T., DANTU, R., KALLA, M., SCHARZBAUER, H. J., SIDEBOTTON, G., and MORNEAULT, K.: SS7 MTP2-User Peer-to-Peer Adaptation Layer, Internet-Draft, July 2001. Work in progress. http://www.watersprings.org/pub/id/draft-ietf-sigtran-m2pa-03.txt
[Gib2001] GIBSON, S.: The Strange Tale of the Denial of Service Attack against GRC.COM, June 2001. http://media.grc.com/files/grcdos.pdf
[GSM2001] GSM World: Member Statistics, December 2001. http://www.gsmworld.com/membership/mem_stats.html
[Han1998] HANDLEY, M., and JACOBSON, V.: SDP: Session Description Protocol, RFC 2327, April 1998. http://www.ietf.org/rfc/rfc2327.txt
[Han1999] HANDLEY, M., SCHULZRINNE, H., SCHOOLER, E., and ROSEMBERG, J.: SIP: Session Initiation Protocol, RFC 2543, March 1999. http://www.ietf.org/rfc/rfc2543.txt
[Han2000] HANDLEY, M., PERKINS, C., and WHELAN, E.: Session Announcement Protocol, RFC 2974, October 2000. http://www.ietf.org/rfc/rfc2974.txt
[Har1998] HARKINS, D., and CARREL, D.: The Internet Key Exchange (IKE), RFC 2409, November 1998. http://www.ietf.org/rfc/rfc2409.txt
[Hin1998] HINDEN, R. M., and DEERING, S. E.: IP Version 6 Addressing Architecture, RFC 2373, July 1998. http://www.ietf.org/rfc/rfc2373.txt
[Hui1998] HUITEMA, C.: IPv6: The new Internet Protocol, Second Edition, Prentice- Hall International, 1998.
[ITU1996] ITU-T: Error-correcting procedures for DCEs using asynchronous-to- synchronous conversion, Recommendation V.42, October 1996. Bibliography 132
[Jac1988] JACOBSON, V.: Congestion Avoidance and Control, Computer Communication Review, Vol. 18, No. 4, pages 314-329, August 1988. http://www.acm.org/pubs/articles/proceedings/comm/52324/p314- jacobson/p314-jacobson.pdf
[Jac1990] JACOBSON, V.: Compressing TCP/IP Headers for Low-Speed Serial Links, RFC 1144, February 1990. http://www.ietf.org/rfc/rfc1144.txt
[Jac1992] JACOBSON, V., BRADEN, R., and BORMAN, D.: TCP Extensions for High Performance, RFC 1323, May 1992. http://www.ietf.org/rfc/rfc1323.txt
[Jun2000] JUNGMAIER, A., SCHOOP, M., and TXEN, M.: Performance Evaluation of the Simple Control Transmission Protocol (SCTP), Proceedings of the IEEE Conference on High Performance Switching and Routing, June 2000. http://tdrwww.exp-math.uni-essen.de/pages/forschung/atm2000.pdf
[Jun2001] JUNGMAIER, A., RESCORLA, E., and TEXEN, M.: TLS over SCTP, Internet-Draft, November 2001. Work in progress. http://www.watersprings.org/pub/id/draft-ietf-tsvwg-tls-over-sctp-00.txt
[Kaa2001] KAARANEN, H., AHTIAINEN, A., LAITINEN, L., NAGHIAN, S., and NIEMI, V.: UMTS Networks. Architecture, Mobility and Services, First Edition, John Wiley & Sons, 2001.
[Kar1999] KARN, P., and SIMPSON, W. A.: Photuris: Session-Key Management Protocol, RFC 2522, March 1998. Bibliography 133
http://www.ietf.org/rfc/rfc2522.txt
[Ken1998a] KENT, S., and ATKINSON, R.: Security Architecture for the Internet Protocol, RFC 2401, November 1998. http://www.ietf.org/rfc/rfc2401.txt
[Ken1998b] KENT, S., and ATKINSON, R.: IP Authentication Header, RFC 2402, November 1998. http://www.ietf.org/rfc/rfc2402.txt
[Ken1998c] KENT, S., and ATKINSON, R.: IP Encapsulating Security Payload, RFC 2406, November 1998. http://www.ietf.org/rfc/rfc2406.txt
[Kes1998] KESSLER, G. C., and SOUTHWICK, P. V.: ISDN, Signature Edition, McGraw-Hill, 1998.
[Kle2001] KLENSIN, J. (editor): Simple Mail Transfer Protocol, RFC 2821, April 2001. http://www.ietf.org/rfc/rfc2821.txt
[Kra1997] KRAWCZYK, H., BELLARE, M., and CANETTI, R.: HMAC: Keyed- Hashing for Message Authentication, RFC 2104, February 1997. http://www.ietf.org/rfc/rfc2104.txt
[Lou2002] LOUGHNEY, J., SIDEBOTTON, G., MOUSSEAU, G., LORUSSO, S., COENE, L., VERWIMP, G., KELLER, J., ESCOBAR, F., SULLY, W., FURNISS, S., and BIDULOCK, B.: SS7 SCCP-User Adaptation Layer (SUA), Internet-Draft, January 2002. Work in progress. http://www.watersprings.org/pub/id/draft-ietf-sigtran-sua-11.txt
[Ma1998] MA, G.: T/UDP: UDP for TCAP, Internet-Draft, expired May 1999. http://www.watersprings.org/pub/id/draft-ma-tudp-00.txt
[Mad1998a] MADSON, C. (editor), and GLENN, R. (editor): The use of HMAC-MD5- 96 within ESP and AH, RFC 2403, November 1998. http://www.ietf.org/rfc/rfc2403.txt
[Mad1998b] MADSON, C. (editor), and GLENN, R. (editor): The use of HMAC-SHA- 1-96 within ESP and AH, RFC 2404, November 1998. http://www.ietf.org/rfc/rfc2404.txt
[Mat1996] MATHIS, M., MAHDAVI, J., FLOYD, S., and ROMANOW, A.: TCP Selective Acknowledgement Options, RFC 2018, October 1996. http://www.ietf.org/rfc/rfc2018.txt
[McC1996] McCANN, J., DEERING, S. E., and MOGUL, J.: Path MTU Discovery for IP Version 6, RFC 1981, August 1996. Bibliography 134
http://www.ietf.org/rfc/rfc1981.txt
[Moc1987] MOCKAPETRIS, P.: Domain Names Concepts and Facilities, RFC 1034, November 1987. http://www.ietf.org/rfc/rfc1034.txt
[Mod1992] MODARRESSI, A. R., and SKOOG, R. A.: An Overview of Signaling System No. 7, Proceedings of the IEEE, Vol. 80, No. 4, April 1992. http://ieeexplore.ieee.org/iel1/5/3687/00135382.pdf?isNumber=3687
[Mog1990] MOGUL, J., DEERING, S.: Path MTU Discovery, RFC 1191, November 1990. http://www.ietf.org/rfc/rfc1191.txt
[Moo1965] MOORE, G. E.: Cramming More Components onto Integrated Circuits, Electronics, Volume 38, Number 8, April 1965. http://www.intel.com/research/silicon/moorespaper.pdf
[Mor2001] MORNEAULT, K., RENGASAMI, S., KALLA, M., and SIDEBOTTON, G.: ISDN Q.921-User Adaptation Layer, RFC 3057, February 2001. http://www.ietf.org/rfc/rfc3057.txt
[Mor2002] MORNEAULT, K., DANTU, R., SIDEBOTTON, G., GEORGE, T., BIDULOCK, B., and HEITZ, J.: SS7 MTP2-User Adaptation Layer, Internet-Draft, January 2002. Work in progress. http://www.watersprings.org/pub/id/draft-ietf-sigtran-m2ua-13.txt
[Moy1998] MOY, J.: OSPF Version 2, RFC 2328, April 1998. http://www.ietf.org/rfc/rfc2328.txt
[NBS1995] NATIONAL BUREAU OF STANDARDS: Secure Hash Standard, Federal Information Processing Standards Publication 180-1, April 1995. http://csrc.nist.gov/publications/fips/fips180-1/fip180-1.txt
[Nic1998] NICHOLS, K., BLAKE, S., BAKER, F., and BLACK, D.: Definition of the Differentiated Services Field (DS Field) in the IPv4 and IPv6 Headers, RFC 2474, December 1998. http://www.ietf.org/rfc/rfc2474.txt
[Nua2001] Nua: How Many Online?, Nua Internet Surveys, 2001. http://www.nua.ie/surveys/how_many_online/index.html
[Ong1999] ONG, L., RYTINA, I., HOLDREGE, M., LODE, C., GARCA, M. A., SHARP, C., JUHASZ, I., LIN, H. P., and SCHWARZBAUER, H. J.: Framework Architecture for Signaling Transport, RFC 2719, October 1999. http://www.ietf.org/rfc/rfc2719.txt
Bibliography 135
[Oti2001a] OTIS, D.: RE: [TSVWG] SCTP and Checksums, email sent to the TSVWG distribution list, 14 th May 2001. ftp://ftp.ietf.org/ietf-mail-archive/tsvwg/2001-05.mail
[Oti2001b] OTIS, D.: Integrity-Authentication Digest for SCTP, Internet-Draft, June 2001. Work in progress. http://www.watersprings.org/pub/id/draft-otis-sctp-digest-02.txt
[Pat1995] PARTRIDGE, C., HUGHES, J., and STONE, J.: Performance of Checksums and CRCs over Real Data, Proceedings of SIGCOMM '95 Conference, ACM, pages 68-76, August 1995. http://www.acm.org/pubs/articles/proceedings/comm/217382/p68- partridge/p68-partridge.pdf
[Pax1997] PAXSON, V.: End-to-End Internet Packet Dynamics, Proceedings of SIGCOMM '97 Conference, ACM, pages 139-152, September 1997. http://www.acm.org/pubs/articles/proceedings/comm/263105/p139-paxson/p139- paxson.pdf
[Pax2000] PAXSON, V., and ALLMAN, M.: Computing TCP's Retransmission Timer, RFC 2988, November 2000. http://www.ietf.org/rfc/rfc2988.txt
[Pos1980] POSTEL, J. (editor): User Datagram Protocol, RFC 768, August 1980. http://www.ietf.org/rfc/rfc768.txt
[Pos1981a] POSTEL, J. (editor): Internet Protocol, RFC 791, September 1981. http://www.ietf.org/rfc/rfc791.txt
[Pos1981b] POSTEL, J. (editor): Internet Control Message Protocol, RFC 792, September 1981. http://www.ietf.org/rfc/rfc792.txt
[Pos1981c] POSTEL, J. (editor): Transmission Control Protocol, RFC 793, September 1981. http://www.ietf.org/rfc/rfc793.txt
[Pos1983] POSTEL, J. , and REYNOLDS, J. K.: Telnet Protocol Specification, RFC 854, May 1983. http://www.ietf.org/rfc/rfc854.txt
[Pos1985] POSTEL, J. , and REYNOLDS, J. K.: File Transfer Protocol (FTP), RFC 959, October 1985. http://www.ietf.org/rfc/rfc959.txt
[Pri2001] PRICE, R., HANCOCK, R., McCANN, S., WEST, M. A., SURTEES, A., OLLIS, P., ZHANG, Q., LIAO, H., ZHU, W., and ZHANG, Y.,: TCP/IP Compression for ROHC, Internet-Draft, November 2001. Work in progress. Bibliography 136
[Ram2001] RAMAKRISHNAN, K. K., FLOYD, S., and BLACK, D. L.: The Addition of Explicit Congestion Notification (ECN) to IP, RFC 3168, September 2001. http://www.ietf.org/rfc/rfc3168.txt
[Rij1994] RIJSINGHANI, A. (editor): Computation of the Internet Checksum via Incremental Update, RFC 1624, May 1994. http://www.ietf.org/rfc/rfc1624.txt
[Riv1992] RIVEST, R. L.: The MD5 Message-Digest Algorithm, RFC 1321, April 1992. http://www.ietf.org/rfc/rfc1321.txt
[Ros2001a] ROSEN, E., VISWANATHAN, A., and CALLON, R.: Multiprotocol Label Switching Architecture, RFC 3031, January 2001. http://www.ietf.org/rfc/rfc3031.txt
[Ros2001b] ROSENBERG, J., SCHULZRINNE, H., and CAMARILLO, G.: SCTP as a transport for SIP, Internet-Draft, November 2001. Work in progress. http://www.watersprings.org/pub/id/draft-ietf-sip-sctp-01.txt
[Rus1998] RUSSELL, T.: Signaling System #7, Second Edition, McGraw-Hill, 1998.
[Sn1998] SNCHEZ, D.: Connectionless SCCP over IP Adaptation Layer (CSIP), Internet-Draft, expired May 1999. http://www.watersprings.org/pub/id/draft-sanchez-CSIP-v0r0-00.txt
[Sn1999] SNCHEZ, D.: A Simple SCCP Tunneling Protocol (SSTP), Internet-Draft, expired July 1999. http://www.watersprings.org/pub/id/draft-sanchez-garcia-SSTP-v1r0-00.txt
[Sch1996] SCHULZRINNE, H., CASNER, S., FREDERICK, R., and JACOBSON, V.: RTP: A Transport Protocol for Real-Time Applications, RFC 1889, January 1996. http://www.ietf.org/rfc/rfc1889.txt
[Sch1998] SCHULZRINNE, H., RAO, A., and LANPHIER, R.: Real Time Streaming Protocol (RTSP), RFC 2326, April 1998. http://www.ietf.org/rfc/rfc2326.txt
[She2000] SHEPLER, S., CALLAGHAN, B., ROBINSON, D., THURLOW, R., BEAME, C., EISLER, M., and NOVECK, D.: NFS Version 4 Protocol, RFC 3010, December 2000. http://www.ietf.org/rfc/rfc3010.txt Bibliography 137
[She2001] SHEINWALD, D., SATRAN, J., THALER, P., CAVANNA, V., and WAKELEY, M.: iSCSI CRC/Checksum Considerations, Internet-Draft, May 2001. Work in progress. http://www.watersprings.org/pub/id/draft-sheinwald-iscsi-crc-00.txt
[Sid2002] SIDEBOTTON, G., PASTOR-BALBAS, J., RYTINA, I., MOUSSEAU, G., ONG, L., SCHWARZBAUER, H. J., GRADISCHNIG, K., MORNEAULT, K., KALLA, M., GLAUDE, N., BIDULOCK, B., and LOUGHNEY, J.: SS7 MTP3-User Adaptation Layer (M3UA), Internet- Draft, January 2002. Work in progress. http://www.watersprings.org/pub/id/draft-ietf-sigtran-m3ua-11.txt
[Sol1992] SOLLINS, K. R.: The TFTP Protocol (Revision 2), RFC 1350, July 1992. http://www.ietf.org/rfc/rfc1350.txt
[Sri1999] SRISURESH, P., and HOLDREGE, M.: IP Network Address Translator (NAT) Terminology and Considerations, RFC 2663, August 1999. http://www.ietf.org/rfc/rfc2663.txt
[Sri2001] SRISURESH, P., and EGEVANG, K. B.: Traditional IP Network Address Translator (Traditional NAT), RFC 3022, January 2001. http://www.ietf.org/rfc/rfc3022.txt
[Sta1995] STALLINGS, W.: ISDN and Broadband ISDN with Frame Relay and ATM, Third Edition, Prentice-Hall International, 1995.
[Ste1994] STEVENS, W. R.: TCP/IP Illustrated, Volume 1, First Edition, Addison- Wesley Professional Computing Series, 1994.
[Ste1998] STEWART, R. R., and XIE, Q.: Multi-Network Datagram Transmission Protocol, Internet-Draft, expired January 1999. http://www.watersprings.org/pub/id/draft-stewart-xie-mdtp-00.txt
[Ste2000] STEWART, R. R., XIE, Q., MORNEAULT, K., SHARP, C., SCHWARZBAUER, H. J., TAYLOR, T., RYTINA, I., KALLA, M., ZHANG, L., and PAXSON, V.: Stream Control Transmission Protocol, RFC 2960, October 2000. http://www.ietf.org/rfc/rfc2960.txt
[Ste2001a] STEWART, R. R.: [TSVWG] SCTP and Checksums, email sent to the TSVWG distribution list, 4 th May 2001. ftp://ftp.ietf.org/ietf-mail-archive/tsvwg/2001-05.mail
[Ste2001b] STEWART, R. R.: Re: [TSVWG] sctp error check, again, email sent to the TSVWG distribution list, 5 th June 2001. ftp://ftp.ietf.org/ietf-mail-archive/tsvwg/2001-06.mail
Bibliography 138
[Ste2001c] STEWART, R. R., and XIE, Q.: Stream Control Transmission Protocol (SCTP), A Reference Guide, First Edition, Addison-Wesley, 2001.
[Ste2002a] STEWART, R. R., STONE, J., and OTIS, D.: SCTP Checksum Change, Internet-Draft, January 2002. Work in progress. http://www.watersprings.org/pub/id/draft-ietf-tsvwg-sctpcsum-02.txt
[Ste2002b] STEWART, R. R, ONG, L., ARIAS-RODRGUEZ, I., and POON, K.: SCTP Implementors Guide, Internet-Draft, January 200. Work in progress. http://www.watersprings.org/pub/id/draft-ietf-tsvwg-sctpimpguide-03.txt
[Ste2002c] STEWART, R. R., RAMALHO, M. A., XIE, Q., TUEXEN, M., RYTINA, I., and CONRAD, P.: SCTP Extensions for Dynamic Reconfiguration of IP Addresses, Internet-Draft, January 2002. Work in progress. http://www.watersprings.org/pub/id/draft-ietf-tsvwg-addip-sctp-04.txt
[Ste2002d] STEWART, R. R., XIE, Q., YARROLL, L., WOOD, J., POON, K., and FUJITA, K.: Sockets API Extensions for SCTP, Internet-Draft, January 2002. Work in progress. http://www.watersprings.org/pub/id/draft-ietf-tsvwg-sctpsocket-03.txt
[Tan1996] TANENBAUM, A. S.: Computer Networks, Third Edition, Prentice-Hall International, 1996.
[Tho1998] THOMSON, S., and NARTEN, T.: IPv6 Stateless Address Autoconfiguration, RFC 2462, December 1998. http://www.ietf.org/rfc/rfc2462.txt
[Ton1999] TONEY, K.: PURDET. Reliable Transport Extensions on UDP, Internet- Draft, expired September 1999. http://www.watersprings.org/pub/id/draft-toney-purdet-00.txt
[Tou1996] TOUCH, J., and PARHAM, B.: Implementing the Internet Checksum in Hardware, RFC 1936, April 1996. http://www.ietf.org/rfc/rfc1936.txt
[Vh2000] VH-SIPIL, A.: URLs for Telephone Calls, RFC 2806, April 2000. http://www.ietf.org/rfc/rfc2806.txt
[W3C1999] WORLD WIDE WEB CONSORTIUM: HTML 4.01 Specification, W3C Recommendation, December 1999. http://www.w3.org/TR/html401/html40.pdf.gz
[Wil1993] WILLIAMS, R. N.: A Painless Guide to CRC Error Detection Algorithms, Third Version, August 1993. ftp://ftp.rocksoft.com/papers/crc_v3.txt
Bibliography 139
[Xie2001a] XIE, Q., STEWART, R. R., SHARP, C., and RYTINA, I.: SCTP Unreliable Data Mode Extension, Internet-Draft, expired October 2001. http://www.watersprings.org/pub/id/draft-ietf-tsvwg-usctp-00.txt
[Xie2001b] XIE, Q.: [TSVWG] Not proceeding with U-SCTP, email sent to the TSVWG distribution list, 11 th September 2001. ftp://ftp.ietf.org/ietf-mail-archive/tsvwg/2001-09.mail
[Yav2000] YAVATKAR, R., PENDARAKIS, D., and GUERIN, R.: A Framework for Policy-based Admission Control, RFC 2753, January 2000. http://www.ietf.org/rfc/rfc2753.txt
[Yuv1979] YUVAL, G.: How to Swindle Rabin, Cryptologia Magazine, Vol. 3, pages 187-190, July 1979.
[Zwe1990] ZWEIG, J., and PARTRIDGE, C.: TCP Alternate Checksum Options, RFC 1146, March 1990. http://www.ietf.org/rfc/rfc1146.txt
Index 140
INDEX
A Adaptive fast retransmit algorithm, 1078 Adding and deleting addresses, 1035 Adler-32 Checksum, 116 Advertised receiver window credit, 56 Application Service Element, 17 ASE (see Application Service Element) Associated signaling mode, 7 Automatic callback, 6 B Birthday attack, 64 BISUP (see Broadband ISDN Used Part) Blind attack, 56 Broadband ISDN User Part, 14 Bundling, 67 Burst error, 114 C CCS (see Common Channel Signaling) Chunks, 39, 4346 ABORT chunk, 44 CANCEL chunk, 87, 106 CHECKSUM chunk, 119 Chunk Flags, 46 Chunk Length, 46 Chunk Type, 4546 COOKIE ACK chunk, 43, 66 COOKIE ECHO chunk, 43, 65 CWR chunk, 44 DATA chunk, 43, 6788 ECNE chunk, 44 ERROR chunk, 44, 9294, 9294 Fixed Fields, 46 HEARTBEAT ACK chunk, 44, 8991 HEARTBEAT chunk, 44, 8991 IETF-defined chunk extensions, 46 INIT ACK chunk, 43, 5565 INIT chunk, 43, 5565 SACK chunk, 43, 6788 SHUTDOWN ACK chunk, 44 SHUTDOWN chunk, 44 SHUTDOWN COMPLETE chunk, 44, 100101 Vendor-specific chunks, 45 Circuit-switched network, 7 Common Channel Signaling, 4, 5 Common Open Policy Service, 27 Common Transport Protocol, 32 Congestion avoidance algorithm, 73 Connectionless SCCP over IP Adaptation Layer, 34 Cookie, 6365 Cookie mechanism, 50, 5466, 122 COPS (see Common Open Policy Service) CRC (see Cyclic Redundance Check) CRC-16, 43, 112 CRC-32, 117 CRC-CCITT, 114 CSIP (see Connectionless SCCP over IP Adaptation Layer) CTP (see Common Transport Protocol) Cumulative TSN Ack, 68 D Data User Part, 14 DC signaling, 4 Delayed ACK Algorithm, 70 Delayed SACKs, 85 Denial of service, 53 Differentiated Services, 28 DiffServ (see Differentiated Services) Digital signaling, 5 DUP (see Data User Part) Duplicate TSNs, 68, 107 E Error causes, 47, 9394, 97 F Fast retransmit algorithm, 74, 1078, 120 Fletcher Checksum, 115 Fletcher-Adler Checksum, 117 Fragmentation, 79 G Generator polynomial, 113 H H.323, 28 H.323 Annex E, 34 Half-closed connection, 98 Half-open connection, 50, 53 Hamming distance, 119 Head-of-line blocking, 33, 58, 7778 Heartbeat interval, 91 HMAC (see Keyed-Hashing algorithm for Message Authentication) HOL (see Head-of-line) I IANA (see Internet Assigned Numbers Authority) Idle address, 90 Implementors guide, 11921 In-band signaling, 4, 5 Initiate Tag, 55 Internet Assigned Numbers Authority, 42 Internet Checksum, 112 Internet Protocol ARPANet, 18 Header, 22 History, 1821 HTTP (see Hypertext Transfer Protocol) Hypertext Transfer Protocol, 19 Index 141
IP spoofing, 53 NSFNet, 18 SCTP over IP, 35 Voice over IP, 2531 VoIP (see Voice over IP) World Wide Web, 19 WWW (see World Wide Web) Internet telephony, 2531 Interoperability session, 37, 105, 111 IP (see Internet Protocol) ISDN Q.921-User Adaptation Layer, 110 ISDN User Part, 17 ISUP (see ISDN User Part) K Karn's algorithm, 85 Keepalive mechanism, 89 Keyed-Hashing algorithm for Message Authentication, 63 L LAPD (see Link Access Procedures on the D-channel) Link Access Procedures on the D-channel, 110 LNP (see Local Number Portability) Local Number Portability, 6 M M2PA (see MTP2-User Peer-to-Peer Adaptation Layer) M2UA (see MTP2-User Adaptation Layer) M3UA (see MTP3-User Adaptation Layer) MAC (see Message Authentication Code) Maximum Transfer Unit Black hole detection, 84 Discovery, 8085 MDTP (see Multi-network Datagram Transmission Protocol) Media Gateway, 31 Media Gateway Controller, 31 Message Authentication Code, 63 Message Digest 5, 63 Message Transfer Part, 1416 MMUSIC (see Multiparty Multimedia Session Control) Modified Adler-32 Checksum, 118 MPLS (see Multiprotocol Label Switching Architecture) MTP (see Message Transfer Part) MTP2-User Adaptation Layer, 109 MTP2-User Peer-to-Peer Adaptation Layer, 110 MTP3-User Adaptation Layer, 108 MTU (see Maximum Transfer Unit) Multihoming, 1035, 122 Multi-network Datagram Transmission Protocol, 34 Acknowledgedment Number, 39 Biggest message, 39 Data field, 40 Data Size, 39 Endpoint drain procedure, 95 Establishment procedure, 5152 Flags, 39 Header, 3740 In Queue field, 40 Mode, 39 Of field, 39 Part field, 39 Protocol Identifier field, 38 Sequence Number, 39 Termination of an endpoint procedure, 95 Version field, 39 Multiparty Multimedia Session Control Working Group, 31 Multiprotocol Label Switching Architecture, 28 N NAKs, 76 NAT (see Network Address Translator) Network Address Translator, 6062 Nonassociated signaling mode, 7 O OOTB (see Out of the blue datagram) Out of the blue datagram, 96 Out-of-band signaling, 5 P Packet-switched network, 7 Padding, 46, 47 Parameters, 4647 INIT ACK parameters, 5965 INIT parameters, 5965 Parameter Length, 47 Parameter Type, 46 Path heartbeat mechanism, 89, 121 Payload Protocol Identifier, 70 Per stream flow control, 1035 Polynomial Code, 113 Primary Address, 104 Pseudo header, 112 PURDET, 34 Q Q.921, 110 Q.931, 110 QoS (see Quality of Service) Quality of Service, 27 R RAP (see Resource Allocation Protocol) Real Time Protocol, 28, 34 Real Time Streaming Protocol, 28 Reference implementation, 124 Reliable request procedure, 104 Reliable UDP, 34 Reordering of packets, 107 Resource Allocation Protocol, 27 Resource Reservation Protocol, 27 Retransmission Time-Out, 73, 8587 Round Trip Time, 8587 Round Trip Time Variation, 87 RSVP (see Resource Reservation Protocol) RTO (see Retransmission Time-Out) RTP (see Real Time Protocol) RTSP (see Real Time Streaming Protocol) RTT (see Round Trip Time) RTTVAR (see Round Trip Time Variation) RUDP (see Reliable UDP) S SCCP (see Signaling Connection Control Part) SCCP-User Adaptation Layer, 108 Index 142
Secure Hash Standard 1, 63 Sequence number attack, 56 Service Specific Connection-Oriented Protocol, 34 Session Announcement Protocol, 28 Session Description Protocol, 28 Session Initiation Protocol, 28 SHA-1 (see Secure Hash Standard 1) Signaling Connection Control Part, 16 Signaling Gateway, 31 Signaling System #7 Functional architecture, 713 Global title translation, 10, 16 International plane, 8 Linkset, 11 Message discrimination, 7 National Plane, 8 Protocol architecture, 1317 SCP (see Service Control Point) Screening, 10 Service Control Point, 11 Service Switching Point, 9 Signaling Link, 8, 1113 Signaling Point, 7 Signaling Transfer Point, 910 SSP (see Service Switching Point) STP (see Signaling Transfer Point) Signaling Transport Working Group, 31, 48, 70, 122 SIGTRAN (see Signaling Transfer working group) Simple SCCP Tunneling Protocol, 34 Slow start algorithm, 73 Smoothed Round Trip Time, 86 Socket interface, 123 SP (see Signaling Point) SRTT (see Smoothed Round Trip Time) SS7 (see Signaling System #7) SSN (see Stream Sequence Number) SSTP (see Simple SCCP Tunneling Protocol) Stream Control Transmission Protocol Checksum, 43, 11119 Common header, 4043 Congestion avoidance algorithms, 7076 Cookie mechanism, 35 Defects found, 36, 75, 11121 Destination Port Number, 41 Establishment procedure, 51 Extensibility features, 36, 4748 Primary address, 60 Protocol number, 38 Public implementation, 125, 127 Source Port Number, 41 State diagram, 4850 TCB (see Transmission Control Block) Transmission Control Block, 59 Verification Tag, 42, 55 Stream Sequence Number, 70 Streams, 33, 58, 7680, 87, 122 SUA (see SCCP-User Adaptation Layer) T T/UDP (see UDP for TCAP) TCAP (see Transaction Capabilities Application Part) TCP (see Transmission Control Protocol) TCP-32 Checksum, 115 Telephone User Part, 14 Tie-Tags, 64, 120 TLV (see Type-Length-Value structure) Transaction Capabilities Application Part, 1617 Transmission Control Protocol Checksum, 112 Extensibility problems, 4445 Problems to become the Common Transport Protocol, 33 SYN attack, 33, 5254 TIME WAIT state, 98 Timestamps, 85 Transmission Sequence Number, 67 Transport Area Working Group, 36, 48, 68 Transport Layer Security, 110 TSN (see Transmission Sequence Number) TSVWG (see Transport Area Working Group) TUP (see Telephone User Part) Two-army problem, 100 Type-Length-Value structure, 43 U UDP for TCAP, 34 Unreliable SCTP, 105 W WATS (see Wide Area Telephone Service) Wide Area Telephone Service, 6 Index 143