Vous êtes sur la page 1sur 1399

1

2010/02/15 {C} Herbert Haas


Communication Basics
Principles and Dogmas
In this chapter we discuss basic communication issues, such as synchronization,
coding, scrambling, modulation, and so on.

2
'Evervthing
should be made
as simple as possible,
...but not simpler.`
AIbert Einstein

3
3 {C} Herbert Haas 2010/02/15
Information
What is information?

Carried by symboIs

Recognized by receiver (hopefuIIy)

Interpretation is the key.


What is inIormation? This question may sound quite easy but think a bit about it.
Obviously we need symbols to represent inIormation. But these symbols must
also be recognized as symbols by the receiver. In Iact, philosphical considerations
conclude that inIormation can only be deIined through a receiver. The same
problem is with art. What is art? Several decades and centuries had their own
deIinitions. Today most critics use a general deIinition: art can only be deIined in
context with the viewer.
In the Iollowing chapters throughout the whole data communication we will
deal with symbols representing inIormation. A symbol is not a 0 or a 1. But this
binary inIormation can be represented by symbols. Be patient...

4
4 {C} Herbert Haas 2010/02/15
SymboIs
SymboIs (may) represent information

Voice patterns (Speech)

Sign Ianguage, Pictograms

Scripture

VoItage IeveIs

Light puIses
BIue WhaIe Sonagrams
What is a good inIormation source? From a theoretical point oI view a random
pattern is the best because you'll never know what comes next. On the other hand,
iI you receive a continous stream oI the same symbol this would be boring. More
than boring: there is no inIormation in it, because you can predict what comes
next! From this we conclude that a sophisticated coding representing the
inIormation as eIIicient as possible using symbols is a critical step during the
communication process.
Throughout these chapters we will mainly deal with symbols such as voltage
levels or light pulses.
Look at the Blue Whale Sonograms. The x-axis represents time, the y-axis
Irequency and the color represents power density. This communication pattern is
very complex (those oI dolphins is even more complex). It is known that each
herd has their own traditional hymn. And: they like to communicate!

5
5 {C} Herbert Haas 2010/02/15
SymboIs on Wire
Discrete voItage IeveIs = "DigitaI"

Resistant against noise


How many IeveIs?

Binary (easiest)

M-ary: More information per time unit!


Binary M-ary
(here 4 IeveIs, e. g. ISDN)
What symbols do we encounter on wire? Digital binary symbols are commonly
known and widely in use. Why? Consider inIormation transmissions in groups oI
symbols (Ior example the group oI 8 binary symbols is called a byte). We have
two parameters: the number base B and the group order C. II you calculate the
"costs" that you get Ior arbitrary variations oI B and C, and iI we assume a linear
progress (so that cost kBC) then Ior any given (constant) cost the perIect base
would be Be, that is B2.7182...
In other words: the perIect base is a number between 2 and 3. The technical
easiest solution is to use B2. Note that these considerations assume a linear cost
progression.
In many cases we pay the price oI higher eIIorts and use a larger base. This leads
us to m-ary symbols and later to PAM and QAM.

6
6 {C} Herbert Haas 2010/02/15
Synchronization
Sender sends symboI after symboI...
When shouId receiver pick the signaI
sampIes?

=> Receiver must sync with sender's cIock !


? 00001
00001100110
000100111111
001010010111
SampIing instances Interpretation:
(onIy this one is correct)
One oI the most important issues among communication is that oI
synchronization. Nature Iorbids absolute synchronization oI clocks. Suppose you
are a receiver and you see alternating voltage levels on your receiving interIace. II
you had no idea about the sending clock then you would never be able to
interprete the symbols correctly. When do you make a sample?

7
7 {C} Herbert Haas 2010/02/15
Synchronization
In reaIity, two independent cIocks are
NEVER preciseIy synchronous

We aIways have a frequency shift

But we must aIso care for phase shifts


?
001010011110
????????????
001010011011
Phase shift
(worst case)
Different
cIock
frequencies
So we must assume that the receivers clock is approximately identical to the
senders clock. At least we must deal with small phase and Irequency gaps. As you
can see in the slide above, we still cannot be sure when to make samples.
What we need is some kind oI synchronization method.

8
8 {C} Herbert Haas 2010/02/15
SeriaI vs ParaIIeI
ParaIIeI transmission

MuItipIe data wires (fast)

ExpIicit cIocking wire

SimpIe Synchronization but not cost-effective

OnIy usefuI for smaII distances


SeriaI transmission

OnIy one wire (-pair)

No cIocking wire

Most important for data communication


In case oI parallel transmissions there is always a dedicated clock line. This is a
very comIortable synchronization method. A symbol pattern on the data lines
should be sampled by the receiver each time a clock pulse is observed on the
clock line. But unIortunately, parallel transmissions are too costly on long links.
In LAN and WAN data communication there are practically no parallel lines.
The most important transmission technique is the serial. Data is transmitted over
a single Iiber or wire-pair (or electromagentic wave). There is no clock line. How
do we synchronize sender and receiver?

9
9 {C} Herbert Haas 2010/02/15
Asynchronous Transmission
Independent cIocks
OversampIing: Much faster than bitrate
OnIy phase is synchronized

Using Start-bits and Stop-bits

VariabIe intervaIs between characters

Synchronity onIy during transmission


Inefficient
Character Character Character
Stop-Bits
Start-
Edge
Start-Bit
VariabIe
One synchronization method is the Asynchronous Transmission. Actually this
method cannot provide real synchronization (hence the name) but at least a short-
time quasi-synchronization is possible. The idea is to Irame data symbols using
start and stop symbols (lets sloppy call them start- and stop bits). Using
oversampling, the receiver is able to get a sample approximately in the middle oI
each bit but only Ior short bit-sequences.
Asynchronous transmission is typically Iound in older character-oriented
technologies.
Example application: RS-232C
Relative overhead: 3/11

10
10 {C} Herbert Haas 2010/02/15
Synchronous Transmission
Synchronized cIocks
Most important today!

Phase and Frequency synchronized


Receiver uses a Phased Locked Loop (PLL)
controI circuit
Requires frequent signaI changes
=> Coding or Scrambling of data necessary to avoid
Iong sequences without signaI changes
Continous data stream possibIe

Large frames possibIe (theoreticaIIy endIess)


Receiver remains synchronized
TypicaIIy each frame starts with a short "training
sequence" aka "preambIe" (e. g. 64 bits)
The most important method is the Synchronous Transmission. Don't conIuse this
with synchronous multiplexingwe are still on the physical layer! Two things are
necessary: a control circuit called Phased-Locked-Loop (PLL) and a signal that
consists oI Irequent transitions. How do we ensure Irequent transitions in our data
stream? Two possibilities: coding and scrambling our data.
Synchronous Transmission is Iound in most modern bit-oriented technologies
nearly anything you know.

11
11 {C} Herbert Haas 2010/02/15
Line Coding
1 0 1 0 1 1 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 NRZ
RZ
Manchester
DifferentiaI
Manchester
NRZI
AMI
HDB3
Code
Violation
The trivial code is Non Return to Zero (NRZ) which is usually the human naive
approach.
Remarks:
RZ codes might also use a negative level Ior logical zeroes, a positive level Ior
logical ones and a zero Volt level inbetween to return to. RZ is Ior example used
in optical transmissions (simple modulation).
NRZI codes either modulate Ior logical ones or zeros. In this slide we modulate
the zeroes, that is each logical zero requires a transition at the beginning oI the
interval.
NRZI means Non-return to zero inverted or interchanged.
B8ZS: same as bipolar AMI, except that any string oI eight zeros is replaced by a
string oI two code violations.
Manchester is used with 10 Mbit Ethernet. Token Ring utilizes DiIIerential
Manchester. Telco backbones (PDH technology) use AMI (USA) or HDB3
(Europe). OI course there are many many other coding styles.

12
12 {C} Herbert Haas 2010/02/15
Power Spectrum Density
0.5 1.5 1.0 2.0
1.0
0.5
NRZ,
NRZI
HDB3
AMI
Manchester,
DifferentiaI Manchester
NormaIized
Frequency (f/R)
SpectraI
Density
The slide above compares the power density distribution oI some codes
mentioned beIore. Obvously the code must match the spectral characteristics oI
the transmission channel.
Note that these codes are still kinds oI baseband transmissions. Each one can be
modulated using a carrier signal at higher Irequency to comply to a speciIic
channel characteristic.

13
13 {C} Herbert Haas 2010/02/15
ScrambIing ExampIe
T
S
T
S
T
S
T
S
T
S
T
S
T
S
T
S
T
S
T
S
T
S
T
S
T
S
T
S
ChanneI
ExampIe:
Feedback PoIynomiaI = 1+x
4
+x
7
Period Iength = 127 bit
t(n-4) t(n-4)
t(n-7) t(n-7)
s(n) t(n) t(n) s(n)
Another method to guarantee Irequent transitions is scrambling. Scramblers are
used with ATM, SONET/SDH Ior example.
The Ieedback polynomial above can be written as
t(n) s(n) XOR t(n-4) XOR t(n-7)
The descrambler recalculates the original pattern with the same Iunction (change
s(n) with t(n))
Period length 2`R 1 , where R is the number oI shiIt registers
That is, even a single 1 on the input (and all registers set to 0) will produce a 127-
bit sequence oI pseudo random pattern.
This scrambler is used with 802.11b (Wireless LAN).

14
14 {C} Herbert Haas 2010/02/15
Transmission System Overview
Information
Source
Source
Coding
ChanneI
Coding
Line
Coding
ModuIation
Information
Interpretor
Source
Decoding
Error
Detection
Descramber
EquaIizer
FiIter
DemoduIator
10110001...
FiIter unnecessary bits
(Compression)
FCS and FEC (Checksum)
BandIimited puIses
NRZ, RZ, HDB3, AMI, ...
SignaI
Noise Noise
A
N
A
L
O
G
U
E
D
I
G
I
T
A
L
Coding is not coding. The above slide gives you an overview about diIIerent
coding purposes. Even modulation is sometimes called coding.
Source coding tries to eliminate redundancy within the inIormation. Source
coders must know well about the type oI inIormation that is delivered by the
source.
Channel coding protects the non-redundant data stream by adding calculated
overhead. Typically a Frame Check Sequence (FCS) is added. Only on very
errourness and/or long-delay links a Forward Error Correction (FEC) method
might be useIul. FEC requires too much overhead in most terrestial applications.
Line coding Iocuses on the line, that is we want the symbols to be received
correctly, even iI noise and distortions are present. Furthermore line coding
provides clock synchronization as discussed earlier.
Finally modulation might be necessary in case the channel has better properties at
higher Irequencies.

15
15 {C} Herbert Haas 2010/02/15
Communication ChanneIs
UsuaIIy Low-Pass behavior

Higher frequencies are more attenuated than


Iower
Baseband transmission

SignaI without a dedicated carrier

ExampIe: LAN technoIogies (Ethernet etc)


Carrierband transmission

The baseband signaI moduIates a carrier to


match speciaI channeI properties

Medium can be shared for many users (different


carriers) - e. g. WLAN
Each communication channel exhibits a low-pass behaviorat least beyond a
very high Irequency. Not only is the signal attenuated; phase shiIts occur and
even nonlinear eIIects sometimes rise with higher Irequencies. The result is a
smeared signal with little energy.
In most cases the signals do not need to be modulated onto a carrier. That is, all
the channel bandwidth can be used up Ior this signal. We call this baseband
transmission.
Carrier and transmission put the baseband signal onto a carrier with higher
Irequency. This is necessary with radio transmissions because low Irequencies
have a very bad radiation characteristic. Another example is Iiber optics, where
special signal Irequencies are signiIicantly more attenuated and scattered than
others.

16
16 {C} Herbert Haas 2010/02/15
ChanneI utiIization exampIes
Frequency
Power
Density
Baseband
Transmission
Frequency
(kHz)
Power
Density
1 2 3 0.3 3.4
TeIephone
ChanneI
Frequency
Power
Density
f
c1
f
c2
f
c3
MuItipIe Carriers
The above slide shows some examples Ior baseband and carrierband
transmission. In case we use multiple carriers we may also call it broadband-
transmission.
The third picture (bottom oI slide) shows the spectral characteristic oI a telephony
channel (signal). The ITU-T deIined an "attenuation-hose" in great detail
(dynamics, ripples, edge Irequencies, etc). As a rule oI thumb we can expect low
attenuation between 300 Hz and 3400 Hz.

17
17 {C} Herbert Haas 2010/02/15
MaximaI SignaI-Rate
MaximaI data rate proportionaI to channeI-
bandwidth B
Raise time of Heavyside T=1/(2B)
So the maximum rate is R=2B, aIso caIIed the Nyquist
Rate
Note: We assume an ideaI channeI here - without noise!
Bandwidth decreases with cabIe Iength
As a dirty ruIe of thumb: BW Length const
But note that the reaIity is much more compIex
SoIitons are remarkabIe exceptions.
0
1
(2B)
-1
Maximum signaI rate: At Ieast
the ampIitude must be reached
Since each channel is a low-pass, and some channels even damp (very) low
Irequencies, data can only be transmitted within a certain channel bandwidth B.
II we put a 0 to 1 transition on the line (with ideally zero transition time), the
receiver will see a slope with a rise time oI T1/(2B).
So the maximal signal rate is T1/(2B) in theory. In practice we need some
budget because there is noise and distortion and imperIect devices.
The longer the cable the more dramatically the low-pass behaviour. In other
words: on the same cable type we can transmit (let's say) 1,000,000,000 bits/s iI
the cable is one meter in length, or only 1 bit/s iI the cable is one million
kilometers in length.
It is very interesting to mention that some modern Iiber optic transmission
methods violate this basic law. This methods base on so-called Soliton-
Transmission.

18
18 {C} Herbert Haas 2010/02/15
The Maximum Information Rate
What about a reaI channeI? What's the
maximum achievabIe information rate in
presence of noise?
Answer by C. E. Shannon in 1948

Even when noise is present, information can


be transmitted without errors without errors when the
information rate is beIow the channel capacity channel capacity
ChanneI capacity depends onIy on
channeI bandwidth AND SNR

ExampIe: AWGN-channeI
C = B Iog (1 + S/N)
The great inIormation theory guru Claude E. Shannon made a great discovery in
1948. BeIore 1948, it was commonly assumed, that there is no way to guarantee
an error-less transmission over a noisy channel. However, Shannon showes that
transmission without errors is possible when the inIormation rate is below the so-
called channel capacity, which depends on bandwidth and signal-to-noise ratio.
This discovery is regarded as one oI the most important achievements in
communication theory.

19
19 {C} Herbert Haas 2010/02/15
Bitrate vs Baud

Information Rate: Bit/s

SymboI Rate: Baud

The goaI is to send many (=as much as possibIe)


bits per symboI
=> QAM (see next sIides)
0 1 0 1 1 1 1 1 0 0 0 0 00 10 10 01 01 11
N bit/s 2N bit/s
N Baud N Baud
Baud is named aIter the 19th centurey French inventor Baudot, originally reIerred
to the speed a telegrapher could send Morse Code.
Today the symbol rate is measured in Baud whereas the inIormation rate is
measured in bit/s.

20
20 {C} Herbert Haas 2010/02/15
AnaIogue ModuIation Overview
t
1 0 1
AmpIitude Shift Keying (ASK)
t
1 0 1
Phase Shift Keying (PSK) Frequency Shift Keying (FSK)
t
1 0 1
) 2 cos( ) (
t t t
t f A t g
EVERY transmission is anaIogue - but there are different methods to
put a base-band signaI onto a high-frequency carrier
The most simpIe (and oIdest) is ASK
The iIIustrated ASK method is simpIe "On-Off-Keying" (OOK)
FSK and PSK are caIIed "angIe-moduIation" methods (nonIinear =>
spectrum shape is changed!)
For digitaI transmission, aImost aIways QAM is used
The BER of BPSK is 3 dB better than for simpIe OOK
These three parameters can be moduIated
The slide shows a general modulation equation. The 3 parameters oI the equation describe the 3 basic modulation
types. All 3 parameters, the amplitude A
t
, the Irequency I
t
and the phase
t,
can be varied, even simultaneously. In
nature, there is no real digital transmission; the binary data stream needs to be converted into an analog signal. As
Iirst step, the digital data will be 'directly transIormed into a analog signal (0 or 1), which is called a baseband
signal. In order to utilize transmission media such as Iree space (or cables and Iibers) the base signal must be mixed
with a carrier signal. This analog modulation shiIts the center Irequency oI the baseband signal to the carrier
Irequency to optimize the transmission Ior a given attenuation/propagation characteristic.
Amplitude Shift Keying
A binary 1 or 0 is represented through diIIerent amplitudes oI a sinus oscillation. Amplitude ShiIt Keying (ASK)
requires less bandwidth than FSK or PSK since natura non facit saltus. However ASK is interIerence prone. This
modulation type also used with inIrared-based WLAN.
Frequency Shift Keying
Frequency ShiIt Keying (FSK) is oIten used Ior wireless communication. DiIIerent logical signals are represented
by diIIerent Irequencies. This method needs more bandwidth but is more robust against interIerences. To avoid
phase jumps, FSK uses advanced Irequency modulators (Continuous Phase Modulation, CPM).
Phase Shift Keying
The 3rd basic modulation method is the Phase ShiIt Keying (PSK). The digital signal is coding through phase
skipping. In the picture above you see the simplest variation oI PSK, using phase jumps oI 180. In practice, to
reduce BW, phase jumps must be minimized, and thereIore PSK is implemented using advanced phase modulators
(e. g. Gaussian Minimum ShiIt Keying, etc). The receiver must use same Irequency and must be perIectly
synchronized with the sender using a Phase Locked Loop (PLL) circuit. PSK is more robust as FSK against
interIerences, but needs complex devices.
AIter understanding these modulation methods QAM shall be introduced, which is the most important modulation
scheme today Ior both wired and wireless transmission lines.


21 {C} Herbert Haas 2010/02/15
QAM: Idea
"Quadrature AmpIitude ModuIation"
Idea:
1. Separate bits in groups of words (e. g. of 6
bits in case of QAM-64)
2. Assign a dedicated pair of AmpIitude and
phase to each word (A,)
3. Create the compIex ampIitude Ae
j
4. Create the signaI Re{Ae
j
e
jt
}
= A (cos cos t - sin sin t ) which
represents one (of the 64) QAM symboIs
5. Receiver can reconstruct (A,)

22
22 {C} Herbert Haas 2010/02/15
QAM: SymboI Diagrams
Q
I
10 11
00 01
Standard
PSK
Quadrature
PSK (QPSK)
Q
I
1 0
Q
I
16-QAM
Re{U
i
}
m{U
i
}
1V 3V 5V
Other exampIe:
Modem V.29
2400 Baud
Max. 9600 Bit/s
For noisy and
distorted channels
4800 bit/s
For better channels
7200 bit/s
For even better
channels
9600 bit/s
Worth to know: Simple Phase ShiIt Keying (PSK) which only uses two symbols,
each representing either 0 and 1. Quadrature Phase ShiIt Keying (QPSK) with
Iour symbols.
Usually the assignment oI bit-words to symbols is such that the error probability
due to noise is minimized. For example the Gray-Code may be used between
adjacent symbols to minimize the number oI wrong bits when an adjacent symbol
is detected by the receiver.
The above slide also shows the symbol distribution over the complex plane Ior
the V.29 protocol which is/was used by modems. Depending on the noise-power
oI the channel, diIIerent sets oI symbols are used.
14,400 bit/s requires 64 points
28,800 bit/s requires 128 points

23
23 {C} Herbert Haas 2010/02/15
ExampIe QAM AppIications
One symboI represents a bit pattern

Given N symboIs, each represent Id(N) bits


Modems, 1000BaseT (Gigabit Ethernet),
WiMAX, GSM, .
WLAN 802.11a and 802.11g:

BPSK @ 6 and 9 Mbps

QPSK @ 12 and 18 Mbps

16-QAM @ 24 and 36 Mbps

64-QAM @ 48 and 54 Mbps


It is important to understand that spread spectrum (or OFDM) techniques are always combined
with a symbol modulation scheme. Quadrature Amplitude Modulation (QAM) is a general method
where practical methods such as BPSK, QPSK, etc are derived Irom.
The main idea oI QAM is to combine phase and amplitude shiIt keying. Since orthogonal
Iunctions (sine and cosine) are used as carriers, they can be modulated separately, combined into a
single signal, and (due to the orthogonality property) de-combined by the receiver.
And since A*cos(wt phi) A/2cos(wt)cos(phi) sin(wt)sin(phi)} QAM can be easily
represented in the complex domain as Real A*exp(i*phi)*exp(i*wt)}.
The standard PSK method only use phase jumps oI 0 or 180 to describe a binary 0 or 1. In the
right picture above you see a enhanced PSK method, the Quadrature PSK (QPSK) method. While
using Quadrature PSK each condition (phase shiIt) represent 2 bits instead oI 1. Now it is
possible to transIer the same datarate by halved bandwidth.
The QSK signal uses (relative to reIerence signal)
- 45 Ior a data value oI 11
- 135 Ior a data value oI 10
- 225 Ior a data value oI 00
- 315 Ior a data value oI 01
To reconstruct the original data stream the receiver need to compare the incoming signal with the
reIerence signal. The synchronization is very important.
Why not coding more bits per phase jump ?
Especial in the mobile communication there are to much interIerences and noise to encode right.
As more bits you use per phase jump, the signal gets more 'closer. It is getting impossible to
reconstruct the original data stream. In the wireless communication the QPSK method has proven
as a robust and eIIicient technique.

24
24 {C} Herbert Haas 2010/02/15
QAM ExampIe SymboIs (1)
Note that the above QAM signals show diIIerent successive QAM-symbols Ior
illustration purposes. In reality each symbol is transmitted many
hundred/thousand times

25
25 {C} Herbert Haas 2010/02/15
QAM ExampIe SymboIs (2)
Note that the above QAM signals show diIIerent successive QAM-symbols Ior
illustration purposes. In reality each symbol is transmitted many
hundred/thousand times
These diagrams have been generated using Octave, a Iree Matlab clone.

26
'The biggest problem
with communication
is the illusion
that it has occured.`
Married?

1
2009/08/12 {C} Herbert Haas
Network Layers
Standardization Cruelty
This chapter introduces the layer concept widely used in data communication.
Most Iamous is the ISO-OSI 7-layer model, which is also discussed in great
detail here. By the way the interaction oI layering and standardization is
explained.

2
'The good thing
about standards is
that there are so
manv to choose from`
Andrew S. Tanenbaum

3
3 {C} Herbert Haas 2009/08/12
Standards
We need networking standards

Ensure interoperabiIity

Large market, Iower cost


(mass production)
Vendors need standards

Good for marketing


Vendors create standards

Bad for competitors, hard to catch up


But: SIow standardization processes
freeze technoIogy...
We need standards. UnIortunately. Otherwise, each vendor would create what
he wants and we would not be able to communicate accross networks. This
situation occured very oIten in history. For example the United Nations
initiated a world-wide Telephony standardization board, known as CCITT
(today ITU-T). Or in the pre-Ethernet age, many vendors built completely
incompatible LAN protocols.
Especially to Iorce interoperability, many vendors Ior Internet-equipment
initiated the TCP/IP Interoperability ConIerence in 1987, today known as
"INTEROP".

4
4 {C} Herbert Haas 2009/08/12
Who Defines Standards?
ISO - Anything
IETF - Internet
ITU-T - TeIco TechnoIogies
ATM Forum
Frame ReIay Forum
IEEE - LAN ProtocoIs
The above slide mentions the most important standardization organizations.
The Internet Engineering Task Force (IETF) is "actually" the most important
technical organization Ior the Internet working groups and is organized in
several areas. Area manager and IETF chairman Iorm the IESG (Internet
Engineering Steering Group). The IETF is also responsible to maintain the
RFCs.

5
5 {C} Herbert Haas 2009/08/12
Standards Types
De facto standards

Anyone can create them

E.g. Internet RFCs


De jure standards

Created by a standardization
organization

E.g. ISO/OSI, ITU-T


Not all standards are like the others. De Iacto standards are more Ilexible and
speed-up the implementation. Usually everybody is allowed to extend them.
The whole Internet is built on such loosely standards. UnIortunately
misinterpretations can occur. (RFC`s)
De jure standards are like acts oI law. For example ITU-T standards explain
nearly every detail implementers may ask.

6
6 {C} Herbert Haas 2009/08/12
Note
Standardization is appIied
to network layers network layers
and interfaces interfaces
between them
The above sentence leads us to network layers. Break big problems into smaller
ones and write standards Ior them ("divide and conquer"). OI course the
interIaces between the layers must be standardized too. Eventually, multiple
developers can work on diIIerent parts oI the whole story.

7
7 {C} Herbert Haas 2009/08/12
Network Layers
Divide task of communication in
muItipIe sub-tasks
HierarchicaIIy organized

Each Iayer receives services from the


Iayer beIow

Each Iayer serves for the Iayer above


Good for interoperabiIity

CapsuIated Entities and Interfaces


But increases compIexity
Network layers are an abstraction to hide complexity. Layers are organized
hierarchically, that is there is a predeIined command direction. Imagine what
would happen iI we have a democratic model?
Note that network layers Iorce a more complex development. Many high-
perIormant communication technologies have been developed in an ad-hoc act,
or alternatively consists oI only a Iew layers.

8
8 {C} Herbert Haas 2009/08/12
Where to Define Layers
Group functions (services) together
When changes in technoIogy occur
To expose services
To aIIow changes in protocoI and HW
To utiIize existing protocoIs and HW
A good layering structure requires a intelligent grouping oI Iunctions. Ideally,
technology improvements can be implemented immediately.
For example the X.25 packetizing algorithm, which is written in soItware and
part oI a network driver oI the operating system can remain untouched, while
the serial line hardware can be updated, and vice versa.

9
9 {C} Herbert Haas 2009/08/12
The ISO/OSI ModeI
InternationaI Standards Organization (ISO)
InternationaI agency for the deveIopment of
standards in many areas
Founded 1946
CurrentIy 89 member countries
More than 5000 standards untiI today
1988 US Government OSI ProfiIe (GOSIP)

Requires Government products to support OSI


Iayering
The ISO standardized anythingcharacter sets, paper sizes, screws, ..., and
network layers. In 1988 the US Government required any communication
device to comply with the ISO/OSI model (GOSIP). Note that the non-OSI
Internet was built much earlier, so many people expected the end oI the
Internet. But the Internet (which was created as nuclear-bomb resistant) not
only survived the ISO/OSI model but also displaced many OSI-compliant
protocols, such as CLNP.
Similarly, in Europe the "European Procedurement Handbook Ior Open
Systems" (EPHOS) had been released.

10
10 {C} Herbert Haas 2009/08/12
Purpose
OSI modeI describes communication
services and protocols
No assumption about

Operating system

Programming Language
PracticaIIy, the OSI modeI

Organizes knowIedge

Provides a common discussion base


Although every book oI data communication mentions the ISO/OSI 7-layer
model it is not that important in the real world: most technologies do not
comply to this model. It is merely a reIerence model so that we can reIer to it
when we want to explain certain Iunctions in our protocols. From this point oI
view the OSI model is indeed important today.

11
11 {C} Herbert Haas 2009/08/12
OSI Basics
Point-to-Point, no shared media
Nodes are caIIed

End Systems (ES)

Intermediate Systems (IS)


Each Iayer of the OSI modeI detects
and handIes errors (FCS)
Dumb hosts and inteIIigent network

Compared with Internet: dumb network,


inteIIigent hosts
The original OSI modes was created Ior point-to-point connections only (Ior
example there was no speciIication Ior LAN-like shared media originally).

12
12 {C} Herbert Haas 2009/08/12
The OSI Truth
OSI modeI was created before
protocoIs

Good: Not biased, generaI approach

Bad: Designers had IittIe experience, no


ideas in which Iayers to put which
functionaIity...
Not widespread (compIex,
expensive)
But serves as good teaching aid !!!
Although the OSI Model was created beIore the protocols, and so the complete
model is very complex and not practically elaborated, its widely used today to
deIine and category most oI the important protocols. OSI is not biased because
this reIerence Iramework is not associated with any particular vendor
philosophy. OSI represents a general approach Ior describing data
communication procedures but this property is oIten considered as a big
disadvantage, because practical implementations typically can be described
with a much simpler model and on the other hand the OSI architects had only
little experience with real liIe implementations. ThereIore, genuine OSI
protocols are not really widespread today, because oI its complexity.
Nevertheless, the OSI model serves as reIerence Irame when discussing or
learning about protocols.

13
13 {C} Herbert Haas 2009/08/12
The 7 OSI Layers The 7 OSI Layers
AppIication Layer
Transport Layer
Network Layer
Data Link Layer
PhysicaI Layer
Session Layer
Presentation Layer
AppIication Layer
Transport Layer
Network Layer
Data Link Layer
PhysicaI Layer
Session Layer
Presentation Layer
System A System B
Sender Process Receiver Process
Because the communication between diIIerent systems can be a very complex
task, OSI splits the communication aspects into smaller tasks. All layering is
based on the OSI reIerence Model, which deIines tasks and interactions oI
seven layers.
The user`s data moves Irom the Iirst layer (Application Layer) through all other
layers. When two systems communicate with each other, then only the diIIerent
layers talk. The application layer only talk with the application layer or the
network layer only communicate with the network layer oI system B. We can
talk about a parallel communication between the layers. Every layer works Ior
its own, it is not interested what the other layer does.

14
14 {C} Herbert Haas 2009/08/12
PhysicaI Layer
MechanicaI and eIectricaI
specifications
Access to physicaI medium
Generates Bit stream
Line coding and cIocking
ExampIes

LAN: Ethernet-PHY, 802.3-PHY

WAN: X.21, I.400 (ISDN),


RS-232
AppIication Layer
Transport Layer
Network Layer
Data Link Layer
PhysicaI Layer
Session Layer
Presentation Layer
The Physical Layer generates the bit stream. This layer provides access to the
physical medium by applying line coding (NRZ, Manchester, etc),
synchronization (clocking, PLL), but also includes mechanical speciIications.
Layer 1 also can activate or deactivate the links between end systems (link
management).
The physical part oI the Ethernet NIC is called "PHY" and is perhaps the most
complex entity oI Ethernet because the PHY consists oI a number oI sublayers
that care Ior interoperability with diIIerent Ethernet speeds (10, 100, 1000,
10000 Mbit/s) and codings (Manchester, 4B5B, 8B10B, ...). Note that there is
a Iundamental diIIerence between "Ethernet" and IEEE "802.3": these are two
separate LAN speciIications but typically implemented on the same NICthey
just share the same topology and use the same media access strategymost
people are not aware oI that.
The X.21 is a typical and widely available interIace on a Cisco router. The
ISDN-layer 1 is speciIied in the ITU-T standard I.400 and describes both a 192
kbit/s synchronous multiplexing interIace capable to transport 2 B channels and
one D channel and secondly a high speed 2.048 Mbit/s interIace capable to
carry 30 B channels and one D channel. These ISDN speciIics are presented in
the N-ISDN chapter in more details. The old well-known Recommended
Standard (RS) 232 speciIies the classical serial interIace Iound on many PCs
and other peripheral devices.

15
15 {C} Herbert Haas 2009/08/12
Link Layer
ReIiabIe transmission of
frames between two NICs
Framing
FCS
PhysicaI Addressing of NICs
OptionaI error recovery
OptionaI fIow controI
ExampIes:

LAN: 802.2

PPP, LAPD, LAPB, HDLC


AppIication Layer
Transport Layer
Network Layer
Data Link Layer
PhysicaI Layer
Session Layer
Presentation Layer
The data link layer builds the Irame. In that way, Iraming or Irame
synchronization is the most important thing on layer 2. Where is the beginning
oI the Irame ? Where is the end ? With a special Bit-Code the layer 2 protocols,
such as HDLC or PPP, guarantee the Iraming oI the data. That`s important Ior
the MTU (maximum transIer unit).
Also Irame checking, correction oI transmission errors on a physical link, is
implemented on layer 2. There are also a physical address oI the network
interIace cards. This address is transported with the data link layer too (e.g.
MAC-Address with Ethernet).
Error recovery and Ilow control may be realized in connection-oriented mode.

16
16 {C} Herbert Haas 2009/08/12
Network Layer
Transports packets between
networks
Provides structured
addresses to name networks
Fragmentation and
reassembIing
ExampIes:

CLNP

IP, IPX

Q.931, X.25
AppIication Layer
Transport Layer
Network Layer
Data Link Layer
PhysicaI Layer
Session Layer
Presentation Layer
The network layer builds the so-called "packet". Layer 3 transports the packets
between the diIIerent networks. ThereIore layer 3 needs structured and routable
addresses to Iind the right networks. IP is the most important Layer 3 protocol
today (IPv4 has a structured 4 byte address). The OSI Connectionless Network
Protocol (CLNP) is another example Ior a layer-3 protocol but it is not so
widely used today, except some Telcos and Carriers use it Ior internal
purposes. IPX has been developed by Novell in order to extend Novell
networks over diIIerent data-link layer worlds. Q.931 is the ISDN layer 3
carried over the D-channel and is used Ior signaling purposes. Basically Q.931
conveys the telephone numbers and other service parameters. The classical
packet-switched WAN standard X.25 actually speciIies only the layer 3 oI this
technology and is used to set up a number oI virtual calls over an asynchronous
link layer (LAPB).

17
17 {C} Herbert Haas 2009/08/12
Transport Layer
ReIiabIe transport of
segments between
appIications
AppIication muItipIexing
through T-SAPs
Sequence numbers and
FIow controI
OptionaI QoS CapabiIities
ExampIes:

TCP (UDP)

ISO 8073 Transport ProtocoI


AppIication Layer
Transport Layer
Network Layer
Data Link Layer
PhysicaI Layer
Session Layer
Presentation Layer
The transport layer is necessary to build a logical connection to the application
in order to send data in so-called "segments". With the help oI port numbers
(by TCP and UDP), a layer 4 protocol guarantees the transport oI the segments
to the right application. These port numbers are called T-SAPs in the OSI
world. The transport layer optionally takes care about Ilow control, reliable
transmission between end systems, and is most important Ior QoS capabilities.
Flow control requires connection oriented mode.

18
18 {C} Herbert Haas 2009/08/12
Session Layer
Provides a user-oriented
connection service

Synchronization Points
LittIe capabiIities, usuaIIy
not impIemented or part of
appIication Iayer

TeInet: GA and SYNCH

FTP: re-get aIIows to continue


an interrupted downIoad

ISO 8327 Session ProtocoI


AppIication Layer
Transport Layer
Network Layer
Data Link Layer
PhysicaI Layer
Session Layer
Presentation Layer
The Session Layer coordinates and controls dialogue between diIIerent end
systems. This layer is only seldom or sparsely implemented. For example a
Telnet server gives the sending permission to the Telnet client via a Go Ahead
(GA) sequence. Using the BRK-Key, a SYNCH sequence is triggered and the
server must synchronize with the client by Ilushing the buIIered stream. FTP
keeps track oI the data blocks transmitted and is able to continue an interrupted
session Irom this checkpoint on.
Session protocols are important with telephony applications such as H.323
which employs H.225 to establish sessions. Another example is the IETF
Session Initiation Protocol (SIP). The ISO 8327 is an OSI basic connection
oriented session protocol speciIication.

19
19 {C} Herbert Haas 2009/08/12
Presentation Layer
Specifies the data
representation format for the
appIication
ExampIes:

MIME (part of L7) and


UUENCODING (part of L7)

ISO: ASN.1 and BER


AppIication Layer
Transport Layer
Network Layer
Data Link Layer
PhysicaI Layer
Session Layer
Presentation Layer
The layer 6 is responsible Ior common language between end systems. The
presentation layer speciIies the "meaning" oI the data and how each byte
should be interpreted.
In the Internet the presentation layer uses ASCII coding and the meaning oI the
data is speciIied by a so-called "Multipurpose Internet Mail Extension"
(MIME) header. MIME is used by SMTP (Email) and HTTP (Web browsing)
Ior example. UUENCODING is one example oI how to transIorm 8-bit-bytes
into 7-bit-bytes and it is typically used with Internet Mail attachments. The
ISO/OSI world generally uses the "Abstract Syntax Notation Language
Number One" (ASN.1) as common presentation layer. This language is used to
speciIy data structures and contents. On the wire the data is transmitted using
the "Basic Encoding Rules" (BER).

20
20 {C} Herbert Haas 2009/08/12
AppIication Layer
Provides network-access for
appIications
ExampIes:

ISO 8571 FTAM FiIe Transfer


Access + Management,
X.400 EIectronic MaiI, CMIP

SMTP, FTP, SNMP, HTTP,


TeInet, DNS, .
AppIication Layer
Transport Layer
Network Layer
Data Link Layer
PhysicaI Layer
Session Layer
Presentation Layer
The Application layer supports user with common network applications. For
example: Iile transIer or virtual terminals. Layer 7 also supports basic network
procedures in order to implement distributed applications (e.g. transaction
systems). Note that the application layer is not identical with the application!
The application itselI "sits" upon the application layer and uses the service
primitives provided by the application layer to access the network.
Application layer protocols either use "inline" or "inband" control sequences
(as it is used with Telnet), where control bytes are mixed with the data stream,
or it might use a predeIined Irame structure, consisting oI header and body.
Another method is to open a dedicated logical control connection only to
exchange control inIormation (as it is used with FTP).

21
21 {C} Herbert Haas 2009/08/12
EncapsuIation PrincipIe
L7
L4
L3
L2
L1
L5
L6
DATA
DATA
A-PDU
P-PDU
S-PDU
T-PDU
N-PDU
7
4
3
2
1
5
6
DATA
101000111010010110100101001010000100101010001010101010101010010110001001010101010100101111100000101010
L-PDU or "Frame"
N-PDU or "Packet"
T-PDU or "Segment"
S-PDU
P-PDU
A-PDU
The data moves through all 7 layers. Every layer add his own header. The data
with layer 4,5,6 and 7 header is called 'segment. A segment plus layer 3
header is called 'packet. The so called 'Irame (data plus six headers) will be
transport over layer 1 to the destination system. Then the Irame will moves
through all 7 layers again, and in each station a header will be removed.

22
22 {C} Herbert Haas 2009/08/12
PracticaI EncapsuIation
Ethernet Frame
IP Packet
TCP
Segment
HTTP
Message
HTML
Webpage
The idea oI encapsulation is Iundamental in the data communication world.
Adjacent layers encapsulate or decapsulate inIormation by adding/removing
additional "overheads" or "headers" in order to implement layer-speciIic
Iunctionalities. The whole process can be regarded as Matroschka-puppet
principle.
In our example let's suppose a webserver sends a webpage (HTML code) to a
client. The webpage is carried via the Hyper Text TransIer Protocol (HTTP)
which provides Ior error and status messages, encoding styles and other things.
The HTTP header and body is carried via TCP segments, which are sent via IP
packets. On some links in-between, the IP packets might be carried inside
Ethernet Irames.

23
23 {C} Herbert Haas 2009/08/12
Internet EncapsuIation
HTTP
Header
HTTP-Data
HTML-Content
(Webpage)
TCP
Header
TCP-Data
IP
Header
P-Data
WiII reach the next
Ethernet DTE
Eth
Header
Ethernet-Data
Eth
TraiIer
WiII reach the target host
WiII reach the target appIication
This is what the appIication wants
This is what the user wants
In our example let's suppose a webserver sends a webpage (HTML code) to a
client. The webpage is carried via the Hyper Text TransIer Protocol (HTTP)
which provides Ior error and status messages, encoding styles and other things.
The HTTP header and body is carried via TCP segments, which are sent via IP
packets. On some links in-between, the IP packets might be carried inside
Ethernet Irames.

24
24 {C} Herbert Haas 2009/08/12
OSI Speak (1)
Entities

Anything capabIe of sending or


receiving information
System

PhysicaIIy distinct object which


contains one or more entities
ProtocoI

Set of ruIes governing the exchange of


data between two entities
Entities:
Any hardware or soItware module that acts upon a single layer is called an
"entity". Several entities exist peer to peer within a given layer and are capable
to communicate which each other. This type oI communication is reIerred as
"horizontal" communication -- this is actually what we mean when we talk
about a "protocol".
System:
Several entities make up a "system". For example a PC is a "system" because
it consists oI the entities Ethernet PHY entity, MAC entity, LLC entitiy, IP
entity, TCP entity, and several L7 entities. A system is merely a term that
reIlects the physically separation oI groups oI entities.
Protocol:
We already described the meaning oI protocol above together with the
deIinition oI a entity, but a protocol can be explained more simple: A protocol
is a set oI rules that are necessary to exchange data in an ordered and
unmistakable way.

25
25 {C} Herbert Haas 2009/08/12
OSI Speak (2)
Layer

A set of entities
Interface

Boundary between two Iayers


Service Access Point (SAP)

VirtuaI port where services are passed


through
Layer:
A "layer" in the OSI jargon is a set oI entities--but do not conIuse layers with systems! The
entities oI a layer reside on the same hierarchy level and a single layer comprises several
systems. On the other hand a system comprises several layers but typically only one (or a
limited number) oI entities are available on each layer oI a system. For example: In order to
communicate in the Internet, all devices must support layer 3 (the IP layer). That is, each
system must provide at least one IP-entity.
Interface:
An "interIace" is simply the logical boundary between two layers. Note that interIaces are
typically not physically visible because they represent the boundary between two layers at a
whole. The local representation oI an interIace is called a "Service Access Point" or SAP. The
Service Access Point is one oI the most Irequently used terms in data communication and
simply reIlects the piece oI hardware or soItware that acts as an interIace between two layers.
The previously OSI-interIace is meant globally, while the SAP has local meaning, i. e. at one
system. A SAP is a practical term, in some technologies such as IEEE 802.2 it is just a Iield in
the header indicating the destination and source layer. II you use an Ethernet NIC with an AUI
interIace, than this electrical interIace can be also considered a SAP because "service
primitives" are passed through this interIace. Service Primitives are explained below...
Service Access Point:
An "InterIace Data Unit" (IDU) is practically spoken the piece oI data that is passed through a
SAP to the next layer's entity. It contains ICI and SDU which is described below. When an
IDU is passed through a SAP to the next layer, this layer extracts and processes the InterIace
Control InIormation (ICI).

26
26 {C} Herbert Haas 2009/08/12
OSI Speak (3)
Interface Data Unit (IDU)

Data unit for verticaI communication


(between adjacent Iayers of same
system)
ProtocoI Data Unit (PDU)

Data unit for horizontaI communication


(between same Iayers of peering
systems)
Interface Data Unit:
An "InterIace Data Unit" (IDU) is practically spoken the piece oI data that is
passed through a SAP to the next layer's entity. It contains ICI and SDU which
is described below. When an IDU is passed through a SAP to the next layer,
this layer extracts and processes the InterIace Control InIormation (ICI).
Note that data is passed through a SAP using "service primitives". Service
primitives are Iunctions that are implementation speciIic (Ior example an API)
and are used to pass data Irom one layer to another on the same system. These
service primitives actually pass on these IDUs.
Protocol Data Unit:
The SDU actually represents the payload plus headers Ior upper layers. The
SDU is transported horizontally with an header used at this layer. Both SDU
and Header is called a "Protocol Data Unit" (PDU). The PDU is the most
oIten used term oI all these terms mentioned here. At least you should
remember the PDU.

27
27 {C} Herbert Haas 2009/08/12
OSI Speak (4)
Interface ControI Information (ICI)

Part of IDU

Destined for entity in target-Iayer


Service Data Unit (SDU)

Part of IDU

Destined for further communication

Contains actuaI data ;-)



28
28 {C} Herbert Haas 2009/08/12
OSI Speak Summary (1)
(N) Layer
(N+1) Layer
(N-1) Layer
Interface
Interface
(N) Layer
Entity
(N+1) Layer
Entity
(N+1) Layer
Entity
(N-1) Layer
Entity
(N-1) Layer
Entity
"ProtocoI"
Service Access
Point (SAP)
Service Primitives
Service Primitives
(N) Layer
Entity
The ISO/OSI model deIines Iour service primitives: request, indication,
response and conIirm.
Note that the service primitives are only used Ior vertical communication.

29
29 {C} Herbert Haas 2009/08/12
OSI Speak Summary (2)
(N) Layer
(N+1) Layer
Interface
(N) Layer
Entity
(N+1) Layer
Entity
(N) Layer
Entity
ICI SDU
IDU
ICI SDU
SDU NH
N-PDU
SAP
VerticaI
Communication
HorizontaI
Communication

30
30 {C} Herbert Haas 2009/08/12
Layer 1 Devices
Adapts to different physicaI
interfaces
AmpIifies and/or refreshes the
physicaI signaI
No inteIIigence
Repeater, Hub,
NT1
AppIication
Transport
Network
Data Link
PhysicaI
Session
Presentation
AppIication
Transport
Network
Data Link
PhysicaI
Session
Presentation
Repeater
To connect diIIerent system with each other we need special devices. II you
want to connect two systems only per physical layer you need a so called 'hub
or 'repeater.
This kind oI devices are not intelligence and only used to ampliIies or reIresh
the physical signal, or to connect systems with diIIerent physical interIaces.

31
31 {C} Herbert Haas 2009/08/12
Layer 2 Devices
FiIter/Forwards frames according Link
Layer Address
Incorporates Layer 1-2
LAN-Bridge ("Switch")
AppIication
Transport
Network
Data Link
PhysicaI
Session
Presentation
AppIication
Transport
Network
Data Link
PhysicaI
Session
Presentation
Bridge
A so called 'bridge or 'switch is a device to connect systems per data link
layer. This kind oI devices determine the physical layer and can Iorward
datagram's according the link layer address. For example: MAC address with
Ethernet. Note that a bridge utilizes two or more physical layer entities (NICs)
that is a bridge is able to convert encodings and signal-rates.
Note the diIIerence between 'bridge and 'switch: A bridge is implemented in
soItware, whereas a switch is built in hardware. Today only switches are used,
because they are much Iaster.

32
32 {C} Herbert Haas 2009/08/12
Layer 3 Devices
"Packet Switch" or "Intermediate
System"
Forwards packets to other networks networks
according structured structured address
Terminates Links
Router,
WAN-Switch
AppIication
Transport
Network
Data Link
PhysicaI
Session
Presentation
AppIication
Transport
Network
Data Link
PhysicaI
Session
Presentation
Router
The most important device in the Internet is the so called 'router. A router
consists oI several layer 1 and layer 2 entities and a single layer 3 entity, thus it
can Iorward packets to other networks according structured addresses
(remember IP-Addresses). By terminating layer 1 and 2 a router is able to
connect total diIIerent network technologies with each other. For example: on
one side there is Ethernet on the other side ATM.

33
33 {C} Herbert Haas 2009/08/12
A PracticaI ExampIe
PhysicaI
(Twisted Pair)
PhysicaI
(SeriaI Line)
PhysicaI
(Fiber Ring)
Link
(Ethernet)
Link
(HDLC)
Link
(FDDI)
Network
(IP)
Transport
(TCP)
Netscape
Browser
Apache
Webserver
MAC
Address
MAC
Address
SimpIe or
dummy Address
I
P

A
d
d
r
e
s
s
I
P

A
d
d
r
e
s
s P
o
r
t

N
u
m
b
e
r
P
o
r
t

N
u
m
b
e
r
What is my
destination appIication?
Where is my
destination network?
Just move this frame
to the next NIC
In the picture above you see a good example in which 'symbolic way the
diIIerent layers talk with each other. The link layer only searches Ior the right
NIC address. IP only wants to the destination network, and TCP is the protocol
to communicate between applications. Most importantly, notice that packets
are sent over diIIerent link layer technologies such as Ethernet, HDLC, or
FDDI. Exactly this is the reason why a common network layer is needed to
allow communication over diIIerent "networks" (links).
Don't be conIused about the diIIerent usages oI the term "network". People say
"network" and mean "bunch oI devices interconnected with each other". To be
more exact, a network is identiIied by a unique network identiIier, such as the
network-ID oI the IP-address. Since a contiguous link layer implementation
(such as an Ethernet LAN) can have assigned a single IP net-ID, each link can
be regarded as network.

34
34 {C} Herbert Haas 2009/08/12
PadIipsky's RuIe
If you know what
you're doing, three
Iayers is enough. If
you don't, even
seventeen won't heIp.
Until now we discussed the Iamous OSI seven layer reIerence model, but real
implementations typically consist oI a subset oI this 7-layer model. On the one
hand, not all OSI layers are necessary in real-world applications, on the other
hand, many important technologies had been created beIore the OSI standard.

35
35 {C} Herbert Haas 2009/08/12
Stevens 4-Layer ModeI
Transport Layer
Network Layer
Data Link Layer
Process Layer
Transport Layer
Network Layer
Data Link Layer
Process Layer
EquivaIent to the DoD ModeI (Internet)
The picture above shows the W. Stevens 4 layer model which is used also in
the Internet. The Internet layer model is also called "Department oI DeIense"
(DoD) model.

36
36 {C} Herbert Haas 2009/08/12
Tanenbaum 5-Layer ModeI
AppIication Layer
Transport Layer
Network Layer
Data Link Layer
PhysicaI Layer
AppIication Layer
Transport Layer
Network Layer
Data Link Layer
PhysicaI Layer
The Iamous computer scientist Andrew S. Tanenbaum deIined a more practical
approach utilizing Iive layers. Other than the DoD or Stevens 4-layer model
the physical speciIications are deIined in a separate layer.

37
37 {C} Herbert Haas 2009/08/12
Summary
Network Iayers ensures
interoperabiIity and eases
standardization
ISO/OSI 7 Iayer modeI is an
important reference modeI
PracticaI technoIogies empIoy a
different Iayer set, but it's aIways
possibIe to refer to OSI


The nternet perspective is impl ement it,
make it work well, then write it down.

The O8 perspective is to agree on it,
write it down, circulate it a lot and now
we'l l see if anyone can impl ement it after
it's an i nternational standard and every
vendor in the world is committed to it.
One of those processes is backwards,
and don't think it takes a Lucasian
professor of physics at Oxford to figure
out which.

MarshaII Rose, "The Pied Piper of OSI"

39
39 {C} Herbert Haas 2009/08/12
Quiz
ExpIain Iayer-2 capabiIities!
What couId be the task of a Iayer-4
device ?
What is a "gateway"?
How does the (N) Iayer teII (N+1)
Iayer that it has data to hand over ?
Why have OSI protocoIs not been
successfuI on market ?


40 {C} Herbert Haas 2009/08/12
Hints
Q1: Framing, Protection, Access,...
Q2: Layer 4 device might deaI with QoS,
sequencing and fIow controI
Q3: According to OSI a Iayer 1-7 device,
according to IETF a router.
Q4: Using Service Primitives (Indicate)
Q5: OSI is too compIex and generaI,
severaI fieIds in headers might have
variabIe Iength, sometimes ignores byte-
and word-deIineation, ...

1
2005/03/11 {C} Herbert Haas
ProtocoI PrincipIes
Framing, FCS and ARQ

2
2 {C} Herbert Haas 2005/03/11
Link Layer Tasks
Framing
Frame Protection
OptionaI Addressing
OptionaI Error Recovery
Connection-oriented or
connectionIess mode
OptionaI FIow ControI
The Data-link layer oI the OSI model is the Iirst layer above the physical layer
that is used to perIorm logical tasks like:
Framing is the task oI packing the inIormation oI higher layers to provide Ior
example start and end oI packet detection plus some optional Ieatures
which will be discussed on the Iollowing slides
Frame protection is used to detect possible errors during data transmission
Addressing is optional and is normally only used in point-to-multipoint Data-
link technologies.
Error recovery can be used to allow packet retransmissions iI data errors are
detected by the Irame protection mechanism
The Data-link can be driven in connection oriented mode or connection-less
mode. Newer technologies mainly use the connection-less mode because the
connection-oriented Iunctionality is typically provided by higher OSI layers
e.g. TCP on OSI layer 4.
Flow control can be used to prevent buIIer overIlow situations on the receiver
side

3
3 {C} Herbert Haas 2005/03/11
BuiIding a Frame
DATA ControI FCS ED SD PreambIe
Consists of

Data

Metadata (Header or "Overhead")


Requires synchronous physicaI
transmission (PLL)

Arbitrary frame Iengths


A basic layer 2 transmission Irame consists oI Iollowing components:
Preamble - is used to provide synchronization between the sender and the
receiver transmission clock. This is necesarry to allow the detection oI the
single bit borders.
SD - Start Delimiter is needed to detect the actual beginning oI the layer 2
transmission Irame. From this point on data is Ied Irom the physical layer into
the receive buIIer.
Control Field - provides optional addressing, connection establishment, error
recovery and Ilow control
Data - is the payload provided by higher OSI layers
FCS - Frame Check Sequence is used Ior error detection
ED - End Delimiter is used to determine the end oI the layer 2 Irame

4
4 {C} Herbert Haas 2005/03/11
PreambIe
DATA ControI FCS ED SD PreambIe
EnabIes PLL synchronization

TypicaIIy a 0101010...-pattern

ExampIe: 8 Byte preambIe in Ethernet frames


Note: OnIy necessary when sender- and
receiver-cIock are not synchronized
between frames

Asynchronous physicaI Iayer


The purpose oI the Preamble is to lock the receiver clock towards the sender
clock by the help oI Phase Locked Loop (PLL) circuits. The Preamble is
diIIerent depending on the type oI Data-link technology that is used.
In Ethernet technology Ior example the bit pattern consists oI 62 clock changes
between a logical 1 and a logical 0, Iollowed by two logical 1s to indicate the
start oI Irame.
The Preamble is obviously only needed Ior synchronous physical layers. In the
case that an asynchronous physical layer is used, e.g. COM port on PC or async
serial interIace on a router, the Preamble is not needed. Because sender and
receiver clocks don t need to be synchronized.

5
5 {C} Herbert Haas 2005/03/11
Frame Synchronization
DATA ControI FCS ED SD PreambIe
Beginning and ending of a frame is
indicated by SD and ED symboIs

bit-patterns or code-vioIations

Ienght-fieId can repIace ED (802.3)

IdIe-Iine can repIace ED (Ethernet)


AIso caIIed "Framing"
Starting
DeIimiter
Ending
DeIimiter
There are diIIerent methods available to indicate the start and the end oI a data
Irames. The simplest method is the use oI a special bit pattern. In the HDLC
protocol and its derivates the bit pattern '01111110 is used to indicate start
and end oI Irame.
In Token Ring technology code violation is used. Code violation is an intended
brake in the rules oI coding.
In Ethernet technology the SD is indicated by the bit pattern '11 Iollowing the
Preamble. But the end oI the Irame is indicated by an idle line, this means
silence on the wire Ior a speciIied period oI time.
Optionally the ED can be omitted iI an length Iield is present inside the Data-
link Irame. In this case the end oI Irame can be calculated by counting the
number oI bytes received.

6
6 {C} Herbert Haas 2005/03/11
ProtocoI Transparence
What, if deIimiter symboIs occur
within frame ?
SoIution:

Byte-Stuffing

Bit-Stuffing
DATA ControI FCS ED SD PreambIe ED SD
! !
In the case that a special bit pattern is used to indicate the start and end oI a
Irame, it is necessary to prevent this pattern inside the data portion oI the
Irame. Otherwise this would lead to Irame misinterpretation.
There are two principle methods to achieve this goal either by modiIying single
bits oI the data stream (bit-stuIIing) or by replacing the whole byte (byte-
stuIIing).

7
7 {C} Herbert Haas 2005/03/11
Byte Stuffing
Some character-oriented protocoIs
divide data stream into frames

OId technique, not so important today


Data Link Escape (DLE) character
indicates speciaI meaning of next
character
A B C DLE DLE E F G ETX H I STX H STX ETX DLE DLE
A B C DLE E F G ETX H I STX H
Data to send:
Typically, character-oriented protocols use asynchronous transmission (start-
stop bits), so the receiver gets a bunch oI characters but has to Iind the Irames
in it. In this case special control characters Start oI Text (STX, 0x02) and End
oI Text (ETX, 0x03) are used to indicate start/end oI a transmission Irame.
Obviously STX and ETX must not appear inside the date portion, so an
additional special control character Data Link Escape (DLE, 0x10) is used.
STX and ETX are only interpreted as control characters iI the DLE pattern is
Iound in Iront oI them. This means iI STX or ETX bit pattern are Iound inside
the data stream, with the DLE pattern in Iront oI them, they are interpreted as
data.
II the DLE pattern itselI would appear inside the data portion it is doubled to
indicate that it is only data.

8
8 {C} Herbert Haas 2005/03/11
Bit Stuffing
Used in bit-oriented protocoIs

Used by most protocoIs

Bits represent smaIIest transmission unit


HDLC-Iike framing: 01111110-pattern
RuIe:

Trasceiver-HW inserts a zero after five ones

Receiver rejects each zero after five ones


010011111000111111100101100110
Data to send:
01001111100001111101100101100110 01111110 01111110
In HDLC technology and its derivates bit stuIIing is used to prevent the
appearing oI the SD, ED pattern inside the data Irame.
This means there is a common rule between sender and receiver that aIter every
IiIths 1 an additional logical 0 will be inserted by the sender in the data
stream. The receiver itselI removes all logical 0s Iollowing Iive logical 1s.

9
9 {C} Herbert Haas 2005/03/11
Code VioIations
Manchester
AMI
DifferentiaI
Manchester
Code violation is an intended hurt oI the rules oI coding. It can be used Ior
signaling purposes.
In Token Ring systems Ior example the diIIerential manchester code is used.
Violations oI the diIIerential manchester code are used Ior the SD and ED
patterns to indicate the start and the end oI Irame.
The diIIerential manchester code violation symbols are called J and K. The
code violation in the diIIerential manchester code is achieved by omitting the
change Irom 1 to 0 or Irom 0 to 1 in the middle oI a pulse.

10
10 {C} Herbert Haas 2005/03/11
Frame Protection
A frame check sequence (FCS) protects
the integrity of our frame

From Sunspots, MobiIe-Phones, Noise,


Heisenberg and others
FCS is caIcuIated upon data bits

Different methods based on mathematicaI


efforts: Parity, Checksum, CRC
Receiver compares its own caIcuIation
with FCS
DATA ControI ED SD PreambIe FCS
Protected
The Frame Check Sequence (FCS) allows the receiver to detect possible errors
in the data stream.
The FCS is calculated by the sender and is attached at the end oI the Irame. The
calculation oI the FCS is typically perIormed in hardware.
There are many diIIerent technologies available to calculate an FCS like:
Parity bit calculation
XOR operation
Modulo operations
Cycle Redundancy Check (CRC)
Forward Error correction (FEC)

11
11 {C} Herbert Haas 2005/03/11
FCS Methods
Parity

Even (100111011) or odd (100111010)


parity bits

ExampIes: Asynchronous character-


transmission and memory protection
Checksum

Sum without carries (XOR operation)

Many variations and improvements

ExampIes: TCP and IP Checksums


The simplest method oI FCS technologies is the parity bit calculation. This
method is typically used in asynchronous character based transmission systems.
One parity bit is computed Ior each single character. The Iirst two least
signiIicant bits are XORed together and the output oI this operation is then
XORed with the next more signiIicant bit, and so on. The output oI the Iinal
operation represents the required parity bit, which can be even (1) or odd (0).
Obviously the parity check can only protect against single bit errors. A two bit
Iailure Ior example in the opposite direction 1 to 0 and 0 to 1 cannot be
detected by the parity check.
For packet based transmission systems a similar method to calculate a
checksum can be used. Instead oI a single bit, 16 or 32 bit long words are used
in combination with the XOR operation.
The XOR operation is also known as a modulo-2 adder, since the output oI the
XOR operation between two binary digits is the same as the addition oI the two
digits without a carry bit.

12
12 {C} Herbert Haas 2005/03/11
Checksum ExampIe: ISBN
100% Protection against

SingIe incorrect digits

Permutation of two digits


Method:

10 digits, 9 data + 1 checksum

Each digit weighted with factors 1-9

Checksum = Sum moduIo 11

If checksum=10 then use "X"


ISBN 0-13-086388-2
0*1+1*2+3*3+0*4+8*5+6*6+3*7+8*8+8*9 = 244
244 moduIo 11 = 2
ISBN stands Ior International Standard Ior Book Numbering.
The ISBN code which you may use to order your Iavourite books is protected
by an checksum as well. This checksum is built on the basis oI an modulo 11
operation.
The ISBN code itselI consists oI 10 digits where the Iirst nine digits represent
the book code and the tens digit is the checksum used Ior error detection.
The ISBN checksum allows the detection oI a single bit Iailure or the
permutation oI two bits with a probability oI 100.

13
13 {C} Herbert Haas 2005/03/11
CycIic Redundancy Check
CRC is one of the strongest methods
Bases on poIynomiaI-codes
SeveraI standardized generator-
poIynomiaIs

CRC-16: x
16
+x
15
+x
2
+1

CRC-CCITT: x
16
+x
12
+x
5
+1
Checksums which are built on basis oI the XOR operation do not provide a
reliable detection scheme against error bursts.
ThereIore the Cycle Redundancy Check is based on mathematical polynomial
equations and it is capable to detect even bursts oI erroneous bits inside the
data stream.
A single set oI check bits is generated using an polynomial equation Ior each
transmitted Irame and appended to the tail oI the Irame by the sender. The
receiver then perIorms the same computation on the complete Irame and
compares the received checksum with its own calculated checksum.
There are a lot oI diIIerent standardized polynomial equations like the CRC-16,
CRC-32 or the CRC-CCITT equations.
The CRC-16 check Ior example will detect all error bursts oI less than 16 bits
and most error bursts greater than 16 bits. The CRC-16 and the CRC-CCITT
are mainly used in WAN technologies, while the CRC-16 is mainly used in
LAN environments.
It is quite simple to implement these CRC checks in Hardware with the use oI
16 or 32 bit long shiIt registers.

14
14 {C} Herbert Haas 2005/03/11
Forward Error Correction
Required for "extreme" conditions

High BER, EMR

Long deIays, space-Iinks


Introduces much redundancy

ExampIe: Reed-SoIomon codes,


Hamming-codes
1 1 1 0
0 1 0 0
0 0 1 1
0 1 0 1
1
1
0
0
1 1 0 0 0
1 1 0 0
0 1 0 0
0 0 1 1
0 1 0 1
1
1
0
0
1 1 0 0 0
All till now discussed error protection technologies lead to an drop or optional
retransmission oI the corrupted data Iame. With the help oI Forward Error
Correction (FEC) technologies it is possible to Iix the Iaults inside a data Irame
up to a certain extend.
One error correction coding scheme is the Hamming FEC scheme. In this
scheme the data bits plus the additional check bits are called the codeword. The
minimum number oI bit positions in which two valid codewords diIIer is
known as the Hamming distance oI the code.
It can be shown that to correct errors we need a code with a Humming
distance oI .
In practice the number oI check bits needed Ior error correction is much larger
than the number oI bits needed just Ior error detection. ThereIore in most
transport systems ARQ techniques are still used Ior error correction.
FEC technologies are mainly used in nische technologies, e.g. communication
with space probes, where the RTT is very high and sometimes only a
unidirectional communication channels exists.

15
15 {C} Herbert Haas 2005/03/11
ControI FieId
Contains protocoI information

Addressing

Sequence numbers

AcknowIedgement FIag

Frame Type

SAP or PayIoad Type

SignaIIing information
The contents oI the control Iield depends on the tasks that need to be
perIormed by the Data-link protocol.
So the control Iield could contain:
Address inIormation Ior addressing especially in point to multipoint
environments
Sequence numbers that can be seen like serial numbers Ior each single Irame
Acknowledgement Flags to indicate that the data was received properly
Frame Type inIormation to indicate whether it`s a Irame that carries data or
control inIormation
Service Access Point (SAP) or payload type inIormation to indicate what is
transported by the Irame
Signaling inIormation in case oI connection oriented protocols to build up an
connection

16
16 {C} Herbert Haas 2005/03/11
Connection-Oriented ProtocoIs
Different definitions

Some say "protocols without


addressing information" and think of
circuit-switched technoIogies

Some say "protocols that do error


recovery"
Correct: "protocoIs that require a
connection estabIishment before sending
data and a disconnection procedure when
finished"
In the past an connection oriented protocol was very oIten seen as a protocol
that perIorms a connection setup procedure and supports error recovery and
Ilow control.
But newer technologies like Irame relay typically do not carry address
inIormation, do not need an connection setup procedure in case oI PVC service
and do not support error recovery and Ilow control. But they are still
connection oriented.
So we may say there are two diIIerent types oI connections temporary (like
SVC) and permanent (like PVC).
For temporary connections we obviously will need addressing and connection
setup procedures.
For permanent connections a virtual circuit identiIier is enough.
In both cases temporary or permanent connections we may optionally support
error recovery and Ilow control.

17
17 {C} Herbert Haas 2005/03/11
CO-ProtocoI Procedures (1)
time time
Connection Request
Connection EstabIished
DATA
Disconnection Request
Disconnected
Station A Station B
In connection oriented protocols a connection is established beIore data is
allowed to Ilow.
The connection establishment is done by special control Irames like connection
request and connection established.
Then we Iind the data exchange phase which may use error recovery and Ilow
control. When the data transmission is Iinished we have special control Irames
like disconnect request and disconnected to tear down the connection again.

18
18 {C} Herbert Haas 2005/03/11
CO-ProtocoI Procedures (2)
time time
KeepaIive
KeepaIive ACK
Other synonyms:
"HeIIo" or
"Receiver Ready"

.
.
.
.
.
.
.
.
(Connection
aIready
estabIished)
Connection oriented protocols use special keep-alive procedures to make sure
the connection is up in periods where no data Irames are transmitted.

19
19 {C} Herbert Haas 2005/03/11
CO-Issues
EstabIishment deIay
Traffic desriptor during
estabIishment (QoS)
AdditionaI frame types necessary

Connection estabIishment

Disconnect

KeepaIive
ARQ possibIe (Error Recovery)
Obviously it takes some time to establish a connection beIore the data is
allowed to Ilow. But the establishment delay is typically in the range oI
milliseconds. Even ISDN provides a connection establishment worldwide in
less than one second.
A traIIic descriptor may be used optionally Ior all technologies that support
Quality oI Service Ieatures. The traIIic descriptor holds the inIormation about
the service parameters, e.g. delay, burst size etc, that should be used Ior this
connection.
There are also separate Irame types deIined Ior connection establishment,
disconnect procedures and keep-alives.
The use oI Automatic Repeat Requests (ARQ) is optional depending on the
technology used. The purpose oI the ARQ is to provide error recovery in case
oI transmission Iailures.

20
20 {C} Herbert Haas 2005/03/11
Automatic Repeat Request
ARQ protocoIs guarantee correct
deIivery of data

Receiver sends acknowIedgements

AcknowIedgements refer to sequence


numbers

Missing data is repeated


When do we need this?

For most data traffic (FTP, HTTP, ...)

Not for reaI-time traffic (VoIP)


ARQ techniques are used to allow data retransmissions in the case oI
transmission errors and packet loss.
With the help oI sequence numbers (serial number oI a data packet), applied by
the sender, the receiver is able to detect packet loss and is Iurther able to
acknowledge properly received Irames.
Reliable transmission techniques are mainly used Ior data traIIic e.g. SMTP,
HTTP, FTP etc, because we don t want to receive corrupted mails or html
pages.
For real time traIIic like Voice over IP systems we preIer unreliable
transmission techniques, because it makes no sense to retransmit a lost word a
Iew milliseconds later again. This would destroy the harmony in the speech
even more than the lost word. For real time systems Forward Error Correction
systems would be much more useIul.

21
21 {C} Herbert Haas 2005/03/11
ARQ Variants
IdIe-RQ
SeIective ACK
Positive ACK
GoBackN
SREJ
Continuous-RQ
There are two major Iamilies oI ARQ requests the idle-RQ system and the
continuous-RQ system.
The continuous-RQ system consists oI Iour diIIerent Ilavors. The Selective
ACK, Positive ACK, GoBackN, the Selective Reject (SREJ) method and
sometimes even a mix out oI these basic methods.

22
22 {C} Herbert Haas 2005/03/11
IdIe-RQ
Sender Receiver
Data
Ack
Data
Ack
Ack
Ack
Data
Data
The idle-RQ technique is a stop and go protocol, this means when the sender
has sent out one data Irame he must wait Ior the according acknowledgement
beIore he is allowed to sent the next Irame.
The idle-RQ protocol operates in a halI duplex mode and is typically used in
master slave environments. So the master sends a Irame and must wait Ior the
response oI the slave whether the Irame was properly received or not.
In today's networks the idle-RQ technique is seen very rarely, because it
introduces large delays and is not able to Iill data pipes we are currently used
to.

23
23 {C} Herbert Haas 2005/03/11
Without Sequence Numbers:
Sender Receiver
Data "ABCD"
Ack
Ack
Ack
Data "ABCD"
Data "EFGH"
No Ack?
Retransmission!
ABCD
ABCD
ABCD
ABCD
ABCD
EFGH
There are two diIIerent ways how idle-RQ technique might be implemented.
In this graphic an idle-RQ system without the use oI sequence numbers is
shown. The sender sends out a data Irame, starts an retransmission timer and
waits Ior the receive oI an acknowledgement.
An already sent data Irame remains in the send buIIer and may only be deleted
iI an proper acknowledgement is received.
A data Irame retransmission will happen, iI the retransmission timer times out
beIore an acknowledgement is received. Even iI the transmission itselI was
successIul and only the acknowledgement was lost. This could lead to an
transport oI duplicate data Irames.

24
24 {C} Herbert Haas 2005/03/11
With Sequence Numbers:
Sender Receiver
Data "ABCD" S=0
Ack=0
Ack=0
Ack=1
Data "ABCD" S=0
Data "EFGH" S=1
No Ack?
Retransmission!
ABCD
ABCD
ABCD
EFGH
In the second scenario oI idle-RQ systems sequence numbers are used.
In this case the sender sends out a data Irame with a valid sequence number.
Keeps the data Irame in the send buIIer and starts the retransmission timer. The
acknowledgement Irame is lost again and the data Irame is retransmitted.
But now the receiver is able to recognize, by the help oI the sequence number,
that the received data Irame is a duplicate and is able to discard it.

25
25 {C} Herbert Haas 2005/03/11
SIow !
Vienna Tokyo
D
a
ta
A
c
k
"Stop and Wait": "Stop and Wait":
Data is traveIing round
the earth whiIe sender
waits for the
acknowIedgement.

In the meantime
no data can be sent !!
D
a
ta
As already mentioned beIore the idle-RQ technique is not suited Ior today's
data transport systems. The stop and wait procedure would introduce a lot oI
delay, especially on long distance connections, and would no be able to
eIIiciently use the network inIrastructure.

26
26 {C} Herbert Haas 2005/03/11
Empty Pipe !
Vienna Tokyo
t = 0 s
t = 350 ms
1
K
B
D
a
ta
A
c
k
1.5 Mbit/s
This pipe aIIows
1.5 Mbit/s 350 ms
= 525,000 bit 64 KB
of data.
But stop-and-wait onIy
aIIows one frame to be
outstanding !!!
Assume MTU=1 KByte,
then the max rate is
(1024 8) bit / 0.35 s
23 Kbit/s
In this scenario we have a 1.5 Mbit/s connection between Vienna and Tokyo.
A 1 KByte data Irame is sent Irom Vienna to Tokyo with a transport delay oI
175 milliseconds in one direction. With the use oI idle-RQ technique it would
take at least 350 milliseconds beIore an acknowledgement is received and the
next data Irame is sent.
In this case a maximum throughput oI only 23 Kbit/s can be achieved.

27
27 {C} Herbert Haas 2005/03/11
IdIe-RQ Facts
OId and sIow method

But smaII code and onIy IittIe resources


necessary
At Ieast two sequence numbers
necessary

To distinguish new from oId data


HaIf dupIex protocoI
ExampIe: TFTP
The only advantage oI the idle-RQ technique compared to the more
sophisticated continuous RQ techniques is the little amount oI memory and
processor resources that is needed.
ThereIore it is very easy to implement them in ROM based systems e.g. cisco
router support TFTP Irom ROM monitor

28
28 {C} Herbert Haas 2005/03/11
Continuous RQ
Data and Acks
are sent
continuousIy !!
D
a
ta
A
c
k
D
a
ta
D
a
ta
D
a
ta
A
c
k
A
c
k
A
c
k
A
c
k
s
D
a
ta IdIe-RQ Continuous-RQ
In continuous-RQ technology data Irames and their according
acknowledgements are sent continuously.
The sender is allowed to put a certain amount oI Irames into the send buIIer
and transmit them all in one go. The amount oI Irames the sender is allowed to
send is either negotiated during the connection establishment phase or
dynamically adjusted by max window size announcements oI the received
acknowledgements.

29
29 {C} Herbert Haas 2005/03/11
FuII Pipe !
Vienna Tokyo
t = 0 s
t = 350 ms
1
K
B
D
a
ta
A
c
k
1.5 Mbit/s
This situation
corresponds with a
sIiding window
By the use oI continous-RQ technique and an proper adjusted send window
size it would now be possible to use the complete capacity oI our Vienna
Tokyo connection.

30
30 {C} Herbert Haas 2005/03/11
Need For Retransmission Buffer
Vienna Tokyo
D
a
ta
S
=
0
D
a
ta
S
=
1
D
a
ta
S
=
2
D
a
ta
S
=
3
Four packets are sent,
but due to a network
faiIure none of them
arrive (or equivaIentIy
they do arrive but the
Acks are Iost)
Timeouts !!!
0
0 1
0 1 2
0 1 2 3
D
a
ta
S
=
0
D
a
ta
S
=
1
.
.
.
.
0 1 2 3
0 1 2 3
0 1 2 3
0 1 2 3
Retransmissions
A
c
k
0
1 0
A
c
k
In this example Iour data Irames are sent Irom Vienna to Tokyo and all oI them
are lost or the according acknowledgements do not arrive.
This situation leads to an timeout oI the retransmission timer in the Vienna
location and causes a retransmission oI all Iour data Irames.

31
31 {C} Herbert Haas 2005/03/11
Continuous-RQ Resources
Continuous-RQ requires dramatically dramatically
more resources than IdIe-RQ or
connectionIess protocoIs !!!

Retransmission Timers

Retransmission Buffers

Receive Buffers (to maintain sequence)


Might resuIt in high CPU Ioads !!!

Reason for DoS Attacks


Practically retransmission timers are started Ior groups oI packets and not Ior
each single packet. The retransmission timer calculation is based upon the
measured round-trip-time (RTT) which may vary between milliseconds and
minutes (Internet). Several more or less sophisticated algorithms adjust the
retransmission timers in order to maximize overall data throughput.

32
32 {C} Herbert Haas 2005/03/11
SeIective AcknowIedgements
Vienna Tokyo
Data S=0
Data S=1 Ack=0
Ack=2
Ack=3
Data S=2
Data S=3
Data S=1
Ack=1
0
0 1
1 2
1 2 3
Timeout for S=1
1
st
retransmission
1
1 3
1
0
2 0
3 2 0
1 3 2 0
Reordering
necessary
In Selective Acknowledgement technology every single Irame that is sent out
needs to be acknowledged. As soon as an acknowledgement arrives the
according data Irame may be deleted out oI the send buIIer.
II a data Irame is lost no acknowledgement will be sent Ior this Irame only.
This causes again a retransmission timeout at the sender and leads to an
retransmission oI the data Irame. But now the data Irames are received out oI
order at the receiver side.
The receiver is now able to reorder the data Irames by the help oI the sequence
numbers beIore they are handed over to the next higher level soItware process.

33
33 {C} Herbert Haas 2005/03/11
SACK: DupIicates
Vienna Tokyo
Data S=0
Data S=1 Ack=0
Ack=2
Ack=3
Data S=2
Data S=3
Data S=1
Ack=1
0
0 1
1 2
1 2 3
Timeout for S=1
1
st
retransmission
1
1 3
1
0
2 1 0
3 2 1 0
3 2 1 0
DupIicate !!!
Ack=1
1
The Selective Acknowledgement technique causes retransmissions even in the
case when only the acknowledgement Irame is lost. Because iI the Ack Irame is
missing the sender is not allowed to remove the according data Irame Irom the
send buIIer.
But nevertheless duplicate data Irames are recognized by the receiver with the
help oI the sequence numbers.

34
34 {C} Herbert Haas 2005/03/11
SACK
AppIication:

New option for TCP to accomodate to


Iong fat pipes with high BER
OptionaIIy, retransmissions might be
sent immediateIy when unexpected
(the next but one) ACK occurs
Opposite idea: CumuIative ACK
This SACK technology is nowadays also supported by modern TCP protocol
stacks and its use is negotiated during connection setup.
Optionally retransmissions may occur immediately when an unexpected ACK
is received.

35
35 {C} Herbert Haas 2005/03/11
GoBack N
Vienna Tokyo
0 Data S=0
Data S=1 Ack=1
Data S=2
NACK = 1
NACK = 1
0 1
1 2
1 2 3
0
0
0
Data S=1
1 0
2 1 0
Data S=2 Ack=2
Ack=4
1 2 3
2 3

1 2 3
Data S=3
0
Data S=3
3 2 1 0
Sequence
maintained !
In GoBackN ARQ technique data Irames are only acknowledged when the
series oI received sequence numbers is continuous.
In the above example the data Irame with the sequence number S 1 is lost.
The receiver recognizes that a data Irame is missing when he receives the
Irame with the sequence number S 2. Now an so called Not
Acknowledgement (NACK) data Irame is sent back to the sender. The sender
recognizes which Irame was lost by the sequence number carried in the NACK
data Irame.
Starting with this sequence number all data Irames are retransmitted. This
causes more overhead than by the use oI SACK but makes sure that the order
oI the data Irames is maintained.

36
36 {C} Herbert Haas 2005/03/11
GoBack N - Facts
Maintains order at receiver-buffer

Reordering was too much time-


consuming in earIier days
StiII used by

HDLC and cIones ("REJECT")

TCP
Variant known as "fast retransmit"
Uses dupIicate acks as NACK
The GoBackN procedure is used by the quite old HDLC protocol because
reordering oI data packets was a too much time and memory consuming task
Ior processors in the early days oI data communication.
Also in today's TCP implementation a variant oI the NACK procedure is used,
the duplicate ACK. As soon as an TCP Stack recognizes that a data Irame is
missing it sends out an duplicate acknowledgement. So the sender recognizes
that a data Irame was lost and retransmits the according data Irame. This
speeds up the error recovery procedure because the retransmission timeout is
omitted in his case.

37
37 {C} Herbert Haas 2005/03/11
SeIective Reject ARQ
Modern modification of GoBack N
OnIy those frames are retransmitted
that receive a NACK

Or those that time out


Receiver must be abIe to reorder
frames
AppIication:

OptionaI for modern HDLC cIones


A combination oI diIIerent ARQ procedures is the SACK procedure with the
use oI NACK Irames. In this technique only data Irames Ior which an NACK
Irame is received or those that time out will be retransmitted.
The correct order oI the data Irames might get lost.

38
38 {C} Herbert Haas 2005/03/11
Positive Ack
Vienna Tokyo
0 Data S=0
Data S=1 Ack=1
Data S=2
0 1
1 2
0
0
2 0
Data S=3
3 2 0
1 2 3
Timeout S=1
1 2 3
Data S=1
1 3 2 0
Ack=4
CummuIative Ack
In the positive ARQ technique an ACK is only sent when the order oI the
arriving data Irames is correct.
In the example above we Iind that the data Irame with the sequence number S
1 gets lost. The next Irame that arrives is the data Irame with S 2. The
receiver recognizes discontinuous sequence numbers and stops to transmit
ACK Irames. In the meantime all other Irames are received and stored in the
receive buIIer but are not acknowledged.
When the data Irame S 1 Ialls into timeout, it is retransmitted. The receiver
recognizes that the missing data Irame has arrived and launches an ACK 4
Irame to acknowledge all till now received Irames.
Obviously it may happen that another data Irame is retransmitted depending on
the remaining timeout period and the RTT oI the connection.

39
39 {C} Herbert Haas 2005/03/11
Positive Ack - Facts
AIways together with cumuIative
acks

Any frame received is buffered

Receiver must be abIe to reorder


ProbIem:

OnIy timeouts trigger retransmission


AppIication:

EarIy TCP
The positive ACK procedure is always used together with cumulative
acknowledgement technology. Disordering oI data Irames may occur and must
be Iixed. Only timeouts trigger retransmissions oI lost data Irames.

40
40 {C} Herbert Haas 2005/03/11
Windowing
As shown, sender must buffer
unacknowIedged frames in case for
retransmissions
Necessary sender-buffer size is
caIIed "window"
Window size depends on

Bandwidth of channeI

Round-Trip-Time (RTT)
The sender needs to buIIer all transmitted Irames until the according
acknowledgement arrives. The size oI this transmit buIIer is called the window
size.
The optimum window size directly depends on the bandwidth oI the channel as
well as on the Round-Trip-Time.
Elder protocols use a Iixed window size negotiated during connection setup.
More modern protocols such as TCP use a method called adaptive windowing
which allows to automatically adapt the window size to needs oI the transport
system.

41
41 {C} Herbert Haas 2005/03/11
Remember: FuII Pipe !
Vienna Tokyo
t = 0 s
t = 350 ms
1
K
B
D
a
ta
A
c
k
1.5 Mbit/s
This situation
corresponds with a
sIiding window
In this scenario a proper window size is used which guaranties a most eIIicient
use oI the transport capacity.

42
42 {C} Herbert Haas 2005/03/11
SIiding Window Basics (1)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Frames to be sent
Window
Frames
on fIight
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Frames to be sent
Window
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
AIready sent
and acknowIedged
Window
Frames to be sent
A
c
k
=
3
A
c
k
=
7
In this example a Iixed window size oI Iour is used. This means that up to Iour
data Irames without explicit acknowledgments may be sent.
As soon as the Iirst ACK 3 arrives the sliding window moves to the leIt. This
is possible because the data Irames 1 and 2 are now removed out oI the send
buIIer. Now the data Irames 5 and 6 are sent and kept in the send buIIer until
their acknowledgements arrive.
On the leIt side oI the window we Iind the data Irames that will be sent in the
Iuture, while on the right side oI the sliding window already sent and
acknowledged Irames can be Iound. So the window moves continuously Irom
right to leIt thereIore the name sliding window.

43
43 {C} Herbert Haas 2005/03/11
SIiding Window Basics (2)
Window Size in Bytes = BW RTT

If smaIIer: jumping window

Extreme case: IdIe-RQ for W=1


How many Identifiers?

At Ieast W+1
If aII W frames must be retransmitted,
receiver must distinguish from new data

W < (MaxSeqNum+1) /2
To avoid troubIes on wrap around
Assume we have a window size W4 and the same number oI identiIiers
(sequence numbers 0..3). When we send Iour Irames at once and unIortunately
all coresponding acknowledgements get lost, our timers expire and we must
retransmit those Irames. Now we again send Irames 0..3 but the receiver cannot
recognize them as the second incarnation. Thus, the Iirst real new Irame must
have a sequence number oI 4, that is the next Irames have identiIiers 4, 0, 1, 2.
The result is that we need at least W1 identiIiers.
Even then, another tricky scenario might happen. Assume we have a sequence
number space 0..7 (8 identiIiers) and W7. We send Irames 0,1,2,3,4,5,6 and
get all acknowledgements. Then we send 7, 0, 1, 2, 3, 4, 5. How can the
receiver be sure that Irames 0,1,2,3,4,5 are not retransmitted Irames? (Irame 7
might get lost!)
Because oI this, we use a smaller window size W4. Now we send Irames
0,1,2,3 and get all acknowledgements. Then we send 4,5,6,7. The receiver is
not conIused.
So the Iinal rule requires W to be smaller than halI the number oI identiIers.

44
44 {C} Herbert Haas 2005/03/11
Jumping Window
Vienna Tokyo
In this example we Iind what will happen iI the window size is to small.
The sender in Vienna has sent out all data Irames until he reached the max
window size. Now the sender in Vienna has to wait Ior acknowledgments to
arrive.
The incoming acknowledgements Iree up buIIer space and the sender may
continou to send. In this scenario the chosen window size is obviously to small
leading to an insuIIicient use oI the transport capacity.

45
45 {C} Herbert Haas 2005/03/11
Jumping Window
Vienna Tokyo
Evidently, this speciIic situation depicted above, can only occur when the
senders employs a dynamic window size (assume the sender started with W20
and suddenly reduced the window size to W4.

46
46 {C} Herbert Haas 2005/03/11
FIow ControI
Too Iarge window sizes

Might require too much retransmissions


(especiaIIy with GoBackN)

ResuIt in network congestion (imagine


thousands of users)

Receiver buffer overfIow


FIow controI #1: Adaptive Windowing
FIow controI #2: Stop and Go
Imagine our window size is W1,000,000 and aIter sending the 1,000,000th
Irame the receiver sends us a NACK(1). That is we have to repeat 999,999
Irames, although most oI them might have been transmitted without problems.
This would cause congestions in the network as well as buIIer overIlow
problems at the receiver side.

47
47 {C} Herbert Haas 2005/03/11
FIow ControI
Adaptive Windowing

The reciever adjusts the sender's


window size (sent together with ack)

TCP's approach
Stop and Go

Dedicated fIow controI frames

HDLC's approach (RR and RNR)

Ethernet's approach (Pause-Frame)


Flow control to prevent buIIer overIlow conditions at the receiver side can be
implemented either by the use oI adaptive windowing or by the use oI some
kind oI stop and go procedure.
Adaptive windowing technique, which is used by the TCP protocol stack,
announces the receivers available buIIer size within the acknowledgements that
are send back to the sender side. So the window size changes dynamically
dependent on the buIIer conditions at the receiver.
The Stop and Go technique, which is mainly used by some elder protocols like
HDLC and its derivates, uses some special control Irames Receiver Ready
(RR) and Receiver Not Ready (RNR). These protocols typically are using a
Iixed window size oI 7 or 127 packets.
There is also a new approach 802.3x to implement Ilow control in Ethernet
technology which allows an congested device to speciIy transmission pauses in
terms oI bit times.

48
48 {C} Herbert Haas 2005/03/11
Medium Access
In case of shared media

CoIIisions possibIe

Who may send?


Basic Techniques

AIoha (Ethernet PrincipIe "CSMA/CD")

Token (Token Ring, FDDI, Token Bus)

PoIIing (IEEE 802.12)

Time SIices (GSM)


WiII be discussed Iater together with these
reIated technoIogies
The access onto the physical media depends whether the medium needs to be
shared between many diIIerent users or iI the usage is reserved Ior a single user
only.
In the case oI single user connections e.g. serial connections, ISDN circuits etc
the medium access is grab it iI you need it.
In shared medium environments many diIIerent technologies had been
developed to control the access onto the medium e.g. CSMA/CD, Token
controlled, Polling, Time Slices etc.

49
49 {C} Herbert Haas 2005/03/11
Summary
Most Iink Iayer protocoIs utiIize CRC
for frame protection
ARQ Techniques: IdIe-RQ, GoBackN,
SeIective-Ack, Positive-Ack

AdditionaIIy CumuIative-Ack possibIe


OnIy Continuous-RQ fiIIs pipe
FIow controI

Either by controIIing window size

Or deciated stop and go messages



50
" If a packet hits a pocket on a socket on a port,
And the bus is interrupted as a very last resort,
And the address of the memory makes your floppy disk abort,
1hen the socket packet pocket has an error to report!
If your cursor finds a menu item followed by a dash,
And the double-clicking icon puts your window in the trash,
And your data is corrupted 'cause the index doesn't hash,
then your situation's hopeless, and your system's gonna crash!
If the label on the cable on the table at your house,
Says the network is connected to the button on your mouse,
But your packets want to tunnel on another protocol,
1hat's repeatedly rejected by the printer down the hall,
And your screen is all distorted by the side effects of gauss,
So your icons in the window are as wavy as a souse,
When the copy of your floppy's getting sloppy on the disk,
And the microcode instructions cause unnecessary risc,
1hen you have to flash your memory and you'll want to ram your rom.
Quickly turn off the computer and be sure to tell your mom! "

51
51 {C} Herbert Haas 2005/03/11
Quiz
What's the probIem when putting IP-
packets directIy onto ISDN?
What are Hamming-Codes?
What maximum bit-rate can TCP-
hosts utiIize when connected via
sateIIite?
ExpIain why windowing protocoIs
are more prone to DoS-attacks
2005/03/11 (C) Herbert Haas
HDLC
King of the Link
2 (C) Herbert Haas 2005/03/11
What is HDLC ?

High-Level Data Link Control

Early link layer protocol


Based on SDLC (Synchronous-DLC, B!"

#ccess control on hal$-duple% &ode&-lines

Connectionoriented or connectionless

'ra&ing

'ra&e (rotection
!other o$ &any L#) and W#) protocols
3 (C) Herbert Haas 2005/03/11
Hal$-Duple% !anage&ent
!ode& !ode&
*+S
C+S
DCD
!ode& !ode&
*+S
C+S DCD
(,- D#+# (,- D#+#
',- D#+# ',- D#+#
',. D#+#
!ode& !ode&
(,. D#+#
*+S
C+S
DCD
!ode& !ode&
*+S
C+S DCD
4 (C) Herbert Haas 2005/03/11
Sa&e on !ultipoint Lines (."
(ri&ary
Station
Secondary
Stations
(,-, #,C/ D#+#
C. C/ C/ C0 C1
(,., #,C/ D#+#
C. C/ C/ C0 C1
5 (C) Herbert Haas 2005/03/11
Sa&e on !ultipoint Lines (/"
(ri&ary
Station
Secondary
Stations
#,C/, ',- D#+#
C. C/ C/ C0 C1
C. C/ C/ C0 C1
#,C/, ',. D#+#
6 (C) Herbert Haas 2005/03/11
Early HDLC E%a&ple
Mainframe
FEP M M MSD
CC1
M
Terminal
Terminal
Terminal
M
MSD CC2 M M CC2
Terminal
Terminal
Terminal
Terminal
Terminal
Escon
B! Channel
23--44.2/-- 5it6s
HDLC
CC1
D+E
D+E
DCE DCE
+oken
*ing
+oken
*ing
7 (C) Herbert Haas 2005/03/11
HDLC Basics (."

Synchronous +rans&ission

Bit-oriented (Bit-Stu$$ing"

Developed 5y S7

S7 00-2 and S7 1008

Supports

Hal$- and $ull-duple% lines

S9itched and non-s9itched channels

(oint-to-point and &ultipoint lines


8 (C) Herbert Haas 2005/03/11
HDLC Basics (/"

Why do 9e use it today?

'ra&ing

'ra&e protection

Error recovery

Building Blocks

SDLC is no9 a su5set o$ HDLC


9 (C) Herbert Haas 2005/03/11
HDLC Basics (0"

+hree types o$ stations

(ri&ary Station

Secondary Station

Co&5ined Station

+hree &odes

)or&al *esponse !ode ()*!"

#synchronous *esponse !ode (#*!"

#synchronous Balanced !ode (#B!"


10 (C) Herbert Haas 2005/03/11
HDLC !odes (."

)*!

Secondary sends only 9hen per&itted 5y


pri&ary

)o co&&unication 5et9een secondaries

+ypically used in &ultipoint lines

#*!

7nly a single secondary in #*!

+his #*!-secondary &ay trans&it 9henever it


9ants (here5y avoiding collisions"
11 (C) Herbert Haas 2005/03/11
HDLC !odes (/"

#B!

!ost i&portant &ode today :::

*e;uires co&5ined stations

Best &ode $or point-to-point lines



12 (C) Herbert Haas 2005/03/11
)on-operational !odes

)or&al Disconnected !ode ()D!"

'or un5alanced &odes only

Secondary not a5le to receive

#synchronous Disconn4 !ode (#D!"

'or 5alanced &ode only

Co&5ined station not a5le to receive

nitiali<ation !ode (!"

(ara&eter e%change or SW do9nload


13 (C) Herbert Haas 2005/03/11
HDLC 'a&ily
HDLC
L#(B
L#('
L#(D
LLC SDLC
=LLC
(((
>4./-
L#(!
+?-+ EEE B! E+'
14 (C) Herbert Haas 2005/03/11
HDLC 'ra&e 'or&at
'lag #ddress Control Data C*C 'lag
@E @E ) A 5its ) A 5its .3 or 0/ 5its
Send
se;uence nu&5er
(6' -
Supervisory
Code
(6' . -
Code (6' . .
*eceive
se;uence nu&5er
Code
*eceive
se;uence nu&5er
- . / 0 1 8 3 @
n$or&ation 'ra&e
Supervisory 'ra&e
?nnu&5ered 'ra&e
15 (C) Herbert Haas 2005/03/11
Supervisory 'ra&es
Supervisory
Code
(6' . -
*eceive
se;uence nu&5er
- -
- .
. -
. .
** (*eceiver *eady"
*)* (*eceiver )ot *eady"
*EB (*eCect"
S*EB (Selective *eCect"
16 (C) Herbert Haas 2005/03/11
?nnu&5ered 'ra&es
Code (6' . . Code
- - - - -
- - .
- . -
. - -
. . -
- . - - -
- . - - .
- . - . -
- . - . .
. - - - -
. - - - .
. . - - -
. . - - .
. . - . -
. . - . .
. . . - -
. . . - .
. . . . -
- -
- -
- -
- -
Co&&and *esponse
? ?
S)*!
DSC *D
?(
?#
)*- )*-
)*. )*.
)*/ )*/
)*0 )*0
S! *!
'*!*
S#*! D!
*SE+
S#*!E
S)*!E
S#B!
DD
S#B!E
DD
17 (C) Herbert Haas 2005/03/11
DD 'ra&es

?sed $or user data e%change

'or upper layer protocols prior to


connection esta5lish&ent

?sed $or address resolution

?sed on s9itched lines only

?sed $or para&eter negotiation

!a% send and receive $ra&e si<es

Windo9 si<es

E%tensions, etc444
18 (C) Herbert Haas 2005/03/11
#*= (."

De$aultE FoBack ) 9ithout dedicated


)#CG $ra&e (:"

*eceive-Se;uence )u&5er indicates


ne%t $ra&e e%pected

HCheckpointingH

Sender triggers ()"#CG in$or&ation


9ith (6' 5it
19 (C) Herbert Haas 2005/03/11
#*= (/"

7ptionalE *eCect (*EB"

Dedicated )#CG $ra&e

Can 5e sent at any ti&e


(no checkpointing"

7ptionalE Selective *eCect (S*EB"

*e;uests retrans&ission o$ single $ra&e

'lo9 control 9ith ** and *)*


20 (C) Herbert Haas 2005/03/11
HDLC Classes
?n5alanced
)or&al
(?)"
?n5alanced
#synchronous
(?#"
Balanced
#synchronous
(B#"
, **, *)*, S)*!,
?#, DSC, D!, '*!*
, **, *)*, S#*!,
?#, DSC, D!, '*!*
, **, *)*, S#B!,
?#, DSC, D!, '*!*
E%tensionsE
. S9itched Circuits (DD, *D"
/ *eCect (*EB"
0 Selective *eCect (S*EB"
1 ?nnu&5ered n$or&ation (?"
8 nitiali<ation (S!, *!"
3 Froup (olling (?("
@ E%tended #ddressing (.3 5it"
A Delete *esponse 'ra&es
2 Delete Co&&and 'ra&es
.- @ 5it se;uence nu&5ering
.. *ESE+
./ Data Link +ES+
.0 *e;uest Disconnect (*D"
.1 0/ Bit C*C
21 (C) Herbert Haas 2005/03/11
Su&&ary

#ccess control 9ith (6' 5it

+hree &odesE )*!, #*!, #B!

Error recovery uses Checkpointing

!other o$ &any L#) and W#)


protocols

E%tensi5le through 5uilding 5locks


22 (C) Herbert Haas 2005/03/11
=ui<

What is Cisco-HDLC ?

Does Ethernet (A-/40" utili<e


connection-oriented HDLC ?

What is =42/. used $or ?

Which HDLC variant can 5e used on


erroneous links ?

1
2005/03/11 {C} Herbert Haas
MuItipIexing Methods
Daubing the nformation

2
'I think there is a
world market
for about
five computers.`
Thomas Watson,
chairman of IBM 1943
II he was right then internetworking is useless and we can Iorget the whole
chapter...

3
3 {C} Herbert Haas 2005/03/11
MuItipIexing Types
TDM

Most important

StatisticaI and Deterministic


SDM
FDM and (D)WDM
CDM
WiII be covered in
other chapters
In this chapter we will discuss Time Division Multiplexing (TDM) techniques
which is the most common transport technology used today.
TDM can be used in a deterministic way which means dedicated bandwidth and
dedicated delay or in a statistical manner shared bandwidth and variable delay.
Nevertheless there are also some alternative multiplexing techniques available
like:
Space Division Multiplexing (SDM) - data is sent across physically separated
media
Frequency Division Multiplexing (FDM) uses diIIerent electrical Irequencies
to transport data on one and the same physical media
Dense Wave Division Multiplexing (DWDM) mainly used in Iiber optic
systems, data is transported on separate wavelengths oI light
Code Division Multiplexing (CDM) data is transported (and diIIerentiated) by
diIIerent types oI code

4
4 {C} Herbert Haas 2005/03/11
TDM (1)
00110011001101000111100010010000101001010010101001110100010011001
10011100010101001010101010011110001010001101011011100010101001011
11000111000111100000000000000000000000000000000000000001000000000
101010010111
User A
User B
User C
User D
0011100001101 1011100100100 1000011101101
SDM
User A
User B
User C
User D
1011100111 1011100111 1011100111 1011100111 1011100111
Framed Mode
Save wires
User a
User b
User c
User d
User a
User b
User c
User d TDM
In this scenario we see an comparison between SDM and TDM technology.
First the users a, b, c and d are connected together using SDM technique, which
requires one physical connection per communication pair. This is an obviously
very expensive technology because we need one wire pair or Iiber optic
connection per communication pair. So this technique is seen very rarely today.
In our TDM technique example we use only one physical connection Ior Iour
communication pairs. The diIIerent communication pairs on the physical medium
are separated by time. This saves us wires or Iibers but needs Iour times the
transport capacity as one connection in the SDM example.

5
5 {C} Herbert Haas 2005/03/11
TDM (2)
Requires framed Iink Iayer
Saves wires
Is sIower than SDM
Requires muItipIexers and
demuItipIexers

Two fundamentaIIy different methods: Two fundamentaIIy different methods:

Deterministic TDM Deterministic TDM

StatisticaI TDM StatisticaI TDM


To implement TDM data needs to be packed in Irames especially in statistical
TDM techniques. It saves network inIrastructure costs because it needs much less
physical medias than SDM systems.
TDM is obviously slower than SDM because the available bandwidth is shared
between diIIerent communication channels and it requires devices that perIorm
the multiplexing and demultiplexing task.
Deterministic TDM has constant delay and bandwidth and is used in techniques
like ISDN, PDH or SDH.
Statistical TDM has variable delay and bandwidth and is used in technologies like
X25, Frame-relay or ATM.

6
6 {C} Herbert Haas 2005/03/11
C A
Deterministic TDM (1)
User A2
User B2
User C2
User D2
A B C D
"Trunk"
User A1
User B1
User C1
User D1
A B C D A B C D D
A
B
C
D
A
B
C
D
Framing
Deterministic TDM systems uses transport Irames like E1, T1, STM1, etc in
which the actual data can be Iilled in transparently. The Iraming is needed Ior
synchronization, network management and sometimes error detection Iunctions
between multiplexer and demultiplexer devices.
Each communication channel on a deterministic TDM connection is identiIied by
its timely position inside the TDM Irame. Principally no Iurther headers or
address inIormation is required by the payload.
The major disadvantage oI deterministic TDM systems is the Iixed correlation
between communication channel and time slot position. This means iI one
communication channel is not used it still occupies the time slot capacity by
sending some kind oI idle pattern.

7
7 {C} Herbert Haas 2005/03/11
Deterministic TDM (2)
Trunk speed = Number of sIots User access rate
Each user gets a constant timesIot of the trunk
C A
User A2
User B2
User C2
User D2
A B C D
User A1
User B1
User C1
User D1
A B C D A B C D D
A
B
C
D
A
B
C
D
4 64 kbit/s + F 256 kbit/s
64 kbit/s
64 kbit/s
64 kbit/s
64 kbit/s
The bandwidth needed on a deterministic TDM trunk is always determined by
the sum oI all communication channels on the trunk plus some administrative
overhead, because oI the Iixed correlation between communication channel and
timeslot.
In our example we Iind Iour communication channels with a capacity oI 64Kbits/s
each, so the transport capacity oI the trunk needs to be 256 Kbits/s.

8
8 {C} Herbert Haas 2005/03/11
Deterministic TDM - Facts
Order is maintained
Frames must have same size
No addressing information required
InherentIy connection-oriented
No buffers necessary (QoS)
ProtocoI transparent
Bad utiIization of trunk
In deterministic TDM systems the order oI the data packets is maintained, no
packet overtake or time slot position change is possible.
The Irames need to have always the same size because the timeslots in
deterministic TDM systems have a constant length.
Address inIormation is not required, because the destination is determined by the
time slot position.
Deterministic TDM is connection-oriented because a point to point connection is
typically setup in SVC technique or permanent established in PVC technique.
BuIIers are not needed because the data stream is sent out with exactly the same
speed as it is received.
It is protocol transparent because theoretically no Iurther packing is needed and
the destination is determined by the timeslot position.
Bad trunk utilization could occur iI only a Iew oI the reserved timeslots are in use.

9
9 {C} Herbert Haas 2005/03/11
StatisticaI TDM (1)
256 kbit/s
Trunk speed dimensioned for average usage
Each user can send packets whenever she wants
User A2
User B2
User C2
User D2
User A1
User B1
User C1
User D1
A B C D
A
C
D
B
C
D
A C C
D
Average date
rates 64 kbit/s
In statistical TDM systems there is no Iixed correlation between timeslot position
and communication channel as it is with deterministic TDM systems. ThereIore
the speed oI the trunk could be chosen according to the average statistical
transport needs oI the users.
Any user is allowed to send data at any time. OI course a separate addressing and
Iraming scheme needs to be used because the Iixed correlation between timeslot
position and destination is broken in these systems.

10
10 {C} Herbert Haas 2005/03/11
D D
StatisticaI TDM (2)
If other users are siIent, one (or a few) users can
fuIIy utiIize their access rate
256 kbit/s
User A2
User B2
User C2
User D2
User A1
User B1
User C1
User D1
D D
D A
One oI the major advantages oI statistical TDM systems compared to
deterministic TDM systems is the Iollowing Iact: iI the trunk is empty one user
may use the complete transport capacity oI the trunk.
On the other hand it may occur that all users want to use the trunk at the same
time. Because oI the statistical dimensioning oI the trunk capacity it may happen
that more data is Ied in by the users than the trunk capacity allows.
For such cases buIIers are needed by the statistical TDM devices to compensate
the speed diIIerences. In case oI buIIer overIlow conditions it may even happen
that data is lost.

11
11 {C} Herbert Haas 2005/03/11
StatisticaI TDM - Facts
Good utiIization of trunk

StatisticaIIy dimensioned
Frames can have different size
MuItipIexers require buffers
VariabIe deIays
Address information required
Not protocoI transparent
Statistical TDM allows a good utilization oI the trunk because there is no waste oI
bandwidth by the use oI idle patterns and the capacity is determined by the
average needs oI the users.
The Irame size may vary depending on the need oI the users.
BuIIering is required under trunk overload conditions.
The delay is variable because oI buIIering.
Address inIormation is needed because oI the lost correlation between time slot
position and destination.
Statistical TDM is not protocol transparent because a separate packing as well as
addresses are needed.

12
12 {C} Herbert Haas 2005/03/11
Networking: FuIIy Meshed
User A
User B
User C
User D
User F
User E
MetcaIfe's Law:
n(n-1)/2 Iinks
Good fauIt
toIerance

Expensive
A Iully meshed network is a thing that everybody wants, because it gives 100
redundancy and optimized data transport to each destination. But unIortunately
only very Iew can eIIort it, because the costs oI network inIrastructure would
grow with MetcalI s law.
Which is expressed by the Iormula n x (n-1)/2 .This means iI you have ten sites
you want to connect in an any to any topology you would need 45 connections.

13
13 {C} Herbert Haas 2005/03/11
Networking: Switching
User A
User B
User C
User D
User F
User E
OnIy 6 Iinks

Switch supports
either
deterministic or
statisticaI TDM
One way to save costs would be the use oI network switches, which are
responsible Ior handling the traIIic between the diIIerent destinations.
The switches may use a technology either based on deterministic or statistical
TDM. In this case we would need only six links instead oI IiIteen links to
establish communication between all sites.

14
14 {C} Herbert Haas 2005/03/11
Circuit Switching
T1
T2
T3
TA T2
T3
T1
T4 T4
T4 T4 T1 TB
User A2
User B5
TA(1) T1(4) : A1-C9
TA(2) T2(7) : A2-B5
TA(3) T2(6) : A3-D1
. . . . . .
. . . . . .
T2(6) T4(1)
T2(7) T3(18)
. . . . . .
. . . . . .
T3(18) T4(5)
T3(19) T1(1)
. . . . . .
. . . . . .
T4(4) TB(9)
T4(5) TB(5)
. . . . . .
TA(2) T2(7) : A2-B5
T2(7) T3(18)
T3(18) T4(5)
T4(5) TB(5)
Circuit switching technology is based on deterministic TDM.
All network switches in circuit switching technology hold a switching table which
determines the correlation between incoming trunk/timeslot and outgoing
trunk/timeslot.
In our example the connection between user A2 and B5 is established by Iour
network switches and their according switching tables. For both users this
connection looks like a dedicated point to point link, they are not aware what's
going on inside the network cloud.

15
15 {C} Herbert Haas 2005/03/11
Circuit Switching - Facts
Based on deterministic TDM

MinimaI deIay

ProtocoI transparent

PossibIy bad utiIization

Good for isochronous traffic (voice)


Switching tabIe entries

Static (manuaIIy configured)

Dynamic (signaIing protocoI)

ScaIes with number of connections!


Circuit switching based on deterministic TDM has minimal Iixed delay, is
protocol transparent, but may have bad network utilization due to currently
unused connections.
So circuit switching is very well suited Ior isochronous traIIic like voice
communication or video conIerencing. Circuit switching is the typical technology
that is used by Telco's.
The switching table entries which are needed Ior proper data Iorwarding might be
generates manually by the help oI some network management soItware or
dynamically by some signaling protocol.

16
16 {C} Herbert Haas 2005/03/11
TypicaI User-Configuration
CSU/DSU
PBX
ExampIe:
V.35/RS-530/RS-422
Synchronous seriaI ports
ChanneI Service Unit/
Data Service Unit (CSU/DSU
or "modem")
CSU performs protective
and diagnostic functions
DSU connects a terminaI
to a digitaI Iine
ExampIe: E1 or T1 circuit
Switch
Router
In real liIe a typical user conIiguration looks like the one shown in our example.
We Iind some users that are connected via a shared media to maybe a cisco
router. The router itselI is connected to a Channel Service Unit (CSU) or Data
Service Unit (DSU) using an synchronous serial interIace with a data rate oI up to
2Mbit/s.
This CSU/DSU is responsible Ior terminating the TDM circuit which is supplied
by the service provider as well as Ior the conversion between the synchronous
serial interIace and the TDM interIace. In our scenario an PDH E1 (2048 Mbit/s)
or T1 (1544 Mbit/s) circuit is used.
The connection supplied by the service provider might be shared between the
router and the Private Branch Exchange. So the router uses reserved timeslots oI
the E1/T1 trunk Ior data traIIic while the PBX is using some other timeslots to
establish phone calls.

17
17 {C} Herbert Haas 2005/03/11
Packet Switching
T1
T2
T3
TA T2
T3
T1
T4 T4
T4 T4 T1 TB
User A2
User B5
Address
Information
Each switch must anaIyze
address information
"Store and Forward"
In packet switching technology which is based on statistical time division
multiplexing addresses are needed, remember there is no correlation between
timeslot and destination.
Each switch must analyze the destination address oI every data packet to be able
to Iorward it according to some Iorwarding table.
In our example user A2 communicates with user B2 by the help oI addresses.

18
18 {C} Herbert Haas 2005/03/11
TechnoIogy Differences

Datagram Datagram PrincipIe

GIobaI and routabIe addresses

ConnectionIess

Routing TabIe

VirtuaI CaII VirtuaI CaII PrincipIe

LocaI addresses

Connectionoriented

Switching TabIe
There are two major technologies that make use oI the statistical TDM principle.
The Datagram principle which is using global unique and routable addresses.
Data Iorwarding decisions are made by statically or dynamically generated
routing tables and the data transport is connectionless. Examples Ior the
Datagram principle are IP, IPX, Appletalk, etc.
The Virtual Call principle uses locally signiIicant address well known under the
term virtual circuit identiIier. The data transport is done connection-oriented and
the Iorwarding decisions are made by switching tables. The switching tables hold
the inIormation about incoming trunk/circuit identiIier and the corresponding
outgoing trunk/circuit identiIier. Examples Ior Virtual Call services are X25,
Frame-relay, ATM, etc.

19
19 {C} Herbert Haas 2005/03/11
Datagram
User A.2
User B.5
R1 R2
R4
R3
R5
Destination Next Hop
A IocaI
B R2
C R2
..... .....
A2 B5
A2 B5
A2 B5
Destination Next Hop
A R1
B R4
C R3
..... .....
A
2
B
5
Destination Next Hop
A R2
B R5
C R2
..... .....
A2 B5
Destination Next Hop
A R4
B IocaI
C R4
..... .....
In the Datagram technology user A.2 sends out data packets destined Ior the user
B.5. Each single datagram holds the inIormation about sender and receiver
address.
The datagram Iorwarding devices in our example routers hold a routing table in
memory. In the routing table we Iind a correlation between the destination address
oI a data packet and the corresponding outgoing interIace as well as the next hop
router. So data packets are Iorwarded through the network on a hop by hop basis.
The routing tables can be set up either by manual conIiguration oI the
administrator or by the help oI dynamic routing protocols like RIP, OSPF, IS-IS,
etc. The use oI dynamic routing protocols may lead to rerouting decisions in case
oI network Iailure and so packet overtaking may happen in these systems.

20
20 {C} Herbert Haas 2005/03/11
Datagram - Facts (1)
Addresses contain topoIogicaI information

Must be gIobaIIy unique


Routing tabIe is configured

Static (manuaIIy)

Dynamic (routing protocoIs)


EndIess circIing in case of routing Ioops

Important issue among routing protocoIs


Requires "routabIe" or "routed" protocoIs
The addresses used in datagram service technologies need to be unique and
structured. Structured means a part oI the address is reserved Ior the user
identiIication while another part oI the address is used Ior topology inIormation
(describes network where the user is located).
As already mentioned routing can be based on static conIiguration or dynamic
routing protocols.
In case oI inconsistent inIormation held in routing tables routing loops may occur
which would lead to endless circling packets. Some protocols like IP use a
maximum Time to Live Iield in their header to get rid oI the endless cycling data
packets.
Networks which are build on the datagram service technology typically need two
diIIerent types oI protocols routed protocols which are used by the end user and
routing protocols between routers to build up the routing tables.

21
21 {C} Herbert Haas 2005/03/11
Datagram - Facts (2)
No connection estabIishment
necessary

Faster deIivery of first data

No resource reservation (bad QoS)


Sequence not guaranteed

Rerouting on topoIogy change

Load sharing on redundant paths

End stations must care


Datagram services are typically driven in an connection-less mode, this
guaranties a slightly Iaster delivery oI datagrams because the time to establish a
connection is saved.
The reservation oI resources Ior QoS support is very diIIicult because the path oI
the data packets through the network may change during one session.
Topology changes cause rerouting when dynamic routing protocols are used and
load sharing is practiced in the case oI two or more paths with identical distance
towards the destination. Rerouting and load balancing may also lead to packet
overtaking, so the correct order oI data packet arrival is not guaranteed.

22
22 {C} Herbert Haas 2005/03/11
Datagram - Facts (3)
Best effort service

Router may drop packets

ReIiabIe data transport requires good


transport Iayer ("Dumb network, smart
hosts")
SimpIe protocoIs

Easy to impIement (Internet's success)


Proactive fIow controI difficuIt

Since routes might change


Networks based on datagram technology support only best eIIort service, this
means as good as it gets.
Routers that drop data packets because oI buIIer overIlow or other problems don t
care about error recovery. Error recovery is a task that needs to be perIormed by
the end stations oI a network. They have to take care Ior retransmissions in case
oI packet loss or transmission errors. This is typically done by layer 4 protocols
like TCP which uses an connection-oriented mode.
Due to this behavior oI a datagram networks, the protocols to drive this kind oI
network can be kept simple.
Proactive Ilow control is also very diIIicult to establish because paths and so
transport capacities may change while data packets travel through the network.

23
23 {C} Herbert Haas 2005/03/11
ExampIes
IP
IPX
AppIetaIk
OSI CLNP
Remember typical examples oI datagram networks are IP, IPX, Appletalk and the
quite unknown OSI CLNP protocol stack.

24
24 {C} Herbert Haas 2005/03/11
P1
VirtuaI CaII - CR
P1
P2
P3
P0 P0
P1
P2 P0 P0 P2
User A.2
User B.5
PS1 PS2 PS3
PS4 PS5
44 CR B5 A2
Destination Next Hop
A IocaI
B PS2
C PS2
..... .....
In Out
P0:44 P2:10
P2 P0
10 CR B5 A2
Destination Next Hop
A PS1
B PS4
C PS3
..... .....
In Out
P0:10 P3:02
Destination Next Hop
A PS2
B PS5
C PS2
..... .....
0
2
C
R
B
5
A
2
In Out
P1:02 P2:69
Destination Next Hop
A PS4
B IocaI
C PS4
..... .....
69 CR B5 A2
In Out
P0:69 P2:19
19 IC B5 A2
In Virtual Call Service technology addresses are used as well, but in a diIIerent
manner than compared to datagram services. The address inIormation in Virtual
Call Service systems is only used at the beginning oI a conversation to setup a
connection.
With an established connection data packets are Iorwarded according to virtual
circuit identiIiers which are held in switching tables.
In our example user A.2 sends a connection setup request to user B.5. This
connection setup request is Iorwarded by the network under the use oI routing
tables. This routing tables can be conIigured manually by an administrator or
dynamically by the help oI routing protocols e.g. PNNI.

25
25 {C} Herbert Haas 2005/03/11
P1
VirtuaI CaII - CA
P1
P2
P3
P0 P0
P1
P2 P0 P0 P2
User A.2
User B.5
PS1 PS2 PS3
PS4 PS5
P2 P0 In Out
P0:10 P3:02
0
2
C
A
B
5
A
2
19 CA B5 A2
In Out
P0:69 P2:19
In Out
P1:02 P2:69
69 CA B5 A2
In Out
P0:44 P2:10
10 CA B5 A2 44 CC B5 A2
The connection setup request builds up a tunnel-like connection oI virtual circuit
identiIiers held in switching tables.
User B.5 hopeIully answers with a connection accept message back through the
already established tunnel. From now on only switching tables with their circuit
identiIiers are used to Iorward the data packets.
The entries in the switching tables are created dynamically during the connection
setup procedure by each network node.

26
26 {C} Herbert Haas 2005/03/11
VirtuaI CaII - Data
P1 P1
P2
P3
P0 P0
P1
P2 P0 P0 P2
User A.2
User B.5
PS1 PS2 PS3
PS4 PS5
P2 P0 In Out
P0:10 P3:02
In Out
P0:69 P2:19
In Out
P1:02 P2:69
In Out
P0:44 P2:10
44 10
0
2
69 19
During the data transport phase there is no more need Ior addresses.
Data packets are Iorwarded using virtual circuit identiIiers, which change on a
hop per hop basis. Circuit identiIiers have only local meaning in combination
with their according trunk connection.
This behavior also prevents things like packet overtaking and makes it easier to
implement QoS technologies in the network.
II a connection between two nodes is lost due to network Iailure, a new
connection is established, starting with the connection setup procedure right Irom
the beginning.

27
27 {C} Herbert Haas 2005/03/11
VirtuaI CaII - Facts (1)
Connection estabIishment

Through routing process (!)

GIobaIIy unique topoIogy-reIated addresses


necessary

Creates entries in switching tabIes

Can reservate switching resources (QoS)


Packet switching reIies on IocaI identifiers

Not topoIogy reIated

OnIy unique per port


Remember routing processes are needed even in Virtual Call Service technologies
to allow the setup oI a connection. The addresses used Ior connection setup need
to be structured and globally unique.
The connection setup procedure creates entries in switching tables to support the
data Iorwarding phase.
Its quite easy to reserve transport resources (QoS) during connection
establishment, because the path through the network remains the same Ior one
conversation.
Data packet Iorwarding is perIormed according to local and only per port unique
virtual circuit identiIiers.

28
28 {C} Herbert Haas 2005/03/11
VirtuaI CaII - Facts (2)
Packet switching is much faster than
packet forwarding of routers

Routing process is compIex, typicaIIy


impIemented in software

Switching is simpIe, typicaIIy


impIemented in hardware
Why is routing slower? We give just a short explaination here: First, a router
must determine which part oI the address is topology relevant with IP addresses
this so-called network-identiIier has variable length. Second, the router must Iind
the best ("longest") match oI the destination net-ID with the routing table entries.
Third, the next-hop might not be the physical next hop. In this case a recursive
routing table lookup is necessary. Fourth, because oI the topology-related
addresses (and the associated complex Iorwarding processes) the routing table
cannot easily be stored in a high-perIormant data structure. All this is typically
implemented in soItware.
Switching is completely diIIerent. The addresses are unstructured and not
topology related. The switching process is simply to look up the correct entry in
the switching table and determine the outgoing interIace, hereby modiIying the
logical channel number (the local connection identiIier). The whole process can
be implemented in hardware. Additionally, switching is greatly accelerated using
hashing-Iunctions (CAM-tables).

29
29 {C} Herbert Haas 2005/03/11
VirtuaI CaII - Facts (3)
Connection can be regarded as
virtuaI pipe

Sequence is guaranteed

Resources can be guaranteed


Network faiIures disrupt pipe

Connection re-estabIishment necessary

Datagram networks are more robust


Remember a connection used by Virtual Call Service technologies can be seen
like a virtual pipe or tunnel. ThereIore the correct sequence oI data packets is
guaranteed and resources can be reserved quite easily.
Network Iailures will lead to an tear down oI the connection and a new
connection setup procedure.
Datagram networks are more robust because to setup a proper connection is more
diIIicult than data packet Iorwarding on a hop by hop basis. The connection setup
procedure needs more sophisticated protocols especially when QoS parameters
should be taken into account.

30
30 {C} Herbert Haas 2005/03/11
VirtuaI CaII - Facts (4)
VirtuaI caII muItipIex

MuItipIe virtuaI pipes per switch and


interface possibIe

Pipes are IocaIIy distinguished through


connection identifier
Other names for connection
identifier

LCN (X.25)

DLCI (Frame ReIay)

VPI/VCI (ATM)
All WAN-switching technologies utilize the same principle that has been
described above. But the connection identiIer has diIIerent names. In X.25 we call
it the Logical Channel Number (LCN). With Frame Relay we talk about the Data
Link Connection IdentiIier (DLCI). And ATM packets are switched using the
Virtual Path IdentiIier/Virtual Circuit IdentiIier (VPI/VCI). No matter what
complicated names are used, it is simply a dumb identiIier without any special
meaning.

31
31 {C} Herbert Haas 2005/03/11
ExampIe
BANG
This example shows us what will happen iI a node in the center oI a network
collapses. All connection through the collapsed node are torn down and new
connections using signaling needs to be established. This causes a lot oI overhead
through to new connection setup requests. In Virtual Call Service technology its
up to the end devices to set up a new connection through the network.
In Datagram technology this problem would be Iixed by the network itselI by
rerouting.

32
32 {C} Herbert Haas 2005/03/11
Two Service Types
Switched VirtuaI Circuit (SVC)

Dynamic estabIishment as shown

At the end a proper disconnection


procedure necessary
Permanent VirtuaI Circuit (PVC)

No estabIishment and disconnection


procedures necessary

Switching tabIes preconfigured by


administrator
In Virtual Call Service technique we Iind two basic types oI connections
Switched Virtual Circuits (SVC) and Permanent Virtual Circuits (PVC).
SVCs dynamically establish a connection when needed and tear down the
connection when the data transIer is Iinished. SVC technique is mainly used in
combination with X25 and ATM services.
PVCs are permanently up and can be seen like leased line services. PVCs are
mainly used in Frame-relay and ATM services.

33
33 {C} Herbert Haas 2005/03/11
Taxonomy
Circuit Switching Packet Switching
Dynamic
SignaIing
Static
Configuration
Datagram VirtuaI CaII
Deterministic MuItipIexing
Low Iatency
Designed for isochronous
traffic

StatisticaI MuItipIexing
Store and forward
Addressing necessary
Designed for data traffic
ISDN
PDH
SONET/SDH IP
IPX
AppIetaIk
X.25
Frame ReIay
ATM
ConnectionIess Connectionoriented Q.931, SS7, ... ManuaI configuration
This slide gives us an overview about the TDM technologies discussed so Iar.
On the top oI this graphic we Iind the two basic Ilavors oI TDM systems circuit
switching based on deterministic TDM and packet switching based on statistical
TDM.
Current circuit switching technologies are ISDN and PDH systems which can be
used Ior SVC services using the signaling protocols Q931 and Signaling System
Seven (SS7) or based on PVC technique using manually conIigured SONET/SDH
channels.
Current packet switching technologies can be split up in Datagram Services like
IP, IPX etc. or Virtual Call Services like X25, ATM, Frame-relay etc.

34
34 {C} Herbert Haas 2005/03/11
Summary
OnIy two worIds: circuit switching or
packet switching

The first is good for voice the Iatter is


good for data

Everybody wants to have the best of both


worIds
Datagram (CL) versus VirtuaI CaII (CO)

Different address types (!)



35
35 {C} Herbert Haas 2005/03/11
Layer N
Synchronization Revisited
Layer N+1
(a)synchronous
MuItipIexing
(a)synchronous
Transport
This slide wants to tell you that the world is not black and white only, but is
always made up oI some kind oI colored grey.
The same is true Ior networking. Networks are made up oI layers and each layer
has its own identity and properties with interIaces to the next higher or lower
layer.
So its quite easy to take a synchronous layer and put something asynchronous on
top oI it. Like ATM on top oI SONET/SDH.

36
36 {C} Herbert Haas 2005/03/11
Quiz
Derive MetcaIfe's Iaw. Which weII-
known formuIa Iooks very simiIar?
Let's improve the VC principIe!
What's the advantage of using more
than one IabeI per packet?
How do hash tabIes work?
How can we get the best of both
worIds (circuit/packet) ?
Q1: n users have n-1 connection to the n-1 other users. Divide this by 2 because
oI two-way lines. Aka Gauss Iormular to get the sum(1..n-1)
Q2: TraIIic aggregation, Iewer switching-table entries (MPLS, ATM)
Q3: Index hash (key) where hash is typically a modulo-prime operation
Q4: Cells, good queuing algorithms, HW-based routing, MPLS, ATM, optical
packet switching

1
2005/03/11 {C} Herbert Haas
X.25
Slow, Safe and Reliable

2
2 {C} Herbert Haas 2005/03/11
What is X.25 ?
Connection-oriented
Packet Switching
WAN TechnoIogy
Specifies User to Network Interface
(UNI)
Does not not specify network itseIf (!)
X.25 is wide area network service which is based on the virtual circuit
technique. X.25 works only as a user to network interIace (X.25 - DTE (e.g.
router) X.25 - DCE (packet switch)). So its connection oriented and based
on the store-and-Iorward principle oI packets (packet switching technology).

3
3 {C} Herbert Haas 2005/03/11
Roots of X.25
Created by CCITT for TeIco data
networks in 1976

ExampIe: Datex-P
Adopted and extended by ISO

Defined as OSI-Iayer 3 protocoI


X.25 had been created in 1976 by the CCITT (today ITU-T) as data
communications technology. Thus Telcos were able to oIIer data
communication interIaces to the customers. Later this idea had been adopted
by the ISO because X.25 perIectly Iitted in the OSI model (layer 3).

4
4 {C} Herbert Haas 2005/03/11
Features
ReIiabIe

FIow controI and error recovery on Iayer


two

OptionaIIy on Iayer three

Can be used on bad Iinks


Secure

Often used with encryption

Network checks caIIer-ID


High accountabiIity
X.25 technology was developed Ior low quality, low speed lines. Because
X.25 use error recovery and Ilow control on layer 2 to control transmission oI
Irames over physical lines and also use Ilow control and optionally error
recovery on layer 3 to control transmission oI packets over a virtual circuit,
X.25 is very saIe and can be used on very bad links. X.25 is world wide
available and mostly used Ior transaction today (Visa).

5
5 {C} Herbert Haas 2005/03/11
X.25 Network
UNI
X.25 DTE
X.25 DTE
X.25 DTE
X.25 DTE
X.25
Network
X.25 DCE
X.25 DCE
X.25 DCE
Modem
Packet Switching
Exchange
(PSE)
DCE
The network consists oI three components:
1) Data terminal equipment (DTE), which is actually the user device and the
logical X.25 end-system
2) Data communication equipment (DCE, also called data circuit-terminating
equipment), which consists oI modem and packet switch
3) Packet Switching Exchange (PSE), or simple: the packet switch.

7
7 {C} Herbert Haas 2005/03/11
LogicaI ChanneIs (2)
LogicaI ChanneI Number (LCN)

Identifies connection

LocaI significance onIy (!)


PVCs or SVCs
Store and Forward TechnoIogy

VariabIe deIays (!)


On oI the most important things oI X.25 is the logical channel number (LCN).
Virtual circuits are identiIied using these LCN numbers, which identiIies the
connection.
Virtual circuits appear to end systems as transparent transport pipes (logical
point-to-point connections).

8
8 {C} Herbert Haas 2005/03/11
X.25 Layer ModeI
X.21, X.21 bis, EIA/TIA-232,
EIA/TIA-449, EIA-530, G.703
LAPB
X.25 PLP
Other Services
F A C CRC F LCN Data
LCN Data
Data
Several physical layer standards have been speciIied Ior X.25. One oI the most
important is X.21bis which deIines mechanical and electrical interIace issues.
Using X.21bis allows synchronous transmission oI data up to 19.2 Kbit/s.

9
9 {C} Herbert Haas 2005/03/11
X.25 PLP (1)
X.25 PLP

LCN (IocaI significance) 0-4095

X.121 DTE-addresses (unique)

VirtuaI Circuit Services

Prioritizes precedence data

FIow controI

OptionaI end-to-end error recovery


(D-bit)
LCN 0 (zero) is reserved Ior diagnostics.
X.121 addresses are structured (routable) addresses. It's a sequence oI numbers
associated to continent, country, city and so on.
Priority packets are sent using an interrupt-request. Each interrupt packet must
be acknowledged by an Interrupt ConIirmation packet beIore the next
interrupt packet can be sent (Idle-RQ method). The length oI Interrupt packets
is only 32 bytes.
Flow control is based on windowing and RNR (RR) messages. Upon delay oI
acknowledgment, the receiver closes the send window oI the sender
(windowing). The deIault window size is 2. Optional end-to-end error recovery
(GoBack N) can be achieved using the D-bit in the X.25 PLP header. In this
mode, X.25 REJECT messages can be send.

10
10 {C} Herbert Haas 2005/03/11
X.25 PLP (2)
Connection Request
CIear Request
CaII Connected
CIear Confirm
Incoming CaII
CaII Accepted
CIear Indication
CIear Confirm
T23
T21
T11
T13
LCN 4100
LCN 55
LCN 4100
LCN 55
The above picture shows the basic call establishment procedure which is the
task oI layer 3 (X.25 PLP). Note that this layer is responsible Ior logical
channel numbers so the X.25 PLP cares Ior (de)multiplexing oI diIIerent
virtual calls over the same physical media.

12
12 {C} Herbert Haas 2005/03/11
Window=2 and D=0
Data S=0 R=0
RR R=2
Data S=4 R=2
Data S=1 R=0
Data S=5 R=2
RR R=5
RR R=6
Data S=2 R=0
RR R=4
Data S=3R=0
Window cIosed
Window cIosed
Window opened
Window opened
The above example shows an X.25 communication example without using the
D bit. Here, data is reliably sent Irom the leIt PC (DTE) to the leIt switch
(DCE) and Irom the leIt switch to the right (remote) switch. But as soon as a
packet arrives on the remote switch, an acknowledgement is generated
actually there is no guarantee that this packet will arrive on the right PC (DTE).
But in normal cases, the local link DTE-DCE is reliable enough because oI
LAPB.

13
13 {C} Herbert Haas 2005/03/11
Window=2 and D=1
Data S=0 R=0
RR R=1
Data S=4 R=2
Data S=1 R=0
Data S=5 R=2
RR R=5
RR R=6
Window cIosed
R
R
R
=
2
Data S=2 R=0
Data S=6 R=2
RR R=7
RR R=3
Window opened
Window cIosed
Window opened
The above example shows the eIIect oI an end-to-end acknowledgement which
is provided iI D1. Additionally, it can be seen that the sequencing on the leIt
side is completely decoupled Irom the sequencing oI the right side. Consider
the data packet sent with S2. It arrives at the right switch shortly aIter right
host sent RR6. This RR6 is transIormed to RR2 by leIt switch. Usually we
might conclude that right host expects leIt host's packet with S3. But right
switch will send the current packet (S2) as S6. Note that both switches
might have no idea oI the sequence numbers used on the other sides.

14
14 {C} Herbert Haas 2005/03/11
X.121 Addresses
PubIic data network numbering (ITU-T)
OnIy used to estabIish SVCs
Aka InternationaI Data Number (IDN)
4 + up to 10 digits
2 2 3 2 2 5 2 3 1 0 0 0 0
DNIC
Country PSN
NTN
DNIC...Data Network Identification Code
NTN...NationaI TerminaI Number
PSN...PubIic Switched Network
The Data Network IdentiIication Code (DNIC) is optional and typically
omitted inside a speciIic public switched network.
The Iirst digit in the DNIC identiIies the zone. For example Zone 2 covers
Europe and Zone 3 includes North America. The NTN identiIies the DTE and
can have up to 10 digits in lenth. It is possible to map an IP address into the
NTN, see RFC-1236. By the way, the example address above belongs to the
University oI Vienna.

15
15 {C} Herbert Haas 2005/03/11
LCN Ranges
Outgoing requests succeed over
coincident incoming caIIs with same
LCN
Predefined LCN ranges
Minimize propabiIity of LCN coIIisions
DCE
DTE
0
4095
LIC HIC LTC HTC LOC HOC
Three ranges can be predeIined by the provider to avoid collisions. Two
threshold markers are associated to each range. These are:
1) LIC (lowest incoming channel) and HIC (highest incoming channel)
2) LTC (Lowest two-way channel) and HTC (highest two-way channel) mark
the range Ior incoming and outgoing channels
3) LOC (lowest outgoing channel) and HOC (highest outgoing channel)

16
16 {C} Herbert Haas 2005/03/11
X.25 FaciIities (1)
EssentiaI FaciIities

Provided by aII X.25 devices

Have defauIt vaIues


ExampIes

Maximum packet size (DefauIt: 128 Bytes)

Window size

Throughput cIass (75, ..., 48000 bit/s)

Transit deIay
The X.25 standard describes a number oI so-called "Iacilities" that identiIy or
enhance a X.25 session. There are two types oI Iacilities: essential and
optional.
X.25 supports various packet sizes up to 4 KB. The maximal data rate deIined
Ior X.25 is 2 Mbit/s.

17
17 {C} Herbert Haas 2005/03/11
X.25 FaciIities (2)
OptionaI FaciIities

Don't need to be provides

DefauIt vaIues and negotiation possibIe


ExampIes

Packet error recovery (REJ support)

Fast SeIect and Fast SeIect Acceptance

CIosed user groups

Reverse charging

Hunt groups

CaII redirection
Negotiation oI optional Iacilities can be done in advance between user and
service provider, by online-registration or during call setup.
REJ support means optional ARQ on layer 3. This service utilizes the so-called
"D-bit" explained later. Fast Select allows to send data immediately with the
Iirst packet that is sent Ior connection establishment. This Ieature was invented
especially Ior credit-card transactions to speed up this payment method. Closed
user groups guarantee privacy so that only dedicated users can communicate
very important Ior commercial networks. Reverse charging is one oI the
unpleasant Iacilities. DTEs can be collected to a so-called hunt group to
improve accessibleness. II an incoming call occurs each DTE within a hunt
group is alerted, Iollowing a predeIined order. Call redirection is a comIortable
Ieature that let others do your job.

18
18 {C} Herbert Haas 2005/03/11
Fragmentation (1)
Switch may fragment packets

If one DTE requires smaIIer packet sizes


Using M-bit ("More")

M=0 means unfragmented packet or Iast


fragment

M=1 means first or middIe fragment


Switch may combine packets in the
reverse direction

19
19 {C} Herbert Haas 2005/03/11
Fragmentation (2)
In case of end-to-end acks (D=1)

We want an ACK for each sequence

Not for each fragment


Two types of packets

In-sequence packets (M=1, D=0)

SingIe or end-sequence packets (M=0, D=1)


These two types oI packets are also called category-A (in-sequence) and
category-B (single or end-sequence) packets.

20
20 {C} Herbert Haas 2005/03/11
LAPB
Link Access Procedures BaIanced

HDLC variant (ABM)

Error recovery and fIow controI

Addresses are useIess on point-to-point


Iinks used to separate commands
and respones
Since LAPB utilizes the ABM mode there is no master/slave relationship. The
P/F bit is used Ior check pointing purposes only, in cases when either end
becomes unsure about proper Irame sequencing because oI a possible missing
acknowledgement.
Point-to-point communication does not require any addressing scheme.
However, HDLC provides addresses and so does LAPB. But LAPB utilizes
this Iield to separate commands and responses. For example the address 0x01
is used Ior commands Irom DTE to DCE and responses to these commands
Irom DCE to DTE. The address 0x03 is used Ior Irames containing commands
Irom DCE to DTE and associated responses Irom DTE to DCE.

21
21 {C} Herbert Haas 2005/03/11
Scope of Each Layer
X.21
LAPB: ReIiabIe Transmission
X.25 PLP: Addressings
Higher Layers
The picture above shows the basic idea and usage oI X.25. Higher layer data is
carried in X.25 packets that identiIy the associated virtual calls using a unique
address inIormation upon call set-up and LCNs aIterwards. LAPB does not
diIIerentiate between virtual calls and thereIore handles all packets equally.
Remember that X.25 is an interIace speciIication only (a UNI) and the internals
oI the "X.25 network" are not speciIied.

22
22 {C} Herbert Haas 2005/03/11
PAD (1)
Packet AssembIer/DissassembIer (PAD)

CommonIy found in X.25 appIications

Used when DTE is a character-oriented device

Too simpIe for fuII X.25 functionaIity


Three functions

Buffering

Packet AssembIy (chars to packets)

Packet DissassembIy (strips X.25 header)


The "Packet Assembler/Dissassembler" (PAD) is an optional device and
necessary to connect a dumb asynchronous (character-oriented) device to the
X.25 network. The PAD converts the byte-stream into an X.25 packet.

23
23 {C} Herbert Haas 2005/03/11
PAD (2)
X.28
X.25
X.29
PAD X.3
Dumb character
terminaI (DTE)
DCE DCE
X.28 deIines communication issues between a non-packet DTE and a PAD.
X.29 deIines how a PAD and a remote packet station may exchange control
inIormation. The remote station can be a packet-DTE or also a PAD. X.29
identiIies these control packets using the Q-bit in the X.25 PLP header. Note
that the X.29 protocol allows the conIiguration oI a remote PAD.
The X.3 standard speciIies the Iunctionality oI a PAD to handle diIIerent
terminal types and determines how the PAD communicates with the user DTE.
The X.3 standard speciIies parameters such as escape Irom data transIer, data
Iorwarding signal, terminal speed, Ilow control, lineIeed, handling, echo,
Iorward only Iull packets, Iorward a packet upon carriage return, send service
signals to user, send interrupt packet upon receipt oI a BREAK, etc.

24
24 {C} Herbert Haas 2005/03/11
X.75
SignaIIing system to connect two
X.25 networks on internationaI
circuits
Layer 2: LAPB
Layer 3: X.75

X.75 is very simiIar to X.25 but incIudes


a variabIe Iength fieId for network
utiIities
Note that X.75 can also interconnect packet-switched networks other than
X.25.

25
25 {C} Herbert Haas 2005/03/11
Summary
CCITT and ISO standard for
connection oriented packet
switching UNI
LAPB for reIiabIe Iink transmission
X.25 PLP for VC services
SIow - mostIy used for transactions
today
WorId-wide avaiIabIe
Note that ITU-T replaced the CCITT in 1993. The CCITT's origins go back to
1865.


26 {C} Herbert Haas 2005/03/11
Quiz
Who uses X.25 today?
Do shops have both ISDN and X.25
separateIy instaIIed?
What is AX.25?
How can we speed-up X.25?


27 {C} Herbert Haas 2005/03/11
Hints
Q1: ChanceIIeries (ambassador's office),
bank-terminaIs, airport-terminaIs, press
agencies, Lotto,...
UsuaIIy they put X.25 (VISA...) over D-
channeI. AIso X.25 over B channeIs are in
use.
Q3: AX.25 is used for amateur packet
radio. The difference is that the header
must incIude the caIIsigns
Q4: Reduce protocoI overhead (doubIe
fIow controI and ARQ !) - which Ieads us
to FR

1
2005/03/11 {C} Herbert Haas
Frame ReIay
Bigger, Longer, Uncut

2
2 {C} Herbert Haas 2005/03/11
What is Frame ReIay?
Connection-oriented packet switching
(VirtuaI Circuit)
WAN TechnoIogy
Specifies User to Network Interface (UNI)
Does not not specify network itseIf (!)
Sounds
Iike X.25 ...?
Frame-relay is a technolology that appeared in the beginning oI the 1990s and
was developed to replace X25 WAN technology.
Frame-relay like X25 is based on the technique oI Virtual Call Service. So
Frame-relay is a connection oriented WAN technology, today mainly used as a
PVC service instead oI leased line services.
Originally, Frame Relay only speciIies the User Network InterIace (UNI) while
the switch-to-switch communication inside the providers cloud is not
standardized. In order to support the connection oI two diIIerent Irame relay
networks, an Network to Network InterIace (NNI) standard was created.

3
3 {C} Herbert Haas 2005/03/11
Basic Difference to X.25
Reduced overhead

No error recovery (!)

Hence much faster

Requires reIiabIe Iinks (!)


Outband signaIing
Good for bursty and variabIe traffic

QuaIity of Service
Congestion controI
The most important diIIerence to X.25 is the lack oI error recovery and Ilow
control. Note that X.25 perIorms error recovery and Ilow control on each link
(other than TCP Ior example). Obviously this extreme reliable service suIIers
on delays. But Frame Relay is an ISDN applicationand ISDN provides
reliable physical links, so why use ARQ techniques on lower layers at all?
The second important diIIerence is that X.25 send virtual circuit service
packets and data packets in the same virtual circuit. This is called "Inband
Signaling". Frame Relay establishes a dedicated virtual circuit Ior signaling
purposes only.
Thirdly, Frame Relay can deal with traIIic parameters such as "Committed
InIormation Rate" (CIR) and "Ecxess InIormation Rate" (EIR). That is, the
Frame Relay provider guarantees the delivery oI data packets below the CIR
and oIIers at least a best-eIIort service Ior higher data rates. We will discuss
this later in much greater detail.
And Iinally, although Frame Relay does not retransmit dropped Irames, the
network at least responds with congestion indication messages to choke the
user's traIIic.
Basically, Frame Relay can be viewed as a streamlined version oI X.25,
especially tuned to achieve low delays.

4
4 {C} Herbert Haas 2005/03/11
History of Frame ReIay
First proposaIs 1984 by CCITT

OriginaI pIan was to put Frame ReIay on top of


ISDN

SIow progress
1990: Cisco, Northern TeIecom, StrataCom,
and DEC founded the Gang of Four (GoF)

Focus on Frame-ReIay deveIopment

CoIIaborating with CCITT


ANSI specified Frame ReIay for USA
GoF became Frame ReIay Forum (FRF)

Joined by many switch manufacturers


In 1988 the ITU-T recommendation I.122 had been released, entitled
"Frameworking Ior Providing Additional Packet Mode Bearer Services", today
known as "Frame Mode Bearer Service", or simply "Frame Relay". I.233
describes Frame Relay between two S/T reIerence points.
Due to the slow standartization process by the ITU-T Iormerly the CCITT a
private organization the Gang oI Four (GOF) or Frame Relay Forum was
Iounded to push the developments oI new Frame-relay standards.
Additionally the ANSI came up with its own Frame-relay standards Ior the US
market. Though we have the situation today that there are three diIIerent
standartization institutes with in some parts oI the Frame-relay technique
diIIerent standards.

5
5 {C} Herbert Haas 2005/03/11
Frame ReIay Network
UNI
FR DTE
FR DTE
FR DTE
FR DTE
Frame ReIay
Network
FR DCE
FR DCE
FR DCE
FR DCE
The network consists oI Iour components:
1) Data terminal equipment (DTE), which is actually the user device and the
logical Frame-relay end-system
2) Data communication equipment (DCE, also called data circuit-terminating
equipment), which consists oI modem and packet switch
3) Packet Switching Exchange (PSE), or simple: the packet switch itselI.
4) The provider cloud which is not covered by the Frame-relay standard

7
7 {C} Herbert Haas 2005/03/11
LogicaI ChanneIs (2)
Data Link Connection Identifier
(DLCI)

Identifies connection

OnIy IocaIIy significant


Some impIementation support so-
caIIed "GIobaI addresses"

ActuaIIy aIso IocaIIy significant

Destination address = DLCI


The virtual circuit identiIiers are called Data Link Connection IdentiIiers
(DLCI) in Frame-relay technique. Ten bit in the Frame-relay header are
reserved Ior the DLCI, so up to 1024 diIIerent DLCI values are possible. Some
oI them are reserved by the diIIerent standards Ior signaling and congestion
indication.
Some implementation oI Frame-relay even support so called 'global
addresses, where the DLCI might be used as a Destination address.

9
9 {C} Herbert Haas 2005/03/11
Addressings for SVCs
(PubIic) FR networks using SVCs use
either

X.121 addresses (X.25)

E.164 addresses (ISDN)


Advantage of X.121 addresses:

Contain DNICs (Data Network


Identification Codes) which are
obIigatory
Although only a Iew service providers oIIer SVC Frame Relay service it is still
possible and part oI the standard. In order to establish an SVC a DTE must
know a globally unique host address oI the destination. Typically, the X.121 or
E.164 address plans are also utilized Ior Frame Relay. Don't conIuse X.121
and E.164 addresses with the priviously mentioned global addresses.

10
10 {C} Herbert Haas 2005/03/11
NNI (1)
FR Net
Provider X
FR Net
Provider Y
NNI
UNI
UNI

NNI had been defined to connect


different Frame ReIay networks
together
ExampIe: PubIic FR Net with Private
Due to the Iact that the Frame-relay standards to not cover the Frame-relay
cloud itselI a Frame-relay Network to Network InterIace (NNI) was
standardized to allow the connection oI diIIerent Frame-relay networks under
the use oI diIIerent vendor equipment. The NNI interIace standardizes the FR-
DCE to FR-DCE communication e.g. in the case oI connection a private
Frame-relay network to a public Frame-relay network.

11
11 {C} Herbert Haas 2005/03/11
NNI (2)
DLCI 100
D
L
C
I

2
0
0
DLCI 10
DLCI 20
DLCI 500
D
L
C
I

6
0
0

Sequence of DLCIs associated to


each VC
By the use oI the Frame-relay NNI interIace a sequence oI DLCIs is
established which represent the virtual connection. This means the connection
between two FR-DTEs with each other is determined by a sequence oI DLCIs
like in our example DLCI 200 20 600.
The DLCI number in the Frame-relay header is changed appropriately by the
UNI and NNI interIace, when a Frame-relay Irame travels through the network.

12
12 {C} Herbert Haas 2005/03/11
Outband SignaIing
SignaIing (DLCI 0 or 1023)
VC (DLCI 100)
VC (DLCI 200)
VC (DLCI 300)
DTE
DCE
"LocaI Management Interface" (LMI)
SignaIing through dedicated virtuaI
ciruit = "Outband SignaIing"

SignaIing protocoI is LMI


The Local Management InterIace (LMI) was developed to inIorm the Frame-
relay users about the condition oI the Frame-relay network itselI.
With the LMI protocol the addition, deletion and status oI DLCIs can be
announced by the Frame-relay provider to the users.
UnIortunately LMI is diIIerently implemented by the standardization
organizations. All oI them use LMI out-band signaling but on diIIerent DLCIs
and with partly diIIerent Iunctionality.
The ITU-T with its Q922 Annex A standard is using DLCI 0 as well as the
ANSI with its T1.617 Annex D. Both standards only allow the announcement
oI addition deletion and the status (active or inactive) oI a PVC.
The FRF uses DLCI 1023 Ior LMI service and allows additionally the
announcement oI bandwidth and Ilow control parameters.

13
13 {C} Herbert Haas 2005/03/11
ITU-T PVC Service ModeI
Control-Plane
(PVC-LM)
User-Plane
(PVC)
I.430
I.431
Q.922 DL-core
(LAPF)
User
specified
Q.933
Annex A
Q.922 DL-core
(LAPF)
Annex A is
for PVC onIy
Every protocol that employs outband signaling has a vertically divided layer
architecture. Here the leIt part (in the slide above) correspond to the layers used
Ior outband signaling while the layer stack on the right hand handles data
packet delivery through virtual circuits. Additionally, the outband path is called
the "Control Plane" and the data-VC path is called the "User Plane". Take it as
it is.
Most Frame Relay service providers only oIIer so-called "Annex-A" service, in
other words they only support PVCs with LMI support.

14
14 {C} Herbert Haas 2005/03/11
ITU-T SVC Service ModeI
Control-Plane
(SVC)
User-Plane
(SVC)
I.430
I.431
Q.922 DL-core
(LAPF)
User
specified
Q.933
Q.922 DL-core
(LAPF)
Q.922 DL-upper
Error recovery
and
FIow controI
But Frame Relay can also support SVC services. In this case we don't use
Annex A but rather plain "Q.933". Furthermore SVC mode requires a reliable
Q.922 connection to the DCE, which is handled by the so-called "Q.922 DL-
upper". The Frame Relay layer itselI is the Q.022 DL-core layer, which must be
always existent.

15
15 {C} Herbert Haas 2005/03/11
Layer Description
LAPF is a modified LAPD (ISDN)

Specified in Q.922
Q.922 consists of

Q.922 core (DLCIs, F/BECN, DE, CRC)

Q.922 upper (ARQ and FIow ControI)


Q.933 is based on Q.931 (ISDN)

Annex A for PVC management (LMI)


The Link Access Procedure Frame-relay (LAPF) is a modiIied variant oI the
Link Access Procedure D-channel (LAPD) used on the D-channel by ISDN to
reliable transport Q931 signaling messages.
The LAPF protocol is divided in two sub variants, the Q922 core which is used
Ior PVC service with LMI status reports, and the Q922 upper used with Frame-
relay SVC technique Ior the reliable transport oI Q933 Frame-relay signaling
messages.
The Q933 is based on the Q931 signaling protocol and it supports the
connection setup and tear down oI Frame-relay SVCs by the help oI E164 or
X121 addresses. The Q933 Annex A is used in combination with PVC services
only.

16
16 {C} Herbert Haas 2005/03/11
ANSI PVC Service ModeI
Control-Plane
(PVC-LM)
User-Plane
(PVC)
ANSI PhysicaI Layer Standards
T1.618
User
specified
T1.617
Annex D
T1.618
Annex D here
(instead of
Annex A)
In this slide the corresponding standards oI the ANSI committee Ior Frame-
relay PVC service can be seen. The ANSI standard T1.618 describes the basic
Frame-relay Irame with DLCI, BECN, FECN, etc. and it corresponds to the
Q922 core standard Irom the ITU-T.
The T1.617 Annex D standards describes the LMI service and can be seen
equivalent to the Q933 Annex A standard Irom the ITU-T.

17
17 {C} Herbert Haas 2005/03/11
ANSI SVC Service ModeI
Control-Plane
(SVC)
User-Plane
(SVC)
ANSI PhysicaI Layer Standards
T1.618
User
specified
T1.617
T1.602
The ANSI T1.602 standard is equivalent to the ITU-T Q922 core upper
standard and supports the reliable transport oI signaling messages to set up
Frame-relay SVCs.
The T1.617 is equivalent to the Q933 standard and uses E164 and X121
addresses Ior the set up oI SVCs.

18
18 {C} Herbert Haas 2005/03/11
ANSI Layer Description
T1.602 specifies LAPD

Based on Q.921
T1.618 is based on a subset of
T1.602 caIIed the "core aspects"

DLCIs, F/BECN, DE, CRC


T1.617

SignaIing specification for Frame ReIay


Bearer Service

Annex D for PVCs (LMI)


This is a summary oI the ANSI Frame-relay standards discussed so Iar.

19
19 {C} Herbert Haas 2005/03/11
Frame ReIay Forum (FRF)
FRF.1.1 User to Network Interface (UNI)
FRF.2.1 Network to Network Interface (NNI)
FRF.3.1 MuItiprotocoI EncapsuIation
FRF.4 SVC
FRF.5 FR/ATM Network Interworking
FRF.6 Customer Network Management (MIB)
FRF.7 MuIticasting Service Description
FRF.8 FR/ATM Service Interworking
FRF.9 Data Compression
FRF.10 Network to Network SVC
FRF.11 Voice over Frame ReIay
FRF.12 Fragmentation
FRF.13 Service LeveI Agreements
FRF.14 PhysicaI Layer Interface
FRF.15 End-to-End MuItiIink
FRF.16 MuItiIink UNI/NNI
This list gives us an overview oI the standards published by the FRF.
The FRF.1.1 standard describes the UNI interIace and can be seen in
combination with the FRF.4 standard as an equivalent to the Q922, Q933
standard oI the ITU-T.
The FRF.2.1 standard speciIies the connection oI Frame-relay DCE to DCE Ior
mixed vendor support.
The FRF.11 describes the direct transport oI voice on top oI Frame-relay
Irames and the FRF.12 deals with Iragmentation. The FRF.11 and the FRF.12
are needed in combination to establish voice over Frame-relay networks.

20
20 {C} Herbert Haas 2005/03/11
Voice over FR
VoFR Standard FRF.11 (Annex C)
MuItipIe subframes in a singIe FR-Frame
30 Byte Voice PayIoad per subframe
AdditionaI identifier CID (Channed ID) to identify
separate streams
Dedicated CID for signaIing (Cisco: CID 0)
Voice + Data in same PVC: DeIay ProbIem

SoIution: FRF.12 (Fragmentation)


Data packets are fragmented and interIeaved with voice
packets
Voice-frames shouId keep "inter-frame-deIay" <10ms
Adjustments of fragment-size based on AR
Cisco: fr-fragment-size
The FRF.11 standard describes how multiple voice communication channels
can be transported across a Frame-relay network. The voice channels are
packed into separate subIrames, oI up to 30 byte in length ,with an additional
FRF.11 header in Iront oI them. The FRF.11 header carries a Channel ID (CID)
which is needed to distinguish between the diIIerent voice channels. Several
subIrames can be transported by one Frame-relay Irame depending on the
maximum allowed Irame size.
Here the FRF.12 standards comes into play, because the size oI the Frame-relay
Irames needs to be reduced to adopt to the delay and jitter requirements needed
by voice communication. Normally Frame-relay depending on the standard
allows max. payload sizes between 1600 to 8192 bytes. In Voice over Frame-
relay systems the maximum payload size is conIigurable between 16 and 1600
bytes. Cisco uses a deIault value oI 53 bytes.

21
21 {C} Herbert Haas 2005/03/11
PhysicaI Interfaces
Some UNI Specifications (FRF.1)
ITU-T G.703 (2.048 Mbps)
ITU-T G.704 (E1, 2.048 Mbps)
ITU G.703 (E3, 34.368 Mbps)

ITU-T X.21
ANSI T1.403 (DS1, 1.544 Mbps)
ITU-T V.35

ANSI/EIA/TIA 613 A 1993 High Speed SeriaI


Interface (HSSI, 53 Mbps)
ANSI T1.107a (DS3, 44.736 Mbps)
ITU V.36/V.37 congestion controI
Frame-relay is a typical Data-link technology which can be used on top oI
many diIIerent layer 1 techniques. In this graphic a short overview oI the most
common used layer 1 techniques in combination with Frame-relay is shown.

22
22 {C} Herbert Haas 2005/03/11
Layer 2 Tasks
Q.922 Annex A (LAPF) or T1.618 specifies

Frame muItipIexing according DLCI

Frame aIignment (HDLC FIag)

Bit stuffing

16-bit CRC error detection but no correction

Checks minimum size and maximum frame


size

Congestion controI
The Q922 Annex A or the T1.618 ANSI cover Iollowing tasks:
Both describe the multiplexing oI diIIerent communication channels on one
physical connection by the help oI the according DLCI.
Frame alignment which means start and end oI Irame detection plus
synchronization with the help oI the HDLC Ilag.
Bit stuIIing to prevent the appearance oI the Flag bit pattern inside the payload
area oI the Irame.
16 bit Cycle Redundancy Check Ior error detection inside the Frame-relay
network. Frames in error will be discarded only, there are no error recovery
Iunctions implemented.
Determination oI maximum and minimum Frame-relay Irame sizes depending
on the conIigurations (e.g. voice)
Congestion control and indication with the help oI the FECN, BECN bits or
the CLLM system.

23
23 {C} Herbert Haas 2005/03/11
The Frame ReIay Frame
FIag Header Information FCS
DLCI (MSB)
FIag
C/R EA DLCI (LSB)
FE
CN
BE
CN
DE EA
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
Legend: Legend:
DLCI Data Link Connection Identifier
C/R Command/Respond
EA Extended Addressing
FECN Forward ExpIicit Congestion Notification
BECN Backward ExpIicit Congestion Notification
DE Discard EIigibiIity
1 2 2 1
The DLCI Iield length is typically 10 bits. Optionally, it can be extended using
the EA bit (max 16 bits according FRF and GOF). The EA bits are used such
that the Iirst and middle DLCI address octets are indicated by EA0 whereas
the last address octet is indicated by EA1.
Note that the second address octet always contains:
The FECN, BECN, and DE bit. Currently only 10 bit DLCIs are supported,
but the EA Ilag allows the use oI longer DLCIs in the Iuture. Today, MPLS
utilizes the Extended Address Iield oI the FR header.
The C/R bit is a rudimentary bit, inherited Irom HDLC. It is not used within
Frame Relay!
According to FRF, the maximum length oI the inIormation Iield is 1600 bytes.
The other standards allow lengths up to 8192 (theoretically) but the CRC-16
only protects 4096 bytes. Practically, maximimum Irame siztes oI up to 1600
bytes are used.
The usage oI the FECN and BECN bit is explained in a Iew seconds...

24
24 {C} Herbert Haas 2005/03/11
Congestion ControI (1)
FECN indicates congestion to the receiver
BECN indicates congestion to the sender
ProbIem: DTEs do not need to react (!)
FECN
BECN
congested
The Frame-relay network is able to indicate congestion situations to its users
by the help oI the BECN and FECN bit located in the Frame-relay header.
With the help oI these two bits not only a congestion situation but also the
direction oI the congestion can be indicated. In the direction oI the congestion
the FECN bit in the Frame-relay header oI the by passing packets is set, by the
congested Frame-relay switch, while in the opposite direction the BECN bit
will be set.

25
25 {C} Herbert Haas 2005/03/11
Congestion ControI (2)
Routers can be configured to react
upon receiving a BECN
OnIy a few higher Iayer protocoIs
react upon receiving a FECN

OnIy some OSI and ITU-T protocoIs

TCP does not


So the sender will receive its packets with the BECN bit set while the receiver
receives packets with the FECN bit set. Now its completely up to the sender to
reduce the amount oI traIIic it injects (traIIic shaping conIigurable by
soItware).
Typically routers do not react on the receive on packets with the FECN bit set.
But in the case that there is no return traIIic, routers can be conIigured to send
dummy Frame-relay packets back to the sender to allow the BECN bit to be
set.

26
26 {C} Herbert Haas 2005/03/11
CLLM
ConsoIidated Link Layer Management
ITU-T and ANSI deveIopment
OptionaI out-band signaIing for
congestion indication messages

DLCI 1023
Before congestion, DCE sends CLLM
message to DTE

Associated DLCIs specified


The Consolidated Link Layer Management was developed by the ITU-T and
ANSI to provide a more sophisticated tool Ior congestion indication.
An additional out-band channel (DLCI 1023) is used to actively signal
congestion situation towards the users, beIore the congestion actually happens.
Compared to the FECN and BECN bit which is based on reactive congestion
indication CLLM provides a proactive congestion indication tool.

27
27 {C} Herbert Haas 2005/03/11
CLLM Message
CLLM message is carried inside LAPF Frame
CtrI = 0xAF (XID)
Format ID = 10000010 (ANSI/ITU)
Group ID = 00001111
Group VaIue FieId
Parameter-ID (1 octet)
Parameter Length (1 octet)
Parameter VaIue (n octets)
FIag Header
Format
ID
FCS FIag
2
CtrI
Group
ID
Group
Length
Group
VaIue FieId
1 1 1 variabIe
This is an example oI an CLLM message carried inside an LAPF Irame.
The control Iield is set to 0xAF which corresponds to an Exchange
IdentiIication (XID) message. The Format ID Iield indicates the standardization
organizations.
The group ID and the group value Iield inIorm which DLCIs are congested
and apart Irom congestion indication CLLM is also able to inIorm the users
about the cause oI congestion e.g. short term network congestion due to
excessive traIIic or long term equipment Iailure.

28
28 {C} Herbert Haas 2005/03/11
Traffic ControI
StatisticaI muItipIexing is cheaper for
service providers than deterministic-
synchronous muItipIexing
Users are supposed to require Iess
than the access rate on average
Otherwise congestion wiII occur and
frames are dropped

Which causes the end-stations to


retransmit...and further overIoad the
network
The traIIic control in Frame-relay is based on statistical TDM where
connections are typically dimensioned on the average traIIic needs oI all
connected users. The service providers try to take advantage oI the users traIIic
behavior, because its very unlikely that all users at the same time use their
complete access rate towards the provider.
But nevertheless iI congestion happens Irames are dropped by the Frame-relay
switches, which causes retransmissions by the end-stations due to the use oI
error recovery Iunctions on higher network layers e.g. TCP. This behavior my
lead to an Iurther overload oI the network.

29
29 {C} Herbert Haas 2005/03/11
Time to Transmit 1 kByte
10 Mbit/s
0,8 ms
64 kbit/s
125 ms
10 Mbit/s
0,8 ms
Leased Line (E.g. ISDN)
10 Mbit/s
0,8 ms
2 Mbit/s
4 ms
155 Mbit/s
0,052 ms
2 Mbit/s
4 ms
10 Mbit/s
0,8 ms
Frame ReIay Network
AR=2 Mbit/s
CIR=64 kbit/s
In this example we want to show the advantages oI Frame-relay compared to
leased line services.
In the case oI a leased line connection the bandwidth and thereIore the capacity
and the delay oI a connection is Iixed.
In Frame-relay we will Iind several values which determine the properties oI a
Frame-relay connection. The Committed InIormation Rate (CIR) that is agreed
between provider and customer is based on the average usage oI the
connection. This is what the customer pays Ior. The actual physical Access
Rate supplied by the provider is typically higher than the agreed CIR.
This means in our example the customer gets the same guaranted bandwidth oI
64 Kbit/s as in the leased line example, but has a much smaller delay because
oI the 2 Mbit/s access rate towards the service provider. In times oI low
provider network utilization (maybe during the night) the customer may even
try to send more than the agreed 64 Kbit/s.
Practically Frane Relay is more cost eIIective rather than cheap.

30
30 {C} Herbert Haas 2005/03/11
Bursty Traffic (1)
FR aIIows to differentiate between
Access Rate (AR) and
Commited Information Rate (CIR)

CIR corresponds to average data rate

AR > CIR
Sporadic bursts can use Iine up to AR
OptionaIIy Iimited by
Excess Information Rate (EIR)
As already discussed beIore the main parameters that determine the transport
capacity oI a Frame-relay connection are the physical AR, the CIR and the
Excess InIormation Rate (EIR).
Typically the capacity oI the CIR is guaranteed by the service provider at any
time. In burst situations the customer may try to send more data than the CIR
allows, but Ior this additional data no guarantees Ior delivery are given by the
service provider.
Most service provider allow over utilization up to the AR, some others may
limit the over utilization with a separate traIIic parameter called the EIR.

31
31 {C} Herbert Haas 2005/03/11
Bursty Traffic (2)
CIR and EIR are defined via a
measurement intervaI Tc

CIR = Bc / Tc (Bc...Commited Burst Size)

EIR = (Bc+Be) / Tc (Be...Excess Burst Size)


When traffic can be mapped on these
parameters (provided by provider) then FR
is ideaI for bursty traffic

ExampIe: LAN to LAN connection


Parameters (Bc, Be, Tc, AR) are defined in
a traffic contract
The CIR and the EIR are deIined via a measurement time interval Tc, which is
set to 1 second in most cases. The committed burst size Bc deIines the amount
oI bits per Tc with guaranteed delivery. The Excess Burst Size Be speciIies the
maximum allowed oversubscription oI bits per Tc, Ior which the delivery will
not be guaranteed.
All oI these parameters plus the physical AR need to be negotiated with the
service provider and are written down in a traIIic contract.

32
32 {C} Herbert Haas 2005/03/11
Parameter ExampIe (1)
Bits
Time Tc = 1s 2s
128000
Bc = 64000
A
R
=
1
2
8
,0
0
0
B
i
t
/s
C
IR
=
6
4
,0
0
0
B
it/s
In this example the measurement time interval Tc is set to 1second, the Bc to
64000 bits and the physical access rate is 128 Kbit/s. The red line indicates the
actual traIIic pattern used on this connection. In this scenario the traIIic
characteristic remains within the CIR oI 64 Kbit/s.

33
33 {C} Herbert Haas 2005/03/11
Parameter ExampIe (2)
Bits
Time 1s Tc = 2s
Bc = 64000
A
R
=
1
2
8
,0
0
0
B
i
t
/s
CIR=32,000 Bit/s
In this scenario a measurement interval Tc oI 2 seconds is chosen. The
committed burst size Bc is 64000 bits, so the CIR according to the Iormula CIR
Bc / Tc will be 32 Kbit/s. The actual traIIic pattern indicated by the red line
remains again within the borders oI the CIR.

34
34 {C} Herbert Haas 2005/03/11
Parameter ExampIe (3)
Bits
Time 1s Tc = 2s
Bc = 64000
A
R
=
1
2
8
,0
0
0
B
i
t
/s
CIR=32,000 Bit/s
This example shows a more realistic scenario with a lot oI small data bursts
which in sum do not exceed the CIR. Actually router manuIactures use a burst
interval much smaller than the Tc. For example a cisco router per deIault
would send out small data burst every 125 milliseconds on a Frame-relay
connection. The maximum size oI these bursts is calculated Irom the
parameters Tc, Bc, Be and AR which are deIined in the traIIic contract.

35
35 {C} Herbert Haas 2005/03/11
Traffic Management
Traffic Shaping

Users task

GoaI: smooth traffic profiIe, mitigiate


bursts

Token bucket methods


Traffic PoIicing

Provider's task

GoaI: Drop (excess) frames vioIating the


traffic contract
TraIIic shaping according to the negotiated parameters is in the responsibility
oI the end users. End users traIIic that is outside the traIIic contract will be
discarded by the Iirst Frame-relay switch in the providers network.
So its Ior the beneIit oI the user itselI to smooth and shape its traIIic according
to the parameters. TraIIic shaping according to the token bucket method might
be used to achieve this goal.

36
36 {C} Herbert Haas 2005/03/11
C
C
Token Bucket
C
C
C
C
C C C C
C C C
C
Token Generator
C C C
C C
C
Wire
The token bucket method consists oI a token bucket and a data bucket. The
valve oI the data bucket, which controls the amount oI data that can be sent
out, can only be opened by inserting a token. This means data can only be sent
iI there are tokens available in the token bucket.
The token generation in the token bucket is done according to the Frame-relay
traIIic parameters. So these tokens guarantee that the negotiated traIIic
parameters will not be hurt by the user.

37
37 {C} Herbert Haas 2005/03/11
Traffic Shaping
TB = Token Bucket (=Bc+Be)
MaximaI speed = TB/Tc
TypicaIIy, traffic above maximaI
speed is buffered in a traffic shaping
queue
So the size oI the token bucket itselI corresponds to the value oI Bc Be and
the rate oI token generation corresponds to the term Bc Be / Tc.

38
38 {C} Herbert Haas 2005/03/11
Traffic Shaping for Voice
Tc<=10ms

Provides continuous traffic fIow


AdditionaIIy BECN can be used to
decrease CIR

Cisco: MinCIR - Traffic shaping not caIcuIated


using provider-CIR but for higher vaIues

On receiving of BECN traffic-rate is reduced to


MinCIR (= Provider CIR)
Cisco Proactive Trafficshaping: "Forsight"

ThrottIes traffic before congestion occurs

OnIy supported on Cisco FR-Switches


TraIIic shaping Ior voice is much more delay and jitter sensitive than Ior data.
To accomplish to the needs oI voice Tc is set Irom 125 milliseconds used Ior
data to a value below 10 milliseconds to generate a continuous traIIic Ilow with
minimum jitter. These needs to be done obviously in combination with the
conIiguration oI smaller datagram sizes between 50 to 100 bytes.
In the case oI congestion, indicated by the BECN bits in the header, the traIIic
rate is reduced, iI traIIic shaping is switched on. The way the router shapes can
be adjusted on cisco devices by the help oI the mincir and the cir parameter.
Typically the cir parameter is set to the EIR or AR and the mincir parameter is
set to the CIR.
Under normal conditions the router will send out data with the rate oI the cir
parameter (EIR or AR) and in case oI BECN bits the router will gradually
reduce the speed until the mincir parameter (CIR) is reached.
Cisco has also developed a proprietary method Ior Frame-relay traIIic shaping
called Ioresight, which allows proactive traIIic shaping even beIore the actual
congestion occurs. By the use oI Ioresight the Frame-relay switch is able to
determine the maximum data rate that might be used by the Frame-relay DTE.
This technology can only be used between Cisco routers and Cisco (Iormer
StrataCom) Frame-relay switches.

39
39 {C} Herbert Haas 2005/03/11
Traffic Management
Bits
Time/Tc
Be+Bc
Bc
A
R
E
IR
CIR
1 0.5
Mark
frames with
DE bits
Discard
frames
(ExampIe onIy)
This example shows us what might happen iI more traIIic is injected than the
Frame-relay traIIic parameters allow. Obviously the behavior in real liIe is
completely up to the traIIic contract negotiated and might be diIIerent Irom our
scenario.
As long as the traIIic remains within the borders oI the CIR all Irames are
accepted by the Frame-relay switch and will be delivered to their destination.
Data Irames above the CIR but below the EIR will be marked with the Discard
Eglibility (DE) bit. This bit is located in the Frame-relay header and can be set
either by the end user itselI or by the Iirst Frame-relay switch in the provider
cloud. All Irames marked with the DE bit will be discarded Iirstly in the case oI
congestions inside the provider cloud.
So it might be better Ior the end user to set the DE bit himselI, simply to
control which type oI traIIic should deIinitely arrive and which one might get
lost.
All traIIic, in our scenario, above the EIR will be discarded by the provider.

40
40 {C} Herbert Haas 2005/03/11
Traffic Management (4)
Bits
Time/Tc
Be+Bc
Bc
A
R
E
IR
CIR
1 0.5
D=1
D=0
Mark frames with
DE bits but try to
deIiver with best
efforts
(ExampIe onIy)
This is the typical service provider behavior. Typically, a customer just pays
Ior the CIR and the rest oI the bandwidth up to the access rate is Iree.
However, there is no gurantee that every excess packets is delivered to the
receiver.

41
41 {C} Herbert Haas 2005/03/11
TypicaI Provider Offering
Data rate
Time
AR
CIR
What you
pay for
Free but no
guarantees
This graph shows us the beneIits oI a Frame-relay connection. The CIR is what
you pay Ior but very oIten it is possible to use provider capacities above the
CIR which are Ior Iree.

42
42 {C} Herbert Haas 2005/03/11
LocaI Management Interface
LMI extends Frame ReIay

GIobaI Addressing

Status messages

MuIticasting
LMI is more of a protocoI than an
interface (!)
The Local Management InterIace (LMI) is a protocol that runs on a reserved
DLCI to supply you with inIormation about the conditions oI your PVCs.
But it also supports global addressing and the use oI multicast PVCs.

43
43 {C} Herbert Haas 2005/03/11
LMI DetaiIs
Three LMI Types

ANSI T1.617 (Annex D)

ITU-T Q.933 (Annex A)

LMI (OriginaI, FRF)


No fragmentation of LMI messages (!)

MTU determines maximaI PVC number

E.g. MTU 1500 aIIows 296 DLCIs


Each Standardization Organization developed its own LMI. In Iact, only the
FRF LMI is named LMI, but don't be too subtle. UnIortunetaly (you might
expect it) these signaling standards are not compatible. Practically, our service
provider must tell us which standard is supported by her DCEs, or some
modern routers perIorm an auto-sensing and determine the switch-type
automatically.
Full Status Messages contain all currently used DLCIs within a single Irame.
Because oI this the maximum number oI PVCs is limited by the MTU Ior this
link. LMI messages must not be Iragmented.
Note: When the Irame MTU size is too small, not all PVC status messages can
be communicated . One symptom Ior this mistake is the observation oI
bouncing PVCs (repeated up/down indications).
You can easily calculate the maximum number oI DLCIs per interIace by
yourselI. The equation is MaxDLCIs(MTUbytes 20)/5, because each
entry has 5 bytes. For example a MTU oI 4000 Bytes supports 796 DLCIs.

44
44 {C} Herbert Haas 2005/03/11
LMI Message Format
LMI message is carried inside LAPF Frame
CtrI = 0x03 (UI)
ProtocoI Discriminator
00001000 (ANSI/ITU)
00001001 (GOF)
CaII Reference
00000000 (onIy used for SVC)
Message Type
0111 1101 (Status)
0111 0101 (Status Enquiry)
0111 1011 (Status Update, GOF onIy)
FIag Header
Prot.
Dis
FCS FIag
1
CtrI
CaII
Ref.
1 1 1 variabIe
Msg.
Type
Information
EIements (IE)
Contain PVC
status
information
The LMI messages are packed in standard Frame-relay Irames and are
transported on DLCI 0 according to the ITU-T and ANSI standard or on DLCI
1023 according to the FRF standard.
The LMI messages are sent in a connection-less mode indicated by the value oI
the control Iield (0x03).
The Protocol Discriminator holds the inIormation whether FRF, ANSI or ITU-
T standard is used.
The Call ReIerence is only used in combination with SVC service, its needed
to distinguish between the diIIerent connection setup procedures.
The Message Type speciIies whether LMI message is a status enquiry, status
report or Iull status update including bandwidth and congestion inIormation.
The Iull status update is only supported by the FRF standard.
Finally the inIormation Iield itselI holds the complete status inIormation oI all
PVCs in use.

45
45 {C} Herbert Haas 2005/03/11
LMI Operation
Every 10 seconds the DTE poIIs the
DCE with a Status Enquiry message

Either for a dumb response ("Yes I'm


here")

Or for a ChanneI status information


(FuII) Status Response

Contains information about VCs


Every 5-30 seconds (typically 10 seconds) the DTE polls the DCE to receive a
status inIormation.
The response Irom the DCE might be a small Hello message or a Iull status
report about the PVCs in use every 60 seconds.

46
46 {C} Herbert Haas 2005/03/11
Inverse ARP
Automatic remote-node-address to
IocaI-DLCI mapping

Supports IP, IPX, XNS, DECnet, Banyan


VINES, AppIeTaIk
Extension of existing ARP
Not onIy for Frame ReIay
RFC 1293
II a layer 3 protocol like IP, IPX, XNS, etc. is transported via a Frame-relay
connection layer 2 to layer 3 address mapping needs to be done. The layer 2
address might be a DLCI number in case oI PVC service or a E164/X121
address in case oI SVC service.
In case oI PVC service the Inverse ARP protocol was developed to allow the
automatic mapping between DLCI number and according layer 3 addresses. In
X25 technology the predecessor oI Frame-relay this had to be done manually
by conIiguration.
In Frame-relay SVC service the mapping between E164/X121 address and the
according layer 3 address needs to be done manually by conIiguration, because
the E164 address is needed beIore the actual connection is up to start the
connection setup procedure.

47
47 {C} Herbert Haas 2005/03/11
Inverse ARP and LMI Operation
Frame ReIay
Network
Status Inquiry
10.0.0.1
20.0.0.1
DLCI 100 DLCI 300
LocaI DLCI 100 Active
Status Inquiry
LocaI DLCI 300 Active
HeIIo, I am 10.0.0.1
10.0.0.1 300
FR-Map
HeIIo, I am 20.0.0.1
20.0.0.1 100
FR-Map
Inverse ARP messages
are repeated every 60 seconds !
In this scenario the Iunction and the interaction oI the LMI and the Inverse
ARP protocol is shown.
With the help oI the status enquiry and the status report messages oI the LMI
protocol both nodes on either ends are inIormed about their DLCI number and
the condition oI the DLCI.
Now both nodes on either end send small Hello messages with their according
layer 3 address into their active DLCI. This Hello procedure is repeated every
60 seconds.
Now both nodes can build up a Frame-relay mapping table which includes their
own DLCI number and the layer 3 address oI the opposite site. So they know
whos on the other side.

48
48 {C} Herbert Haas 2005/03/11
DLCI PIan
0 LMI (ANSI, ITU-T) or
FRF In-channeI signaIing
1023 LMI (FRF) or
ITU-T/ANSI In-channeI signaIing
1-15 reserved
993-1007 Frame ReIay bearer service
Layer 2 management (ANSI/ITU-T)
1008-1018 reserved
1019-1022 muIticast connections
FRF: UsabIe DLCIs from 16 to 1007
ANSI/ITU-T: UsabIe DLCIs from 16 to 992
This slide gives us an overview about the reserved DLCIs Ior signaling and
the DLCIs that may be used Ior user traIIic.
So according to the FRF the DLCIs in the range oI 16 to 1007 and according
to ANSI/ITU-T speciIications the DLCIs 16 to 992 can be used to transport
user traIIic.

49
49 {C} Herbert Haas 2005/03/11
Bi-directionaI LMI (1)
Standards LMI is unidirectionaI

Sufficient for UNI signaIing


NNI signaIing requires a bi-directionaI LMI
variant

PVC status must be reported in both directions

SymmetricaI approach necessary


FR Net
Provider X
FR Net
Provider Y
NNI
UNI
LMI
UNI
LMI
BidirectionaI
LMI
Common LMI is unidirectional and can only be used Ior UNI interIaces.
In the case oI an NNI connection between two diIIerent Frame-relay clouds a
bidirectional LMI protocol needs to be supported to report the PVC status to
either ends.

50
50 {C} Herbert Haas 2005/03/11
Bi-directionaI LMI (2)
Using Bi-LMI each network is notified
about PVC status in the other
network
OnIy supported by ITU-T and ANSI

DLCI 0

Not defined by GOF


AdditionaI fieIds

Inactivity reason, country code, nationaI


network identifier
When bidirectional LMI is used every network gets the PVC status inIormation
oI the opposite side. Bidirectional LMI is only supported by the ANSI and the
ITU-T standard and uses the same DLCI number that is used Ior unidirectional
LMI.
Some additional inIormation needs to be transported by the bidirectional LMI
like country codes and network identiIiers.

51
51 {C} Herbert Haas 2005/03/11
Summary
Frame ReIay has reduced overhead
compared to X.25
Outband signaIing (LMI)
Efficient for bursty traffic

Parameters (Bc, Be, Tc or CIR, EIR)


Congestion Notification

FECN, BECN
Frame ReIay Forum, ITU-T, and ANSI

52
52 {C} Herbert Haas 2005/03/11
Quiz
What's the Tc when using Voice over
Frame ReIay?
What's the main difference between
FR and Ethernet, when putting IP
upon them?
What's the typicaI practicaI usage of
BECN?

53
53 {C} Herbert Haas 2005/03/11
Hints
Q1: MiIIiseconds (min 10 ms)
Q2: Broadcast medium. Main
probIem with routing protocoIs
Q3: BECN is used by the provider to
throttIe the customer if he vioIates
the traffic contract

1
2005/03/11 {C} Herbert Haas
ATM Introduction
The Grand Unification

2
2 {C} Herbert Haas 2005/03/11
Agenda
What is it? Who wants it? Who did it?
Header and Switching
ATM Layer Hypercube
Adaptation Layers
SignaIing
Addresses

3
3 {C} Herbert Haas 2005/03/11
What is ATM ?
High-Speed VirtuaI Circuits

PVC and SVC

No error recovery
UNI and NNI defined
Constant frame sizes CeIIs
Based on B-ISDN specifications

Voice, Video, Data




4 {C} Herbert Haas 2005/03/11
Design Ideas
Asynchronous TDM
Best trunk utiIization
Synchronous TDM
Fast Switching and short deIays
through constant timesIots
FIexibIe channeI assignment
through addresses
ProtocoI Transparent
ATM
copy
copy
fake
fake
SoIved through
constant frame sizes
SoIved through
adaptation Iayers


5 {C} Herbert Haas 2005/03/11
CeII Switching and Jitter
Voice and FTP over Frame ReIay
Constant deIays possibIe with ATM
DeIay variations (!)

6
6 {C} Herbert Haas 2005/03/11
CeII Switching
Forwarding of ceIIs impIemented in HW
Very fast
But stiII packet switching
Store and forwarding
Asynchronous muItipIexing
Because of constant ceII size the queuing
aIgorithms can guarantee
Bounded deIay
Maximum deIay variations
For telephony a constant delay is strictly necessary because otherwise echo
cancelers would not work.


7 {C} Herbert Haas 2005/03/11
ATM Usage
PubIic and private networks

LAN, MAN, WAN


Backbone high-speed networks

PubIic (TeIcos) or private


OriginaI goaI: WorId-wide ATM network

But Internet technoIogy and state-of-the art


Ethernet are more attractive today
New importance as backbone technoIogy
for mobiIe appIications

CeIIuIar networks for GSM, GPRS, UMTS, ...




8 {C} Herbert Haas 2005/03/11
ATM Network
UNI
ATM DTE
ATM DTE
ATM DTE
ATM DTE
ATM DCE
ATM DCE
ATM DCE
ATM DCE
NNI
UNI + NNI defined

10
10 {C} Herbert Haas 2005/03/11
Who Did It?
CCITT (now ITU-T) issued first
recommendations for B-ISDN in 1988

Recommendation I.121

Aspects and Terms onIy


Switch vendors founded ATM-Forum

To acceIerate deveIopment

Majority ruIe instead of consensus

AIso pushed ITU-T standardization


The CCITT (ITU-T) standardization process is very time consuming because the
Iinal result should meet all demands oI all participants such as governments,
vendors, users, and other industry representatives. Because oI this, the ATM
Forum was Iounded to accelerate the development. Although this eIIorts also
helped the ITU-T standardization eIIorts, there are important diIIerences between
both standards.
The ATM Forum was Iounded in 1991.

11
11 {C} Herbert Haas 2005/03/11
PubIic and Private Networks
ITU-T: PubIic ATM Networks

PubIic UNI: E.164 addressing

PubIic NNI: Static routing


ATM-Forum: Private ATM Networks

Private UNI: OSI NSAP Iike addressing

Private NNI: Dynamic routing (PNNI)


Note that both public and private networks cover SVCs but in order to establish
SVCs we need routing tables. The routing tables can be created automatically in
private ATM networks using the ATM routing protocol PNNI (Private-NNI). In
public networks the routing tables are managed manually.
PNNI is a link-state routing protocol that enables quality oI service routing.

12
12 {C} Herbert Haas 2005/03/11
NNI Types
PubIic ATM
Private ATM
PubIic ATM
PubIic NNI
B-ICI
(NNI-ICI)
ICI...Inter Carrier Interface
NNI-ISSI (Public NNI)
ISSI Inter Switch System InterIace
Used to connect two switches oI one public service provider
NNI-ICI (B - ICI)
ICI - Inter Carrier InterIace
Used to connect two ATM networks oI two diIIerent service providers
Private NNI
Used to connect two switches oI diIIerent vendors in private ATM networks

13
13 {C} Herbert Haas 2005/03/11
What is B-ISDN?
ITU-T identified severaI demands

Emerging need for broadband services

High speed switching

Improved data- and image processing


capabiIites avaiIabIe to the user

Support for reaI-time services

Support for interactive services

Support for distribution services

Circuit and packet mode


Interactive services require a two-way exchange oI inIormation.
Distribution services are one-to-many and is also called multicast.

14
14 {C} Herbert Haas 2005/03/11
ATM and B-ISDN
B-ISDN are broadband (=highspeed)
services for the user
ATM to transport B-ISDN
AIternatives to B-ISDN

IEEE 802.6 (DQDB) pushed by data


communication industry (dying out)

Gigabit Ethernet (new)


IEEE 802.6 is a MAN standard and also known as "Distributed Queued Dual
Bus". Interestingly, DQDB is very similar to ATM in many aspects (same Irame
sizes, same idea oI adaption layers, etc.), However this alternative, that has been
pushed by many proponents oI the data communication industry, vanished Irom
market.
Currently (Gigabit-) Ethernet seems to replace ATM in many areas since it is
easier to deploy, easier to manage and less expensive. However, many customers
suspects its reliability and quality oI service.

17
17 {C} Herbert Haas 2005/03/11
ATM Header
GFC VPI
VPI
HEC
PT CLP
HEC
PT CLP
VCI VCI
VPI
UNI Header NNI Header
8 bit VPI for users 12 bit VPI inside the network
The Generic Flow Control (GFC) Iield is only used on the UNI but not
transported into the network. The GFC is not used today as there are better
methods available (special Ilow-control cells).
The Virtual Path IdentiIier (VPI) is Iour bits longer inside the network (on NNIs)
in order to support better traIIic aggregation (Virtual Path Switching).
The Payload Type (PT) is used to identiIy the cell payload (OAM, Resource
Management, ...)
The Cell Loss Priority (CLP) has the same meaning as the DE-bit in Frame Relay.
Using the CLP we can distinguish between important and not-so-important cells
(CLP1). OI course we hope that the network would be so kind to drop CLP1
cells Iirst in case oI congestion.
The Header Error Check (HEC) is a CRC-8 to protect the header only not the
payload! You may ask how Iraming is accomplished? For this purpose a receiver
device has to compute the CRC-8 Ior each 4 bytes and look Ior a match with the
Iollowing byte. In case oI 6 successive hits the ATM layers are synchronized.
Note: Although ATM is an asynchronous TDM technology it is actually
implemented synchronously. There are no gaps between cells but idle cells
(VPI/VCI set to zero and payload is 01010101010101010101...).

18
18 {C} Herbert Haas 2005/03/11
PayIoad Type
100 OAM F5 segment
101 OAM F5 end-to-end
110 Resource Management (RM)
AIso used by AAL5 to indicate end of
bIock (EOB)
Other combinations: user data
User data (0)
or OAM (1)
Set to (1) if
Congested
User
signaIing
bit
"Flow 5" (F5) is identical to a VC, F4 is the VP Ilow, F3 is a SONET/SDH Path,
F2 is a SONET Line (SDH Mux-section), F1 is a SONET section (SDH
regenerator section).
So an OAM F5 segment cell (PT100) is processed by the next segment, while a
OAM F5 end-to-end cell (PT101) is only processed by an ATM end station
(terminating an ATM link). Operation, Administration, and Maintenance (OAM)
is discusses in another module (ATM QoS).


19 {C} Herbert Haas 2005/03/11
Header FieIds
CeII Loss Priority (CLP)

SimiIar to DE bit in Frame ReIay

Identifies Iess important ceIIs


Header Error Check

CRC-8 to protect the header onIy

I 4.321: Used for ceII deIineation


(6 successive hits necessary)


20 {C} Herbert Haas 2005/03/11
VC Switching
10/12
20/44
73/10
27/99
19/19
VC Switching distinguishes each
virtual circuit according to its
VP VC
Many table entries necessary
3/20
80/31
5/77
1/8
4/5
22/33
53/76
21/41
10/12
17/91
2/1
112/89
40/30


22 {C} Herbert Haas 2005/03/11
Connection Types
Point-to-point:
unidirectionaI or bidirectionaI
Point-to-muItipoint:
unidirectionaI onIy


23 {C} Herbert Haas 2005/03/11
ATM ProtocoI Architecture
PhysicaI Layer
ATM Layer
ATM Adaptation Layer (AAL)
Higher Layer
Management PIane
ControI PIane User PIane
Create ATM
ceIIs and
headers
AdditionaI headers
and fragmentation
according service


24 {C} Herbert Haas 2005/03/11
...And In DetaiI
Transmission Convergence (TC)
ATM Layer
AAL1
SignaIing
and
ControI
Management PIane
ControI
CIass A
CBR for
Circuit
EmuIation
CIass B
VBR for
Audio
and Video
CIass C
Connection
oriented
Data
CIass D
Connection
Iess
Data
Service
Dependent
Convergence SubIayer (CS)
Segmentation and ReassembIy (SAR)
AAL2 AAL3/4 or 5
PhysicaI Medium Dependent (PMD)
User PIane
PIane and Iayer
management
(Resources, Parameters,
OAM FIow, Meta-SignaIing)
Outband signaIing
in designated VCs
(I-LMI)
PDH and
SONET/SDH


25 {C} Herbert Haas 2005/03/11
ControI PIane
0/5 (Q.2931)
DTE DCE
ControI PIane

SignaIing through dedicated virtuaI


ciruit = "Outband SignaIing"
0/18 (PNNI)
DCE


26 {C} Herbert Haas 2005/03/11
Reserved LabeIs
VPI
0
0
0
0
0
0
0
0
0
VCI
0- 15
16 - 31
0
3
4
5
16
17
18
Function
ITU-T
ATM Forum
IdIe CeII
Segment OAM CeII (F4)
End-to-End OAM CeII (F4)
SignaIing
ILMI
LANE
PNNI


27 {C} Herbert Haas 2005/03/11
PhysicaI Layer
Transmission Convergence (TC)
aIIows simpIe change of physicaI
media

PDH, SDH, SONET

HEC and ceII deIineation


PhysicaI Medium Dependent (PMD)
cares for (e. g.)

Line coding

SignaI conversions


28 {C} Herbert Haas 2005/03/11
Interface ExampIes
Standard Speed Medium Comments Encoding Connector Usage
SDH STM-1 155,52 Coax 75 Ohm CM BNC WAN
PDH E4 139,264 Coax 75 Ohm CM BNC WAN
PDH DS3 44,736 Coax 75 Ohm B3ZS BNC WAN
PDH E3 34,368 Coax 75 Ohm HDB3 BNC WAN
PDH E2 8,448 Coax 75 Ohm HDB3 BNC WAN
PDH J2 6,312 TP/Coax 110/75 Ohm B6ZS/B8ZS RJ45/BNC WAN
PDH E1 2,048 TP/Coax 120/75 Ohm HDB3 9pinD/BNC WAN
PDH DS1 1,544 TP 100 Ohm AM/B8ZS RJ45/RJ48 WAN
SDH STM-4 622,08 SM fiber SDH SC LAN/WAN
SDH STM-1 155,52 SM fiber SDH ST LAN/WAN
SDH STM-1 155,52 MM fiber 62,5 um SDH SC LAN/WAN
SDH STM-4 622,08 SM fiber NRZ SC (ST) LAN
SDH STM-4 622,08 MM (LED) NRZ SC (ST) LAN
SDH STM-4 622,08 MM (Laser) NRZ SC (ST) LAN
SDH STM-1 155,52 UTP5 100 Ohm NRZ RJ45 LAN
SDH STM1 155,52 STP (Type1) 150 Ohm NRZ 9pinD LAN
Fber Channel 155,52 MM fiber 62,5 um 8B/10B LAN
TAX 100 MM Fiber 62,5 um 4B/5B MC LAN
SONET STS1 51,84 UTP3 NRZ RJ45 LAN
ATM 25 25,6 UTP3 NRZ RJ45 LAN


29 {C} Herbert Haas 2005/03/11
ATM Layer
MuItipIexing and demuItipIexing of
ceIIs according VPI/VCI
Switching of ceIIs

"LabeI swapping"

Note: origin of MPLS


Error management: OAM ceIIs
FIow ControI
Qos negotiation and traffic shaping

30
30 {C} Herbert Haas 2005/03/11
Adaptation Layers
ATM onIy provides bearer service
ATM cannot be used directIy
AppIications must use adaption
Iayers to access the ATM Iayer
Consist of SAR and CS

Part of DTEs onIy

Transparent for switches (DCEs)


The ATM adaption layers translate between the speciIic worlds oI higher layer
protocols such as IP, X.25, or PCM-Voice and the cell-nature oI the ATM layer
itselI. Using speciIic adaptation layers, nearly every application can be
transported over ATM. This capability emphasizes again the B-ISDN idea that
has been realized.
Note: "Application" means simply "any higher layer communication protocol".
Just consider ATM as a "Transport Layer" (not to conIuse with the OSI layer 4!)
that provides "Bearer Services".

31
31 {C} Herbert Haas 2005/03/11
Adaptation Sub-Layers
Convergence SubIayer (CS)

Service dependent functions


(cIock recovery, message identification)

Adds speciaI information


(e. g. Frame ReIay header)
Segmentation and ReassembIy (SAR)

You name it...


Convergence
SubIayer (CS)
SSCS
Service Specific CS
SSCS
Service Specific CS
CPCS
Common Part Convergence SubIayer
AppIication 1 AppIication 2
The Convergence Sublayer (CS) is divided in two Iurther sublayers, the Common
Part Convergence Sublayer (CPCS) and the Service SpeciIic Convergence
Sublayer (SSCS). The CPCS is common to all instances oI a speciIic AAL.
ThereIore only one CPCS has been deIined per AAL while many SSCS can be
deIined Ior the same AAL.


32 {C} Herbert Haas 2005/03/11
AAL1
Constant Bit Rate (CBR)
Circuit EmuIation
Expensive

Overprovisioning Iike Ieased Iine


necessary

Queuing prefers AAL1 ceIIs over aII


other traffic (in case of congestion)


34 {C} Herbert Haas 2005/03/11
AAL2
AnaIog appIications that require
timing informations but not CBR

VariabIe Bit Rate (VBR)

Compressed audio and video


ReIativeIy new (1997/98)

OriginaI standard withdrawn and Iater


reinvented for mobiIe systems

35
35 {C} Herbert Haas 2005/03/11
AAL2 for MobiIe Systems
CeIIuIar communication issues

Packetization deIay ( QoS)

Bandwidth efficiency ( Money)


Before AAL2 Iow-bit rate reaI-time
appIications were used by "partiaI fiIIing"
of ATM ceIIs

Using "AAL0" or AAL1

Very inefficient (few bytes per ceII onIy)


AAL2 is designed to be fast and efficient
AAL0 simply means using ATM without any adaptation layer. The AAL2 CPCS
allows variable packet length between 1 byte to 45 bytes, so the packetization
delay can be kept very small, iI needed. On the other hand, bandwidth eIIiciency
is achieved by multiplexing several AAL2 connections within one ATM
(VPI/VCI) pipe.
Note: Instead oI having partially Iilled ATM cells, the CPCS will Iill each cell
with AAL2 packets Irom multiple sessions!

37
37 {C} Herbert Haas 2005/03/11
AAL3 + AAL4
AAL3 designed to carry
connection-oriented packets

Such as X.25 or Frame ReIay


AAL4 designed to carry
connection-Iess datagrams

Such as IP or IPX
Because of simiIarity both adaptation
Iayers were combined to AAL3/4

40
40 {C} Herbert Haas 2005/03/11
AAL3/4
Can muItipIex different streams of data
on the same ATM connection

Up to 210 streams using the same VPI/VCI


But too much overhead

Sequence numbers unnecessary when not


interIeaving

One CRC for whoIe packet wouId be


sufficient

Length unnecessary

NearIy totaIIy repIaced by AAL5



41
41 {C} Herbert Haas 2005/03/11
AAL5
Favorite for data communication

AAL 5 simuIates connectionIess data


interface

AIIows simpIe migration to ATM


SmaIIest overhead

Convergence Layer:
8 byte traiIer in Iast ceII

SAR Layer:
just marks EOM in ATM header (PT)
AAL5 is the most widely used AAL today. Also UNI signaling, ILMI and PNNI
signaling is done upon AAL5.


44 {C} Herbert Haas 2005/03/11
Packets and CeII Loss (2)
CeIIs of damaged packets are stiII
forwarded by ATM switches

SoIution: InteIIigent TaiI Packet Discard


or EarIy Packet Discard
IP Routers can immediateIy drop
whoIe packet

And recover queuing resources

So BER can be much higher (!)




45 {C} Herbert Haas 2005/03/11
SignaIing
ATM Forum UNI signaIing specification

UNI 3.0, 3.1 and 4.0 standardized


UNI 2.0 PVC
UNI 3.0 PVC+SVC, CBR+VBR+UBR
UNI 4.0 +ABR, QoS Negotiation
Based on ITU-T Q.2931 (B-ISDN)

46
46 {C} Herbert Haas 2005/03/11
SignaIing Layers
ATM
Layer
Q.2931
SAAL
SSCS
SAR
CPCS
(AAL 3/4, I363 or
AAL 5)
SSCF
(Q.2130)
SSCOP
(Q.2110)
Common Part
Convergence
SubIayer
Service Specific
Convergence SubIayer
SignaIing
AAL
Service Specific
Coordination
Function
Service Specific
Connection-
oriented ProtocoI
SSCOP is very similar to X.25.
ITU-T recommends AAL 3/4 Ior CPCS,
while ATM Forum recommends AAL 5.
The Q.2931 protocol has its origins in Q.931 (N-ISDN, D channel) and Q.933
(UNI signaling Ior Frame Relay). Q.2931 is responsible Ior:
Connection establishment
Negotiation oI perIormance parameters
VPI/VCI use instead oI a D-channel (N-ISDN)
Uses meta signaling to establish signaling paths
and channels (ITU-T)
ITU-T reserved VPI/VCI 0/1 Ior Meta-Signaling (seldom used) and 0/2 Ior
broadcast signaling (both Ior UNI headers).
Additionally, the ATM Forum reserved 0/15 Ior point-to-point signaling, 0/16 Ior
I-LMI, and 0/18 Ior PNNI.


47 {C} Herbert Haas 2005/03/11
ATM Addresses
ATM Forum defined three address-
formats

ISO DCC NSAP format

ISO ICD NSAP format

E.164 Address format


OnIy pubIic networks may use
E.164 address format

May aIso choose other formats



48
48 {C} Herbert Haas 2005/03/11
ATM Addresses
ESI
6 Bytes
SeI
1 Byte
Prefix
13 Bytes
Different types of ATM addresses
All have 20 byte length
All consist of three main parts

Prefix (Basically topology information)

End System dentifier (ES)

NSAP Selector (Selects application)


20 Byte
The NSAP Selector Iield is basically the same as the port number in TCP.


49 {C} Herbert Haas 2005/03/11
Address FIavours
DCC DFI AA reserved RD AREA ESI SeI AFI
ICD DFI AA reserved RD AREA ESI SeI AFI
AFI E.164 RD AREA ESI SeI
DCC ATM Address Format (AFI=39)
ICD ATM Address Format (AFI=47)
E.164 ATM Address Format (AFI=45)
InternationaI Code
Designator
Endsystem Identifier ISDN Number NSAP SeIector
Domain and Format
Identifier
Administrative
Authority
Area Identifier
Routing Domain
Authority and
Format Identifier


50 {C} Herbert Haas 2005/03/11
Summary
ATM is the soIution for B-ISDN

Different broadband services upon common


ceII reIay technoIogy
Remember: 53 bytes, 5 bytes Header
Services via Adaptation Layers

AAL1, AAL2, AAL3/4, AAL5 (IP)


QuaIity of Service

DetaiIs in other moduIe


VP and VC switching

51
51 {C} Herbert Haas 2005/03/11
Quiz
Which framing is used with XDSL?
What are the 4 ATM basic service
types regarding QoS?
ATM fIow controI is simiIar to...?
Which concepts of ATM have been
copied for IP networks?
Q1: ATM cells
Q2: CBR, VBR, ABR, UBR
Q3: Frame Relay ECN
Q4: Label Swapping (MPLS), QoS-Signaling (RSVP) and QoS-Marking (DSCP)

1
2005/03/11 {C} Herbert Haas
N-ISDN
"t still does nothing"

2
2 {C} Herbert Haas 2005/03/11
Why ISDN?
During the century, TeIcos

Created teIephony networks

Created separate digitaI data networks


Today: Demand for various different
services

Voice, fast signaIing, data appIications,


reaItime appIications, videostreaming and
videoconferences, music, Fax, ...
Why has ISDN been invented and what is its basic idea? Originally there were
two types oI Telco networks: one Ior voice and one Ior data. Since both traIIic
types are totally diIIerent in behavior it was reasonable to implement two
diIIerent technologies. Basically, synchronous techniques were used Ior voice
and asynchronous protocols (X.25) were used Ior data.
Later additional traIIic types appeared, such as voice and video streaming,
various realtime applications, and so on. Today we call these traIIic types
"services".
The inventors oI ISDN proposed one single network to transport all these services
in order to reduce complexity, increase maintainability, improve scalabilityand
basically to saIe money.

3
3 {C} Herbert Haas 2005/03/11
What it is...
Integrated Services DigitaI Network
ISDN is the digitaI unification of the
teIecommunication networks for different
services
ISDN ensures worId wide interoperabiIity
AII-digitaI interfaces at subscriber outIet
This moduIe describes N-ISDN (!)

Narrowband ISDN (the "normaI" ISDN)


N-ISDN means Narrowband-ISDN, but you can also think oI "Normal-ISDN".
The planning oI ISDN began already in 1976, but real-world applications became
available only with the mid-80's. Also Frame-Relay is regarded as part oI the
ISDN Iamily, because it can be transported upon the physical layer oI ISDN,
which we will discuss soon.

4
4 {C} Herbert Haas 2005/03/11
TechnicaI Overview
ISDN provides standardized UNI

Basic Rate Interface (BRI)

Primary Rate Interface (PRI)


Synchronous and deterministic
muItipIexing

Constant deIays

Constant bandwidth
Dynamic connection estabIishment

User initiated

TemporariIy
ISDN speciIies only a User to Network InterIace (UNI)quite similar than X.25
and Frame Relay. But the main diIIerence is that ISDN relies on deterministic,
synchronized multiplexing.
Two data rates were deIined: The Basic Rate Interface (BRI) and the Primary
Rate Interface (PRI). Both are explained on the next pages in more detail.
Synchronous and deterministic multiplexing provides constant delays and
bandwidth. ThereIore, a user can able to put any type oI traIIic upon this layer
it works Iully transparent!
The connections are established dynamically by a signaling protocol. The user
dials a number and a temporary connection is created. The signaling protocol is
the Iamous "Q.931". It is explained later but you should try to memorize it even
by now.

5
5 {C} Herbert Haas 2005/03/11
Basic Rate Interface (BRI)
2 Bearer (B) channeIs with 64 kbit/s
each
1 Data (D) channeI with 16 kbit/s

For outband signaIing purposes


(mainIy)
BRI
2 B
D
TeIco or Provider
Network
144 kbit/s (pIus overhead)
The picture above describes the BRI which might be installed in every household.
The BRI speciIies three channels: 2 Bearer (B) channels providing 64 kbit/s each
and one signaling or Data (D) channel, providing only 16 kbit/s.
The dedicated timeslot Ior the Data (D) channel assures a reliable outband
signaling. In many cases the D channel is also used Ior other data traIIic, Ior
example X.25 packets.
The total bandwidth oI all three channels is 646416144 kbit/s, not regarding
the overhead inIormation.
Note that the ISDN link is terminated at the switch oI the Telco or provider
network. This termination is discussed in greater detail soon.
Unlike a normal telephone connection, an ISDN connection can have more than
one telephone number - each oI these is called an MSN (Multiple Subscriber
Number).

6
6 {C} Herbert Haas 2005/03/11
Primary Rate Interface (PRI)
30 Bearer (B) channeIs with 64 kbit/s
each (USA: 23 B)
1 Data (D) channeI with 64 kbit/s

For outband signaIing purposes


(mainIy)
30 B
D
PRI
2.048 Mbit/s
(E1 Frames)
The PRI also contains B and D channels, but now there are 30 B channels and
also the D channel has the same bandwidth oI 64 kbit/s. These 31 channels plus
an additional synchronization channel result in a total data rate oI 2,048 Mbit/s,
which is transported over a so-called E1 Irame.
Note: In USA and Japan the ISDN PRI oIIers a data rate oI 1.544 Mbit/s.

7
7 {C} Herbert Haas 2005/03/11
ISDN Services
CCITT defined three services

Bearer services (Circuit or Packet)

TeIeservices (TeIephony, TeIefax, ...)

SuppIementary services
Reverse charging
Hunt groups
etc...
The CCITT (today known as ITU-T) deIined three services Ior ISDN..
Bearer services deIine transport oI inIormation in real time without alteration oI
the content oI the message. Both circuit mode and packet mode (virtual call and
permanent virtual circuit) is supported.
Teleservices combine transportation Iunction with inIormation-processing
Iunctions, e.g. telephony, teletex, teleIax, videotex, and telex.
Supplementary services can be used to enhance bearer or teleservices.
Examples Ior supplementary services are reverse charging, closed user group, line
hunting, call Iorwarding, calling-line-identiIication, multiple subscriber number
(MSN), and subaddressing.

8
8 {C} Herbert Haas 2005/03/11
FunctionaI Groups
TerminaI Equipment (TE)

TE1 is the native ISDN user device


(phone, PC-card, ...)

TE2 is a non-ISDN user device


(AnaIog teIephone, modem, ...)
Network Termination (NT)

NT1 connects TEs with ISDN

NT2 provides concentration and suppIementaI


services (PBX)
TerminaI Adapter (TA)

TA connects TE2 with NT1 or NT2


Several "Iunctional groups" have been speciIied to diIIerentiate technical
capabilities. An end device is called a "Terminal Equipment" (TE).
A TE1 is a true ISDN device such as an ISDN telephone.
A TE2 is any non-ISDN device that can be attached to the ISDN interIace via a
Terminal Adapter (TA).
A NT1 connects the 4 wire TE1 to the 2 wire ISDN link to the Telco switch, also
known as Local Exchange (LE).
A NT2 is an optional device that provides concentration oI multiple local
premises phone lines and connection to the LE. This device is also called a
Private Branch Exchange (PBX) and might provide a lot oI additional services,
depending on the vendor.

9
9 {C} Herbert Haas 2005/03/11
Reference Points
LogicaI interfaces between functionaI
groups

R connects PSTN equipment with TA

S connects TEs with NT2

T connects NT2 with NT1

U connects NT1 with Exchange


Termination (ET)
Besides the Functional Groups, also "ReIerence Points" had been speciIied.
ReIerence Points identiIy logical interIaces between the previously mentioned
Funtional Groups.


10
10 {C} Herbert Haas 2005/03/11
Reference Diagram (BRI)
TA
NT1
Up to 8 TEs
TE1
TE1
TE2
LT ET
ISDN Switch
V U S/T
R
Phone Company
Home
Termination point
in Europe
Termination point
in USA
LT Line Termination
ET Exchange Termination
TA TerminaI Adapter
TE TerminaI Equipment
NT Network Termination
2 Wires
4 Wires
A TE2 is Ior example a plain old telephone (POT) or an analog modem. The R
interIace is typically a EIA/TIA-232-C, V.24, or V.35.
Basically, the NT1 converts the U to S/T interIace: 2 wires to 4 wires, diIIerent
coding scheme, diIIerent bit-rates (160 to 192 bit/s). Furthermore the NT1 cares
Ior synchronisation, multiplexing oI B and D channels, and optional power
provision Ior TEs. Some people just call it ISDN-modem. Never say that.

11
11 {C} Herbert Haas 2005/03/11
Reference Diagram (PRI)
NT1 LT ET
ISDN Switch
V U
Phone Company
Company
NT2 T S
PBX
.
.
.
.
.
Can be a singIe device
The picture above shows the principle oI a PRI installation, using a PBX (NT2)
which terminates all local telephones. Note that these telephones are not
necessarily ISDN compliant telephones. Rather vendor proprietary technologies
are used here.

12
12 {C} Herbert Haas 2005/03/11
U-Interface
Recommendation G.961

160 kbit/s (remaining capacity used for framing


and synchronization)
Either echo canceIIation or time
compression (ping-pong)
2B1Q (ANSI T1.601)

-2.5 V, -0.833 V, +0.833 V, +2.5 V

Requires haIf the BW of NRZ

PIus scrambIing for synchronization and


uniform PSD distribution
The U-interIace is deIined in CCITT "Recommendation G.961" and speciIies a
160 kbit/s transmission method over two wires. Bidirectional communication is
provided either by echo cancellation or "ping-pong" transmission, i. e. alternating
sending and receiving oI both sides within short time periods.
"Two Binary One Quaternary" (2B1Q) digital coding is used on this interIace.

13
13 {C} Herbert Haas 2005/03/11
ISDN ChanneIs
TEs just require one D
and 1 or 2 B channeIs
High-speed PRI appIications can be
connected with so-caIIed H-channeIs

H0 (6B = 384 kbit/s)

H11 (24B = 1536 kbit/s)

H12 (30B = 1920 kbit/s)


H channels are bundles oI B channels to obtain a higher data rate.

14
14 {C} Herbert Haas 2005/03/11
Layers
I.430 (BRI)
I.431 (PRI)
User
specified
Q.931
Q.921 (LAPD)
Control-Plane
(D-Channel)
User-Plane
(B or H channel)
The diagram above shows the ISDN layer model. There is one common physical
layer, which is either I.430 (the BRI) or I.431 (the PRI). Note the vertical
separation above the common physical layera clear sign oI outband signaling!
On the leIt side the signaling protocol Q.931 can be identiIied in this diagram.
This "Control Plane" protocol carries the dial numbers and is itselI carried by
Q.921, a HDLC variant providing a reliable delivery oI data between two adjacent
interIacesbetween TE and LE.
On the right side the "User Plane" is speciIied as an open interIace. That is, the
user can put any service directly upon the synchronous physical layer.

15
15 {C} Herbert Haas 2005/03/11
AdditionaI Standards
Q.920 (I.440)

Layer 2 UNI generaI aspects


Q.921 (I.441)

Layer 2 UNI specification and LAPD


Q.930 (I.450)

Layer 3 UNI generaI aspects


Q.931 (I.451)

Layer 3 UNI specification and caII


controI procedures
Just Ior your interestand to provide a complete descriptionthe most important
standards are listed as summary.

16
16 {C} Herbert Haas 2005/03/11
I.430 S/T-Bus
S/T interface is impIemented as bus

Point-to-point
Maximum distance between TE and NT is
1km (!)
Requires a PBX

MuItipoint
Up to 8 TEs can share the bus
Maximum distance between TE and NT is
200 meters (short bus) or 500 meters
(extended bus)
An ISDN interIace can be conIigured either in multipoint mode or in point-to-
point mode.
The point-to-point mode is the normal connection mode Ior business ISDN users.
The user can attach only one single devices to the ISDN connection which will
have to handle all calls (typically a PBX will be used).
The ISDN provider will assign a range oI numbers to the ISDN connection. Any
call within this number range will be sent to the user. The ISDN provider will
leave assignment oI the last digits oI the telephone number to the ISDN user. This
setup usually allows Ior additional Ieatures, but is also more expensive.

17
17 {C} Herbert Haas 2005/03/11
MuItipoint Configuration
D channeI is shared by aII TEs

To request usage of B channeIs

Contention mode
B channeIs are dynamicaIIy assigned
to TEs

ExcIusive usage onIy (!)


The multipoint conIiguration is typically used Ior private users. Here the D
channel is shared by up to 8 TEs. The D channel is used similarly as an Ethernet
bus mediumcontention takes place! The winner gets a B channel Ior
communication. This B channel is dynamically assigned but immediately
released when the call is terminated.

18
18 {C} Herbert Haas 2005/03/11
S/T Bus DetaiIs
192 kbit/s=
144 kbit/s (2B+D) + 48 kbit/s
for Framing, D-echoing, and DC
balancing
48 bit frames every 250 s

Modified AMI code (zero-moduIation)

Bit-stuffing

Synchronization through code vioIation


Two B channels and one D channel plus 48 kbit/s overhead results in a sum oI
192 kbit/s. This data rate is actually provided by a BRI. The overhead is
necessary Ior Iraming, bus arbitration, and DC balancing. This details can be
seen on the next page.
Electrical details:
RJ-45 Connectors with 8 pins
2 TX
2 RX
4 optional power Ieeds
100 termination impedance

19
19 {C} Herbert Haas 2005/03/11
S/T-Bus
F B1 L L D L F
A
L B2 L D L B1 L D L B2 L D L
48 bits in 250 s
F B1 L E D AF
A
N B2 E D M B1 E D S B2 E D L
TE to NT
NT to TE
F... Framing bit
L... DC baIancing bit
E... D-echo channeI bit
A... Activation bit
F
A
.. AuxiIiary framing bit
N... Set to opposite of F
A
M... MuItiframing bit
S.... Spare bits
8-bit
F () Iollowed by L(-) marks start oI Irame. To prevent F in the bit stream, code
violations are used (normally alternate pulses (, -) used Ior zeroes)
General rule: Iirst logical zero to be transmitted uses a code violation symbol.
In case oI "all-ones", the F
A
perIorms code violation. The auxiliary Iraming bit F
A

is always set to 0; N is always inverse oI F
A
(1 here).
L bits are used to guarantee DC balance:
From NT to TE only one L bit is necessary
From TE to NT every part oI the Irame (B1, B2 and D)
is balanced by individual L bits. Reason: every part oI the Irame (B1, B2, D)
may be sent by a diIIerent TE hence every TE must balance its own part.

20
20 {C} Herbert Haas 2005/03/11
D - ChanneI Access ControI (1)
Before TE may use D channeI:
Carrier Sense

At Ieast eight ones (no signaI activity) in


sequence must be received
Then TE may transmit on D channeI:
CoIIision Detection

If E bits unequaI D bits TE wiII stop


transmission and wait for next eight
ones in sequences
In multipoint mode the S7T bus is used in contention mode similar to Ethernet.
BeIore a TE may use the D channel it must listen whether some TE is sending
"carrier sense" is perIormed. Here at least eight "1" must be received in
sequence. Since the inverse AMI coding is used this means that nobody is
currently sending.
Then the TE may transmit data (e. g. a Q.931 packet within a Q.921 Irame) on the
D channel. But during sending, this station must perIorm collision detection by
observing the echo bits which reIlect all sent bits back Irom the NT.

21
21 {C} Herbert Haas 2005/03/11
D - ChanneI Access ControI (2)
When using D channeI

Bit stuffing prevents sequence of eight


ones for the rest of the message
Fairness

TE must reIease D channeI after


message was sent

Next time, this TE must wait for a


sequence of nine ones
OI course measures must be implemented to avoid eight ones during sending
another TE might assume that the S/T bus is empty! Thus bit stuIIing is perIormed
in such cases (inserting a zero).
Furthermore, iI a TE succeeded recently this TE must wait Ior nine ones beIore
grabbing the D channel. This method assures Iairness among the TEs.

22
22 {C} Herbert Haas 2005/03/11
PRI (I.431)
Point-to-point configuration onIy
Europe: E1

30 B channeIs

1 D channeI (aIso 64 kbit/s)

1 Framing ChanneI
USA: T1

23 B channeIs

1 D channeI
The T1 Irame synchronization is achieved using a single bit at the beginning oI the
Irame. Both E1 and T1 are explained in another module (Telco Backbones).

23
23 {C} Herbert Haas 2005/03/11
LAPD (Q.921)
Link Access Procedure D-ChanneI

Based on HDLC ABM mode

2 byte address fieId (SAPI + TEI)

OptionaIIy extended sequence


numbering (0-127)
Carries Q.931 packets
May aIso be used to carry user traffic

For exampIe X.25 packets


Note that the D channel is empty in most oI the time because its only needed when
establishing or closing a connection. Because oI this, many providers allow to send
user data over the D channel, using Ior example X.25. OI course this is no Iree
service, because the provider network has to transport this data, so users have to
pay Ior it.

24
24 {C} Herbert Haas 2005/03/11
FIag
SAPI C/R EA
TEI
ControI
Information
FCS
SAP . Service Access Point dentifier
TE ... Terminal Endpoint dentifier
EA ... Address Field Extension Bit
C/R .. Command/Response Bit
LAPD Frame Format
EA
FIag
0 1 2 3 4 5 6 7
0
1
2
3
4
Address Information
The picture above shows the Q.921 or LAPD Irame Iormat. The Service Access
Point IdentiIier (SAPI) and the Terminal Endpoint IdentiIier (TEI) are described
next.
FYI: The SAPI and TEI is also called Data Link Connection Identifier (DLCI,
like in Frame Relay).

25
25 {C} Herbert Haas 2005/03/11
TEI
When TE occupies D channeI, the ET
(switch) assigns a TerminaI Endpoint
Identifier (TEI) to it
LAPD frames carry TEI

To identify source (TE ET)

To identify destination (ET TE)


PossibIe vaIues: 0-127
A switch (LE) would not really know which TE is currently actice and has grabbed
the D channel. ThereIore a Terminal Endpoint IdentiIier (TEI) is assigned to the
TEs. The LAPD Irames carry the TEI which can be compared to an Ethernet
MAC address while the telephone number is similar to an IP address in this
context.

26
26 {C} Herbert Haas 2005/03/11
TEI Management
TEIs are either assigned automaticaIIy

By switch (ET)

TEI vaIue range 64-126


Or preconfigured

Checking for dupIicates necessary

TEI vaIue range 0-63


TEI = 127 reserved for broadcasting
Note that the TEI is not used Ior primary rate interIaces (PRI) because PRI do not
support multipoint connections. Here the TEI is always set to zero.
The local switching station, or with an internal S0 the PBX, automatically or
permanently assigns each end device a Terminal End IdentiIier (TEI). This simply
allows the addressing oI the D channels. TEIs have the Iollowing values: 0-63
permanent TEIs (e.g. 0 is used Ior point to point connections) 64-126
automatically assigned 127 broadcast to all devices (e.g. an incoming call) .

27
27 {C} Herbert Haas 2005/03/11
SAPI
Service Access Point Identifier
(SAPI)

OSI interface to Iayer 3

"Identifies payIoad"
0 signaIing information (s-type)
16 packet data (p-type)
63 management information
Additionally a Service Access Point IdentiIier (SAPI) is needed to identiIy the
content oI this LAPD Irame. Each SAPI number identiIies a layer 3 service. For
example Q.931 services might be addressed or the SAPI might also indicate that
the LAPD payload is a X.25 data Irame.

28
28 {C} Herbert Haas 2005/03/11
TEI Management Messages
UI frames with SAPI = 63 and TEI 127
Information fieId contains

Reference indicator (RI) to correIate


request and responses

Action indicator (AI) to specify TEI in


question

Message type
Also management messages are identiIed by a special SAPI (63), combined with a
TEI oI 127, which addresses all TEs (broadcast). These management messages are
used to assign TEIs to the TEs.
Examples Ior message types are:
IDRequest, IDCheck Response, IDVeriIy (TE to NT) and
IDAssigned, IDDenied, IDCheck Request, IDRemove (NT to TE)

29
29 {C} Herbert Haas 2005/03/11
Q.931
Carries signaIing information

CaII controI

E. g. diaI number and ring information

Terminated by ET
ET is reaI 7-Iayer gateway

TransIates Q.931 into SignaIing System


7 (SS#7)
Country-dependent versions (!)
Q.931 is a signaling protocol used by N-ISDN and also (slightly enhanced) by B-
ISDN. Using Q.931 the dial number is Iorwarded to the Telco switch, which
terminates the D channels and puts all signaling inIormation on top oI another
signaling protocol. Typically SS#7 is used in most Telco networks.
FYI: Some special features
CLIP (Calling Line IdentiIication Presentation) can be oIIered by the ISDN provider. When you
call somebody, then your telephone number will be transmitted to the other phone. The opposite oI
CLIP is CLIR: one can (Irom call to call) restrict the identiIication oI one's own caller ID to the
other party.
COLP (Connected Line IdentiIication Presentation) can also be oIIered by the ISDN provider.
COLP provides an extended dialing protocol. You will receive Ieedback Irom your
telecommunication company who picked up your outgoing call. Normally, you will get the same
number as you dialed beIorehand; however, with call diversion this could also be a diIIerent
number.

30
30 {C} Herbert Haas 2005/03/11
ISDN Switch Types
BRI
Basic-net3 (Euro ISDN)
5ESS, DMS-100, NT1 (USA)
NTT (Japan)
Basic 1TR6 (Germany, oId)
VN2, VN3 (France)
TS013 (AustraIia)
PRI
primary-net5 (Euro ISDN)
4ESS, 5ESS, DMS-100 (USA)
NTT (Japan)
TS014
When conIiguring the ISDN devices it is very important to know about the switch
(LE) type because there are many Ilavors.
The list above presents the most important ISDN BRI and PRI interIace variants.

31
31 {C} Herbert Haas 2005/03/11
Q.931 Packet Format
ProtocoI Discriminator
0 0
CaII Reference
Information EIements
CaII Ref. Length
0 1 2 3 4 5 6 7
0
1
2
3
4
0 0
F
Message Type 0
CaII Information Phase
RESume
RESume ACKnowIegde
RESume REJect
SUSPend
SUSPend ACKnowIedge
SUSPend REJect
USER INFOrmation
MisceIIaneous
CANCeI
CONgestion CONtroI
FACiIity (Ack, Rej)
INFOrmation
REGister (Ack, Rej)
STATUS
CaII EstabIishment
ALERTing
CALL PROCeeding
CONNect
CONNect ACKnowIedge
SETUP
SETUP ACKnowIegde
CaII CIearing
DETatch
DETach ACKnowIedge
DISConnect
RELease
RELease COMpIete
REStart
REStart ACKnowIedge
Random
Number
Message Types:
The Q.931 packet Iormat is given in the picture above only to provide a consistent
ISDN overview here. It is not necessary to memorize this structure in detail.
However it should be noticed that the actual inIormation is carried in so-called
"Information Elements" (IE). Several Q.931 messages are listed in the right
hand side oI the packet. Each message type is identiIied in the equivalent Iield in
the header and supports a speciIic set oI IEs.
The protocol discriminator is set to 0x08 (except 1TR6: 0x41).

32
32 {C} Herbert Haas 2005/03/11
Information EIements ExampIes
Bearer CapabiIity (eg. 0x8890 .. dig. 64kb/s Circuit) 0x04
0x08 Cause (reason codes for caII disconnect)
ChanneI Identification 0x18
0x1E Progress Indicator (check for 56kb/s connection)
Keypad 0x2C
0x6C CaIIing Party Number
0x6D CaIIing Party Sub address
CaIIed Party Number 0x70
0x71 CaIIed Party Subaddress
Low-Layer CompatibiIity 0x7C
0x7D High-Layer CompatibiIity
In order to get a practical understanding oI how Q.931 works, the table above
shows some examples oI important InIormation Elements. The leIt column shows
the InIormation Element IdentiIier which is used at the beginning oI each IE in
order to identiIy this IE. The IE structure is not shown in this chapter.

33
33 {C} Herbert Haas 2005/03/11
Setup
CaII Proceeding
Setup
CaII Proceeding
AIerting
Progress
AIerting
Connect
Connect
Connect Ack
Connect Ack
CaII EstabIishment
TE LE
TE
Examples for Setup Information Elements are:
Bearer Capability IE
Voice/data call/Iax, speed (64/56), transIer mode (packet/circuit),
user inIo L2 (I.441/X.25 L2), user inIo L3 (I.451/X.25 L3)
Channel Identification IE
DeIines which B-channel is used
Called-Party number IE
Whom are you calling
Calling-Party number IE
Who is calling you (does not need to be delivered)
Keypad IE
Can be used instead oI called-party number
High-Layer Compatibility IE
Used with the BC to check compatibility
Note: IEs vary among switch types (!)

34
34 {C} Herbert Haas 2005/03/11
TE LE
Disconnect (cause)
ReIease
ReIease CompIete
Disconnect (cause)
ReIease
ReIease CompIete
OR
CaII ReIease
At the end oI this chapter we release the ISDN call with the messages shown
above.

35
35 {C} Herbert Haas 2005/03/11
Summary
DynamicaI circuit switching
BRI (2B+D) and PRI (30B+D)

Bearer channeIs (B)

SignaIing channeI (D)


Q.921 (LAPD) and Q.931 on D channeI
Reference points (R, S, T, U)
Function Groups

TE1, TE2, TA, NT1, NT2, ET



36
36 {C} Herbert Haas 2005/03/11
Quiz
What voItage might be suppIied for
power suppIy?
The U interface is fuII-dupIex but
there are onIy two wires...? How
does it work?

1
2005/03/11 {C} Herbert Haas
TeIco ScaIabIe Backbones
PDH, SONET/SDH

2
'Evervthing
that can be invented
has been invented`
CharIes H. DueII,
commissioner of the
US Office of Patents 1899

3
3 {C} Herbert Haas 2005/03/11
Agenda
Basics

Shannon

Jitter

Compounding Iaws

DigitaI Hierarchies
PDH
SONET/SDH
This chapter gives an introduction into the complex world oI Telco technologies.
First we discuss transmission basics related to voice and scalability issues.
In order to understand these technologies it is important to know about Shannon's
laws, jitter problems, signal to noise problems, and digital hierarchy concepts.
AIter this basics sections this chapter presents two important Telco backbone
technologies, PDH and SONET/SDH.

4
4 {C} Herbert Haas 2005/03/11
Long History
Origins in Iate 19th century
Voice was/is the yardstick

Same terms

Same signaIing principIes

Even today, aIthough data traffic


increases dramaticaIIy

Led to technoIogicaI constraints and


demands
"circuit" "cross-
connect"
Telco technologies have a long history. Its origins date back until the late 19th
century. Originally voice transmission was the only goal. Even today the
characteristics oI voice transmission Iorms the basic design oI Telco technologies
such as PDH and SONET/SDH.

5
5 {C} Herbert Haas 2005/03/11
GeneraI GoaIs
InteroperabiIity

Over decades

Over different vendors

WorId-wide!
AvaiIabiIity

Protection Iines in case of faiIures

High non-bIocking probabiIity


The most important goals Ior Telco technologies are interoperability and
availability.
Telco backbones are laid throughout nations and must thereIore Iunction over
several decades, must integrate with older technologies and diIIerent vendors.
Actually, people expect to communicate Irom any phone on earth to any other
phone on earth.
Due to the big size oI these networks even a small error probability can cause a
denial oI service Ior thousands or even millions oI users. Because oI this the
Telco backbones must be designed to support great availability, Ior example using
redundant protection lines which are activated in case oI Iailures.
Additionally it cannot be economically justiIied to dimension a backbone
connection which could support all possible users at the same time, Ior instance
between two cities. ThereIore the user behaviors must be estimated and complex
statistical calculations are made in order to dimension the link.

6
6 {C} Herbert Haas 2005/03/11
SampIing of Voice
Shannon's Theorem
Any anaIogue signaI with Iimited bandwidth f
B
can be
sampIed and reconstructed properIy when the
sampIing frequency is 2f
B
Speech signaI has most of its power and information
between 0 and 4000 Hz
Power
Frequency
300 Hz 3400 Hz
TeIephone channeI: 300-3400 Hz
8000 Hz x 8 bit resoIution = 64 kbit/s
The Shannon's sampling theorem requires that each bandwidth-limited signal must
be sampled by a rate which is twice higher than the cut-oII bandwidth oI the signal
in order to support an error-Iree (anti-aliased) reconstruction oI the signal.
Since speech signals have most oI their power below 4 kHz it has been agreed that
speech is to be sampled 8000 times per second.
From this it Iollows that when each signal sample is encoded by one byte, a data
rate oI 64 kbit/s is necessary to transmit digital speech.

7
7 {C} Herbert Haas 2005/03/11
Isochronous Traffic
Data rate end-to-end must be
constant
DeIay variation (jitter) is criticaI

To enabIe echo suppression

To reconstruct sampIed anaIog signaIs


without otherwise distortion
Next, it is important to understand the properties oI isochronous traIIic. "Iso"
means "Equal" and "chronous" means "time". That is, each portion oI data oI an
isochronous traIIic must be delivered exactly with same delay.
Delay variationsalso called "jitter"are very critical Ior isochronous traIIic. For
example telephony requires isochronous transmission because oI the bidirectional
communication, echo suppression is necessary. But how to suppress echoes when
they arrive at diIIerent times?

8
8 {C} Herbert Haas 2005/03/11
ReaItime Traffic
Requires guaranteed bounded deIay
"onIy"
ExampIe:

TeIephony (< 1s RTT)

Interactive traffic (remote operations)

Remote controI

TeIemetry
Realtime traIIic does not necessarily require "Iast" transmission. It only demands
Ior "Iast enough" transmission. That is, a bounded delay is deIined within all
required data must be received.

9
9 {C} Herbert Haas 2005/03/11
SoIutions
Isochronous network

Common cIock for aII components

Aka "Synchronous" network


PIesiochronous network

With end-to-end synchronization


somehow
TotaIIy asynchronous network

Using buffers (pIayback) and QoS


techniques
There are several solutions to support telephony, which has both isochronous and
realtime properties.
First, a total synchronous network can be created, utilizing a common clock Ior all
network components.
Second, a plesiochronous network can be created, which is "nearly" synchronous
but at least synchronized between end users.
Third, an asynchronous network can be used, such as the Internet or similar. Here
it is very tricky to achieve end-to-end synchronization and bounded delays.
Modern Quality oI Service (QoS) techniques allow to overcome the asynchronous
problems at least partly.

10
10 {C} Herbert Haas 2005/03/11
Improving SNR
SNR improvement of speech signaIs

Quantize Ioud signaIs much coarser than quiet signaIs


Expansion and compression specified by
nonIinear function
USA: -Iaw (BeII)
Europe: A-Iaw (CCITT)
Quantization
IeveIs
AnaIogue input signaI
Conversion is task
of the -Iaw worId
The Signal-to-Noise Ratio (SNR) is an indicator oI signal quality. Furthermore, a
better SNR allows lower signal strengths and higher data rates.
Digital voice is generally "compounded", that is the higher amplitude levels are
quantized at a lower resolution and the smaller amplitudes at a higher quantization
resolution. The characteristic oI this compression and expansion technique is
expressed by a nonlinear Iunction which has Iirst been deIined by Graham Bell. In
the USA the so-called -law is used while in Europe the CCITT deIined the A-law
Iunction to improve the SNR.
Note that digital voice signals have to be converted when the -law world talks to
the A-law world or vice versa. The rule is, that the conversion must be a task oI
the -law world.

11
11 {C} Herbert Haas 2005/03/11
PIesiochronous DigitaI Hierarchy
Created in the 1960s as successor of
anaIog teIephony infrastructure
Smooth migration

Adaptation of anaIog signaIing methods


Based on Synchronous TDM
StiII important today

TeIephony access IeveI

ISDN PRI

Leased Iine
In the middle oI the 20th century, the telephony network inIrastructure was still
analog and very complex. Each connection was realized by a dedicated bundle oI
wires and all terminated in the central oIIice. Signaling was slow and primitive
and switching a time consuming process. Furthermore speech quality degraded on
long haul connections.
In the 1960s digital backbones were created and also digital signaling protocols
such has SS#7. Central oIIice equipment became smaller and more eIIicient and
the number oI wires were reduced drastically. This technology was called
Plesiochronous Digital Hierarchy (PDH) and is based on synchronous TDM,
however it was not Iully synchronous because oI technical restrictions oI that days.
PDH is still important and used today.

12
12 {C} Herbert Haas 2005/03/11
Why PIesiochronous?
1960s technoIogy: No buffering of frames
at high speeds possibIe
GoaI: Fast deIivery, very short deIays
(voice!)

Immediate forwarding of bits

PuIse stuffing instead of buffering


PIesiochronous = "nearIy synchronous"

Network is not synchronized but fast

Sufficient to synchronize sender and receiver


What exactly does "plesiochronous" mean? First it was clear that a digital
backbone must be able to concentrate at least hundreds (or even thousands) oI
telephone calls. Assuming a data rate oI 64 kbit/s per call, the backbone rate
would be more or less 30 Mbit/s or something.
In the 1960s it was nearly impossible to design hardware which is able to buIIer
Irames at that rate. But how to compensate slightly diIIerent data rates? On the
other hand, buIIering introduced delaysbut isochronous realtime traIIic should
be transported.
So ideally each bit is immediately Iorwarded by the network nodes without
buIIering. Bit rate diIIerences were compensated by a so-called "pulse stuIIing"
technique, which is also sometimes called "bit stuIIing". Using this method any
node oI the network can compensate phase driIts due to diIIerences oI the sending
rate by inserting or removing single data bits oI the stream.
OI course the lowest rates must be synchronized in order to obtain a correct signal.


13
13 {C} Herbert Haas 2005/03/11
Why Hierarchy?
OnIy a hierarchicaI digitaI muItipIexing
infrastructure

Can connect miIIions of (Iow speed)


customers across the city/country/worId
LocaI infrastructure: SimpIe star
Wide area infrastructure: Point-to-point
trunks or ring topoIogies

Grooming required
Now we know the meaning oI the term "plesiochronous". But what is meant by
the term "hierarchy" in this context? Obviously Telcos were supposed to supply
millions oI users with a dial tone. Which topology would be most eIIicient? Only
star topology can eIIiciently cover whole villages, cities, and even countries. A
star consists oI many point-to-point connections: each spoke is connected to a hub.
The hub is called the "Central OIIice" (CO) and the spokes are either telephones
or multiplexers.
TraIIic always concentrates to the hubs but is also distributed Irom the hubs. The
hubs are interconnected by PDH trunks. Many trunks constitute spokes and are
again concentrated in anotherhigher levelhub. This principle is applied
recursively, Iorming a so-called Digital Hierarchy. II you go deeper into this
hierarchy you will see higher data rates.
The backbone itselI consists oI point-to-point or ring topologies. Rings have the
advantage oI providing one redundant connection between each two nodes.
OI course the number oI links are much lower in the heart oI the hierarchy
(thereIore the data rate is much higher). Hubs are responsible to collect all user
signals that are destined to the same direction and put them onto the same trunk.
This process is called "grooming".

14
14 {C} Herbert Haas 2005/03/11
DigitaI Hierarchy of MuItipIexers
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
E1 = 30 x 64 kbit/s + Overhead
E2 = 4 x 30 x 64 kbit/s + O
E3 = 4 x 4 x 30 x 64 kbit/s + O
E4 = 4 x 4 x 4 x 30
x 64 kbit/s + O
64 kbit/s
ExampIe: European PDH
The picture above shows the digital multiplexing hierarchy used in European PDH
networks. The lowest data rate uses so-called "E1" Irames, consisting oI 30 user
signals. At each multiplexing level Iour lower rate channels can be combined to
one higher rate channel. This way an "E2", "E3", and "E4" is Iormed.
Also higher multiplexing levels had been deIined, Ior example "E5" but they are
not used very oIten.
15

15 {C} Herbert Haas 2005/03/11
DigitaI SignaI LeveIs
Differentiate:

SignaI (Framing Iayer)

Carrier (PhysicaI Layer)


North America (ANSI)

DS-n = DigitaI SignaI IeveI n

Carrier system: T1, T2, ...


Europe (CEPT)

CEPT-n = ITU-T digitaI signaI IeveI n

Carrier system: E1, E2, ...


The Telco world diIIerentiates between the digital signal level and the carrier
system. The signal level can be regarded as the OSI link layer and the carrier
system is similar to the OSI physical layer. Note that this picture is not really
correct because the OSI system cannot really applied to this world.
In North America the ANSI is responsible Ior Telco standardization eIIorts and
deIined the so-called Digital Signal DS to identiIy the Iraming layer. For example
DS-0 is the 64 kbit/s user signal and DS-1 denotes the Iirst multiplexing level.
Equivalently the carrier system Ior DS-1 is called T1, and DS-2 is carried upon T2,
and so on.
The same thing happened in Europe. The ConIerence oI European Post and
Telecommunications (CEPT, now ETSI) deIined signal levels CEPT-1, CEPT-2,
and so on, to be carried upon E1, E2, etcetera.

16
16 {C} Herbert Haas 2005/03/11
WorIdwide DigitaI SignaI LeveIs
SignaI Carrier
DS0
DS1
DS2
DS3
T1
T2
T3
North America
Mbit/s
0.064
1.544
6.312
44.736
DS1C T1C 3.152
SignaI Carrier
DS0
CEPT-1
CEPT-3
CEPT-4
"E0"
E1
E2
E3
E4
Europe
Mbit/s
0.064
2.048
34.368
139.264
CEPT-2 8.448

ChanneIs ChanneIs
1
24
48
96
672
1
32
128
512
2048
DS4 T4 274.176 4032 CEPT-5 E5 565.148 8192
ncompatible MUX rates
Different signalling schemes
Different overhead
-law versus A-law
The tables above summaries the North American and the European PDH systems.
These signal levels are related according to the Iollowing Iormulas:
ANSI T1.107 Hierarchy:
DS1C 2 DS1
DS2 4 DS1
DS3 7 DS2
DS4/NA 3 DS3 (international connections only)
DS4 6 DS3 (rare)
ITU-T Hierarchy:
En1 4 En
Later a harmonization oI the ANSI and ITU-T hierarchy has been made. The ANSI
international DS4/NA (not listed above) is compatible to the 139264 kbit/s E4.
The basic message oI the slide above is that there are several inconsistencies
between the two systems, including MUX rates, signaling schemes, overhead
diIIerences, and compounding methods.

17
17 {C} Herbert Haas 2005/03/11
Frame Duration
Each sampIes (byte) must arrive within 125 s
To receive 8000 sampIes (bytes) per second
Higher order frames must ensure the same byte-rate per user(!)
DS0: 1 Byte
E1: 32 Byte
E2: 132 Byte
125 s
64 kbit/s
2.048 kbit/s
8.448 kbit/s
Remember that voice transmission was and is the yardstick Ior Telco backbone
technologies. Since all higher digital signal levels are basically multiplex methods
to transport many DS0 signals it is clear that each multiplex Irame (e.g. an E1
Irame or E2 Irame etc) must be transmitted within the same time period than the
DS0 signal. A DS0 signal has 64 kbit/s which is created by sending one byte oI a
voice sample 8000 times per second.
As it can be seen in the picture above, each usereach DS0is assigned to one
timeslot in the higher rate Irames. Moreover, there is exactly one byte Ior each
user. Thus, in order to assure a proper delivery oI the DS0 signal within a higher
rate Irame, any higher rate Irame must be sent within 125 s, which is 1/8000.
We call this a "periodic Irame".

18
18 {C} Herbert Haas 2005/03/11
PIesiochronous MuItipIexing
Bit interIeaving at higher MUX IeveIs

SimpIer with sIow circuits (Bit stuffing!)

CompIex frame structures and muItipIexers


(e.g. M12, M13, M14)
DS1/E1 signaIs can onIy be accessed by
demuItipIexing
Add-drop muItipIexing not possibIe

AII channeIs must be demuItipIexed and then


recombined

No ring structures, onIy point-to-point


Since Irequency shiIts are compensated by bit stuIIing it is not possible to
implement byte interleaving multiplexers at higher rates. ThereIore higher
multiplex levels are bit-interleaved! This results in complex Irame structures. For
example a M12 multiplexer converts a Iour E1s into one E2, whereas a M14
multiplexer converts several E1 Irames into one E4 Irame.
Obviously, single DS1/E1 signals can only be accessed by demultiplexing the
whole higher rate Irame! Moreover, it is technically very diIIicult to implement
add-drop multiplexers because DS1/E1 signals are needed by Digital Cross
Connects (DXCs). The only way is to remove bit stuIIing and do
resynchronization.

19
19 {C} Herbert Haas 2005/03/11
Synchronization
M14
+
LT
CB
M14
+
LT
CB
DS0
Switch
M14
+
LT
M14
+
LT
E1 E4 E1 E1 E4 E1
Asynchronous
transport network
Asynchronous
transport network
Synchronous
MUX
Synchronous
MUX
End-to-End Synchronization
Network CIock
(Stratum 1)
CB ........... ChanneI Bank
M14+LT ... MUX and Line Termination
Clocks are not synchronized centrally because this was impractical at the time oI
the creation oI this schemehowever, driIt is inside speciIied limits.
Note that actually asynchronous TDM (!) is used at higher levels!
"Pulse stuIIing" is used to compensate clock diIIerences. Using pulse stuIIing
Irequency shiIts can be compensated as the total number oI bits/Irame might be
increased or decreased to adjust the bits per second rate.
A so-called Stratum 1 clock is used to synchronize E1 Irames. This is a atomic
clock with a guaranteed accuracy oI 10
-11
(0.000001 ppm). Using independent
Stratum 1 clocks would cause only one Irame loss every 72.3 days. Stratum 1
clocks are typically only available in Central OIIices because they are very
expensive. Practically the timing signal is embedded inside dedicated E1 channels
to supply branch oIIices (timing distribution).
Higher rate signals are asynchronous with respect to the transported E1 signals.

20
20 {C} Herbert Haas 2005/03/11
E1 Basics
CEPT standardized E1 as part of European
channeIized framing structure for PCM transmission
(PDH)
E1 (2 Mbit/s)
E2 (8 Mbit/s)
E3 (34Mbit/s)
E4 (139Mbit/s)
ReIevant standards

G.703: Interfacing and encoding


G.704: Framing

G.732: MuItipIex issues


G.703 speciIies electrical and physical characteristics such as 75 ohm coax cables
(unbalanced) or 120 ohm twisted pair (balanced), and the HDB3 encoding.
G.704 speciIies Iraming structures Ior diIIerent interIace rates. For example E1 is
used at an interIace rate oI 2.048Mbit/s and uses 32 timeslots (8 bit each) per
Irame. The Irame repetition rate is always 8000 Hz, thereIore 32 x 8 x 8000
2.048 Mbit/s. Also reserved E1 timeslots are deIined: Timeslot 0 is used Ior Irame
synchronization and allows distinction oI Irames and timeslots; timeslot 16 can be
used Ior signaling.
G.732 speciIies the PCM multiplex equipment operating at 2.048 Mbit/s. This
Irames use the structure deIined in G.704. Furthermore A-law must be used when
converting analog to digital. G.732 also describes loss and recovery oI Irame
alignment, Iault conditions and consequent actions, and acceptable jitter levels.

21
21 {C} Herbert Haas 2005/03/11
frame frame frame frame frame frame frame
8000 frames per second
timesIot 0 timesIot 1 timesIot 2 timesIot 3 timesIot 31 .................
C 0 0 1 1 0 1 1
C 1 A N N N N N
AIternating
Frame AIignment SignaI (FAS)
Not Frame AIignment SignaI (NFAS)
8 bits per timesIot
2.048 Mbit/s
E1 Frame Structure
...
.
...
.
The timeslot 0 is used Ior Irame checking and multiIrame synchronizationend-
to-end!
The C (CRC) bit is part oI timeslot 0 and can Iorm an optional 4-bit CRC sequence
using 4 consecutive E1 Irames. The A (Alarm Indication) bit can transmit a so
called "Yellow" alarm (remote error) to signal loss oI signal (LOS) or out oI Irame
(OOF) condition to the remote station.
N (National) bits are vendor speciIic and reserved.

22
22 {C} Herbert Haas 2005/03/11
E1 SignaIing: TimesIot 16
To connect PBXs via E1

TimesIot 16 can be used as standard out-band


signaIing method
Common ChanneI SignaIing (CCS)

Dedicated 64 kbit/s channeI for signaIing


protocoIs such as DPNSS, CorNet, QSIG, or SS7
ChanneI Associated SignaIing (CAS)

4 bit signaIing information per timesIot (=user)


every 16th frame

30 independent signaIing channeIs


(2kbit/s per channeI)
The timeslot 16 can be used Ior so-called Channel Associated Signalling (CAS),
a classical method to carry outband signaling inIormation Ior all 30 user channels.
This method is typically used to interconnect two PBXs oI diIIerent vendors.
More eIIicient is to run a dedicated higher-level signaling protocol over timeslot
16, such as SS7 or QSIG. This method is generally known as Common Channel
Signaling (CCS).

24
24 {C} Herbert Haas 2005/03/11
T1 Basics
T1 is the North American PDH
variant

DS0 is basic eIement


24 timesIots per T1 frame
= 1.544 Mbit/s
frame frame frame frame frame frame frame
8000 frames per second
F
timesIot 1 timesIot 2 timesIot 3 timesIot 24 .................
8 bits per sIot
Extra bit for framing
....
....
In North America the PDH technology also originated Irom digital voice
transmission. Here the so-called T1 is the equivalent to the European E1. The "T"
stands Ior "Trunk". But T1 and E1 are not compatible because the T1 consists oI
24 timeslots only.
Also encoding and physics is diIIerent:
AMI or B8ZS (Bipolar 8 Zero bit Suppression)
100 ohm, twisted pair
The timeslots are numbered 1-24 whereas one timeslot can carry 8 bits. Only one
extra bit is Ior Iraming. The total Irame length is 193 bits. Since the Irame
repetition rate must also be 8000 Hz the resulting data rate is: (24 x 8 1) x 8000
1.544 Mbit/s.

25
25 {C} Herbert Haas 2005/03/11
T1 Basics
No reserved timesIot for signaIing
Robbed Bit SignaIing
Combinations of frames to superframes

12 T1 frames (DS4)

24 T1 frames (Extended Super Frame, ESF)


Modern aIternative: Common ChanneI
SignaIing
T1 Iraming is oIten used to connect PBX (Private Branch Exchanges) via leased
line hence the signaling inIormation between PBXs must be exchanged. But T1
deIines no dedicated timeslot Ior CAS, instead "robbed bit signaling" is used.
Using CAS the signaling inIormation is transmitted by robbing certain
bits, which are normally used Ior data. The signaling is placed in the LSB oI
every time slot in the 6th and 12th Irame oI every D4 superIrame (A, B).
Using an Extended Super Frame (ESF) structure, the signaling inIormation is
placed in the LSB oI every time slot in the 6th, 12th 18th and 24th Irame oI every
ESF superIrame (A, B, C, D).
Robbed Bit Signalling does not aIIect PCM signals (analog sources) but damages
data channels completely!
ThereIore only 56 kbit/s data channels are possible with CAS. Alternatively,
CCS can be used in the same way like E1. For example timeslot 24 can be used as
transparent signaling channel. In the USA, ISDN is typically carried over CAS
systems because there is still a lot oI old equipment used across the country. So
only 56 kbit/s per B channel usable. 64 kbit/s B channels would require CCS,
which is also called "Clear Channel Capability (CCC)".

26
26 {C} Herbert Haas 2005/03/11
PDH Limitations
PDH overhead increases dramaticaIIy with
high bitrates
1%
2%
3%
4%
5%
6%
7%
8%
9%
10%
11%
0.52
2.70
3.90
6.60
6.25
9.09
10.60
11.76
DS1 DS2 DS3 DS4 CEPT-1 CEPT-2 CEPT-3 CEPT-4
Overhead
The diagram above shows one oI the main disadvantages oI PDH technologies: the
overhead increases signiIicantly with the data rate, i. e. multiplex level. Thus it is
not reasonable to create much higher signal levels with this technology.
Note that the North American bit robbing method has also one advantage: the total
overhead is much lower compared to the European PDH variant.

27
27 {C} Herbert Haas 2005/03/11
Why SONET/SDH?
Many incompatibIe PDH impIementations
PDH does not scaIe to very high bitrates

Increasing overhead

CompIex muItipIexing procedures


Demand for a true synchronous network

No puIse stuffing between higher MUX IeveIs

Better compensate phase shifts by fIoating


pIayIoad and pointer technique
Demand for add-drop MUXes and ring
topoIogies
In the early 1980s there was a big demand Ior another backbone technology
because oI the severe drawbacks oI the old PDH technology.
During the decades, many diIIerent PDH implementations were built by diIIerent
vendors. Furthermore PDH does not scale to high data rates because oI the
overhead problem and because oI the complex multiplexing method.
One thing was clear: A successor oI PDHwhich was supposed to scale up to
inIinite data ratesmust be truly synchrone. Also Ilexible topology conIigurations
should be possible.

28
28 {C} Herbert Haas 2005/03/11
History Take 1: USA
Many companies after divestiture of AT&T

Many proprietary soIutions for PDH successor


technoIogy
In 1984 ECSA (Exchange Carriers
Standards Association) started on SONET

GoaI: one common standard

A standard that aImost wasn't: over 400


proposaIs!
SONET became an ANSI standard

Designed to carry US PDH payIoads


In 1984 the Exchange Carriers Standards Association (ECSA) started on the
development oI "Synchronous Optical Networks", short: SONET. The goal was to
deIine one common standard Ior all companies that were born aIter the divestiture
oI AT&T. Over 400 proposals were sent; but Iinally, aIter a long negotiation
period, the SONET standards was born and became an ANSI standard.
First US nation-wide SONET ring backbone were Iinished in 1997.

29
29 {C} Herbert Haas 2005/03/11
History Take 2: WorId
In 1986 CCITT became interested in
SONET

Created SDH as a superset

Designed to carry European PDH


payIoads incIuding E4 (140 Mbit/s)
OriginaIIy designed for fiber optics
In 1986 the CCITT (now ITU-T) became interested in SONET and deIined the
"Synchronous Digital Hierarchy" (SDH) as a superset oI SONET. Now SDH is
the world standard and SONET is considered as a subset oI SDH.
SDH was Iirst published in the CCITT "Blue Book" in 1989, speciIying the
interIaces and methods G.707, G.708, G.709, and many more.

30
30 {C} Herbert Haas 2005/03/11
(Regen.
Section)
Network Structure
Path
Path
Termination
Service (DSn or En)
mapping and
demapping
PTE PTE
Line
Line
termination
(MUX
section
termination)
Section
(Regen.)
Section
termination
REG REG
Line
Section Section Section
(Regenerator
Section)
(Regen.
Section)
(Regenerator
Section)
Path Termination (Regen.)
Section
termination
Service (DSn or En) mapping and
demapping
SONET SONET(SDH) (SDH) Terms
ADM
or
DCS
(Path Section)
(Multiplex Section) (Multiplex Section)
The picture above shows the network structure oI a SONET/SDH network.
Although SONET and SDH are compatible, note the slightly diIIerent terms
between both worlds.
The "Terminal Multiplexer" represents a so-called "Path Termination" and
marks the edge oI the SONET/SDH network (Path) by providing connectivity to
the PDH network devices. A Path is an end-to-end connection between those
Terminal Multiplexers. The "Regenerator" extends the possible distance and
quality oI a "Line". The Line spans between a Path termination and a network
node, Ior example an ADM or DCS. The Regenerator splits a line into multiple
Sections.
The Add/drop multiplexer (ADM) is the main element Ior conIiguring paths on
top oI line topologies (point-to-point or ring). Using an ADM it is possible to add
or drop multiplexed channels.
The Digital Cross Connect (DCS or DXC) is named aIter the historical patch
panels used in the early analog backbones. This device is basically a "static
switch" and connects equal-level channels with each other.

31
31 {C} Herbert Haas 2005/03/11
Layers and Overhead
SONET (SDH) consists of 4 Iayers

PhysicaI Layer
Section (Regenerator Section) Layer

Line (MuItipIex Section) Layer


Path Layer
AII Iayers (except the physicaI) insert information
into the so-caIIed overhead of each frame
Note:

SONET and SDH are technicaIIy consistent, onIy the


terms might be different
In this chapter, each SONET term is named first,
foIIowed by the associated SDH term written in brackets
SONET/SDH consists oI Iour layers which are not related to OSI layers:
Physical Layer:
Optical-Electrical and Electrical-Optical conversions and recovering oI the
transmit clock Ior proper sampling oI the incoming signal. No Irame overhead
is associated with the physical layer! Line coding depends on the type oI
interIaces used. For electrical interIaces the coding is compatible with PDH.
For optical interIaces, very simple binary encoding (NRZ) is used.
Section:
Deals with the transport oI an STS-N Irame across the physical medium. Typical
tasks: Framing and scrambling, section error monitoring, and introducing
section level communications overhead. The Regenerator Equipment Section is
terminated by (Regenerator-) Section Terminating Equipment STE (or RSTE
in the SDH world).
Line:
Transport oI path layer payloads across the physical medium. Supports the
synchronization and multiplexing Iunctions oI the path layer overhead
associated Iunctions. Includes maintenance and protection. Overhead is
interpreted and modiIied by Line Terminating Equipment (SONET) or
Multiplex Section Terminating Equipment (SDH).
Path:
Transport oI various payloads between SONET/SDH terminal multiplexing
equipment. Maps payloads into the Iormat required by the Line Layer and
communicates end-to-end via the Path Overhead (POH).

32
32 {C} Herbert Haas 2005/03/11
SONET SignaIs
EIectricaI signaI: STS-n

Synchronous Transport SignaI IeveI n


OpticaI signaI: OC-n

OpticaI Carrier IeveI n

OC-nc means concatenated


No muItipIexed signaI
Administrative overhead optimized compared to
muItipIexed signaI
Frame format is independent from
eIectricaI or opticaI signaIs
SONET deIines diIIerent terms Ior the electrical signal and the optical signal.
OC-nc originates at that speed (e.g. ATM). Typically only the term OC-n is used
(instead oI the STS-n terms).

33
33 {C} Herbert Haas 2005/03/11
SDH SignaIs
EIectricaI signaI: STM-n

Synchronous Transport ModuIe IeveI n

STM-nc means concatenated


No muItipIexed signaI
Administrative overhead optimized compared to reaI
muItipIexed signaI

OpticaI signaI: STM-nO


Frame format is independent from
eIectricaI or opticaI signaIs
TypicaIIy onIy the term STM-n is used
SDH deIines only one term Ior the electrical and the optical signal. Actually the
suIIix "O" has been deIined to diIIerentiate between optical and electrical signals,
but this suIIix is only seldom used.
STM-nc originates at that speed (e.g. ATM).

35
35 {C} Herbert Haas 2005/03/11
Two-dimensionaI Frame ModeI
SimiIar to PDH every frame has 125 s time
Iength

To support 8 kHz sampIed voice appIications


Bytes organized into rows and coIumns

Administrative channeIs are rate decoupIed for easier


processing
Basic SONET frame is STS-1

9 rows and 90 coIumns = 810 bytes totaI


810 bytes 8 bits 8000/s = 51.8 Mbit/s
Basic SDH frame is STM-1
9 rows and 270 (390) coIumns = 2430 bytes totaI

2430 bytes 8 bits 8000/s = 155.52 Mbit/s


Similar as in the PDH world, the overhead oI those periodic Irames must be
viewed as two-dimensional superIrame. Again each Irame must be sent every 125
s (1s/8000).

36
36 {C} Herbert Haas 2005/03/11
STS-1 (STM-0) Frame Structure
3 coIumns 87 coIumns
90 coIumns
9

r
o
w
s
Transport
Overhead
PayIoad EnveIope Capacity (VirtuaI Container Capacity)
Line
Overhead
Section
Overhead
Synchronous PayIoad EnveIope (SPE)
P
a
t
h

O
v
e
r
h
e
a
d
The STS-1 (STM-0) Irame consists oI a Transport Overhead (Section Overhead)
and a Payload Envelope Capacity (Virtual Container Capacity). Note that higher
level signals have the same percentage oI overheadthe number oI columns are
simply multiplied by the rate Iactor.
The Transport Overhead (TOH) consists oI Section Overhead SOH (Regen.
Section Overhead RSOH) and a Line Overhead (Multiplex Section Overhead
MSOH).

37
37 {C} Herbert Haas 2005/03/11
FIoating PayIoad
Path
Overhe
ad
Pointer Bytes
Synchronous
Payload
Envelope
The payload is carried inside the Synchronous Payload Envelope or SPE. The SPE
may Iloat inside the Payload Envelope Capacity (Virtual Container Capacity) to
compensate phase and Irequency shiIts.
The Path Overhead (POH) is the Iirst column oI the SPE. Various additional
"envelopes" were deIined to support every type oI payload e. g. DS1, DS3, E1, E3,
E4, ..., ATM, etc. For this reason the service signals are carried in so-called Virtual
Tributaries (Virtual Containers) which have a deIined size to smoothly Iit into a
SPE.
38

38 {C} Herbert Haas 2005/03/11
Uni- and Bi-directionaI Routing

OnIy working traffic is shown

Path or Iine switching for protection


A
C E
B F
D
Uni-directionaI Ring
(1 fiber)
C-A
A-C
A
C E
B F
D
Bi-directionaI Ring
(2 fibers)
C-A
A-C
Bidirectional rings provide much more perIormance over unidirectional rings.
Note that light signals are typically only sent unidirectional through one Iiber
because oI technical simplicity.

39
39 {C} Herbert Haas 2005/03/11
Add-drop Provisioning
Transport connections over a SONET
infrastructure are created by add-drop
provisioning

A path is buiIt up hop-by-hop by specifying


which channeIs shouId be added to a ring and
which channeIs shouId be dropped from the
ring
Add-drop provisioning is typicaIIy done by
the network management system

There is no signaIing protocoI !!!


The most important node Ior SONET/SDH is the ADM. An ADM allows Ilexible
conIigurations because it is able to add or drop lower rate signals to or Irom a
higher rate signal.
Note that SONET/SDH networks are still relatively static. These backbones are
used to established paths between long distances and remain active Ior several
months or years. Typically the establishment requires weeks and is manually
controlled. There is no signaling protocol (although recently some vendor speciIic
solutions appeared).
40

40 {C} Herbert Haas 2005/03/11
ADM
3
ADM
1
ADM
4
OC-12
Drop
Add 1-2, 3
Add 3-4
Drop
Add 4-2
Drop
Add and Drop ExampIe
ExampIe: OC-12
ring

Consists of 4 x
OC-3c channeIs

Uni-directionaI
routing
2 channeIs
occupied
ADM
2
Drop &
Continue
The picture above illustrates the capabilities oI ADMs.
41

41 {C} Herbert Haas 2005/03/11
Uni- and Bi-directionaI Routing
ADM
2
ADM
3
ADM
1
ADM
4
ADM
2
ADM
3
ADM
1
ADM
4
Uni-directionaI routing Bi-directionaI routing
The picture above illustrates the capabilities oI ADMs together with unidirectional
and bidirectional routing.

42
42 {C} Herbert Haas 2005/03/11
Operations
Protection

Circuit recovery in miIIiseconds


Restoration

Circuit recovery in seconds or minutes


Provisioning

AIIocation of capacity to preferred routes


ConsoIidation

Moving traffic from unfiIIed bearers onto fewer bearers


to reduce waste trunk capacity
Grooming

Sorting of different traffic types from mixed payIoads


into separate destinations for each type of traffic
SONET/SDH topologies are designed Ior providing a Ilexible and reliable
transport Ior required paths. Capacity planning and bandwidth provisioning is still
a reearch issue. Redundancy and automatic Iail-over is provided within 20 ms.
Delay and jitter control through control signals.
Typical topology concepts:
Point-to-point links (with protection) and DCS/MUX allows arbitrary complex
topology to be built.
Interconnected protected rings with ADM/DCS allow Ior minimum resource usage
(physical media) Ior avoiding single point oI Iailures.

43
43 {C} Herbert Haas 2005/03/11
SONET/SDH and the OSI ModeI
SONET/SDH covers

PhysicaI, Data Link, and Network Iayers


However, in data networking it is used
mostIy as a transparent bit stream pipe
Therefore SONET/SDH is regarded as a
PhysicaI Iayer, aIthough it is more
Functions might be repeated many times
in the overaII protocoI stack

Worst case: IP over LANE over ATM over


SONET
Note that SONET/SDH layers cannot be easily compared with OSI layers.
Actually SONET/SDH links are oIten used as "physical layer" Ior several OSI
compliant protocols or even the Internet protocol.
UnIortunately, optical switching is a very immature technology and thereIore a
number oI adaptation layers are needed to transport IP over SONET/SDH. Typical
conIigurations consists oI IP over LANE (LAN Emulation) over ATM over
SONET (over DWDM). Current research eIIorts Iocus on direct "IP over optical"
techniques.

44
44 {C} Herbert Haas 2005/03/11
Summary
TeIecommunication backbones must
be very reIiabIe and backward
compatibIe
PDH is stiII an important backbone
technoIogy
RecentIy moving to opticaI
backbones using SONET/SDH
Traffic voIume of voice services wiII
decrease reIative to generaI IP traffic

1
2005/03/11 {C} Herbert Haas
PPP
The point-to-point protocol

2
2 {C} Herbert Haas 2005/03/11
PPP versus SLIP
PPP

Where is PPP used

What is the task of LCP

What is the task of NCP


SLIP

SeriaI Line IP

Predecessor of PPP

We don't even think of it today



3
3 {C} Herbert Haas 2005/03/11
Introduction (1)
GoaI of PPP

Convey datagrams over a seriaI Iink

Both synchronous or asynchronous seriaI


Iinks are supported

Both bit or byte oriented transmissions are


supported
BasicaIIy, PPP consists of

One Link ControI ProtocoI (LCP)

SeveraI Network ControI ProtocoIs (NCPs)


The Point-to-Point Protocol (PPP) provides a standard method Ior transporting
multi-protocol datagrams over point-to-point links. PPP is comprised oI three
main components:
1. A method Ior encapsulating multi-protocol datagrams.
2. A Link Control Protocol (LCP) Ior establishing, conIiguring, and testing the
data-link connection.
3. A Iamily oI Network Control Protocols (NCPs) Ior establishing and
conIiguring diIIerent network-layer protocols.

4
4 {C} Herbert Haas 2005/03/11
Introduction (2)
HDLC is basis for encapsuIation

OnIy framing and error detection necessary

OnIy simpIe unnumbered information frames


(UI)
PPP supports fuII-dupIex Iinks onIy (!)
PPP Frame = Datagram + 2-8 bytes extra
header

Extra header consists of HDLC header and


PPP header
Byte Stuffing: Data dependent overhead!
Overhead
Only 8 additional octets are necessary to Iorm the encapsulation when used with
the deIault HDLC Iraming. In environments where bandwidth is at a premium,
the encapsulation and Iraming may be shortened to 2 or 4 octets.
Byte Stuffing
II the Ilag byte (126) occurs in the data Iield it has to be escaped using the escape
byte 125, while byte 126 is transmitted as a two byte sequence (125, 94) and the
escape byte itselI is transmitted as (125, 93).

5
5 {C} Herbert Haas 2005/03/11
LCP
Link ControI ProtocoI (LCP)

Setup, configure, test and terminate PPP


connection

Supports various environments


LCP negotiates

EncapsuIation format options

MaximaI packet sizes

Identification and authentification of peers (!)

Determination of proper Iink functionaIity


In order to be suIIiciently versatile to be portable to a wide variety oI
environments, PPP provides a Link Control Protocol (LCP). The LCP is used to
automatically agree upon the encapsulation Iormat options, handle varying limits
on sizes oI packets, authenticate the identity oI its peer on the link, determine
when a link is Iunctioning properly and when it is deIunct, detect a looped-back
link and other common misconIiguration errors, and terminate the link.

6
6 {C} Herbert Haas 2005/03/11
NCPs
Network ControI ProtocoIs (NCPs)

HeIper to estabIish various network


protocoIs

IP uses "IPCP"
TypicaI tasks

Assignment and management of IP


addresses

Compression and authentication


Point-to-Point links tend to exacerbate many problems with the current Iamily oI
network protocols. For instance, assignment and management oI IP addresses,
which is a problem even in LAN environments, is especially diIIicult over circuit-
switched point-to-point links (such as dial-up modem servers). These problems
are handled by a Iamily oI Network Control Protocols (NCPs), which each
manage the speciIic needs required by their respective network-layer protocols.
NCPs have been developed Ior all important network layer protocols such as IP,
which uses the IP Control Proocol (IPCP).
There are also NCPs designed to enable compression and authentication.

7
7 {C} Herbert Haas 2005/03/11
Data Link Layer: HDLC
Address 11111111 means "aII stations"

PPP does not assign individuaI station


addresses
OnIy the controI fieId 00000011 is used

Unnumbered Information (UI) command


ProtocoI fieId identifies datagram

AIready part of PPP, not HDLC (!)


01111110 11111111 00000011 16 Bits .... 16 Bit CRC 01111110
Flag Flag Address Address Control Control Protocol Protocol FCS FCS Flag Flag Data Data
(126) (255) (003)
Up to 1500 bytes data
(126)
Protocol: The True PPP Field
The most important Iield is the protocol Iield, which has two octets and its value
identiIies the datagram encapsulated in the InIormation Iield oI the packet.
PPP Header Compression
II protocol Iield compression is enabled, the protocol Iield is reduced Irom 2 to 1
byte. Since the Iirst two bytes are always constant, that is the address byte
(always 255) and the control byte (always 003), PPP also supports address-and-
control-Iield-compression, which omits these bytes.
Byte Stuffing
II the Ilag byte (126) occurs in the data Iield it has to be escaped using the escape
byte 125, while byte 126 is transmitted as a two byte sequence (125, 94) and the
escape byte itselI is transmitted as (125, 93).

8
8 {C} Herbert Haas 2005/03/11
ProtocoI FieId
0xxx 3xxx
8xxx bxxx
4xxx 7xxx
cxxx fxxx
L3 protocoI type
L3 protocoI type without associated NCPs
Associated NCPs for protocoIs in range 0xxx - 3xxx
LCP, PAP, CHAP, ...
0021 IP
002b NoveII IPX
002d Van Jacobson Compressed TCP/IP
002f Van Jacobson Uncompressed TCP/IP
8021
802b
IP-NCP (IPCP)
IPX-NCP (IPXCP)
c021 Link ControI ProtocoI (LCP)
c023 Password Auth. ProtocoI (PAP)
c025 Link QuaIity Report
c223 ChaIIenge Handshake Auth. ProtocoI (CHAP)
Important ExampIes
Protocol Field Values
Protocol Iield values in the "0***" to "3***" range identiIy the network-layer
protocol oI speciIic packets, and values in the "8***" to "b***" range identiIy
packets belonging to the associated Network Control Protocols (NCPs), iI any.
Protocol Iield values in the "4***" to "7***" range are used Ior protocols with
low volume traIIic which have no associated NCP. Protocol Iield values in the
"c***" to "I***" range identiIy packets as link-layer Control Protocols (such as
LCP).
All these numbers are controlled by the IANA (see RFC-1060).

9
9 {C} Herbert Haas 2005/03/11
CHAP - The ChaIIenge Handshake
Authentication ProtocoI
Supports 1-way and 2-way authentication
PeriodicaIIy verifies the identity of the remote
node using a three-way handshake
ReIies on MD5 hash (regarded as weak today)

OffIine dictionary attacks possibIe!


StiII wideIy used
Request to Iogin, User="LEFT", ChaIIenge_1
User="RIGHT", MD5_hash(ChaIIenge_1, KEY), ChaIIenge_2
MD5_hash(ChaIIenge_2, KEY)
MicrosoIt's MSCHAPv2 is even worse


10 {C} Herbert Haas 2005/03/11
PPP today
Is stiII a usuaI choice when carrying
IP packets over high-speed seriaI
Iines
SeveraI fIavors for different media

PPPOE (over Ethernet)

PPPOA (over ATM)

PPTP (TunneI PPP through a IP network)

POS - Packet over SONET/SDH


See RFC 1661, 1662

1
2005/03/11 {C} Herbert Haas
Ethernet
The LAN Killer

2
'Ethernet works in
practice but not
in theorv.`
Robert MetcaIfe
Yeah,...Robert MetcalIe was the inventor oI Ethernet.

3
3 {C} Herbert Haas 2005/03/11
History (1)
Late 1960s: AIoha protocoI University of
Hawaii
Late 1972: Robert MetcaIfe deveIoped first
Ethernet system based on CSMA/CD

Xerox PaIo AIto Research Center (PARC)

ExponentaI Backoff AIgorithm was key to


success (compared with AIoha)

2.94 Mbit/s

Destination
Address
Data
Source
Address
CRC
1 8 8 about 4000 bits 16
OriginaI Ethernet Frame
The Aloha protocol was Iairly simple: send whenever you like, but wait Ior an
acknowledgement. II there is no acknowledgement then a collision is assumed
and the station has to retransmit aIter a random time. "Pure Aloha" achieved a
maximum channel utilization oI 18 percent. "Slotted Aloha" used a centralized
clock and assigned transmission slots to each sender, hereby increasing the
maximum utilization to about 37 percent. Robert MetcalIe perceived the
problem: another backoII algorithm was needed but also "listen beIore talk".
MetcalIe created Carrier Sense Multiple Access Collision Detection
(CSMA/CD) and a truncated exponential backoII algorithm which allows a
100 percent load.
Robert MetcalIe's Iirst Ethernet system used a transmission rate at 2.94 Mbit/s
which was the system clock oI the Xerox Alto workstations at that time.
Originally, in 1972 MetcalIe called his system Alto Aloha Network, but one
year later he renamed it into "Ethernet" in order to emphasize that this
networking system could support any computer not just Altos and oI course
to clariIy the diIIerence to traditional Aloha!

4
4 {C} Herbert Haas 2005/03/11
History (2)
1976: Robert MetcaIfe reIeased the
famous paper:
"Ethernet: Distributed Packet
Switching for LocaI Computer
Networks"
OriginaI sketch
The press has oIten stated that Ethernet was invented on May 22, 1973, when
Robert MetcalIe wrote a memo to his bosses stating the possibilities oI
Ethernet's potential, but MetcalIe claims Ethernet was actually invented very
gradually over a period oI several years. In 1976, Robert MetcalIe and David
Boggs (MetcalIe's assistant) published a paper titled, "Ethernet: Distributed
Packet-Switching For Local Computer Networks."
MetcalIe leIt Xerox in 1979 to promote the use oI personal computers and local
area networks (LANs). He successIully convinced Digital Equipment, Intel,
and Xerox Corporations to work together to promote Ethernet as a standard.
Now an international computer industry standard, Ethernet is the most widely
installed LAN protocol.

5
5 {C} Herbert Haas 2005/03/11
History (2)
1978: Patent for Ethernet-Repeater
1980: DEC, InteI, Xerox (DIX) pubIished
the 10 Mbit/s Ethernet standard

"Ethernet II" was Iatest reIease (DIX V2.0)


Feb 1980: IEEE founded workgroup 802
1985: The LAN standard IEEE 802.3 had
been reIeased
First Ethernet standard was entitled "The Ethernet, A Local Area Network:
Data Link Layer and Physical Layer SpeciIications" and Iocused on thick
coaxial cable only.

6
6 {C} Herbert Haas 2005/03/11
The IEEE Working Groups
802.1 Higher Layer LAN ProtocoIs
802.2 LogicaI Link ControI
802.3 Ethernet
802.4 Token Bus
802.5 Token Ring
802.6 MetropoIitan Area Network
802.7 Broadband TAG
802.8 Fiber Optic TAG
802.9 Isochronous LAN
802.10 Security
802.11 WireIess LAN
802.12 Demand Priority
802.13 Not Used
802.14 CabIe Modem
802.15 WireIess PersonaI Area Network
802.16 Broadband WireIess Access
802.17 ResiIient Packet Ring
Superstition?
On this slide you can see a summary oI the most important IEEE standards so
Iar. The Ethernet system is covered by the standards 802.1, 802.2 and 802.3.
The 802.1 describes management and optional Iunctions inside the Ethernet
technology like the Spanning-tree (SPT) process, Ethernet bridging, VLAN
systems, etc.
The 802.2 standards describes the Logical Link Control (LLC) Iunction, which
is only used in 802.3 Ethernet systems, and that allows the use oI Ethernet in
connection-oriented or connection-less mode.
The 802.3 standard describes the physical layer oI the Ethernet system plus the
media access that is controlled by the CSMA/CD procedure.

7
7 {C} Herbert Haas 2005/03/11
IEEE 802 Layer ModeI
802.2 - LogicaI Link ControI (LLC)
Media Access ControI (MAC)
802.3
CSMA/CD
802.4
Token Bus
802.5
Token Ring
802.6
DQDB
802.12
Demand
Priority
802.11
WireIess
PHY PHY PHY PHY PHY PHY
L
i
n
k

L
a
y
e
r
PLS
AUl
PMA (MAU)
MDl
Medium
Reconciliation Reconciliation Reconciliation
PCS
PMA
PMD
GMll
MDl
PLS
AUl
PMA
Mll
MDl
PCS
PMA
PMD
Mll
MDl
Medium Medium Medium
P
h
y
s
.

L
a
y
e
r
802.1 Management, Bridging (802.1D), QoS, VLAN, .
The physical layer is responsible Ior the speed oI the transmission currently
there are Iour diIIerent speeds available, 10, 100, 1000, 10000 Mbit/s. In the
graphic the physical interIace structure oI the 10, 100, 1000 Mbit/s systems is
shown.
The interIace Iunction between the physical layer and the Ethernet data-link
layer is perIormed by the CSMA/CD algorithm.
The Medium Access Control layer is responsible Ior addressing and it controls
whether a data Irame is picked up Irom the wire and is loaded into the buIIer oI
the Ethernet card or not.
The Logical Link Control layer is responsible Ior the interIace Iunction
between the Ethernet layer and higher layers on top oI Ethernet plus the
support oI connection-less or connection-oriented mode.
The 802.1 Management cannot be seen as an separate Ethernet layer but it
describes additional optional Ethernet Iunctions like bridging, QoS, Ilow
control, SPT, etc.

8
8 {C} Herbert Haas 2005/03/11
IEEE 802.3/Ethernet
Since 1984 the IEEE aIso maintains
the DIX Ethernet standard
Both frame types are supported by
"Ethernet NICs"

Network Interface Cards


Remember at the early days oI Ethernet there were two competing
organizations the IEEE committee responsible Ior the 802.X standards and the
companies Digital, Intel and Xerox which where responsible Ior the Ethernet 2
DIX standard.
In the year 1984 the DIX committee disappeared and the IEEE took over the
responsibility to maintain and adapt the DIX standard Ior new upcoming
Ethernet technologies.
Today all Ethernet interIace cards support both Irame types the 802.3 and the
ETH 2 Irame.

9
9 {C} Herbert Haas 2005/03/11
CSMA/CD
Carrier Sense MuItipIe Access
CoIIision Detection

Improvement of ALOHA

"Listen before taIk" pIus

"Listen whiIe taIk"


Fast and Iow-overhead way to
resoIve any simuItaneous
transmissions
1) Listen if a station is currentIy sending
2) If wire is empty, send frame
3) Listen during sending if coIIision occurs
4) Upon coIIision stop sending
5) Wait a random time before retry
Ethernet is a shared media technology, so a procedure had to be Iound to
control the access onto the physical media. This procedure was called the
Carrier Sense Multiple Access Collision Detection (CSMA/CD) circuit.
The way it works is quite simple, every stations that wants to send need to do a
Carrier Sense to check iI the media is already occupied or not.
II the media is available the station is allowed to perIorm an Media Access and
may start sending data.
In the case that two stations almost at the same time access the media, a
collision will happen. To recognize and resolve a collision is the task oI the
Collision Detect circuit.
Every station listens to its own data while sending. In the case oI a collision the
currently sending stations recognize the collision by the superimposition oI the
electrical waves on the wire. A jamming signal will be sent out to make sure all
involved stations recognize the occurrence oI an collision.
All stations involved in the collision stop sending and start a randomize timer.
When the randomize timer expires the station may try to access the media
again.

10
10 {C} Herbert Haas 2005/03/11
SIot Time
Minimum frame Iength has to be
defined in order to safeIy detect
coIIisions
Each frame sent must stay on wire
for a RTT duration - at Ieast
This duration is caIIed "sIot time"
and has been standardized to be 512
bit-times

51,2 s for 10 Mbit/s


There is a very basic Ethernet rule that says a collision must be detected while
a station is transmitting data. ThereIore a stations needs to keep on sending at
least oI the duration oI the RTT oI the Ethernet system. The maximum allowed
RTT is standardized and is called the slot time. The slot time Ior 10Mbit/s
Ethernet systems is set to 51,2 s.
II collisions occur aIter expiration oI the slot time we talk about 'late
collisions, which may cause malIunctions in the network.
For example iI a station transmits a Irame and no collision was detected, the
station assumes correct delivery oI the Irame. Now the station removes the
Irame Irom the transmit buIIer, leaving no chance to retransmit the Irame in the
case oI a late collision.

11
11 {C} Herbert Haas 2005/03/11
SIot Time Consequences
So minimum frame Iength is 512 bits
(64 bytes)
With signaI speed of 0.6c the RTT of
512 bit times aIIows a network
diameter of

2500 meters with 10 Mbit/s

250 meters with 100 Mbit/s

25 meters with 1000 Mbit/s (!)


NOTE:
OnIy vaIid on
shared media
(!)
The minimum Irame length in Ethernet systems is set to 64 byte or 512 bit.
This minimum Irame length plus the slot time in combination with the speed oI
electrical signals on a wire (~ 180.000 km/s) determines the maximum
outspread oI an Ethernet system.
ThereIore we end up at a maximum outspread oI 2500 meters Ior 10Mbit/s
Ethernet systems. The maximum outspread oI Iaster Ethernet systems is
directly related to their shorter slot times, because oI the higher speed.
These distance limitations must only be taken into account in shared media
environments like Ethernet Bus and Hub systems. In more modern switched
environments using Iull duplex communication these distance limitations can
be neglected.

12
12 {C} Herbert Haas 2005/03/11
ExponentiaI Backoff (1)
Most important idea of Ethernet !
Provides maximaI utiIization of
bandwidth

After coIIision, set basic deIay = 512 x


sIot time

TotaI deIay = basic deIay * rand

0 <= rand < 2^k


k = min (number of transm. attempts, 10)
AIIows channeI utiIization
The retransmission in case oI collisions is controlled by the exponential
backoII algorithm.
The retransmission is delayed about a basic delay, which is set to 26,2
milliseconds Ior 10 Mbit/s Ethernet, times a random Iactor. The range out oI
which the randomize Iactor is selected is increasing with the number oI
retransmission attempts.
Repeated collisions indicate a busy medium, thereIore the station tries to adjust
to the medium load by progressively increasing the time delay between
repeated retransmission attempts.

13
13 {C} Herbert Haas 2005/03/11
ExponentiaI Backoff (2)
After 16 successive coIIisions

Frame is discarded

Error message to higher Iayer

Next frame is processed, if any


Truncated Backoff (k<=10)

1024 potentiaI "sIots" for a station

Thus maximum 1024 stations aIIowed


on haIf-dupIex Ethernet
The retransmission oI a Irame is attempted up to a deIined maximum number
oI retries typically known as the attempt limit. The attempt limit is set to a
maximum oI 16 retries by the standard.
AIter 16 retries the Irame is discarded and a error message is sent to higher
layers. Then the station continuous to process the next Irame.
Due to the truncated backup algorithm a maximum oI 1024 potential time slots
Ior a station are available. So the maximum number oI stations attached to halI
duplex Ethernet systems should not exceed 1024 stations.

14
14 {C} Herbert Haas 2005/03/11
ChanneI Capture
Short-term unfairness on very high
network Ioads
Stations with Iower coIIision counter
tend to continue winning
10 times harder to occur on 100
Mbit/s Ethernet
Rare phenomena, so no soIution
against it
But wouId I choose
Ethernet for mission-
criticaI reaItime
appIications.?
In the case oI very high network loads Ethernet tends to preIer stations with
lower collision counters, because they try to access the media in shorter time
intervals than stations with a higher collision counter.
This is a phenomena that was never solved in Ethernet systems, but can be
disregarded Ior today's Ethernet networks, because most oI them are switched
networks where collisions play no or just a minor role.

15
15 {C} Herbert Haas 2005/03/11
CoIIision Detection
10Base2, 10Base5

Manchester with -40 mA DC IeveI

"high" = 0 mA, "Iow" = -80 mA


10BaseT

Manchester with no DC offset

CoIIisions are detected by Hub who


sends a "Jam" signaI back

SimiIariIy at 100BaseT and 1000BaseT


The method oI collision detection is diIIerent Ior every physical layer.
In coaxial Ethernet, transceivers send their Manchester code using the DC
oIIset method. A "high" value is nominally zero current; a "low" value is
nominally -80 mA. This results in a DC component to the signal oI -40 mA,
which creates a voltage oI -1 VDC (the transceiver sees a 25 ohm load Irom the
two 50 ohm cables going "leIt and right" away Irom the transceiver). When two
transceivers send at the same time, their currents add, increasing the DC
component oI the combined signal to -2 VDC Thus, we can detect collisions by
looking Ior DC signals in excess oI the maximum that could possibly be
generated by a single transmitter.
In 10BASE-T, the Manchester code is sent symmetrically, with no DC oIIset.
Collisions are detected in the repeater hub, which can observe when two or
more devices are transmitting at the same time. Normally, the hub does not
repeat a station's own signal back to the station on its receive cable pair.
However, when a collision is noted, the hub does send a signal (the so-called
"collision enIorcement", or "jam") to the transmitting stations. The stations
detect collisions by noting when they see a signal on their receive pair at the
same time that they are transmitting on their transmit pair.

16
16 {C} Herbert Haas 2005/03/11
6 Byte MAC Addresses
IndividuaI/Group (I/G)

I/G=0 is a unicast address

I/G=1 is a group (broadcast) address


UniversaI/LocaI (U/L)

U/L=0 is a gIobaI, IEEE administered address

U/L=1 is a IocaI administered address


b45,...,b44 ....................... ....................... ....................... ....................... b7,....,b1,b0
b45,...,b44 ....................... ....................... ....................... ....................... b7,....,b1,b0
I/G
U/L
U/L
The addresses used in Ethernet systems are called MAC addresses. A MAC
address is 6 bytes or 48 bits long and is typically written in hexadecimal
notation. Each Ethernet network card has one burnt in MAC address. Network
cards oI some vendors even support the use oI programmable local
administered MAC addresses.
Ethernet is using a canonical address Iormat, which deIines the order how bits
Irom the transmission buIIer are put onto the medium. In Ethernet systems the
least signiIicant bit oI each byte is put on the medium Iirst Iollowed by the
more signiIicant bits.
The Iirst two bits oI a MAC address on the have a special meaning. The
Iirst bit (I/G) speciIies whether the MAC address is a unicast address (0) or a
broadcast/multicast address (1). The second bit (U/L) speciIies whether it`s a
global and unique MAC address, or a locally programmed and administered
address.

17
17 {C} Herbert Haas 2005/03/11
MAC Address Structure
Each vendor of networking
component can appIy for an unique
vendor code
Administered by IEEE
0
byte 0 byte 1 byte 2 byte 3 byte 4 byte 5
Organizational Unique
dentifier OU
serial number
The MAC addresses are globally administered by the International Electrical
and Electronic Engineering (IEEE) standardization organization.
Each vendor oI networking components can apply Ior an globally unique
vendor code. The vendor code costs 1000$ and occupies the Iirst three bytes oI
the MAC address.
The remaining three bytes oI the MAC address may be used by the vendor to
address its components.

18
18 {C} Herbert Haas 2005/03/11
Ethernet Frames
Due to different deveIopment
branches, there are two different
frame types

IEEE type: consists of MAC and LLC

DIX type: consists of a Type fieId


Why using both?

Different appIications have been defined


for either IEEE or DIX
Due to the historical development oI Ethernet there are two diIIerent types oI
Ethernet Irames. The DIX type commonly called Ethernet 2 Irame and the
IEEE type known as 802.3 Irame.
The IEEE Irame type consists oI a MAC part, an LLC (802.2) part and is using
the Destination/Source Service Access Points (DSAP, SSAP) to interIace with
higher layers.
The DIX Irame type consists oI a MAC part and a Type Iield used to interIace
with higher layers.
Which Irame type is used depends on the higher layer protocols e.g. Ior the
transport oI IP Irames the DIX type is speciIied.

19
19 {C} Herbert Haas 2005/03/11
IEEE 802.2 (LLC)
Every IEEE LAN/MAN protocoI
carries the LogicaI Link ControI
header

HDLC heritage
DSAPSSAP Ctr
layer 2 (LLC)
data MAC Header MAC Traer
Basic frame format of every IEEE protocoI
Which is my
destination
Iayer?
Which is my
source
Iayer?
HDLC
functionaIity
The LLC (802.2) is part oI every basic Irame Iormat that is speciIied by the
IEEE e.g. Token ring, Token bus, Ethernet, etc.
The DSAP and SSAP Iield are both eight bit in length and are used to address
layer 3 processes. With the SSAP the layer 2-3 interIace used at the source is
speciIied, while the DSAP speciIies the layer 2-3 interIace at the destination.
But typically it is very unlikely to use a SSAP value diIIerent Irom the DSAP
value, because only layer 3 processes oI the same kind are able to communicate
with each other. So IP to IP communication would use a SSAP and DSAP
value oI 0 x 06.
The Control Iield inside the LLC can be used Ior connection-oriented or
connection-less communication and the way it works is basically the same
what HDLC does.

20
20 {C} Herbert Haas 2005/03/11
LLC DetaiIs
According sophisticated HDLC
functionaIities, 4 LLC cIasses defined

CIass 1 is most important (UI, no ACKs)


DSAP SSAP CtrI
CtrI DSAP SSAP
Either 1 or 2 bytes for controI fieId
SimpIe UI frames
Information and
Supervisory
frames, carrying
sequence
numbers (!)
The LLC Iunctionality is divided into Iour classes:
Class 1- connection-less unacknowledged service
Class 2- connection-oriented service
Class 3- Class 1 plus connection-less acknowledged service
Class 4- Class 2 plus connection-less acknowledged service
Class 1 oIIers best eIIort service only, while Class 2 works connection-oriented
with error recovery and Ilow control support.
The most important service class is the Class 1 connection-less service,
because the tasks oI error recovery and Ilow control are typically perIormed by
higher layer processes e.g. TCP.
Only protocols like MicrosoIt's Netbeui or IBM's SNA need Class 2
connection-oriented service, because error recovery and Ilow control is not
supported by their protocol stacks.

21
21 {C} Herbert Haas 2005/03/11
SAP Identifiers
128 possibIe vaIues for protocoI identifiers
ExampIes:

0x42 . Spanning Tree ProtocoI 802.1d

0xAA. SNAP

0xE0. NoveII

0xF0. NetBios
U CtrI
I
G
U
63 IEEE defined
63 vendor defined
DSAP SSAP
63 IEEE defined
63 vendor defined
C
R
User: IEEE or
Vendor
Command or
Response
IndividuaI or
Group
The DSAP and the SSAP are both 8 bit in length. The least signiIicant bit in
the DSAP is reserved to indicate whether it`s a individual or group access
point. In the SSAP this bit is the command/response bit and is not used in
Ethernet systems. The U bit is used to speciIy whether its an IEEE or vendor
speciIic access point.
Hex E0 .......... Novell (U0)
Hex Fy .......... reserved Ior IBM (U0)
Hex F0 .......... Netbios (U0)
Hex F4 .......... IBM LAN manager individual (U0)
Hex F5 .......... IBM LAN manager group (U0, I/G 1)
Hex F8 .......... remote program load (U0)
Hex 04 .......... SNA path control individual (U0)
Hex 05 .......... SNA path control group (U0, I/G 1)
The range Hex 8y to 9C (with U0) is reserved Ior Iree usage except y xx1x (binary notation); U1
Hex 00 ......... Null SAP
A station with running LLC soItware always responds to a Irame destined to the Null SAP a LLC Ping can be
implemented.
Hex 03 ......... LLC sub-layer management (U1)
Hex 06 ......... DoD IP (U1)
Hex 42 ......... 802.1d Spanning Tree Protocol (U1)
Hex AA ......... TCP/IP SNAP (U1)
Hex FE ......... ISO Network Layer (U1)

22
22 {C} Herbert Haas 2005/03/11
DIX Type fieId
2-bytes Type fieId to identify payIoad
(protocoIs carried)

Most important: IP type 0x800


No Iength fieId
PreambeDA SA Type Data FCS
2 Bytes
"THE" Ethernet Frame
The Type Iield used by the DIX Eth2 Irame Iormat is 16 bit in length and
allows thereIore to address up 65 536 diIIerent layer 3 processes. The Type
Iield only allows the addressing oI the destination service access point. The
indication oI the source service access point is not supported by the DIX Irame
Iormat. Typically only layer 3 processes oI the same kind are able to
communicate with each other.
Some Type Iield examples:
Hex 0800........... IP
Hex 0806............ ARP
Hex 8035............. RARP
Hex 814C........... SNMP
Hex 6001/2........ DEC MOP
Hex 6004............ DEC LAT
Hex 6007............ DEC LAVC
Hex 8038............ DEC Spanning Tree
Hex 8138............ Novell

23
23 {C} Herbert Haas 2005/03/11
SNAP
Demand for carrying type-fieId in
802.4, 802.5, 802.6, ... aIso !
Subnetwork Access ProtocoI (SNAP)
header introduced

If DSAP=SSAP=0xAA and CtrI=0x03


then a 5 byte SNAP header foIIows

Containing 3 bytes organizationaI code


pIus 2 byte DIX type fieId
The IEEE had problems to address all necessary layer 3 processes, due to the
short (8 bit) DSAP and SSAP Iields in the IEEE header. So they introduced a
new Irame Iormat which was called Subnetwork Access Protocol (SNAP). The
SNAP Iormat was simply importing the DIX Type Iield by the backdoor. This
new header Iormat was then also used Ior technologies like Token Ring, Token
Bus, DQDB, etc.
In the SNAP Iormat the DSAP and the SSAP is set to the hex value oI AA.
This indicates an Iive byte extension to the standard 802.2 header, which is
made up oI a three byte long Iield called Organization Unique IdentiIier (OUI)
and the two byte Type Iield.


24
24 {C} Herbert Haas 2005/03/11
Frame Types Summary
PreambeDA SALength data FCS DSAPSSAP Ctr
802.3 with 802.2 (SAP)
layer 2 (LLC)
PreambeDA SA Type data FCS
PreambeDA SALength data FCS AA AA 03
layer 2 (LLC)
Ethernet Version 2 ("Ethernet ")
802.3 with 802.2 (SNAP)
46-1500
> 1518
SNAP
type org. code
So we end up with three diIIerent Irame Iormats used in Ethernet systems. The
802.3 without SNAP, the DIX Eth2 Iormat and the 802.3 with SNAP.
The DIX Eth 2 Irame Iormat is mainly used Ior the data transport oI protocols
that have the Iunctionality oI error recovery and Ilow control implemented in
their protocol stack e.g. IP.
The 802.3 without SNAP Irame Iormat is used Ior protocols that need the
Iunctions oI error recovery and Ilow control on layer 2 e.g. Netbeui, SNA.
The 802.3 with SNAP Irame Iormat is used by vendors to implement
proprietary protocols, Ior example Ciscos CDP, VTP, CGMP, etc. protocols.
For such purposes the OUI Iield is used to indicate the vendor and the type
Iield value is chosen vendor speciIic.

25
25 {C} Herbert Haas 2005/03/11
PHY Variants
10Base2 (10 Mbit/s, 200 meters)
10Base5 (500 meters)
10BaseT (star-Iike cabIing, hub needed)
10BaseF (fiber)
10Broad36 (broadband cabIe)
100BaseT
1000BaseT
1000BaseX
In this graphic an overview oI the currently available physical layers oI the
Ethernet system is shown.
The 10Base5, 10broad36 and the 10Base2 Ethernet bus systems can be seen as
historic and might only be Iound in existing elder installations.
The 10BaseT was the Iirst Ethernet system that allowed to build up star shaped
networks by the help oI HUBs and CAT 3 Unshielded Twisted Pair (UTP)
cables. Also a Iiber interIace 10BaseF exits Ior the 10 Mbit/s Ethernet system,
but is very rarely used because oI the higher costs compared to copper
interIaces.
The 100BaseT uses a cabling inIrastructure oI CAT 5 UTP cables and a 4B/5B
coding scheme. This encoding scheme adds a IiIth bit Ior every Iour bits oI
user data, to allow enough changes in the signal Ior synchronization purposes.
10BaseT as well as 100BaseT are only using the pins 1, 2, 3 and 6 Irom the
eight pin RJ45 connector.
1000BaseT is a copper interIaces that allow the transport oI Gigabit Ethernet
on CAT 5e UTP cables by the use oI all Iour pairs, 5 level PAM code and echo
cancellation. 1000BaseT is backward compatible to 10BaseT and 100BaseT.
1000BaseX can be used in combination with Iiber interIaces or shielded
balanced copper cables with a 8B/10B coding.

26
26 {C} Herbert Haas 2005/03/11
Twisted Pair CabIing
Category X cabIes

Cat 3 (Voice grade)

Cat 4

Cat 5

Cat 5e (1000BaseT, unshieIded)

Cat 6

Cat 7
Category depends on twisting cycIes
per Iength unit, isoIation, and shieIding
The cables types used in networking are divided in diIIerent categories which
determine the capability oI a cable e.g. max. Irequency, impedance,
attenuation, crosstalk, etc.
The CAT 3, 4, 5, 5e, 6 are speciIied by the T568-B standard published by the
Electronic Industry Association and Telecommunications Industry Association
(EIA/TIA).
CAT 7 cables are currently not covered by the standard but it is assumed that
they will provide a bandwidth capacity oI up to 400 MHz.
CAT 3... 16 Mhz
CAT 4... 20 MHz
CAT 5... 100 MHz
CAT 5e..... 100 MHZ
CAT 6... 250 MHZ
The Category 5e (CAT5e), or Enhanced Category 5, was ratiIied in 1999. It`s
an incremental improvement designed to enable cabling to support Iull-duplex
Fast Ethernet operation and Gigabit Ethernet.
Like CAT5, CAT5e is a 100-MHz standard, but has stricter speciIications Ior
crosstalk, attenuation and return loss.


27
27 {C} Herbert Haas 2005/03/11
TypicaI NIC Design
Connector
PHY
MDI
AU/M/GM-cable
MAC
PHY
MDI
E.g. 100BaseFX
transceiver
E.g. Fiber MC connector
internal transceiver
Computer /O Bus
RJ45
connector
AUI Attachment Unit Interface
MII Media Independent Interface
GMII Gigabit MII
MDI Medium Dependent Interface
PHY PhysicaI Layer Device
MAC Media Access ControI Unit
In this graphic we Iind a drawing about the principal design oI a network
interIace card.
We Iind the MAC layer directly located on the Ethernet card which is
responsible Ior the interaction between the physical and the Data-link layer.
Then there is a physical interIace directly located at the Ethernet card itselI
equipped with an RJ45 connector.
The AUI/MII/GMII connector represents a bus system Ior 10/100/1000
Ethernet systems used Ior media conversion with the help oI an transceiver.

28
28 {C} Herbert Haas 2005/03/11
Summary
SuccessfuI because simpIe
Two frames: DIX (Ethernet2) and
IEEE (802.3)
Shared medium has consequences

CoIIisions SIot time Network


diameter

UnpredictabIe, bad for reaItime


Increased data rate untiI today
10 GE aIready avaiIabIe (!)

29
29 {C} Herbert Haas 2005/03/11
Quiz
What is a hub?
List typicaI properties:

HaIf/fuII-dupIex?

Different data rates?

CoIIision behavior?
What is the canonicaI addressing format?
What is a jam signaI?
What is 802.3u and 803.3z ?
What is a runt? What is the opposite?

1
2005/03/11 {C} Herbert Haas
Transparent Bridging
and VLAN
Plug and Play Networking

2
I think that I shall never see
a graph more lovelv than a tree
a graph whose crucial propertv
is loop-free connectivitv.
A tree which must be sure to span
so packets can reach everv lan.
first the root must be selected
bv ID it is elected.
least cost paths to root are traced,
and in the tree these paths are place.
mesh is made bv folks like me,
bridges find a spanning tree.
Algorhyme
Radia PerIman
Radia Perlman, PhD computer science 1988, MIT * MS math 1976, MIT * BA
math 1973, MIT
Radia Perlman specializes in network and security protocols. She is the
inventor oI the spanning tree algorithm used by bridges, and the mechanisms
that make modern link state protocols eIIicient and robust. She is the author oI
two textbooks, and has a PhD Irom MIT in computer science.
Her thesis on routing in the presence oI malicious Iailures remains the most
important work in routing security. She has made contributions in diverse areas
such as, in network security, credentials download, strong password protocols,
analysis and redesign oI IPsec's IKE protocols, PKI models, eIIicient certiIicate
revocation, and distributed authorization. In routing, her contributions include
making link state protocols robust and scalable, simpliIying the IP multicast
model, and routing with policies.

3
3 {C} Herbert Haas 2005/03/11
Bridge History
Bridges came after routers!
First bridge designed by Radia PerIman

Ethernet has size Iimitations

Routers were singIe protocoI and


expensive
Spanning Tree because Ethernet had
no hop count
IEEE 802.1D
Bridging is a Iundamental part oI the IEEE LAN standard. Actually bridges
were invented relatively laterouters were invented a bit earlier. Radia
Perlman, a pioneer in data communication designed the Iirst bridge. The main
reason was to extend the total network diameter oI Ethernet and to provide a
transport technique which supports multiple layer 3 technologies. She also
invented the Spanning Tree Protocol (STP) because Ethernet had no hop count,
thus any store and Iorwarding technology would suIIer Irom broadcast storms,
when broadcast destination addresses are used. But this issue is discussed in
more detail later in this chapter.
The IEEE standard 802.1D speciIies bridging and spanning tree (and more).

4
4 {C} Herbert Haas 2005/03/11
What is Bridging?
Layer 2 packet forwarding principIe
Separate two (or more) shared-media
LAN segments with a bridge

OnIy frames destined to the other LAN


segment are forwarded

Number of coIIisions reduced (!)


Different bridging principIes

Ethernet: Transparent Bridging

Token Ring: Source Route Bridging


Bridges Iorward layer 2 packets (Irames) according to their destination address.
Hereby, those Irames are Iiltered whose destination is not reachable on another
port oI the bridge. This Iiltering capability signiIicantly enhances the total
perIormance oI a LAN as it is divided into multiple segmentsmultiple
broadcast domains: The number oI collisions is reduced!
IEEE deIined bridges Ior all kind oI LAN technologies. For example a Token
Ring network relies on so-called source route bridging, while Ethernet uses
"Transparent Bridging".
This chapter only discusses Transparent Bridging.

5
5 {C} Herbert Haas 2005/03/11
Bridging vs Routing
Bridging works on OSI Iayer 2

Forwarding of frames

Use MAC addresses onIy

Termination of physicaI Iayer (!)


Routing works on OSI Iayer 3

Forwarding of packets

Use routabIe addresses onIy (e.g. IP)

Termination of both Iayer 1 and 2


There are many diIIerences between bridging and routing! The only thing in
common is the store and Iorwarding principle, based on some sort oI
destination address.
But a bridge Iorwards layer 2 Irames while a router Iorwards layer 3 packets.
Layer 2 Irames use simple MAC addresses, having no logical structure, while
layer 3 packets use structured addresses, revealing topology inIormation. Only
layer 3 addresses are routable. In order to understand the latter statement, it is
important to understand the principles oI routing and how a routing table
works. We will discuss this soon.
Bridges terminate physical links. Thus, one port oI the same bridge might
support optical Iiber transmission and another port might support twisted pair
copper cabling.
On the other hand, routers terminate layer 2 links. That is, one interIace might
utilize Ethernet as link layer technology, another interIace Frame Relay, and a
third interIace might run ATM. A router only Iorwards the packetthe layer 3
inIormationcarried inside a Irame.

6
6 {C} Herbert Haas 2005/03/11
OSI Comparison
MAC addresses not
routabIe
NetBios over
NetBEUI not
routabIe (no L3)
Bridge supports
different physicaI
media on each port
E.g. 10Mbit/s to
100Mbit/s
Router supports
different Iayer-2
technoIogies
E.g. Ethernet to
Frame ReIay
AppIication
Transport
Network
Data Link
PhysicaI
Session
Presentation
AppIication
Transport
Network
Data Link
PhysicaI
Session
Presentation
Bridge
AppIication
Transport
Network
Data Link
PhysicaI
Session
Presentation
AppIication
Transport
Network
Data Link
PhysicaI
Session
Presentation
Router
It is very important to understand the diIIerences between bridges and routers.
There are many implications related to the operating layer these devices
support. As a rule oI thumb any device is able to terminate all layers below the
highest layer implemented.

7
7 {C} Herbert Haas 2005/03/11
How does it work?
Transparent bridging is Iike "pIug &
pIay"
Upon startup a bridge knows nothing
Bridge is in Iearning mode
A B C D
Port 1 Port 2
The main advantage oI transparent bridging over source route bridging (token
ring) is the transparency or "plug & play" capability. No end station notices the
presence oI bridges.
In order to be invisible, bridges must learn somehow where end stations are
located. Upon startup, a bridge knows nothing and the bridging table is empty.
At this time the bridge is in leaning mode.

8
8 {C} Herbert Haas 2005/03/11
Learning
Once stations send frames the bridge
notices the source MAC address

Entered in bridging tabIe


Frames for unknown destinations are
fIooded

Forwarded on aII ports


A B C D
A Port 1
SA=A
DA=D
HeIIo C,
How are
you?
Port 1 Port 2
SA=A
DA=D
Don't know
where D is
I'II fIood this
frame
HeIIo C,
How are
you?
Assume we have a bridge with only two ports, each attached at one Ethernet
segment. Assume the leIt station "A" sends one Irame to "D" on the right side.
Obviously the bridge learns the location oI A but has no idea where D is. Thus
the MAC address oI A is entered in the bridging table and also the port number
"1", on which A is reachable. Since the location oI D is unknown, the bridge
bridge Iloods this Irame over all ports, in our case only to port two (as there are
no other ports).
This way, connectivity is granted even iI there is no entry in the bridging table.

9
9 {C} Herbert Haas 2005/03/11
Learning TabIe FiIIing
If the destination address matches a
bridging tabIe entry, this frame can be
activeIy

forwarded if reachabIe via other port

fiItered if reachabIe on same port


A B C D
A Port 1
D Port 2
Port 1 Port 2
SA=D
DA=A
I know A is
reachabIe via
port 1
Thanks,
I'm fine
SA=D
DA=A Thanks,
I'm fine
Now assume D replies to the message which has been received Irom A. The
bridge knows already the port number over which A can be reached and
Iorwards the Irame accordingly. II A would be located on the same port as D
then this Irame would be Iiltered.

10
10 {C} Herbert Haas 2005/03/11
Learning TabIe FiIIing
After some time the Iocation of every
station is known - simpIy by Iistening!
Now onIy forwarding and fiItering of
frames
A B C D
A Port 1
D Port 2
B Port 1
C Port 2
Port 1 Port 2
I know B is
reachabIe via
port 1 and C
via
port 2
SA=C
DA=B Greetings
to B
SA=B
DA=C
HeIIo C,
How are
you?
AIter some traIIic observing time, the bridging table contains all host locations
(addresses and port numbers). At this time the bridge enters the Iorwarding
and Iiltering mode.

11
11 {C} Herbert Haas 2005/03/11
Forwarding and FiItering
Frames whose source and destination
address are reachabIe over the same
bridge port are fiItered
LAN separated into two coIIision
domains
A B C D
A Port 1
D Port 2
B Port 1
C Port 2
Port 1 Port 2
This frame
must be fiItered
(not forwarded)
SA=D
DA=C
HeIIo C, ever
heard from
A and B?
5 minutes aging
timer (defauIt)
Since only Irames are Iorwarded to other ports whose destination is really
located there, the LAN is separated into as many collision domains as ports are
available (and attached to a LAN segment).
What iI a host is removed Irom its location and attached at another place in the
LAN? Obviously Irames could be Iorwarded to the wrong port. ThereIore
each entry in the bridging table ages out aIter some time. The deIault aging
time is 300 seconds or 5 minutes.

12
12 {C} Herbert Haas 2005/03/11
Most Important !
Bridge separates LAN into muItipIe
coIIision domains !
A bridged network is stiII one
broadcast domain !

Broadcast frames are aIways fIooded


A router separates the whoIe LAN
into muItipIe broadcast domains
It is very important to understand the basic message which is given here:
The use oI bridges results in a separation oI multiple collision domains oI the
LAN. Still we have one single broadcast domain! That is, broadcast Irames
are always Ilooded throughout the network.
Only the use oI routers results in a separation oI multiple (layer 2) broadcast
domainsor the use oI VLANs, which will be discussed soon in this chapter.

13
13 {C} Herbert Haas 2005/03/11
What is a Switch?
A switch basicaIIy a bridge,
differences are onIy:

Faster because impIemented in HW

MuItipIe ports

Improved functionaIity
Don't confuse it with WAN Switching!

CompIeteIy different !

Connection oriented (statefuI) VCs


LAN Switch
Now what is the diIIerence between a bridge and a switch? Logically there is
no diIIerence. Technically there are major diIIerences, leading marketing Iolks
to deIine a new termthe switch. Switches typically employ more than two
ports, and the bridging Iunctionality is implemented in hardware. Additionally
other Ieatures are added, depending on the vendor. These will be discussed
next.
Note: Don't conIuse LAN switching with WAN switching. UnIortunately
modern bridging is called switching but logically it is still bridging. The term
"bridging" was originally deIined to diIIerentiate this technique strictly Irom
WAN switching. The main characteristic oI WAN switching is its connection
oriented behaviorWAN switches are never transparent! In order to connect
to a WAN switch the end system must comply to some speciIic User to
Network InterIace (UNI).

14
14 {C} Herbert Haas 2005/03/11
In PrincipIe (LogicaIIy)
Bridge = Switch
Since we use onIy switches today, Iet's taIk about them.

15
15 {C} Herbert Haas 2005/03/11
Modern Switching Features
Different data rates supported simuItaneousIy
10, 100, 1000, 10000 Mbit/s depending on switch
FuII dupIex operation
QoS
Queuing mechanisms
FIow controI
Security features
Restricted static mappings (DA associated with source port)
Port secure (Limited number of predefined users per port)
Different forwarding
Store & Forward
Cut-through
Fragment-Free
VLAN support (Trunking)
Spanning Tree
Today most switches support diIIerent data rates at each interIace or at selected
interIaces. Also Iull duplex operation is standard today. QoS might be
supported by using sophisticated queuing techniques, 802.1p priority tags, and
Ilow control Ieatures, such as the pause MAC control Irame.
Security is provided by statically entered switching tables and port locking
(port secure), that is only a limited number or predeIined users are allowed at
some designated ports.
Forwarding oI Irames can be signiIicantly enhanced using cut through
switching: the processor immediately Iorwards the Irame when the destination
is determined. The switching latency is constant and very short Ior all length oI
packets but the CRC is not checked. In the Fragment-Free switching mode, the
switch waits Ior the collision window (64 bytes) to pass beIore Iorwarding. II a
packet has an error or better explained, a collision, it almost always occurs
within the Iirst 64 bytes. Fragment-Free mode provides better error checking
than the Cut through mode with practically no increase in latency. The store
and Iorward mode is the classical Iorwarding mode.
VLAN support allows to separate the whole LAN into multiple broadcast
domains, hereby improving perIormance and security.
The spanning tree protocol (STP) avoids broadcast storms in a LAN. It is
described on the next slides.

16
16 {C} Herbert Haas 2005/03/11
Bridging ProbIems
Redundant paths Iead to

Broadcast storms

EndIess cycIing

Continuous tabIe rewriting


No Ioad sharing possibIe
No abiIity to seIect best path
Frame may be stored for 4 seconds (!)

AIthough rare cases

But onIy IittIe acceptance for reaItime and


isochronous traffic - might change!
You might have noticed that bridges do not really learn the network topology.
They only learn a simple destination to port association! Because oI this there
is no means to determine the best path, and Iurthermore Irames might be
caught in a loop.
Especially broadcast Irames have no deIined destination and would be
Iorwarded over all parallel pathsendlessly! This results in endless circling oI
Irames, or more dangerous, in a so-called "broadcast storm".
Also a continuous table rewriting might occur (this is not so widely known but
also explained in the next pages).
Most people are not aware that Irames might be stored up to 4 seconds inside
the buIIer oI a switchand it still complies to the IEEE standard. Although
this would happen only in rare cases oI congestion, transparent bridging is not
suitable Ior hard realtime applications. Today the situation has changed, QoS
Ieatures are included to assure bounded delays.


17
17 {C} Herbert Haas 2005/03/11
EndIess CircIing
1
2
3
4
5
DA = Broadcast
address or not-
existent host
address
For simpIicity we onIy foIIow one path
The picture above illustrates the endless circling phenomena. Assume a
network with parallel paths between two LAN segments, realized by two
bridges. Any Irame with a broadcast destination address would be Iorwarded
by both bridges to the other segment and back and Iorth and so on.
Obviously endless circling leads to congestion problems an is not desired.
Remember that there is not hop count or time-to-live number within the
Ethernet header.
But endless circling is not the main problem... (see next slide)

18
18 {C} Herbert Haas 2005/03/11
Broadcast Storm (1)
1
2
3
4
5
DA = Broadcast
address or not-
existent host
address
2
3
4
"AmpIification
EIement"
5
For simpIicity we onIy foIIow one path
The most Ieared issue with bridging are broadcast storms. Broadcast storms
can be considered as a dramatically "enhanced" endless circling problem.
Broadcast storms appear when there is an "ampliIication" element within the
network, such as those threeIold parallel paths in the diagram above.
Within a very short time (e.g. 1 second) the whole LAN is overloaded with
broadcast Irames and nobody could transmit any useIul Irame anymore.

19
19 {C} Herbert Haas 2005/03/11
Broadcast Storm (2)
6
7
8
5
6
7
8
"AmpIification
EIement"
5
6 6
7
7
8
8
For simpIicity we onIy foIIow one path
9
9
9
9
The picture above shows the ampliIication eIIect mentioned on the previous
page.

20
20 {C} Herbert Haas 2005/03/11
MutuaI TabIe Rewriting
1
2
2
3
DA = B
SA = A
A Port 1
A Port 2
A Port 1
1
2
3
For simpIicity onIy one path is described
MAC A
MAC B
1
2
1
2
Unicast
Frames!
1
.
A relatively seldom known problem is the mutual table rewriting phenomena.
This problem occurs with unicast Irames!
Assume that host A sends an unicast Irame to destination B, both bridges learn
the location oI host A and host B, but suddenly B is detached. However,
both bridges keep the entry Ior B Ior Iive minutes.
During this time the Iollowing happens:
1) AIter the bridges Iorward the Irame Irom the above segment to the bottom
segment this Irame is not consumed by any host B, and thereIore the
bridges Iorward this Irame back to the top segment.
2) At this moment the bridges rewrites their table as host A appears to be
located on the bottom segment.
3) Again the bridge Iorward the Irame to the bottom segment, hereby
rewriting the port address Ior this source address...ad inIinitum!

21
21 {C} Herbert Haas 2005/03/11
Spanning Tree
Invented by Radia Perlman as generaI
"mesh-to-tree" aIgorithm
A must in bridged networks with
redundant paths
OnIy one purpose:
cut off redundant paths with highest
costs
Now we have learned that active parallel paths lead to severe problems in a
switched (i.e. bridged) network. ThereIore we can only overcome this problem
by deactivating any redundant path. This should be perIormed automatically in
order to call Ethernet bridging still "Transparent" bridging.
The inventor oI bridging, Radia Perlman, also created an easy solution Ior the
redundancy problem: The Spanning Tree Protocol (STP).
The STP is implemented in bridges only (not in hosts) and has only one
purpose: To determine any redundant paths and cut them oII! Hereby cost
values are considered Ior each path in order to maintain the best paths.

22
22 {C} Herbert Haas 2005/03/11
STP Ingredients
SpeciaI STP frames: "Bridge
ProtocoI Data Units" (BPDUs)
A Bridge-ID for each bridge

Priority vaIue (16 bit, defauIt 32768)

(Lowest) MAC address


A Port Cost for each port

DefauIt 1000/Mbits (can be changed)

E.g. 10 Mbit/s C=100


What do we need Ior STP to work? First oI all this protocol needs a special
messaging means, realized in so-called Bridge Protocol Data Units (BPDUs).
BPDUs are simple messages contained in Ethernet Irames containing several
parameters described below.
Each bridge is assigned one unique Bridge-ID which is a combination oI a 16
bit priority number and the lowest MAC address Iound on any port on this
bridge. The Bridge-ID is determined automatically using the deIault priority
32768.
Each port is assigned a Port Cost. Again this value is determined
automatically using the simple Iormula Port Cost 1000 / BW, where BW is
the bandwidth in Mbit/s. OI course the Port Cost can be conIigured manually.

23
23 {C} Herbert Haas 2005/03/11
STP PrincipIe
First a Root Bridge is determined
InitiaIIy every bridge assumes itseIf
as root
The bridge with Iowest Bridge-ID
wins
Then the root bridge triggers
BDPU sending (heIIo time
intervaIs)
Received at "Root Ports" by other
bridges
Every bridge adds its own port
cost to the advertised cost and
forwards the BPDU
On each LAN segment one bridge
becomes Designated Bridge
Having Iowest totaI root path cost
Other bridges set redundant ports
in bIocking state
Bridge-ID = 5
Root Bridge
Bridge-ID
= 10
Bridge-ID
= 20
Root Port
Port Cost = 10
Root Port
Port Cost = 100
Port Cost = 100
We give only a basic explanation here oI how the STP works. First a Root
Bridge is determined by choosing the bridge with the lowest Bridge-ID. This
is simply done by sending BDUs containing the presumed Root Bridge. At
Iirst each bridge assumes to be the Root Bridge itselI. AIter any bridge has
sent his "opinion" the root bridge is determined.
Then the Root Ports are determined by each bridge. The Root Bridge sends
BPDUs periodically (every 2 seconds by deIault) "downstream" to the "leaves"
oI the tree which is currently created. Each bridge adds its own port costs to the
Root Path Cost parameter in the BPDU and Iorwards this BPDU over all other
ports. This way each bridge learns the best path to the root.
Finally on each LAN segment the bridge having best Root Port becomes
Designated Bridge. Its port on this LAN segment is called Designated Port
(DP). Root Ports and Designated Ports are in a Iorwarding state. All other ports
are in a blocking state.
But the best (and shortest) description comes Irom Radia Perlman's poem:
First the root must be selected
bv ID it is elected.
least cost paths to root are traced,
and in the tree these paths are place.

24
24 {C} Herbert Haas 2005/03/11
BPDU Format
Each bridge sends periodicaIIy BPDUs
carried in Ethernet muIticast frames

HeIIo time defauIt: 2 seconds


Contains aII information necessary for
buiIding Spanning Tree
Prot.
ID
2 Byte
Prot.
Vers.
1 Byte
BPDU
Type
1 Byte
FIags
1 Byte
Root ID
8 Byte
Root
Path
Costs
4 Byte
Bridge ID
8 Byte
Port ID
2 Byte
Mess.
Age
2 Byte
Max
Age
2 Byte
HeIIo
Time
2 Byte
Fwd.
DeIay
2 Byte
The Bridge I
regard as root
The totaI cost I see
toward the root
My own ID
Just Ior your interest, the above picture shows the structure oI BPDUs. You
see, there is no magic in here, and the protocol is very simple. There are no
complicated protocol procedures. BPDUs are sent periodically and contain all
involved parameters. Each bridge enters its own "opinion" there or adds its
root path costs to the appropriate Iield. Note that some parameters are transient
and others are not.
The other parameters not explained here are not so important to understand the
basic principle.

25
25 {C} Herbert Haas 2005/03/11
Note
Redundant Iinks remain in active stand-by
mode

If root port faiIs, other root port becomes


active
Low-price switches might not support STP

Don't use them in meshed configurations


OnIy 7 bridges per path aIIowed according
standard (!)
Still it is reasonable to establish parallel paths in a switched network in order to
utilize this redundancy in an event oI Iailure. The STP automatically activates
redundant paths iI the active path is broken. Note that BPDUs are always sent
or received on blocking ports.
Note that (very-) low price switches might not support the STP and should not
be used in high perIormance and redundant condigurations.
For perIormance reasons the IEEE standard 802.1d only allows 7 bridges Ior
each path. Some vendors allow to change this value.
Only Ior your interest, here are the Ethernet parameters Ior BPDUs:
Multicast address 0180 C200 0000 hex
LLC DSAPSSAP 42 hex

26
26 {C} Herbert Haas 2005/03/11
Bridging versus Routing
Depends on MAC addresses onIy
Requires structured addresses (must be
configured)
InvisibIe for end-systems;
transparent for higher Iayers
End system must know its defauIt-router
Must process every frame
Processes onIy frames addressed to it
Number of tabIe-entries = number of aII
devices in the whoIe network
Number of tabIe-entries = number of
subnets onIy
Spanning Tree eIiminates redundant Iines;
no Ioad baIance
Redundant Iines
and Ioad baIance possibIe
No fIow controI
FIow controI is possibIe
(router is seen by end systems)
Bridging Routing
The list shown above summaries all pro and cons oI bridging (switching) and
routing.

27
27 {C} Herbert Haas 2005/03/11
Bridging versus Routing
No LAN/WAN coupIing because of high
traffic (broadcast domain!)
Bridging Routing
Does not stress WAN with subnet's broad-
or muIticasts; commonIy used as
"gateway"
Paths seIected by STP may not match
communication behaviour/needs of end
systems
Router knows best way for each frame
Faster, because impIemented in HW; no
address resoIution
SIower, because usuaIIy impIemented in
SW; address resoIution (ARP) necessary
Location change of an end-system does
not require updating any addresses
Location change of an end-system requires
adjustment of Iayer 3 address
Spanning tree necessary against endIess
circIing of frames and broadcast storms
Routing-protocoIs necessary to determine
network topoIogy
The list shown above summaries all pro and cons oI bridging (switching) and
routing (continued Irom previous slide).

28
28 {C} Herbert Haas 2005/03/11
VirtuaI LANs
Separate LAN into muItipIe
broadcast domains

No gIobaI broadcasts anymore

For security reasons


Assign users to "VLANs"
Red VLAN:
SaIes PeopIe
YeIIow VLAN:
Technicians
Green VLAN:
Guests
Since most organizations consist oI multiple "working groups" it is reasonable
to conIine their produced traIIic somhow. This is achieved using Virtual LANs
(VLANs). Switches conIigured Ior VLANing are consist logically oI multiple
virtual switches inside.
Users are assigned to dedicated VLANs and there is no communication
possible between diIIerent VLANseven broadcasts are blocked! This
signiIicantly enhances security.
On a switch each VLAN is identiIied by a number and a name (optionally) but
in our example we also use colors to diIIerentiate them.

29
29 {C} Herbert Haas 2005/03/11
Host to VLAN Assignment
Different soIutions

Port based assignment

Source address assignment

ProtocoI based

CompIex ruIe based


Bridges are interconnected via VLAN
trunks

IEEE 802.1q (New: 802.1w, 802.1s)

ISL (Cisco)
There are diIIerent ways to assign hosts (users) to VLANs. The most common
is the port-based assignment, meaning that each port has been conIigured to be
member oI a VLAN. Simply attach a host there and its user belongs to that
VLAN speciIied.
Hosts can also be assigned to VLANs by their MAC address. Also special
protocols can be assigned to dedicated VLANs, Ior example management
traIIic. Furthermore, some devices allow complex rules to be deIined Ior
VLAN assignment, Ior example a combination oI address, protocol, etc.
OI course VLANs should span over several bridges. This is supported by
special VLAN trunking protocols, which are only used on the trunk between
two switches. Two important protocols are commonly used: the IEEE 802.1q
protocol and the Cisco Inter-Switch Link (ISL) protocol. Both protocols
basically attach a "tag" at each Irame which is sent over the trunk.

30
30 {C} Herbert Haas 2005/03/11
VLAN Trunking ExampIe
Inter-VLAN communication not possibIe
Packets across the VLAN trunk are tagged
Either using 802.1q or ISL tag

So next bridge is abIe to constrain frame to


same VLAN as the source
VLAN Trunk:
typicaIIy Fast
Ethernet or more
A B C D
SA=A
DA=D Information
for D
SA=A
DA=D Information
for D 5
SA=A
DA=D Information
for D
Tag identifies
VLAN
membership
VLAN 5 VLAN 5 VLAN 2 VLAN 2
By using VLAN tagging the "next" bridge knows whether the source address is
also member oI the same VLAN.

31
31 {C} Herbert Haas 2005/03/11
Inter-VLAN Traffic
Router can forward inter-VLAN traffic

Terminates Ethernet Iinks

Requirement: Each VLAN in other IP subnet !


Two possibiIities

Router is member of every VLAN with one Iink


each

Router attached on VLAN trunk port


("Router on a stick")
VLAN 2
VLAN 5
VLAN 2 VLAN 5
VLAN 2
VLAN 5
Router on a stick:
Changes tag for
every received
frame and returns
frame again
Now we admit the wholly truth: oI course it is possible to communicate
between diIIerent VLANsusing a router! A router terminates layer 2 and is
not interested in VLAN constraints. OI course this requires that each VLAN
uses another subnet IP address since the router needs to make a routing
decision.
There are two possible conIigurations: The straightIorward solutions is to
attach a router to several ports on one or more switches, provided that each port
is member oI another VLAN.
Another method is the "Router on a stick" conIiguration, employing only a
single attachment to a trunk port oI a switch. This method saves ports (and
cables) but requires trunking Iunctionality on the router. Here the router simply
changes the tag oI each Irame (aIter making a routing decision) and sends the
Irame back to the switch.

32
32 {C} Herbert Haas 2005/03/11
Summary
Ethernet Bridging is "Transparent Bridging"
Hosts do not "see" bridges
PIug & PIay
1 CoIIision domain 1 Broadcast domain
Switches increase network performance !
Redundant paths are dangerous
Broadcast storm is most feared
SoIution: Spanning Tree ProtocoI
VLANs create separated broadcast domains
Port based or address based VLANing
Routers aIIow inter-VLAN traffic

33
33 {C} Herbert Haas 2005/03/11
Quiz
Can I bridge from Ethernet to Token
Ring?
How is fIow controI impIemented?
Which bridge shouId be root bridge?
What are main differences between
802.1q and ISL?
What are Layer-3, Layer-4, and Layer-
7 switches ?
Q1: Yes, using translational bridges, problem with diIIerent MAC-address
styles, increased delay due to higher processing demand, Iorget it!
Q2: HalI duplex: Backpressure (preamble jamming) or reduced interIrame-gap;
Full duplex: Pause Irame (special MAC control Irame)
Q3: Root bridge should be point oI high load
Q4: 802.1q only allows one Spanning Tree Ior all VLANs, ISL allows
multiple.
Q5: HW-routers, QoS-Support, Application awareness (server load)

1
2005/03/11 {C} Herbert Haas http://www.perihel.at
The Spanning Tree
802.1D (2004)
RSTP
MSTP

2
2 {C} Herbert Haas 2005/03/11 http://www.perihel.at
ProbIem Description
We want redundant Iinks in bridged
networks
But transparent bridging cannot deaI
with redundancy

Broadcast storms and other probIems


(see Iater)
SoIution: the spanning tree protocoI

AIIows for redundant paths

Ensures non-redundant active paths



3
2005/03/11 {C} Herbert Haas http://www.perihel.at
Standard STP
A short repetition of why and how

4
4 {C} Herbert Haas 2005/03/11 http://www.perihel.at
Bridging ProbIems
Redundant paths Iead to

Broadcast storms

EndIess cycIing

Continuous tabIe rewriting


No Ioad sharing possibIe
No abiIity to seIect best path
You might have noticed that bridges do not really learn the network topology.
They only learn a simple destination to port association! Because oI this there is
no means to determine the best path, and Iurthermore Irames might be caught in a
loop.
Especially broadcast Irames have no deIined destination and would be Iorwarded
over all parallel pathsendlessly! This results in endless circling oI Irames, or
more dangerous, in a so-called "broadcast storm".
Also a continuous table rewriting might occur (this is not so widely known but
also explained in the next pages).
Most people are not aware that Irames might be stored up to 4 seconds inside the
buIIer oI a switchand it still complies to the IEEE standard. Although this
would happen only in rare cases oI congestion, transparent bridging is not suitable
Ior hard realtime applications. Today the situation has changed, QoS Ieatures are
included to assure bounded delays.


5
5 {C} Herbert Haas 2005/03/11 http://www.perihel.at
EndIess CircIing
1
2
3
4
5
DA = Broadcast
address or not-
existent host
address
For simpIicity we onIy foIIow one path
The picture above illustrates the endless circling phenomena. Assume a network
with parallel paths between two LAN segments, realized by two bridges. Any
Irame with a broadcast destination address would be Iorwarded by both bridges to
the other segment and back and Iorth and so on.
Obviously endless circling leads to congestion problems an is not desired.
Remember that there is not hop count or time-to-live number within the Ethernet
header.
But endless circling is not the main problem... (see next slide)

6
6 {C} Herbert Haas 2005/03/11 http://www.perihel.at
Broadcast Storm (1)
1
2
3
4
5
DA = Broadcast
address or not-
existent host
address
2
3
4
"AmpIification
EIement"
5
For simpIicity we onIy foIIow one path
The most Ieared issue with bridging are broadcast storms. Broadcast storms can
be considered as a dramatically "enhanced" endless circling problem. Broadcast
storms appear when there is an "ampliIication" element within the network, such
as those threeIold parallel paths in the diagram above.
Within a very short time (e.g. 1 second) the whole LAN is overloaded with
broadcast Irames and nobody could transmit any useIul Irame anymore.

7
7 {C} Herbert Haas 2005/03/11 http://www.perihel.at
Broadcast Storm (2)
6
7
8
5
6
7
8
"AmpIification
EIement"
5
6 6
7
7
8
8
For simpIicity we onIy foIIow one path
9
9
9
9
The picture above shows the ampliIication eIIect mentioned on the previous page.

8
8 {C} Herbert Haas 2005/03/11 http://www.perihel.at
MutuaI TabIe Rewriting
1
2
2
3
DA = B
SA = A
A Port 1
A Port 2
A Port 1
1
2
3
For simpIicity onIy one path is described
MAC A
MAC B
1
2
1
2
Unicast
Frames!
1
.
A relatively seldom known problem is the mutual table rewriting phenomena.
This problem occurs with unicast Irames!
Assume that host A sends an unicast Irame to destination B, both bridges learn
the location oI host A and host B, but suddenly B is detached. However, both
bridges keep the entry Ior B Ior Iive minutes.
During this time the Iollowing happens:
1) AIter the bridges Iorward the Irame Irom the above segment to the bottom
segment this Irame is not consumed by any host B, and thereIore the bridges
Iorward this Irame back to the top segment.
2) At this moment the bridges rewrites their table as host A appears to be located
on the bottom segment.
3) Again the bridge Iorward the Irame to the bottom segment, hereby rewriting
the port address Ior this source address...ad inIinitum!

9
2005/03/11 {C} Herbert Haas http://www.perihel.at
The Spanning Tree
IEEE 802.1D-2004

10
10 {C} Herbert Haas 2005/03/11 http://www.perihel.at
Spanning Tree
Invented by Radia Perlman as generaI
"mesh-to-tree" aIgorithm
A must in bridged networks with
redundant paths
OnIy one purpose: Cut off redundant
paths with highest costs
SpeciaI STP frames: Bridge ProtocoI
Data Units (BPDUs)
Now we have learned that active parallel paths lead to severe problems in a
switched (i.e. bridged) network. ThereIore we can only overcome this problem
by deactivating any redundant path. This should be perIormed automatically in
order to call Ethernet bridging still "Transparent" bridging.
The inventor oI bridging, Radia Perlman, also created an easy solution Ior the
redundancy problem: The Spanning Tree Protocol (STP).
The STP is implemented in bridges only (not in hosts) and has only one purpose:
To determine any redundant paths and cut them oII! Hereby cost values are
considered Ior each path in order to maintain the best paths.

11
11 {C} Herbert Haas 2005/03/11 http://www.perihel.at
Three STP Parameters
8 byte Bridge-ID for each bridge

Consists of 2 byte Priority vaIue (defauIt 32768) and 6


byte (Iowest) MAC address
Used to determine root bridge and as tie-breaker to
when determing designated port
4 byte Port Cost for each port

OId (stiII used) standard method:


1000 / Port_BW_in_Mbits
E. g. 10 Mbit/s Cost=100

Used to caIcuIate Root Path Cost to determine root port


and designated port
2 byte Port-ID for each port
Consists of 1 byte Priority vaIue (defauIt 128) and 1 byte
port number
OnIy used as tie-breaker if the same Bridge-ID and the
same Path Cost is received on muItipIe ports
What do we need Ior STP to work? First oI all this protocol needs a special
messaging means, realized in so-called Bridge Protocol Data Units (BPDUs).
BPDUs are simple messages contained in Ethernet Irames containing several
parameters described below.
Each bridge is assigned one unique Bridge-ID which is a combination oI a 16 bit
priority number and the lowest MAC address Iound on any port on this bridge.
The Bridge-ID is determined automatically using the deIault priority 32768.
Each port is assigned a Port Cost. Again this value is determined automatically
using the simple Iormula Port Cost 1000 / BW, where BW is the bandwidth in
Mbit/s. OI course the Port Cost can be conIigured manually.

12
12 {C} Herbert Haas 2005/03/11 http://www.perihel.at
STP Basic PrincipIe
First the Root Bridge is
determined
InitiaIIy every bridge assumes
itseIf as root
The bridge with Iowest Bridge-
ID wins
Then the root bridge triggers
transmissions of BDPUs
In heIIo time intervaIs (2 s)
Received at "Root Ports" by
other bridges
Every bridge adds its own port
cost to the advertised path cost
and forwards the BPDU
On each LAN segment one
bridge becomes Designated
Bridge
Having Iowest root path cost
Other bridges set their
(redundant) ports in bIocking
state
Bridge-ID = 5
Root Bridge
Bridge-ID
= 10
Bridge-ID
= 20
Root Port
Port Cost = 10
Root Port
Port Cost = 100
Path Cost = 100
Path Cost = 0 Path Cost = 0
Path Cost = 10
Desg. Port Desg. Port
We give only a basic explanation here oI how the STP works. First a Root
Bridge is determined by choosing the bridge with the lowest Bridge-ID. This is
simply done by sending BDUs containing the presumed Root Bridge. At Iirst
each bridge assumes to be the Root Bridge itselI. AIter any bridge has sent his
"opinion" the root bridge is determined.
Then the Root Ports are determined by each bridge. The Root Bridge sends
BPDUs periodically (every 2 seconds by deIault) "downstream" to the "leaves" oI
the tree which is currently created. Each bridge adds its own port costs to the Root
Path Cost parameter in the BPDU and Iorwards this BPDU over all other ports.
This way each bridge learns the best path to the root.
Finally on each LAN segment the bridge having best Root Port becomes
Designated Bridge. Its port on this LAN segment is called Designated Port (DP).
Root Ports and Designated Ports are in a Iorwarding state. All other ports are in a
blocking state.
But the best (and shortest) description comes Irom Radia Perlman's poem:
First the root must be selected
bv ID it is elected.
least cost paths to root are traced,
and in the tree these paths are place.

13
13 {C} Herbert Haas 2005/03/11 http://www.perihel.at
FinaI situation
Root switch

Has onIy Designated Ports

AII in forwarding state


Other switches have

ExactIy one Root Port (upstream)

Zero or more Designated Ports


(downstream)

Zero or more Nondesignated Ports


(bIocked)

14
14 {C} Herbert Haas 2005/03/11 http://www.perihel.at
Port States
At each time, a port is in one of the foIIowing states:
BIocking, Listening, Learning, Forwarding, or DisabIed
OnIy BIocking or Forwarding are finaI states (for enabIed
ports)
Transition states
15 s Listening state is used to converge STP
15 s Learning state is used to Iearn MAC addresses for the
new topoIogy
Therefore it Iasts 30 seconds untiI a port is pIaced in
forwarding state
BIocking Listening Learning Forwarding
Give STP time
to converge
PopuIate bridging
tabIe for that new
topoIogy
Start here
(topoIogy changed)

15
15 {C} Herbert Haas 2005/03/11 http://www.perihel.at
Note
Redundant Iinks remain in active
stand-by mode

If root port faiIs, other root port


becomes active
OnIy 7 bridges per path aIIowed
according standard (!)

Because of 15 seconds Iistening state


and 2 seconds heIIo timers
Still it is reasonable to establish parallel paths in a switched network in order to
utilize this redundancy in an event oI Iailure. The STP automatically activates
redundant paths iI the active path is broken. Note that BPDUs are always sent or
received on blocking ports.
Note that (very-) low price switches might not support the STP and should not be
used in high perIormance and redundant condigurations.
For perIormance reasons the IEEE standard 802.1d only allows 7 bridges Ior each
path. Some vendors allow to change this value.
Only Ior your interest, here are the Ethernet parameters Ior BPDUs:
Multicast address 0180 C200 0000 hex
LLC DSAPSSAP 42 hex

16
16 {C} Herbert Haas 2005/03/11 http://www.perihel.at
Usage for a Port-ID
The Port-ID is onIy used as Iast tie-breaker
TypicaI situation in highIy redundant
topoIogies: MuItipIe Iinks between each
two switches

Same BID and Costs announced on each Iink

OnIy IocaI Port-ID can choose a singIe Iink


Root Bridge
BID=00-00:00-ca-fe-ba-be-77
Root Path Cost = 0
BID=00-00:00-ca-fe-ba-be-77
Root Path Cost = 0
Both Iinks are
identicaI but gi0/1
has a Iower Port-ID
so I wiII use that
Iink
gi0/1
gi0/2

17
17 {C} Herbert Haas 2005/03/11 http://www.perihel.at
BPDU Format
Each bridge sends periodicaIIy BPDUs
carried in Ethernet muIticast frames

HeIIo time defauIt: 2 seconds


Contains aII information necessary for
buiIding Spanning Tree
Prot.
ID
2 Byte
Prot.
Vers.
1 Byte
BPDU
Type
1 Byte
FIags
1 Byte
Root ID
8 Byte
Root
Path
Costs
4 Byte
Bridge ID
8 Byte
Port ID
2 Byte
Msg
Age
2 Byte
Max
Age
2 Byte
HeIIo
Time
2 Byte
Fwd.
DeIay
2 Byte
The Bridge I
regard as root
The totaI cost I see
toward the root
My own ID
Just Ior your interest, the above picture shows the structure oI BPDUs. You see,
there is no magic in here, and the protocol is very simple. There are no
complicated protocol procedures. BPDUs are sent periodically and contain all
involved parameters. Each bridge enters its own "opinion" there or adds its root
path costs to the appropriate Iield. Note that some parameters are transient and
others are not.
The other parameters not explained here are not so important to understand the
basic principle.

18
18 {C} Herbert Haas 2005/03/11 http://www.perihel.at
Importance of detaiIs.
Many peopIe think STP is a simpIe
thing - untiI they encounter practicaI
probIems in reaI networks
Important DetaiIs

STP State Machine

BPDU format detaiIs

TCN mechanism

RSTP

MSTP

19
19 {C} Herbert Haas 2005/03/11 http://www.perihel.at
Note: STP is a port-based aIgorithm
OnIy the root-bridge eIection is done
on the bridge-IeveI
AII other processing is port-based

To estabIish the spanning tree, each


enabIed port is either forwarding or
bIocking

AdditionaIIy two transition states have


been defined

21
21 {C} Herbert Haas 2005/03/11 http://www.perihel.at
Another ExampIe
Three steps to create spanning tree:
1. EIect Root Bridge (Each L2-network has exactIy one Root Bridge)
2. EIect Root Ports (Each non-root bridge has exactIy one Root Port)
3. EIect Designated Ports (Each segment has exactIy one Designated
Port)
To determine root port and designated port:
1. Determine Iowest (cumuIative) Path Cost to Root Bridge
2. Determine Iowest Bridge ID
3. Determine Iowest Port ID
C
o
s
t
=
0
F
E
: C
o
s
t=
1
9
Cost=19
F
E
:
C
o
s
t=
1
9
C
o
s
t
=
0
Cost=19
FE: Cost=19
Cost=38
Cost=19
=> Root Port
Cost=19
=> Root Port
Cost=38
Designated
Port
Has Iower Bridge-ID than C,
therefore B becomes Designated
Bridge (i. e. has Designated Port for
this segment)
Designated
Port
BID=100:MAC_B
BID=1:MAC_A
BID=200:MAC_C
A
B C
Designated
Port
Nondesignated
Port
Each segment has exactly one Designated Port. This simple rule actually breaks
any loops.
A nondesignated port receives a more useIul BPDU than the one it would send
out on its segment. ThereIore it remains in the so-called blocking state.
Port ID - Contains a unique value Ior every port. Port 1/1 contains the value
0x8001, whereas Port 1/2 contains 0x8002. (Or in decimal: 128.1, 128.2, .)
From the 802.1D-1998 standard:
Each ConIiguration BPDU contains, among other parameters, the unique
identiIier oI the Bridge that the transmitting Bridge believes to be the Root, the
cost oI the path to the Root Irom the transmitting Port, the identiIier oI the
transmitting Bridge, and the identiIier oI the transmitting Port. This inIormation is
suIIicient to allow a receiving Bridge to determine whether the transmitting Port
has a better claim to be the Designated Port on the LAN on which the
ConIiguration BPDU was received than the Port currently believed to be the
Designated Port, and to determine whether the receiving Port should become the
Root Port Ior the Bridge iI it is not already.

22
22 {C} Herbert Haas 2005/03/11 http://www.perihel.at
Components of the Bridge-ID
The recent 802.1D-2004 standard requires onIy 4-bits for
priority and 12 bits to distinguish muItipIe STP instances
TypicaIIy used for MSTP, where each set of VLANs has its own
STP topoIogy
Therefore, ascending priority vaIues are 0, 4096, 8192, .

TypicaIIy stiII configured as 0, 1, 2, 3 .


Priority Extended System ID
Lowest MAC Address Priority
Lowest MAC Address
2 Bytes 6 Bytes
4 Bits 12 Bits
DefauIt: 32768
TypicaIIy derived
from BackpIane or
Supervisor moduIe
To aIIow distinct BIDs
per VLAN as used by
MSTP
New:
OId:
6 Bytes
802.1T spanning-tree extensions, and some oI the bits previously used Ior the
switch priority are now used Ior the extended system ID (VLAN identiIier Ior the
per-VLAN spanning-tree plus |PVST| and Ior rapid PVST or an instance
identiIier Ior the multiple spanning tree |MST|).
BeIore this, spanning tree used one MAC address per VLAN to make the bridge
ID unique Ior each VLAN.
Extended system IDs are VLAN IDs between 1025 and 4096. Releases
12.1(14)E1and later releases support a 12-bit extended system ID Iield as part oI
the bridge ID.
Switch(config)# spanning-tree extend system-id

23
23 {C} Herbert Haas 2005/03/11 http://www.perihel.at
STP Port Cost
AIso different cost vaIues might be used

See recommendations in the IEEE 802.1D-2004


standard to compIy with RSTP and MSTP
Speed [Mbit/s] Old Cost
(1000/Speed)
New Cost 802.1T
10 100 100 2,000,000
100 10 19 200,000
155 6 14 (129032 ?)
622 1 6 (32154 ?)
1000 1 4 20,000
10000 1 2 2,000

24
24 {C} Herbert Haas 2005/03/11 http://www.perihel.at
802.1T Excerpt

25
25 {C} Herbert Haas 2005/03/11 http://www.perihel.at
DetaiIed BPDU Format
BPDUs are sent in 802.3 frames
DA = 01-80-C2-00-00-00
LLC has DSAP=SSAP = 0x42 ("the answer")
Configuration BPDUs
Originated by Root Bridge periodicaIIy (2 sec HeIIo Time), fIow downstream
ProtocoI ID
Version
Message Type
FIags
Root ID
Root Path Cost
Bridge ID
Port ID
Message Age
Maximum Age = 20
HeIIo Time = 2
Forward DeIay = 15
2
1
1
1
8
4
8
2
2
2
2
2
Predetermined by root bridge
Affect convergence time
Misconfigurations cause Ioops
Broadcast intervaI of BPDUs (defauIt: 2 seconds)
BPDU is discarded if oIder than this vaIue (defauIt: 20 seconds)
Time spent in Iearning and Iistening states (defauIt: 15 seconds)
Time since Root generated this BPDU
Port-ID of sending bridge (unique: Port1/1=0x8001, 1/2=0x8002, ...)
ID of bridge that sent this BPDU
How far away is Root Bridge?
Who is Root Bridge?
LSB = TopoIogy change fIag (TC), MSB = TC Ack fIag (TCA)
Configuration (0x00) or TCN BPDU (0x80)
AIways zero
Bytes
When first
booted,
Root-ID == BID
If vaIue increases,
then the originating
bridge Iost
connectivity to Root
Bridge
AIways zero
A TCN-BPDU onIy
consists of these 3
fieIds !!!
In normal stable operation, the regular transmission oI ConIiguration Messages by
the Root ensures that topology inIormation is not timed out. To allow Ior
reconIiguration oI the Bridged LAN when components are removed or when
management changes are made to parameters determining the topology, the
topology inIormation propagated throughout the Bridged LAN has a limited
liIetime. This is eIIected by transmitting the age oI the inIormation conveyed (the
time elapsed since the ConIiguration Message originated Irom the Root) in each
ConIiguration BPDU. Every Bridge stores the inIormation Irom the Designated
Port on each oI the LANs to which its Ports are connected, and monitors the age
oI that inIormation.

26
26 {C} Herbert Haas 2005/03/11 http://www.perihel.at
TopoIogy Change Notification (TCN)
SpeciaI BPDUs, used as aIert by any
bridge

FIow upstream (through Root Port)

OnIy consists of the first three standard


header fieIds!
Sent upon

Transition of a port into Forwarding state and


at Ieast one Designated Port exists

Transition of a port into BIocking state (from


either Forwarding or Learning state)
Sent untiI acknowIedged by TC
AcknowIedge (TCA)

27
27 {C} Herbert Haas 2005/03/11 http://www.perihel.at
TopoIogy Change Notification (TCN)
OnIy the Designated Ports of upstream
bridges processes TCN-BPDUs and send
TC-Ack (TCA) downstream
FinaIIy the Root Bridge receives the TC
and sends Configuration BPDUs with the
TC fIag set to 1 (=TCA) downstream for
(Forward DeIay + Max Age = 35) seconds

This instructs aII bridges to reduce the defauIt


bridging tabIe aging (300 s) to the current
Forward DeIay vaIue (15 s)

Thus bridging tabIes can adapt to the new


topoIogy
Main idea: To avoid 5 minute age timer upon topology change! Some destinations may not be
reachable any more!
Normally, all ConIiguration BPDUs are (periodically) sent by the root bridge. Other bridges never
send out a BPDU toward the root bridge!
ThereIore dedicated TCN messages have been deIined to allow a non-root bridge to announce
topology changes.
TCN BPDUs are sent on the root port until acknowledged by the upstream bridge (BPDU with the
topology change acknowledgement (TCA) bit set).
The TCN is sent every hellotime which is a locally conIigured value (not the hellotime speciIied
in conIiguration BPDUs)
Reasons to send TCNs:
1. When a port changes Irom "Forwarding" to any other state
2. When a port transitions to Iorwarding and the bridge has a designated port (that is the
bridge is not standalone).
Then a TCN is sent upstream to the root bridge (i. e. only sent through the root port) which
'broadcasts' this inIormation downstream to all other bridges.
o These downstream TCNs are not acknowledged
o The TC bit is set by the root Ior a period oI maxage Iorwarddelay seconds, which
is 201535 seconds by deIault.
o Every bridge now reduces the aging time oI every existing bridging table entry to 15
seconds (more precisely: the actual value oI Iorwarddelay) This is done (also Ior new entries) Ior
the duration oI 35 seconds (more precisely: maxage Iorwarddelay).

28
28 {C} Herbert Haas 2005/03/11 http://www.perihel.at
Configuration on Cisco switches
Switch(config)# spanning-tree vlan 200 Switch(config)# spanning-tree vlan 200 EnabIe SPT on a specific VLAN
Enforcing Root Bridge
Switch(config-if)# spanning-tree cost 18 Switch(config-if)# spanning-tree cost 18 ManipuIate Port Costs
Switch(config-if)# spanning-tree vlan 200 cost 15 Switch(config-if)# spanning-tree vlan 200 cost 15
Switch(config)# spanning-tree vlan 200 priority 0 Switch(config)# spanning-tree vlan 200 priority 0
ManipuIate Port Costs for a specific VLAN
Switch# show spanning-tree vlan 200
VLAN0200
Spanning tree enabled protocol ieee
Root ID Priority 49352
Address 0008.2199.2bc0
This bridge is the root
Hello Time 2 sec Max Age 20 sec Forward Delay 15 sec
Bridge ID Priority 49352 (priority 49152 sys-id-ext 200)
Address 0008.2199.2bc0
Hello Time 2 sec Max Age 20 sec Forward Delay 15 sec
Aging Time 300
Uplinkfast enabled
Interface Port ID Designated Port ID
Name Prio.Nbr Cost Sts Cost Bridge ID Prio.Nbr
---------------- -------- --------- --- --------- -------------------- --------
Fa0/1 128.1 3019 LIS 0 49352 0008.2199.2bc0 128.1
Fa0/2 128.2 3019 LIS 0 49352 0008.2199.2bc0 128.2
Enable spanning tree on a per-VLAN basis.
Old commands:
set spantree priority
set spantree root
show spantree

29
2005/03/11 {C} Herbert Haas http://www.perihel.at
STP Optimizations
Port Fast
UpIink Fast
Backbone Fast

30
30 {C} Herbert Haas 2005/03/11 http://www.perihel.at
Port Fast
Optimizes switch ports connected to
end-station devices

UsuaIIy, if PC boots, NIC estabIishes L2-


Iink, and switch port goes from
DisabIed=>BIocking=>Listening=>Learn
ing=>Forwarding state ...30 seconds!!!
Port Fast aIIows a port to
immediateIy enter the Forwarding
state

STP is NOT disabIed on that port!


Any connectivity problems aIter cold booting a PC in the morning but NOT aIter
warm-booting during the day?

31
31 {C} Herbert Haas 2005/03/11 http://www.perihel.at
Port Fast
Port Fast onIy works once after Iink
comes up!
If port is then forced into BIocking state and
Iater returns into Forwarding state, then the
normaI transition takes pIace!
Ignored on trunk ports
AIternatives:
DisabIe STP (often a bad idea)
Use a hub in between => switch port is
aIways active

32
32 {C} Herbert Haas 2005/03/11 http://www.perihel.at
PortFast Configuration
Switch(config-if)# spanning-tree portfast Switch(config-if)# spanning-tree portfast
EnabIes PortFast on an interface
Switch#show running-config interface fastethernet 5/8
Building configuration...
Current configuration:
!
interface FastEthernet5/8
no ip address
switchport
switchport access vlan 200
switchport mode access
spanning-tree portfast
end
Verify PortFast

33
2005/03/11 {C} Herbert Haas http://www.perihel.at
STP Optimizations
Port Fast
UpIink Fast
Backbone Fast

34
34 {C} Herbert Haas 2005/03/11 http://www.perihel.at
UpIink Fast
AcceIerates STP to converge within 1-3
seconds

Cisco patent

Marks some bIocking ports as backup upIink


TypicaIIy used on access Iayer switches

OnIy works on non-root bridges

Requires some bIocked ports

EnabIed for entire switch (and not for


individuaI VLANs)
UplinkFast is actually a root port optimization.
The standard Cisco mcast address 01-00-0C-CC-CC-CC, which is used Ior CDP,
VTP, DTP, and DISL cannot be used, because all Cisco devices are programmed
to not Ilood these Irames (rather consume it).
Note that only MACs not learned over the uplinks are Ilooded.
show spantree uplinkIast

35
35 {C} Herbert Haas 2005/03/11 http://www.perihel.at
ProbIem
When Iink to root bridge faiIs, STP
requires (at Ieast) 30 seconds untiI
aIternate root port becomes active
Root
Backup root
g0/1 g0/1 bIocked
Root Port
BPDU
BPDU
BPDU

36
36 {C} Herbert Haas 2005/03/11 http://www.perihel.at
Idea of UpIink Fast
When a port receives a BPDU, we know that it has
a path to the root bridge
Put aII root port candidates to a so-caIIed "UpIink
Group"
Upon upIink faiIure, immediateIy put best port of
UpIink group into forwarding state
There cannot be a Ioop because previous upIink is stiII
down
Root
Backup root
Access Switch with
UpIink Fast
g0/1 g0/1 ImmediateIy pIaced in forwarding state
Root Port
BPDU
BPDU
BPDU
The UplinkFast Ieature is based on the deIinition oI an uplink group. On a given
switch, the uplink group consists in the root port and all the ports that provide an
alternate connection to the root bridge. II the root port Iails, which means iI the
primary uplink Iails, a port with next lowest cost Irom the uplink group is selected
to immediately replace it.

37
37 {C} Herbert Haas 2005/03/11 http://www.perihel.at
Incorrect Bridging TabIes
But upstream bridges stiII require 30 s to
Iearn new topoIogy
Bridging tabIe entries in upstream bridges
may be incorrect
g0/1
forwaring state
MAC B
MAC A
g1/3
MAC B is
at g1/3
g3/17
Packet for
MAC B
Packet for
MAC B

38
38 {C} Herbert Haas 2005/03/11 http://www.perihel.at
ActiveIy correct tabIes
UpIink Fast corrects the bridging tabIes of upstream
bridges
Sends 15 muIticast frames (one every 100 ms) for each
MAC address in its bridging tabIe (i. e. for each
downstream hosts)
Using SA=MAC: AII other bridges quickIy reconfigure their
tabIes; dead Iinks are no Ionger used

DA=01-00-0C-CD-CD-CD, fIooded throughout the network


MAC B
MAC A
g1/3
DA=01-00-0C-CD-CD-CD
SA=MAC B
DA=01-00-0C-CD-CD-CD
SA=MAC B
g3/17
MAC B is
at g3/17
Packet for
MAC B
Packet for
MAC B

39
39 {C} Herbert Haas 2005/03/11 http://www.perihel.at
AddionaI DetaiIs
When broken Iink becomes up again, UpIink Fast
waits untiI traffic is seen
That is, 30 seconds pIus 5 seconds to support other
protocoIs to converge (e. g. EtherchanneI, DTP, .)
FIapping Iinks wouId trigger upIink fast too often
which causes too much additionaI traffic

Therefore the port is "hoId down" for another 35


seconds before UpIink Fast mechanism is avaiIabIe for
that port again
SeveraI STP parameters are modified
automaticaIIy
Bridge Priority = 49152 (don't want to be root)

AII Port Costs += 3000 (don't want to be designated


port)
1100xxxx xxxxxxxx 491522`152`14

40
40 {C} Herbert Haas 2005/03/11 http://www.perihel.at
UpIinkFast - Configuration
Switch(config)# spanning-tree uplinkfast [max-update-rate max_update_rate] Switch(config)# spanning-tree uplinkfast [max-update-rate max_update_rate]
Switch# show spanning-tree uplinkfast
UplinkFast is enabled
Station update rate set to 150 packets/sec.
UplinkFast statistics
-----------------------
Number of transitions via uplinkFast (all VLANs) :9
Number of proxy multicast addresses transmitted (all VLANs) :5308
Name Interface List
-------------------- ------------------------------------
VLAN1 Fa6/9(fwd), Gi5/7
VLAN2 Gi5/7(fwd)
VLAN3 Gi5/7(fwd)
VLAN4
VLAN5
VLAN1002 Gi5/7(fwd)
VLAN1003 Gi5/7(fwd)
VLAN1004 Gi5/7(fwd)
VLAN1005 Gi5/7(fwd)

41
2005/03/11 {C} Herbert Haas http://www.perihel.at
STP Optimizations
Port Fast
UpIink Fast
Backbone Fast

42
42 {C} Herbert Haas 2005/03/11 http://www.perihel.at
Backbone Fast
CompIementary to UpIink Fast
Safes 20 seconds when recovering
from indirect Iink faiIures in core
area

Issues Max Age timer expiration

Reduce faiIover performance from 50 to


30 seconds

Cannot eIiminate Forwarding DeIay


ShouId be enabIed on every switch!
BackboneFast is actually a Max Age optimization.
Upon Root Port Iailure, a switch assumes it Root role and generates own
ConIiguration BPDUs, which are treated as "inIerior" BPDUs, because most
switches might still receive the BPDUs Irom the original Root Bridge.
The request/response mechanism involves a so-called Root Link Query (RLQ)
protocol, that is, RLQ-requests are sent to upstream bridges to check whether
their connection to the Root Bridge is stable. Upstream bridges reply with RLQ-
responses. II the upstream bridge does not know about any problems, it Iorwards
the RLQ-request Iurther upwards, until the problem is solved. II the RLQ-
response is received by the downstream bridge on a non-Root Port, then this
bridge knows, that it has lost its connection to the Root Bridge and can
immediately expire the Max Age timer.

43
43 {C} Herbert Haas 2005/03/11 http://www.perihel.at
ProbIem
Consider initiaI situation
Note that bIocked port (g0/1) aIways
remembers "best seen" BPDU -
which has best (=Iowest) Root-BID
Root
BID=R
Backup root
BID=B
g0/1 g0/1
Root Port
BPDU: Root has BID=R
BPDU: Root has BID=R
BID=A
BPDU: Root has BID=R

44
44 {C} Herbert Haas 2005/03/11 http://www.perihel.at
ProbIem (cont.)
Now backup-root bridge Iooses connectivity
to root bridge and assumes root roIe
Port g0/1 does not see the BPDUs from the
originaI root bridge any more
But for MaxAge=20 seconds, any inferior
BPDU is ignored
g0/1 g0/1
Root Port
No, I
remember a
better BPDU
Root
BID=R
Backup root
BID=B
BID=A
BPDU: Root has BID=B
BPDU: Root has BID=R
Note that the key problem is this:
1) Direct link Iailures would immediately set the bridge in listening mode (i. e.
all oI its ports).
2) But indirect link Iailures always includes the max-age timer (20 s) beIore
entering the listening state.

45
45 {C} Herbert Haas 2005/03/11 http://www.perihel.at
ProbIem (cont.)
OnIy after 20 seconds port g0/1 enters
Iistening state again
FinaIIy, bridge A unbIocks g0/1 and
forwards the better BPDUs to bridge B
TotaI process Iasts 20+15+15 seconds
g0/1 g0/1
Root Port
Root
BID=R
Backup root
BID=B
BID=A
BPDU: Root has BID=R
BPDU: Root has BID=R

46
46 {C} Herbert Haas 2005/03/11 http://www.perihel.at
SoIution
If an inferior BPDU is originated from the IocaI
segment's Designated Bridge, then this probabIy
indicates an indirect faiIure
(Bridge B was Designated Bridge in our exampIe)
To be sure, we ask other Designated Bridges
(over our other bIocked ports and the root port)
what they think which bridge the root is
Using Root Link Query (RLQ) BPDU
If at Ieast one repIy contains the "oId" root
bridge, we know that an indirect Iink faiIure
occurred
ImmediateIy expire Max Age timer and enter Listening
state

47
47 {C} Herbert Haas 2005/03/11 http://www.perihel.at
BackboneFast - Configuration
Switch(config)# spanning-tree backbonefast Switch(config)# spanning-tree backbonefast
Switch# show spanning-tree backbonefast
BackboneFast is enabled

BackboneFast statistics
-----------------------
Number of transition via backboneFast (all VLANs) : 0
Number of inferior BPDUs received (all VLANs) : 0
Number of RLQ request PDUs received (all VLANs) : 0
Number of RLQ response PDUs received (all VLANs) : 0
Number of RLQ request PDUs sent (all VLANs) : 0
Number of RLQ response PDUs sent (all VLANs) : 0

48
48 {C} Herbert Haas 2005/03/11 http://www.perihel.at
Other STP Tuning Options
BPDU Guard

Shuts down PortFast-configured interfaces that receive


BPDUs, preventing a potentiaI bridging Ioop
Root Guard

Forces an interface to become a designated port to


prevent surrounding switches from becoming the root
switch
BPDU FiIter
BPDU Skew Detection

Report Iate BPDUs via SysIog


Indicate STP stabiIity issues, usuaIIy due to CPU
probIems
UnidirectionaI Link Detection (UDLD)

Detects and shuts down unidirectionaI Iinks


Loop Guard

49
2005/03/11 {C} Herbert Haas http://www.perihel.at
Rapid Spanning Tree (RSTP)
IEEE 802.1D - 2004
(FormerIy known as 802.1w)

50
50 {C} Herbert Haas 2005/03/11 http://www.perihel.at
Introduction
RSTP is now an add-on to the IEEE 802.1D-
2004 standard

Contains contributions from Cisco


Computation of the Spanning Tree is
identicaI between STP and RSTP

Conf-BPDU and TCN-BPDU stiII remain

New BPDU type "RSTP" has been added


Version=2, type=2
RSTP BPDUs can be used to negotiate port
roIes on a particuIar Iink

OnIy done if neighbor bridge supports RSTP


(otherwise onIy Conf-BPDUs are sent

Using a ProposaI/Agreement handshake



51
51 {C} Herbert Haas 2005/03/11 http://www.perihel.at
Major Features
BPDUs are no Ionger triggered by
root bridge

Instead, each bridge can generate


BPDUs independentIy and immediateIy
(on-demand)
Much faster convergence

Few seconds
Better scaIabiIity

No network diameter Iimit



52
52 {C} Herbert Haas 2005/03/11 http://www.perihel.at
CompatibiIity
RSTP is designed to be compatibIe and
interoperabIe with the traditionaI STP -
without additionaI management
requirements!
If an RSTP-enabIed bridge is connected to
an STP bridge, onIy Configuration-BPDUs
and TopoIogy-Change BPDUs are sent

(No port roIe negotiation)


Memory requirements per bridge port
independent of number of bridges
An RSTP Bridge Port automatically adjusts to provide interoperability, iI it is
attached to the same LAN as an STP Bridge. Protocol operation on other ports is
unchanged. ConIiguration and Topology Change NotiIication BPDUs are
transmitted instead oI RST BPDUs which are not recognized by STP Bridges.
Port state transition timer values are increased to ensure that temporary loops are
not created through the STP Bridge. Topology changes are propagated Ior longer
to support the diIIerent Filtering
Database Ilushing paradigm used by STP. It is possible that RSTP`s rapid state
transitions will increase rates oI Irame duplication and misordering.
BPDUs convey ConIiguration and Topology Change NotiIication (TCN)
Messages. A ConIiguration Message can be encoded and transmitted as a
ConIiguration BPDU or as an RST BPDU. A TCN Message can be encoded as a
TCN BPDU or as an RST BPDU with the TC Ilag set. The Port Protocol
Migration state machine determines the BPDU types used.

53
53 {C} Herbert Haas 2005/03/11 http://www.perihel.at
Basic Parameters
B1 B2 B3 B4 B5 B6 B7 B8
priority System ID
Extension System-ID
60 bits totaI System-ID
Bridge-ID
(the Iesser the better)
Port-ID
(the Iesser the better)
B1 B2
priority unique
identifier
(not zero!)
Unit time vaIue: 1/256 s
Bridge-ID:
12-bit System-ID Extension allows to have a diIIerent BID Ior every VLAN
(MST, 802.1Q). For backwards compatibility, old STP implementations could
use a 16-bit priority value but may only set the 4 most signiIicant bits, remaining
12 must be zero:
MSByte1 2 3 ...
MSB LSB
xxxx 0000 0000 0000
Allowed values: 0, 4096, 8192, ... , 61440, but I think the little Endian
interpretation 0..15 will be used(?)
Port-ID:
In the old standard 8 bits priority 8 bit unique identiIier were used.
Unit time value
Ior all timer values (2 bytes) is 1/256 second, which allows a range Irom 0 to
65535*1/256256.

54
54 {C} Herbert Haas 2005/03/11 http://www.perihel.at
BPDU Types (OId and New)
ProtocoI ID
ProtocoI Version
BPDU Type
Root Bridge ID
(BD of bridge
believed to be the
root by the
transmitter)
Root Path Cost
Bridge ID
(of transmitting
bridge)
Port ID
Message Age
Maximum Age
HeIIo Time
Forward DeIay
1
2
3
4
5
6
7
8
9
10
11
12
13
14
16
15
17
18
19
20
21
23
22
24
25
27
26
28
29
30
31
32
34
33
35
Version 1 Length 36
RSTP BPDU: 0000 0010
all set to zero means RSTP but also STP!
RSTP BPDU: 0000 0010
TCAck fwd agree Iearn prop TCN
Port RoIe:
0 0 = Unknown
0 1 = AIternate or Backup
1 0 = Root
1 1 = Designated
must be less than Max Age
20 seconds
2 seconds
15 seconds
0000 0000 indicates that there is no Version 1 protocol information present
ProtocoI ID
ProtocoI Version
BPDU Type
Root Path Cost
Port ID
Message Age
Maximum Age
HeIIo Time
Forward DeIay
Root Bridge ID
(BD of bridge
believed to be the
root by the
transmitter)
Bridge ID
(of transmitting
bridge)
of the Port through which the message was transmitted
Configuration BPDU
1 byte
RST BPDU TopoIogy Change BPDU
ProtocoI ID
ProtocoI Version
1000 0000
NOTE:
The RST BPDU
repIaces the
Configuration BPDU
and the TopoIogy
Change BPDU
FIags
Flags:
TCN (bit 1)
Proposal (bit 2)
Port Role (bits 3, 4)
Learning (bit 5)
Forwarding (bit 6)
Agreement (bit 7)
Topology Change Acknowledgment (bit 8)
Note: A ConIiguration BPDU has same structure than a RSTP BPDU with the
Iollowing exceptions:
1) A ConIiguration BPDU is only 35 byte long, that is, there is no "Version 1
length" Iield
2) A ConIiguration BPDU only uses two Ilags, that is, TCAck (bit 7) and TCN
(bit 0)
NOTE: II the Unknown value oI the Port Role parameter is received, the state
machines will eIIectively treat the RST
BPDU as iI it were a ConIiguration BPDU.

55
55 {C} Herbert Haas 2005/03/11 http://www.perihel.at
Same simpIe basic ruIes
Bridge with Iowest BID becomes Root
Bridge

Has onIy Designated Ports


Every other bridge has exactIy one Root
Port

Providing a Ieast cost path to the Root Bridge

LocaI tie-breaker is the Port Identifier


A Designated Bridge provides the Iowest
Root Path Cost for a LAN

Tie-breaker between muItipIe bridges is BID

LocaI tie-breaker is the Port Identifier


Every Bridge has a Root Path Cost associated with it. For the Root Bridge this is
zero. For all other Bridges, it is the sum oI the Port Path Costs on the least cost
path to the Root Bridge.
II a Bridge has two or more ports with the same Root Path Cost, then the port
with the best Port IdentiIier is selected as the Root Port.
The Bridge providing the lowest Root Path Cost Ior a LAN is called the
Designated Bridge Ior that LAN. II there are two or more Bridges with the same
Root Path Cost, then the Bridge with the best priority (least numerical value) is
selected as the Designated Bridge.
Since each Bridge provides connectivity between its Root Port and its Designated
Ports, the resulting active topology connects all LANs (is 'spanning) and will be
loop Iree (is a 'tree).
Any operational Bridge Port that is not a Root or Designated Port is a Backup
Port iI that Bridge is the Designated Bridge Ior the attached LAN, and an
Alternate Port otherwise. Backup Ports exist only where there are two or more
connections Irom a given Bridge to a given LAN.

56
56 {C} Herbert Haas 2005/03/11 http://www.perihel.at
Backup and AIternate Ports
If a port is neither Root Port nor
Designated Port

It is a Backup Port - if this bridge is a


Designated Bridge for that LAN

Or an AIternate Port otherwise


DP
DP
RP RP
DP BP AP
Backup and AIternate Ports:

57
57 {C} Herbert Haas 2005/03/11 http://www.perihel.at
Port Types
Shared Ports

Are not supported (ambiguous negotiations)

Uses standard STP here


Point-to-point ports

UsuaI and required port types

Supports proposaI-agreement process


Edge Port

Hosts resides here

Transitions directIy to the Forwarding Port


State, since there is no possibiIity of it
participating in a Ioop

May change their roIe as soon as a BPDU is


seen

58
58 {C} Herbert Haas 2005/03/11 http://www.perihel.at
AIgorithm Overview
Designated Ports transmit Configuration BPDUs
periodicaIIy to detect and repair faiIures

BIocking (aka Discarding) ports send Conf-BPDUs onIy


upon topoIogy change
Every Bridge accepts "better" BPDUs from any
Bridge on a LAN or revised information from the
prior Designated Bridge for that LAN
To ensure that oId information does not endIessIy
circuIate through redundant paths in the network
and prevent propagation of new information,
each Configuration Message incIudes a message
age and a maximum age
Transitions to Forwarding is now confirmed by
downstream bridge - therefore no Forward-DeIay
necessary!
On a given port, iI hellos are not received three consecutive times, protocol
inIormation can be immediately aged out (or iI maxage expires). Because oI the
previously mentioned protocol modiIication, BPDUs are now used as a keep-alive
mechanism between bridges. A bridge considers that it loses connectivity to its
direct neighbor root or designated bridge iI it misses three BPDUs in a row. This
Iast aging oI the inIormation allows quick Iailure detection. II a bridge Iails to
receive BPDUs Irom a neighbor, it is certain that the connection to that neighbor
is lost. This is opposed to 802.1D where the problem might have been anywhere
on the path to the root.
Rapid transition is the most important Ieature introduced by 802.1w. The legacy
STA passively waited Ior the network to converge beIore it turned a port into the
Iorwarding state. The achievement oI Iaster convergence was a matter oI tuning
the conservative deIault parameters (Iorward delay and maxage timers) and
oIten put the stability oI the network at stake. The new rapid STP is able to
actively conIirm that a port can saIely transition to the Iorwarding state without
having to rely on any timer conIiguration. There is now a real Ieedback
mechanism that takes place between RSTP-compliant bridges. In order to achieve
Iast convergence on a port, the protocol relies upon two new variables: edge ports
and link type.

59
59 {C} Herbert Haas 2005/03/11 http://www.perihel.at
Main Differences to STP (1)
The three 802.1d states disabled, blocking,
and listening have been merged into a
unique 802.1w discarding state
Non-designated ports on a LAN segment
are spIit into alternate ports and backup
ports

A backup port receives better BPDUs from the


same switch

An aIternate port receives better BPDUs from


another switch
In most cases, RSTP perIorms better than Cisco's proprietary extensions without
any additional conIiguration. 802.1w is also capable oI reverting back to 802.1d
in order to interoperate with legacy bridges (thus dropping the beneIits it
introduces) on a per-port basis.
There is no diIIerence between a port in blocking state and a port in listening
state; they both discard Irames and do not learn MAC addresses. The real
diIIerence lies in the role the spanning tree assigns to the port. It can saIely be
assumed that a listening port will be either a designated or root and is on its way
to the Iorwarding state. UnIortunately, once in Iorwarding state, there is no way to
inIer Irom the port state whether the port is root or designated, which contributes
to demonstrating the Iailure oI this state-based terminology. RSTP addresses this
by decoupling the role and the state oI a port.
The role is now a variable assigned to a given port. The root port and designated
port roles remain, while the blocking port role is now split into the backup and
alternate port roles.
A non-designated port is a blocked port that receives a more useful BPDU
than the one it would send out on its segment. The "more useIul BPDU" can be
received Irom the same switch (on another port on the same LAN segment) or
Irom another switch (also on the same LAN segment). The Iirst is called a
backup port, the latter an alternate port.
The name blocking is used Ior the discarding state in Cisco implementation.

60
60 {C} Herbert Haas 2005/03/11 http://www.perihel.at
Main Differences to STP (2)
BPDUs are sent every heIIo-
time, and not simpIy reIayed
anymore

Immediate aging if three


consecutive BPDUs are missing
When a bridge receives better
information ("I am root") from
its DB, it immediateIy accepts it
and repIaces the one previousIy
stored

But if the RB is stiII aIive, this


bridge wiII notify the other via
BPDUs
DP
Root
I am root
B
P
D
U
No, you are not!
(see this BPDU)
RP
BackboneFast-Iike behavior:
In most cases, RSTP perIorms better than Cisco's proprietary extensions without
any additional conIiguration. 802.1w is also capable oI reverting back to 802.1d
in order to interoperate with legacy bridges (thus dropping the beneIits it
introduces) on a per-port basis.
There is no diIIerence between a port in blocking state and a port in listening
state; they both discard Irames and do not learn MAC addresses. The real
diIIerence lies in the role the spanning tree assigns to the port. It can saIely be
assumed that a listening port will be either a designated or root and is on its way
to the Iorwarding state. UnIortunately, once in Iorwarding state, there is no way to
inIer Irom the port state whether the port is root or designated, which contributes
to demonstrating the Iailure oI this state-based terminology. RSTP addresses this
by decoupling the role and the state oI a port.
The role is now a variable assigned to a given port. The root port and designated
port roles remain, while the blocking port role is now split into the backup and
alternate port roles.
A non-designated port is a blocked port that receives a more useful BPDU
than the one it would send out on its segment. The "more useIul BPDU" can be
received Irom the same switch (on another port on the same LAN segment) or
Irom another switch (also on the same LAN segment). The Iirst is called a
backup port, the latter an alternate port.
The name blocking is used Ior the discarding state in Cisco implementation.

61
61 {C} Herbert Haas 2005/03/11 http://www.perihel.at
Rapid Transition DetaiIs
The new rapid STP is abIe to
activeIy confirm that a port
can safeIy transition to
forwarding without reIying on
any timer configuration
Feedback mechanism
Edge Ports connect hosts
Cannot create bridging Ioops
Immediate transition to
forwarding possibIe
No more Edge Port upon
receiving BPDU
Rapid transition onIy possibIe
if Link Type is point-to-point
No haIf-dupIex (=shared
media)
Legacy STP:
Upon receiving a (better)
BPDU on a
bIocked/previousIy-disabIed
port, 15+15 seconds
transition time needed untiI
forwarding state reached
But received BPDUs are
propagated immediateIy
downstream: some bridges
beIow may detect a new
Root Port candidate and
aIso require 15+15 seconds
transition time
Network inbetween is
unreachabIe for 30
seconds!!!
NEW: Sync Operation
Not the Root Port
candidates are bIocked, but
the designated ports
downstream-this avoids
potentiaI Ioops, too!
Bridge expIicitIy authorizes
upstream bridge to put
Designated Port in
forwarding state (sync)
Then the sync-procedure
propagates downstream
Basic Principle
Details More Details
30 seconds
unreachabIe
New Iink
Candidate RP
Candidate RP
Root Bridge
1) A new Iink is created between the root and
Switch A.
2) Both ports on this Iink are put in a designated
bIocking state untiI they receive a BPDU from
their counterpart.
3) Port p0 of the root bridge sets "proposaI bit"
in the BPDU (step 1)
4) Switch A then starts a sync to ensure that aII
of its ports are in-sync with this new
information (onIy bIocking and edge-ports are
currentIy in-sync). Switch Ajust needs to
bIock port p3, assigning it the discarding
state (step 2).
5) Switch A can now unbIock its newIy seIected
root port p1 and repIy to the root by sending
an agreement message (Step 3, same BPDU
with agreement bit set)
6) Once p0 receives that agreement, it can
immediateIy transition to forwarding.
7) Now port 3 wiII send a proposaI downwards,
and the same procedure repeats.
The edge port concept is already well known Irom Cisco's PortFast Ieature.
Neither edge ports nor PortFast enabled ports generate topology changes when the
link toggles. Unlike PortFast, an edge port that receives a BPDU immediately
loses its edge port status and becomes a normal spanning tree port.
Note: Cisco's implementation maintains the PortFast keyword be used Ior edge
port conIiguration, thus making the transition to RSTP simpler.
RSTP can only achieve rapid transition to Iorwarding on edge ports and on point-
to-point links. A port operating in Iull-duplex will be assumed to be point-to-
point, while a halI-duplex port will be considered as a shared port by deIault.
Sync Operation: The Iinal network topology is reached just in the time necessary
Ior the new BPDUs to travel down the tree. No timer has been involved in this
quick convergence. The only new mechanism introduced by RSTP is the
acknowledgment that a switch can send on its new root port in order to authorize
immediate transition to Iorwarding, bypassing the twice-the-Iorward-delay long
listening and learning stages.

62
62 {C} Herbert Haas 2005/03/11 http://www.perihel.at
TopoIogy Change
802.1d: When a bridge detects a topoIogy change
A TCN is sent to towards the root
Root sends Conf-BPDU with TC-bit downstream (for 10 BPDUs)
AII other bridges can receive it and wiII reduce their bridging-tabIe aging time to
forward_delay seconds, ensuring a reIativeIy quick fIushing of staIe information
RSTP: OnIy non-edge ports moving to the forwarding state cause a TCN
Loss of connectivity NOT regarded as topoIogy change any more
TCN is immediateIy fIooded throughout whoIe domain
Every bridge fIushes MAC addresses and sends TCN upstream (RP) and
downstream (DPs)
Other bridges do the same: Now, the TCN-process is a one-step procedure, as the
TCNs do not need to reach the root first and require the root for re-origination
downstream
TopoIogy
Change:
New Link!
BPDU with TC-bit set (green)
must first reach root which wiII
redistribute this information
through whoIe network (bIack)
802.1d Behavior:
802.1w Behavior:
There is no need to wait Ior the root bridge to be notiIied and then maintain the
topology change state Ior the whole network Ior max age plus Iorward delay~
seconds. In just a Iew seconds (a small multiple oI hello times), most oI the
entries in the CAM tables oI the entire network (VLAN) are Ilushed. This
approach results in potentially more temporary Ilooding, but on the other hand it
clears potential stale inIormation that prevents rapid connectivity restitution.
RSTP is able to interoperate with legacy STP protocols. However, it is important
to note that 802.1w's inherent Iast convergence beneIits are lost when interacting
with legacy bridges. Each port maintains a variable deIining the protocol to run
on the corresponding segment. A migration delay timer oI three seconds is also
started when the port comes up. When this timer is running, the current (STP or
RSTP) mode associated to the port is locked. As soon as the migration delay has
expired, the port will adapt to the mode corresponding to the next BPDU it
receives. II the port changes its operating mode as a result oI receiving a BPDU,
the migration delay is restarted, limiting the possible mode change Irequency.

63
63 {C} Herbert Haas 2005/03/11 http://www.perihel.at
Agreement
Forwarding
RSTP Summary
IEEE 802.1w is an improvement of 802.1d
Vendor-independent (Cisco's UpIink Fast, Backbone Fast, and Port Fast are
proprietary)
The three 802.1d states disabled, blocking, and listening have been merged
into a unique 802.1w discarding state
Nondesignated ports on a LAN segment are spIit into alternate ports and
backup ports
A backup port receives better BPDUs from the same switch
An aIternate port receives better BPDUs from another switch
Other changes:
BPDU are sent every heIIo-time, and not simpIy reIayed anymore.
Immediate aging if three consecutive BPDUs are missing
When a bridge receives inferior information ("I am root") from its DB, it immediateIy
accepts it and repIaces the one previousIy stored. If the RB is stiII aIive, this bridge
wiII notify the other via BPDUs.
ProtocoI ID
Version
Message Type
FIags
Root ID
Root Path Cost
Bridge ID
Port ID
Message Age
Maximum Age = 20
HeIIo Time = 2
Forward DeIay = 15
2
1
1
1
8
4
8
2
2
2
2
2
Bytes
0 1 2 3 4 5 6 7
TCA TC
ProposaI
Port RoIe:
0 0 = Unknown
0 1 = AIternate/Backup
1 0 = Root
1 1 = Designated
Learning
New fIags for 802.1w
DP
DP
RP RP
DP BP AP
DP
Root
I am root
B
P
D
U
No, you are not!
(see this BPDU)
RP
Backup and AIternate Ports:
BackboneFast-Iike behavior:
In most cases, RSTP perIorms better than Cisco's proprietary extensions without
any additional conIiguration. 802.1w is also capable oI reverting back to 802.1d
in order to interoperate with legacy bridges (thus dropping the beneIits it
introduces) on a per-port basis.
There is no diIIerence between a port in blocking state and a port in listening
state; they both discard Irames and do not learn MAC addresses. The real
diIIerence lies in the role the spanning tree assigns to the port. It can saIely be
assumed that a listening port will be either a designated or root and is on its way
to the Iorwarding state. UnIortunately, once in Iorwarding state, there is no way to
inIer Irom the port state whether the port is root or designated, which contributes
to demonstrating the Iailure oI this state-based terminology. RSTP addresses this
by decoupling the role and the state oI a port.
The role is now a variable assigned to a given port. The root port and designated
port roles remain, while the blocking port role is now split into the backup and
alternate port roles.
A non-designated port is a blocked port that receives a more useful BPDU
than the one it would send out on its segment. The "more useIul BPDU" can be
received Irom the same switch (on another port on the same LAN segment) or
Irom another switch (also on the same LAN segment). The Iirst is called a
backup port, the latter an alternate port.
The name blocking is used Ior the discarding state in Cisco implementation.

64
64 {C} Herbert Haas 2005/03/11 http://www.perihel.at
Other
There is no 15-sec forwarding deIay
anymore

TCN ensures that aII tabIes are immediateIy


fIushed
Protection against misordering and
dupIication

Port state transitions to Learning and


Forwarding are deIayed

Ports can temporariIy transition to the


Discarding state
RSTP provides rapid recovery to minimize
frame Ioss

65
65 {C} Herbert Haas 2005/03/11 http://www.perihel.at
Note
A bridge must first receive a BPDU from
the Root Bridge untiI BPDUs from Non-
Root-Bridges can be forwarded
Every bridge sends BPDUs periodicaIIy
(by defauIt every 2 seconds) and the
neighbor bridge is decIared dead when
three subsequent BPDUs are missing
Upon a topoIogy change (e. g. neighbor
dead) the bridge sends BPDUs with the
ProposaI Bit set which triggers a
recaIcuIation of the STP

66
2005/03/11 {C} Herbert Haas http://www.perihel.at
Cisco Extensions: PVST(+)
Per-VLAN Spanning Tree

67
67 {C} Herbert Haas 2005/03/11 http://www.perihel.at
About
In over 70% of aII enterprise networks you
wiII encounter Cisco switches
Cisco extended STP and RSTP with a per-
VLAN approach: "Per-VLAN Spanning
Tree"
Advantages:

Better (per-VLAN) topoIogies possibIe

STP-Attacks onIy affect current VLAN


Disadvantages:

InteroperabiIity probIems might occur

Resource consumption (800 VLANs means 800


STP instances)

68
68 {C} Herbert Haas 2005/03/11 http://www.perihel.at
ExampIe
Remember that root bridge shouId reaIize the
center of the LAN
Attracts aII traffic

TypicaIIy servers or Internet-connectivty resides there


Different VLANs might have different cores
PVST+ aIIows for different topoIogies

Admin shouId at Ieast configure ideaI root bridge BID


manuaIIy
Root for VLAN 1
Root for VLAN 5
Root for VLAN 8

69
69 {C} Herbert Haas 2005/03/11 http://www.perihel.at
ScaIabiIity ProbIem
TypicaIIy the number of VLANs is much Iarger than
the number of switches
ResuIts in many identicaI topoIogies
In the above exampIe we have 400 VLANs but onIy
three different IogicaI topoIogies

400 Spanning Tree instances

400 times more BPDUs running over the network


Root for VLANs 1-200
Root for
VLANs 301-400
Root for VLANs 201-300

70
70 {C} Herbert Haas 2005/03/11 http://www.perihel.at
PVST (CIassicaI, OLD!)
Cisco proprietary (of course)
InteroperabiIity probIems when aIso
standard CST is used in the network
(different trunking requirements)
Provides dedicated STP for every
VLAN
Requires ISL

Inter Switch Link (Cisco's aIternative to


802.1Q)

71
71 {C} Herbert Haas 2005/03/11 http://www.perihel.at
PVST+
Today standard in Cisco switches

DefauIt mode

InteroperabIe with CST


The PVST BPDUs are aIso caIIed
SSTP BPDUs
The messages are identicaI to the
802.1d BPDU but uses SNAP instead
of LLC pIus a speciaI TLV at the end

72
72 {C} Herbert Haas 2005/03/11 http://www.perihel.at
PVST+ ProtocoI DetaiIs
For native VLAN on trunk, normaI (untagged)
802.1d BPDUs are sent

AIso to the IEEE destination address 0180.c200.0000


For tagged VLANs, PVST+ BPDUs use

SNAP, OID=00:00:0C, and EtherType 0x010B


Destination address 01-00-0c-cc-cc-cd

PIus 802.1Q tag


AdditionaIIy a "PVID" TLV fieId is added at the
end of the frame

This PVID TLV identifies the VLAN ID of the source port


The TLV has the format:
type (2 bytes) = 0x00 0x34
Iength (2 bytes) = 0x00 0x02
VLAN ID (2 bytes)
AIso usuaIIy some padding is appended

73
73 {C} Herbert Haas 2005/03/11 http://www.perihel.at
PVST+ CompatibiIity Issues
PVST+ switches can act as transIators
between groups of Cisco PVST switches
(using ISL) and groups of CST switches

Sent untagged over the native 802.1Q VLAN)

BPDUs of PVST-based VLANs are practicaIIy


'tunneIed' over the CST-based switches using
a speciaI muIticast address (the CST based
switches wiII forward but not interpret these
frames)
Not important anymore.

74
2005/03/11 {C} Herbert Haas http://www.perihel.at
MSTP
Text durch KIicken hinzufgen

75
75 {C} Herbert Haas 2005/03/11 http://www.perihel.at
Overview
AIso the MSTP standard contains
contributions from Cisco
SoIves the cardinaIity mismatch between
the number of VLANs and the number of
usefuI topoIogies
Switches are organized in Regions
In each Region sets of VLANs can be
independentIy assigned to one out of 16
Spanning Tree Instances
Each Instance has its own Spanning Tree
topoIogy

76
76 {C} Herbert Haas 2005/03/11 http://www.perihel.at
ExampIe
Compared to PVST+ onIy three Spanning
Tree TopoIogies (=Instances) required
Each STP instance has assigned 200
VLANs

Each VLAN can onIy be member of one


instance of course
Root for VLANs 1-199
Root for
VLANs 300-400
Root for VLANs 200-299

77
77 {C} Herbert Haas 2005/03/11 http://www.perihel.at
MSTP DetaiIs
Each switch maintains its own MSTP
configuration which contains the
foIIowing mandatory attributes:

The Configuration name (32 chars),

The revision number (0..65535),

The eIement tabIe which specifies the


VLAN to Instance mapping
AII switches in a Region must have
the same attributes

78
78 {C} Herbert Haas 2005/03/11 http://www.perihel.at
Regions
The bridges checks attribute equivaIence
via a digest contained in the BPDUs

Note that the attributes must be configured


manuaIIy and are NOT communicated via the
BPDUs
If digest does not match then we have a
region boundary port
Regions are onIy interconnected by the
Common Spanning Tree (CST)

Instance 0

Uses traditionaI 802.1d STP



79
79 {C} Herbert Haas 2005/03/11 http://www.perihel.at
Region ExampIe
OnIy the IogicaI STP topoIogies are shown (not the physicaI Iinks)
Each region has internaI STP instances (red and bIue)
One CST instance interconnects aII regions (bIack)
Root Bridge for CST
(i. e. for the whoIe region)
Region A
Region B
Region C

80
80 {C} Herbert Haas 2005/03/11 http://www.perihel.at
Note
When enabIing MSTP, per defauIt the
CST (instance zero) has aII VLANs
assigned
Each region must be MSTP-aware

Since onIy a subset of VLANs is


assigned to the CST

OId-STP switched aIways create a


generaI (aII-VLAN) topoIogy

Don't Iet MSTP-unaware switch become


root bridge

81
2005/03/11 {C} Herbert Haas http://www.perihel.at
Any Questions?

82
82 {C} Herbert Haas 2005/03/11 http://www.perihel.at
THE ANSWER IS . FORTY-TWO!
From Rich Seifert's Switch Book
The choice of 0x42 as the LLC SAP value for BPDUs has an
interesting history. First, the chair and editor of the IEEE
802.1D Task Force (Mick Seaman) was British, and 42 is "The
Answer to the Ultimate Question of Life, the Universe, and
Everything" in 1he Hitchhiker's Cuide to the Calaxy, a
popular British book, radio, and television series by Douglas
Adams] at the time of the development of the original
standard.
Even in the United States, the series was so popular that the
original Digital Equipment Corp. bridge architecture
specification was titled eXtended LAN Interface Interconnect,
or XLII, the Roman representation of 42.
Rich SeiIert also continues:
"Finally, 0x42 is a palindrome; it has the same binary pattern regardless oI
whether one transmits the mostsigniIicant bit Iirst or the least-signiIicant bit Iirst
01000010. This eliminates any conIusion regarding bit ordering oI the Iield
when transmitted on Little Endian (e. g. Ethernet) versus Big Endian (e. g. Token
Ring) networks, although this side beneIit was not recognized until aIter the value
was assigned."

1
2010/02/15 {C} Herbert Haas
The Ethernet EvoIution
The 180 Degree Turn

2
'Use common sense in routing cable.
Avoid wrapping coax around sources
of strong electric or magnetic fields.
Do not wrap the cable around
flourescent light ballasts or
cvclotrons, for example.`
Ethernet Headstart Product Information and InstaIIation Guide,
BeII TechnoIogies, pg. 11

3
3 {C} Herbert Haas 2010/02/15
History: InitiaI Idea
Shared media CSMA/CD as access aIgorithm
COAX CabIes
HaIf dupIex communication
Low Iatency No networking nodes
(except repeaters)
One coIIision domain and aIso one broadcast domain
10 Mbit/s shared
by 5 hosts 2
Mbit/s each !!!
The initial idea oI Ethernet was completely diIIerent than what is used today
under the term "Ethernet". The original new concept oI Ethernet was the use
oI a shared media and an Aloha based access algorithm, called Carrier Sense
Multiple Access with Collision Detection (CSMA/CD). Coaxial cables were
used as shared medium, allowing a simple coupling oI station to bus-like
topology.
Coax-cables were used in baseband mode, thus allowing only unicast
transmissions. ThereIore, CSMA/CD was used to let Ethernet operate under
the events oI Irequent collisions.
Another important point: No intermediate network devices should be used in
order to keep latency as small as possible. Soon repeaters were invented to be
the only exception Ior a while.
An Ethernet segment is a coax cable, probably extended by repeaters. The
segment constitutes one collision domain (only one station may send at the
same time) and one broadcast domain (any station receives the current Irame
sent). ThereIore, the total bandwidth is shared by the number oI devices
attached to the segment. For example 10 devices attached means that each
device can send 1 Mbit/s oI data on average.
Ethernet technologies at that time (1975-80s): 10Base2 and 10Base5

4
4 {C} Herbert Haas 2010/02/15
History: MuItiport Repeaters
Demand for structured cabIing (voice-grade
twisted-pair)

10BaseT (Cat3, Cat4, ...)


MuItiport repeater ("Hub") created
StiII one coIIision domain
("CSMA/CD in a box")
Later, Ethernet devices supporting structured cabling were created in order to
reuse the voice-grade twisted-pair cables already installed in buildings.
10BaseT had been speciIied to support Cat3 cables (voice grade) or better, Ior
example Cat4 (and today Cat5, Cat6, and Cat7).
Hub devices were necessary to interconnect several stations. These hub
devices were basically multi-port repeaters, simulating the halI-duplex coax-
cable, which is known as "CSMA/CD in a box". Logically, nothing has
changed, we have still one single collision and broadcast domain.
Note that the Ethernet topology became star-shaped.

5
5 {C} Herbert Haas 2010/02/15
History: Bridges
Store and forwarding according destination MAC
address
Separated coIIision domains
Improved network performance
StiII one broadcast domain
Three coIIision
domains in this
exampIe !
Bridges were invented Ior perIormance reasons. It seemed to be impractical
that each additional station reduces the average per-station bandwidth by 1/n.
On the other hand the beneIit oI sharing a medium Ior communication should
be still maintained (which was expressed by MetcalIe's law).
Bridges are store and Iorwarding devices (introducing signiIicant delay) that
can Iilter traIIic based on the destination MAC addresses to avoid unnecessary
Ilooding oI Irames to certain segments. Thus, bridges segment the LAN into
several collision domains. Broadcasts are still Iorwarded to allow layer 3
connectivity (ARP etc), so the bridged network is still a single broadcast
domain.

6
6 {C} Herbert Haas 2010/02/15
History: Switches
Switch = MuItiport Bridges with HW acceIeration
FuII dupIex CoIIision-free Ethernet No CSMA/CD
necessary anymore
Different data rates at the same time supported
Autonegotiation
VLAN spIits LAN into severaI broadcast domains
10 Mbit/s
100 Mbit/s
100 Mbit/s
1000 Mbit/s
CoIIision-free
pIug & pIay
scaIabIe Ethernet !
Several vendors built advanced bridges, which are partly or Iully implemented
in hardware. The introduced latency could be dramatically lowered and
Iurthermore other Ieatures were introduced, Ior example Iull duplex
communication on twisted pair cables, diIIerent Irame rates on diIIerent ports,
special Iorwarding techniques (e,g, cut through or Iragment Iree), Content
Addressable Memory (CAM) tables, and much more. OI course marketing
rules demand Ior another designation Ior this machine: the switch was born.
Suddenly, a collision Iree plug and play Ethernet was available. Simply use
twisted pair cabling only and enable autonegotiation to automatically determine
the line speed on each port (oI course manual conIigurations would also do).
This way, switched Ethernet become very scalable.
Furthermore, Virtual LANs (VLANs) were invented to split the LAN into
several broadcast domains. VLANs improve security, utilization, and allows
Ior logical borders between workgroups.

7
7 {C} Herbert Haas 2010/02/15
Today
No coIIisions no distance Iimitations !
Gigabit Ethernet becomes WAN technoIogy !
Over 100 km Iink span aIready
Combine severaI Iinks to "EtherchanneIs"
Link Aggregation ControI ProtocoI (LACP, IEEE 802.3ad)
Cisco proprietary: Port Aggregation ProtocoI (PAgP)
HP: Mesh (Iike L2-routing over 5-8 hops)
1 Gbit/s or even 10 Gbit/s Iong reach connection !!!
Ether ChanneI
Ethernet as WAN technoIogy
Note: Spanning Tree regards
this as one IogicaI Iink!
=> Load baIancing!
Today, Gigabit and even 10 Gigabit Ethernet is available. Only twisted pair
and more and more Iiber cables are used between switches, allowing Iull
duplex collision-Iree connections. Since collisions cannot occur anymore,
there is no need Ior a collision window anymore! From this it Iollows, that
there is virtually no distance limit between each two Ethernet devices.
Recent experiments demonstrated the interconnection oI two Ethernet Switches
over a span oI more than 100 km! Thus Ethernet became a WAN technology!
Today, many carriers use Ethernet instead oI ATM/SONET/SDH or other
rather expensive technologies. GE and 10GE is relatively cheap and much
simpler to deploy. Furthermore it easily integrates into existing low-rate
Ethernet environments, allowing a homogeneous interconnection between
multiple Ethernet LAN sites. Basically, the deployment is plug and play.
II the link speed is still too slow, so-called "Etherchannels" can be conIigured
between each two switches by combining several ports to one logical
connection. Note that it is not possible to deploy parallel connections between
two switches without an Etherchannel conIiguration because the Spanning Tree
Protocol (STP) would cut oII all redundant links.
Depending on the vendor, up to eight ports can be combined to constitute one
"Etherchannel".

8
8 {C} Herbert Haas 2010/02/15
What About Gigabit Hubs?
WouId Iimit network diameter to 20-
25 meters (Gigabit Ethernet)
SoIutions

Frame Bursting

Carrier Extension
No GE-Hubs avaiIabIe on the market
today forget it!
No CSMA/CD defined for 10GE (!)
Remember: Hubs simulate a halI-duplex coaxial cable inside, hence limiting
the total network diameter. For Gigabit Ethernet this limitation would be about
25 meters, which is rather impracticable Ior proIessional usage. Although
some countermeasures had been speciIied in the standard, such as Irame
bursting and carrier extension, no vendor developed an GE hub as Ior today.
Thus: Forget GE Hubs!
The 10 GE speciIication does neither consider copper connections nor hubs. 10
GE can only run over Iiber.
At this point please remember the initial idea in the mid 1970s: Bus,
CSMA/CD, short distances, no network nodes.
Today: Structured cabling (point-to-point or star), never CSMA/CD, WAN
capabilities, sophisticated switching devices in between.

9
9 {C} Herbert Haas 2010/02/15
MAC ControI Frames
AdditionaI functionaIity easiIy integrated

CurrentIy onIy Pause-Frame supported


preambIe FCS MAC-ctrI parameters MAC-ctrI opcode 8808h SA DA
8 bytes 6 6 2 2 44 4
AIways 64 bytes
MAC-ctrI opcode ........... Defines function of controI frame
MAC-ctrI parameters .... controI parameter data (aIways fiIIed up to 44 bytes)
DiIIerent data rates between switches (and diIIerent perIormance levels) oIten
lead to congestion conditions, Iull buIIers, and Irame drops. Traditional
Ethernet Ilow control was only supported on halI-duplex links by enIorcing
collisions to occur and hereby triggering the truncated exponential backoII
algorithm. Just let a collision occur and the aggressive sender will be silent Ior
a while.
A much Iiner method is to send some dummy Irames just beIore the backoII
timer allows sending. This way the other station never comes to send again.
Both methods are considered as ugly and only work on halI duplex lines.
ThereIore the MAC Control Irames were speciIied, allowing Ior active Ilow
control. Now the receiver sends this special Irame, notiIying the sender to be
silent Ior N slot times.
The MAC Control Irame originates in a new Ethernet layerthe MAC Control
Layerand will support also other Iunctionalities, but currently only the
"Pause" Irame has been speciIied.

10
10 {C} Herbert Haas 2010/02/15
Auto Negotiation
EnabIes each two Ethernet devices to
exchange information about their
capabiIities

SignaI rate, CSMA/CD, haIf- or fuII-dupIex


Using Link-Integrity-Test-PuIse-Sequence

NormaI-Link-PuIse (NLP) technique is used


in 10BaseT to check the Iink state (green LED)

10 Mbit/s LAN devices send every 16.8 ms a


100ns Iasting NLP, no signaI on the wire
means disconnected
Several Ethernet operating modes had been deIined, which are incompatible to
each other, including diIIerent data rates (10, 100, 1000 Mbit/s), halI or Iull
duplex operation, MAC control Irames capabilities, etc.
Original Ethernet utilized so-called Normal Link Pulses (NLPs) to veriIy layer
2 connectivity. NLPs are single pulses which must be received periodically
between regular Irames. II NLPs are received, the green LED on the NIC is
turned on.
Newer Ethernet cards realize auto negotiation by sending a sequence oI NLPs,
which is called a Fast Link Pulse (FLP) sequence.

11
11 {C} Herbert Haas 2010/02/15
Fast Link PuIses
Modern Ethernet NICs send bursts of
Fast-Link-PuIses (FLP) consisting of
17-33 NLPs for Autonegotiation
signaIIing
Each representing a 16 bit word

GE sends severaI "pages"


A series oI FLPs constitute an autonegotiation Irame. The whole Irame
consists oI 33 timeslots, where each odd numbered timeslot consists oI a real
NLP and each even timeslot is either a NLP or empty, representing 1 or 0.
Thus, each FLP sequence consists oI a 16 bit word.
Note that GE Ethernet sends several such "pages".

12
12 {C} Herbert Haas 2010/02/15
100 Mbit Ethernet Overview
Fast Ethernet
100Base4T+
SignaIing
Fast Ethernet
100BaseX
SignaIing
100BaseTX 100BaseFX
100BaseT4
(haIf dupIex)
100VG-AnyLAN
"100BaseT"
HP and AT&T
invention for real time
applications
IEEE 802.3u
SignaIing Schemes
IEEE 802.12
Demand Priority
The diagram above gives an overview oI 100 Mbit/s Ethernet technologies,
which are diIIerentiated into IEEE 802.3u and IEEE 802.12 standards. The
IEEE 802.3u deIines the widely used Fast Ethernet variants, most importantly
those utilizing the 100BaseX signaling scheme. The 100BaseX signaling
consists oI several details, but basically it utilizes 4B5B block coding over only
two pairs oI regular Cat 5 twisted pair cables or two strand 50/125 or 62.5/125-
m multimode Iiber-optic cables.
100Base4T signaling has been speciIied to support 100 Mbit/s over Cat3
cables. This mode allows halI duplex operation only and uses a 8B6T code
over 4 pairs oI wires; one pair Ior collision detection, three pairs Ior data
transmission. One unidirectional pair is used Ior sending only and two bi-
directional pairs Ior both sending and receiving.
The 100VG-AnyLAN technology had been created by HP and AT&T in 1992
to support deterministic medium access Ior realtime applications. This
technology was standardized by the IEEE 802.12 working group. The access
method is called "demand priority". 100VG-AnyLAN supports voice grade
cables (VG) but requires special hub hardware. The 802.12 working group is
no longer active.

13
13 {C} Herbert Haas 2010/02/15
4B/5B Coding
4B/5B Encoder/Decoder
PMA
PCS
MII
1 0 0 0
0 1 0 1 0
16 code
groups
32 code
groups
4 x 25
Mbit/s
125 MBaud
The diagram above shows the basic principle oI the 4B5B block coding
principle, which is used by 802.3u and also by FDDI. The basic idea is to
transIorm any arbitrary 4 bit word into a (relatively) balanced 5 bit word. This
is done by a Iast table lookup.
Balancing the code has many advantages: better bandwidth utilization, better
laser eIIiciency (constant temperature), better bit-synchronization (PLL), etc.
Note that the signaling overhead is 5/4 12.5 .

14
14 {C} Herbert Haas 2010/02/15
Gigabit Ethernet
Media Access ControI (MAC)
Gigabit Media Independent Interface (GMII)
1000Base-X
8B/10B encoder/decoder
1000Base-T
encoder/decoder
1000Base-LX
LWL
Fiber Optic
1000Base-SX
SWL
Fiber Optic
1000Base-CX
ShieIded
BaIanced
Copper
1000Base-T
UTP
Cat 5e
IEEE 802.3z physicaI Iayer
IEEE 802.3ab
physicaI Iayer
Gigabit Ethernet has been deIined in March 1996 by the working group IEEE
802.3z. The GMII represents a abstract interIace between the common
Ethernet layer 2 and diIIerent signaling layers below. Two important signaling
techniques had been deIines: The standard 802.3z deIines 1000Base-X
signaling which uses 8B10B block coding and the 802.3ab standard uses
1000Base-T signaling. The latter is only used over twisted pair cables (UTP
Cat 5 or better), while 1000BaseX is only used over Iiber, with one exception,
the twinax cable (1000BaseCX), which is basically a shielded twisted pair
cable.

15
15 {C} Herbert Haas 2010/02/15
GE SignaIing
PMA
PCS
802.2 LLC
802.3 CSMA/CD
802.3 PHY
FC-4
upper Iayer mapping
FC-3
common services
FC-2
signaIIing
FC-0
interface and media
FC-1
encoder/decoder
IEEE 802.2 LLC
CSMA/CD
or fuII dupIex MAC
PMD
IEEE 802.3
Ethernet
ANSI X3T11
Fibre ChanneI
IEEE 802.3z
Gigabit Ethernet
ReconciIiation SubIayer
PHY
Gigabit Ethernet layers have been deIined by adaptation oI the LLC and MAC
layers oI classical Ethernet and the physical layers oI the ANSI Fiber Channel
technology. A so-called reconciliation layer is used in between Ior seamless
interoperation. The physical layer oI the Fiber Channel technology uses
8B10B block coding.

16
16 {C} Herbert Haas 2010/02/15
GE 8B/10B Coding
8B/10B Encoder/Decoder
PMA
PCS
GMII
256 code groups
1024 code groups
8 x 125 Mbit/s
125 miIIion code
groups per
second
1250 Mbaud
1
OnIy used
by
1000BaseX
1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1
8B10B block coding is very similar to 4B5B block coding but allows Iully
balanced 10-bit codewords. Actually, there are not enough balanced 10-bit
codewords available. Note that there are 256 8-bit codewords which need to be
mapped on 1024 10-bit codewords. But instead oI using a Iully balanced 10-
bit codeword Ior each 8-bit codeword, some 8-bit codewords are represented
by two 10-bit codewords, which are sent in an alternating manner. That is, both
associated 10-bit words are bit-complementary.
Again, the signaling overhead is 12.5, that is 1250 Mbaud is necessary to
transmit a bit stream oI 1000 Mbit/s.

17
17 {C} Herbert Haas 2010/02/15
1000BaseX
Two different waveIengths supported
FuII dupIex onIy

1000Base-SX: short wave, 850 nm MMF

1000Base-LX: Iong wave, 1300 nm MMF or SMF


1000Base-CX:

Twinax CabIe (high quaIity 150 Ohm baIanced


shieIded copper cabIe)

About 25 m distance Iimit, DB-9 or the newer


HSSDC connector
Gigabit Ethernet can be transmitted over various types oI Iiber. Currently (at
least) two types are speciIied, short and long wave transmissions, using 850 nm
and 1300 nm respectively. The long wave can be used with both single mode
(SMF) and multimode Iibers (MMF). Only SMF can be used Ior WAN
transmissions because oI the much lower dispersion eIIects.
Note that there are several other implementations oIIered by diIIerent vendors,
such as using very long wavelengths at 1550 nm together with DWDM
conIigurations.
The twinax cable is basically a shielded twisted pair cable.

18
18 {C} Herbert Haas 2010/02/15
1000BaseT
Defined by 802.3ab task force
UTP

Uses aII 4 Iine pairs simuItaneousIy for dupIex


transmission! (echo canceIIation)

5 IeveI PAM coding


4 IeveIs encode 2 bits + extra IeveI used for Forward
Error Correction (FEC)

SignaI rate: 4 x 125 Mbaud = 4 x 250Mbit/s data


rate
Cat. 5 Iinks, max 100 m; aII 4pairs, cabIe must
conform to the requirements of ANSI/TIA/EIA-568-A

OnIy 1 CSMA/CD repeater aIIowed in a


coIIision domain
It is very diIIicult to transmit Gigabit speeds over unshielded twisted pair
cables. Only a mix oI multiple transmission techniques ensure that this high
data rate can be transmitted over a UTP Cat5 cable. For example all 4 pairs are
used together Ior both directions. Echo cancellation ensures that the sending
signal does not conIuse the received signal. 5 level PAM is used Ior encoding
instead oI 8B10B because oI its much lower symbol rate. Now we have only
125 Mbaud x 4 instead oI 1250 Mbaud.
The interIace design is very complicated and thereIore relatively expensive.
Using Cat 6 or Cat 7 cables allow 500 Mbaud x 2 pairs, that is 2 pairs are
designated Ior TX and the other 2 pairs are used Ior RX. This dramatically
reduces the price but requires better cables, which are not really expensive but
slightly thicker. Legacy cable ducts might be too small in diameter.

19
19 {C} Herbert Haas 2010/02/15
SeveraI PhysicaI Media Supported
LogicaI Link ControI LLC
MAC ControI (optionaI)
Media Access ControI MAC
PLS
AUI
PMA (MAU)
MDI
Medium
ReconciIiation ReconciIiation ReconciIiation
PCS
PMA
PMD
GMII
MDI
PLS
AUI
PMA
MII
MDI
PCS
PMA
PMD
MII
MDI
Medium Medium Medium
Data Link Layer
PHY
1-10 Mbit/s 10 Mbit/s 100 Mbit/s 1000 Mbit/s
AUI Attachment Unit nterface, PLS Physical Layer Signaling, MDI Medium Dependent nterface
PCS Physical Coding Sublayer, MII Media ndependent nterface, GMII Gigabit Media ndependent
nterface, PMA Physical Medium Attachment, MAU Medium Attachment Unit, PMD Physical Medium
Dependent
The diagram above shows various physical media designs supported by the
oIIicial GE standard. Each modern GE card could theoretically support the old
10 Mbit/s standard as well. However many vendors create GE NICs that only
support GE or GE and FEwho would connect a precious GE interIace with
another interIace, which is 100 times slower?

20
20 {C} Herbert Haas 2010/02/15
10 Gigabit Ethernet / IEEE 802.3ae
OnIy opticaI support

850nm (MM) / 1310nm /1550 nm (SM onIy)

No copper PHY anymore !


Different impIementations at the
moment - standardization not finished!
8B/10B (IBM), SONET/SDH support, .
XAUI ("Zowie") instead of GMII
10 GE only supports optical links. Note that GE is actually a synchronous
protocol! There is no statistical multiplexing done at the physical layer
anymore, because optical switching at that bit rate only allows synchronous
transmissions.
The GMII has been replaced (or enhanced) by the so-called XAUI, known as
"Zowie".
Note: At the time oI writing this module, the 10 GE standard was not Iully
Iinished. Though, some vendors already oIIer 10 GE interIace cards Ior their
switches.
These interIaces are very expensive but the investment ensures backward
compatibility to lower Ethernet rates and at the same time provides a very high
speed WAN interIace.
An alternative technology would be OC192, which requires a very expensive
and complex SONET/SDH environment.

21
21 {C} Herbert Haas 2010/02/15
Note
GE and 10GE use synchronous
physicaI subIayer !!!
Recommendation: Don't use GE over
copper wires

Radiation/EMI

Grounding probIems

High BER

Thick cabIe bundIes (especiaIIy Cat-7)


Both GE and 10GE are synchronous physical technologies on Iiber. It not
recommended to use GE over copper wires anymore although 802.3ab would
speciIy it. This is because the whole electrical hardware (cables and
connectors) are re-used Irom older Ethernet technologies and have not been
designed to support such high Irequencies.
For example the RJ45 connector is not HF prooI. Furthermore, shielded twisted
pair cables require a very good grounding, seldom Iound in reality. The Bit
Error Rate (BER) is typically so high that the eIIective data rate is much lower
than GE, Ior example 30 only.

22
22 {C} Herbert Haas 2010/02/15
Summary
Ethernet evoIved in the opposite direction:

CoIIision free

WAN quaIified

Switched
SeveraI coding styIes CompIex PHY
architecture
PIug & pIay through autonegotiation
Much simpIer than ATM but no BISDN
soIution - might change!

23
23 {C} Herbert Haas 2010/02/15
Quiz
Why tends high-speed Ethernet to
synchronous PHY?
Can I attach a 100 Mbit/s port to a
1000 Mbit/s port via fiber?
What is the idea of EtherchanneIs?
(Maximum bit rate, difference to
muItipIe paraIIeI Iinks)
Q1: On Iiber its diIIicult to deal with asynchronous transmission, photons
cannot be buIIered easily, store and Iorward problems
Q2: No, autonegotiation on Iiber does not care Ior data rates
Q3: "normal" parallel links would be disabled by STP, Etherchannel supports
up to 8 links

1
2005/03/11 {C} Herbert Haas
The Internet ProtocoI (IP)
The Blood of the nternet

2
"Information Superhighway is really an
acronym for 'Interactive Network For
Organizing, Retrieving, Manipulating,
Accessing And 1ransferring Information
On National Systems, Unleashing Practically
Every Rebellious Human
Intelligence, Cratifying Hackers, Wiseacres,
And Yahoos'."
Keven Kwaku

3
3 {C} Herbert Haas 2005/03/11
The Internet ProtocoI (IP)

Introduction

IP Addressing

IP Header

IP Address Format

Address CIasses

CIass A - E

Subnetting, VLSM

IP Fragmentation
In this chapter we talk about the Internet Protocol (IP), especially about IP
Version 4. IPv4 was standardized in September 1981 in RFC 791.
IP is a packet-switching technology on OSI layer 3. IP is connectionless and an
overlay technique. In this module we discuss Iundamental questions around the IP
protocol, such as: What other (helper) protocols are necessary ? What is an IP-
Address ? What is Subnetting and VLSM ?

4
4 {C} Herbert Haas 2005/03/11
Need of an Inter-Net ProtocoI (1)
Different Data-Link Layer

Different frames

Different protocoI
handIing
Different PhysicaI Layer
Different hardware

Different signaIs
No interconnection
possibIe !!!
Host 1
Host 2
Host 3
Host 1
Host 2
Host 3
Host 1
Host 3 Host 2
Why do we need an Inter-Net Protocol? DiIIerent networks have diIIerent Data-
Link Layer. Every Network runs a diIIerent protocol. Some networks use
proprietary link layer protocols or X.25, other networks have Ethernet or HDLC.
You see, every network has its own hardware, signals and Irames. As long as they
do not want to communicate with each other, there is no problem...

5
5 {C} Herbert Haas 2005/03/11
Need of an Inter-Net ProtocoI (2)
Network 1
Network 3
Network 2
Common internetworking Iayer

One packet type


Gateways terminate Iayer 1 and 2
Layer 3 addresses identify
Not onIy Host
But aIso Network
Gateway
Gateway
1.1
1.2
1.3
2.1
2.2
3.4
3.1
3.3
3.2
2.3
2.4
II we want to interconnect these networks we would need a common
internetworking layer. Network interconnections are realized with dedicated hosts
called "Gateways" which include at least two diIIerent network interIace cards
(NIC) each with an appropriate physical and link layer. These gateways
transport the common Inter-Net protocol (encapsulated in layer 2) and terminate
layer 1 and layer 2 on each side. In the late 1970's the IP protocol was widely
used as Inter-Net protocol. It works on Layer 3 and identiIies the host and the
network using dedicated addresses.

6
6 {C} Herbert Haas 2005/03/11
IP Introduction (1)
Packet switching technoIogy

Packet switch = router = "gateway"


(IETF terminoIogy)

End system is caIIed IP host

Layer 3 address (Structured)


Datagram Service

ConnectionIess

Best effort deIivery


IP can be described by mentioning two Iacts: First, IP is "just another" packet
switching technology on layer 3 and the most important thing here is the
structured IP address, identiIying the network and the host. Second, the type oI
packet-switching is connectionless, that is there is no need to establish a
connection prior oI sending packets. We call this a "best eIIort" or "datagram
delivery". There is no guarantee, that all packets are delivered reliably.

7
7 {C} Herbert Haas 2005/03/11
IP Introduction (2)
Shared responsibiIity

Both network and hosts must take care


for deIivery (!)

Routers deIiver datagrams to remote


hosts based on IP address

Hosts responsibIe for end-to-end


controI
End-to-end controI reIies on TCP

Layer 4
The End-to-end control is implemented in the upper Layers oI the IP host, by
TCP (Transmission Control Protocol - Layer 4 Protocol).
TCP is a connection oriented protocol. It takes care about Ilow-control,
sequencing, windowing and error recovery.

9
9 {C} Herbert Haas 2005/03/11
IP Introduction (4)
IP over anything: OverIay Technique

IP can be easiIy integrated upon Iayer 2


technoIogies

Open deveIopment quickIy adapts to new


transport and switching methods
End-to-end principIe

OnIy hosts must be inteIIigent (TCP)

Routers remain simpIe


One reason Ior IP's success is its ability to adapt to all types oI layer 2
technologies. On one hand, the IP developers were very quick to design
convergence ("helper") protocols, Ior example to resolve L2/L3 addresses on
multipoint connections or encapsulation headers Ior delineation on dialup or
serial links, such as PPP. And on the other hand, IP is a relative simple protocol
and because oI this it had been integrated in many diIIerent operating systems,
most importantly UNIX.
Note: IP's simplicity is based on the end-to-end philosophy. That is, the network
itselI does not care Ior reliable transmission; only the end-systems care Ior error
recovery. This way, the network can be kept simple.

10
10 {C} Herbert Haas 2005/03/11
IP Introduction (5)
TCP cares for reIiabiIity

Connection oriented

Error recovery

FIow controI

Sequencing
IP is the router's Ianguage

No idea about appIications

Best effort deIivery


IP knows nothing about the end system applications, it only cares about networks
and host-addresses. TCP carries the Port-Number. The Port-Number is necessary
Ior the host. With the Port-number he knows which datagram belongs to which
application. TCP also takes care oI the end-to-end issues (error recovery, Ilow
control, sequencing,.).

11
11 {C} Herbert Haas 2005/03/11
IP Introduction (6)
Request for Comments (RFCs)

De facto standards for the Internet

InitiaIIy posted by snaiI maiI

IETF (Internet Engineering Task Force)


reviews and confirms them

RFCs are numbered in sequence of


pubIishing

Everybody may write an RFC (!)


All ideas and standards oI the Internet developers are maintained in so-called
"Request Ior Comments" (RFCs) documents. The RFCs are Ireely available and
can be downloaded Irom several sites, Ior example http://www.ietI.org or
http://www.rIc-editor.org. OI course, they can also be ordered by the Network
InIormation Center (NIC).

12
12 {C} Herbert Haas 2005/03/11
Internet Organizations
IAB
IETF IRTF
ISOC
(Internet Society)
RARE
(Reseaux Associes pour Ia
Recherche Europeen)
The Internet Society (ISOC) provides leadership in addressing issues that
conIront the Iuture oI the Internet, and is the organization home Ior the groups
responsible Ior Internet inIrastructure standards, including the Internet
Engineering Task Force (IETF) and the Internet Architecture Board (IAB).
The Reseaux Associes pour la Recherche Europeen (RARE) was Iounded in
1986 to build and maintain a European high speed data network inIrastructure.
RARE is also a member oI ISOC and ETSI (European Telecommunications
Standards Institute). EBONE was initiated by RARA and RARA is a close
cooperation with RIPE (Resaux IP Europeen).
The Internet Architecture Board (IAB) is responsible Ior technical directions,
coordination and standardization oI the TCP/IP technology. It was Iormerly
known as Internet Activity Board and is the highest authority and controls the
IETF and IRTF.
The Internet Engineering Task Force (IETF) is "actually" the most important
technical organization Ior the Internet working groups and is organized in several
areas. Area manager and IETF chairman Iorm the IESG (Internet Engineering
Steering Group). The IETF is also responsible to maintain the RFCs.
The Internet Research Task Force (IRTF) coordinates and prioritize research
groups that are controlled by the IRSG (Internet Research Steering Group).

15
15 {C} Herbert Haas 2005/03/11
IP Address CIasses
Net-ID? Host-ID?
5 CIasses defined!

A (1-127)

B (128-191)

C (192-223)

D (224-239, MuIticast)

E (240-254, ExperimentaI)
CIasses define number of address-
bits for net-id
In the beginning oI the Internet, Iive address classes had been deIined. Classes A,
B, and C had been created to provide diIIerent network addresses ranges.
Additionally Class D is the range oI IP multicast addresses, that is they have no
topological structure. Finally, class E had been reserved Ior research experiments
and are not used in the Internet.
The idea oI classes helps a router to decide how many bits oI a given IP address
identiIy a network number and how many bits are thereIore available Ior host
numbering. The usage oI classes has a long tradition in the Internet and was a
main reason Ior IP address depletion.
The first byte (or "octet") oI an IP address identiIies the class. For example the
address 205.176.253.5 is a class C address.

17
17 {C} Herbert Haas 2005/03/11
Broadcasts and Networks
AII ones in the host-part represents
,network-broadcast"
(10.255.255.255)
AII ones in the net-part and host-
part represents ,Iimited broadcast
in this network" (255.255.255.255)
AII zeros in the host-part represents
the ,network-address" (10.0.0.0)
A network broadcast is used to send a broadcast packet to a dedicated network.
The IETF strongly discourages the use oI network broadcast and it is not deIined
Ior IPv6.
II a destination IP address consists oI "all 1", which can be represented by
decimal numbers as "255.255.255.255", then this is recognized as "local" or
"limited" broadcast. A limited broadcast is never Iorwarded by routers, otherwise
the whole Internet would be congested by "broadcast storms". Note that
broadcast addresses must not be used Ior source addresses.
A network is described using the "network address", which is simply its IP
address with host part set to zero. Network addresses are used in routing entries
and routing protocols, since a router only deals with networks and doesn't care Ior
host addresses.

18
18 {C} Herbert Haas 2005/03/11
Reserved Addresses
Address range for private use

10.0.0.0 - 10.255.255.255

172.16.0.0 - 172.31.255.255

192.168.0.0 - 192.168.255.255
RFC 1918
Network 127.x.x.x is reserved for
"Loopback"
So-called RFC 1918 addresses are class A, B, and C address blocks which can
be used Ior internal purposes. Such addresses must not be used in the Internet.
All gateways connected to the Internet should Iilter packets that contain these
private addresses. Furthermore these addresses must not be used in Internet
routing updates.
Because oI those rigid Iilter policies, it is relatively saIe to utilize RFC 1918
addresses in local networkseverybody in the Internet knows which addresses
must be Iiltered.
Each operating system provides a virtual IP interface, called the loopback
interface. Per deIault the IP addresses 127.x.x.x are reserved Ior this reason.
Initially, the idea came Irom the UNIX world as IP is only one oI several means
to achieve inter-process communication upon a UNIX workstation. Other
methods are named/unnamed pipes, shared memories, or message queues Ior
example.
When using IP Ior inter-process-communication, the involved client/server
processes can be distributed upon diIIerent servers across a networkwithout
any modiIication oI the source codes!
By deIault, a modern operating system assigns the IP address 127.0.0.1 to the
local loopback interIace.

19
19 {C} Herbert Haas 2005/03/11
Addressing ExampIe
E0
E0 E0 E0
E1
S0
S0 S0
S1
S1
S1
10.0.0.0
172.16.0.0
172.20.0.0
192.168.1.0
10.0.0.1 10.0.0.2
172.16.0.1 172.16.0.2 192.168.1.1 192.168.1.2 192.168.1.3
172.20.0.1 172.20.0.2
10.0.0.254
172.20.0.254
192.168.1.254 192.168.1.253 172.16.0.2
192.168.2.1
192.168.2.2
192.168.4.1
192.168.3.1
192.168.3.2
192.168.4.2
192.168.3.0
192.168.2.0
192.168.4.0

22
22 {C} Herbert Haas 2005/03/11
CIassfuI Address Waste
Two-IeveI hierarchy was sufficient in the earIy days of the
Internet
The growing sizes of LANs demanded for a third
hierarchicaI IeveI
"Subnetting" aIIows to identify some bits of the host-ID to
be interpreted as "Subnet"
CIass A
CIass B
CIass C
126 48 54%
16383 7006 43%
2097151 40724 2%
TotaI AIIocated AIIocated %
Network Number Statistics, ApriI 1992 (Source: RFC 1335)
The "classful" method oI identiIying network-IDs oI a given IP address is
inIlexible and lead to address space depletion. The table above shows how the
total address space had been allocated by April 1992, according to RFC 1335.
Note that only 2 oI more than 2 million Class C addresses had been assigned.
Class C networks are too small Ior most organizations but class A and B are too
large. OI course many companies tried to grab a class A network number because
oI the huge address spacethey would never need another IP network number
anymore.
LANs were getting bigger and bigger and a logical separation oI an organization's
network (e. g. oI a class A network number) would be a great help. Until now,
multiple network numbers had been assigned to single companies, which caused
two problems: waste oI IP address space and growing Internet routing tables.
Even in 1985, RFC 950 deIined a standard procedure to support subnetting oI a
single Class A, B or C network number into smaller pieces. Now organizations
can deploy additional subnets without needing to obtain a new network number
Irom the Internet.


24
24 {C} Herbert Haas 2005/03/11
Subnet Zero / Subnet Broadcast
Consider network 10.0.0.0

Is it a cIass A net "10" ?

Or do we have a subnet "10.0" ?


Consider broadcast 10.255.255.255

Is it a directed broadcast for the whoIe


net 10 ?

Or onIy for the subnet 10.255 ?


Subnet zero and subnet broadcast
can be ambiguous!
The older routing protocols, such as RIP, relayed routes as a single 32-bit address. The high-order
bits allowed each address to split into its network and host Iields.
A simple convention was then Iollowed. II the host Iield contained all 0 bits, then the address was
a network route that matched every address within that classIul network, the equivalent oI a /8, /
16, or /24 preIix, depending on the address class.
Any 1 bits in the host Iield caused it to be interpreted as a host route, matching only the exact
address speciIied, the equivalent oI /32 preIix. This is why the all-zeros address is reserved - it
was used by the routing protocols to match the entire classIul network.
The advent oI subnetting undermined this scheme, but the designers oI subnetting decided against
any changes to the Iormat oI the routing protocols. This meant that there was still only a single 32-
bit address to work with, though its interpretation became much more complex.
Addresses in foreign networks (classIul networks not directly attached to the router processing the
inIormation) were interpreted as beIore.
Addresses in local networks were processed using the subnet mask programmed into the router.
The address was Iirst split into its three Iields. II both subnet and host Iields were all 0s, it was a
network route, as beIore. An address with 1 bits in the subnet Iield, but all 0 bits in the host Iield
was a subnet route, matching all addresses within that subnet. Finally, addresses with 1 bits in the
host Iield were interpreted as host routes, as beIore.
This lead to more reserved addresses - both the all-0s subnet and the all-0s host in each subnet
were reserved.

25
25 {C} Herbert Haas 2005/03/11
Subnet ExampIe 1
"Use the cIass A network 10.0.0.0 and 8 bit subnetting"
1) That is: 10.0.0.0 with 255.255.0.0 (pseudo cIass B)
or 10.0.0.0/16
2) ResuIting subnetworks:
10.0.0.0
10.1.0.0
10.1.0.1
10.1.0.2
10.1.255.254
10.1.255.255
...
10.2.0.0
10.3.0.0
10.254.0.0
10.255.0.0
Subnet zero
First IP host in network 10.1.0.0
...
Second IP host in network 10.1.0.0
Last IP host in network 10.1.0.0
Directed broadcast for network 10.1.0.0
Subnet broadcast
The example above shows how to subnet a class A networkin our case network
10. Here we use a 16-bit subnet mask allowing us to deIine 2`8 2 subnets,
because the natural subnet mask oI a class A network is 8 bits in length.
The diagram above shows the total range oI subnetworks including the
"Iorbidden" ones, that is subnet zero and the subnet broadcast.

26
26 {C} Herbert Haas 2005/03/11
Subnet ExampIe 2
"Use the cIass B network 175.32.0.0 and 4 bit subnetting"
1) That is: 175.32.0.0 with 255.255.240.0 or 175.32.0.0/20
2) ResuIting subnetworks:
175.32.0.0
175.32.16.0
175.32.16.1
175.32.16.2
175.32.31.254
175.32.31.255
...
175.32.32.0
175.32.48.0
175.32.224.0
175.32.240.0
Subnet zero
First IP host in network 175.32.16.0
...
Second IP host in network 175.32.16.0
Last IP host in network 175.32.16.0
Directed broadcast for network 175.32.16.0
Subnet broadcast

27
27 {C} Herbert Haas 2005/03/11
VariabIe Length Subnetting (VLSM)
Remember:

IP-routing is onIy possibIe between different "IP-


Networks"
Every Iink must have an IP net-ID
Today IP addresses are rare!
The assigment of IP-Addresses must be as
efficient as possibIe!
E0 E0 S0 S0
LAN A
20 Hosts
LAN B
50 Hosts
Router A Router B
WAN
192.168.1.64 / 26 192.168.1.4 / 30 192.168.1.32 / 27
VLSM was created in 1987. RFC 1009 deIined how a subnetted network could
use more than one subnet mask. With earlier limitation, a organization is locked
into a Iixed number oI Iixed subnets. VLSM supports more eIIicient use oI an
organization`s IP address space.
A short address design history:
1980 ClassIul Addressing RFC 791
1985 Subnetting RFC 950
1987 VLSM RFC 1009
1993 CIDR RFC 1517 - 1520

31
31 {C} Herbert Haas 2005/03/11
IP Fragmentation (4)
ReassembIy is done at the destination

Buffer space has to be provided at the receiver


The first arriving fragment issues a
reassembIy timer

Provided that MF=1 and/or Offset <> 0


The reassembIy timer Iimits the Iifetime of
an incompIete datagram and aIIows better
use of buffer resources
Because Iragments can take diIIerent paths, reassembly is done at the destination.
II the reassembly timer expires beIore the packet was reconstructed, all Iragments
will be discarded and the buIIer is set Iree.
That is: Fragmentation might be a resource- and time-consuming matter. Because
oI this, packets are typically sent with the lowest MTU size that may occur
somewhere in the network. An (older) RFC recommendation speciIies 576 Bytes
to be used as minimum MTU but in the age oI Ethernet most people use 1500
Bytes to gain more eIIiciency. IP version 6 does not Iragment anymore but uses
Path MTU discovery instead.

32
32 {C} Herbert Haas 2005/03/11
Summary
The Internet ProtocoI

Is an "open" (RFC defined) standard


An IP Address is a 32 bit vaIue but
structured
To define net-ID and host-ID

CIasses A, B, C

Subnetting and VLSM aIIows to utiIize


the address-space much more efficient

33
33 {C} Herbert Haas 2005/03/11
Quiz
Why is there aIso a source address in the
IP header?
Why is there no fieId for the subnet-mask
in the IP Header?
Is Subnet-Zero used in "ReaI Life"?
Do Routers today reaIIy care about IP-
CIasses?
Is VLSM stiII important? (why / why not)


2005/03/11 {C} Herbert Haas
Address ResoIution
ARP, RARP, Proxy ARP
2 {C} Herbert Haas 2005/03/11
Agenda
Address ResoIution ProtocoI (ARP)

IP Routing Basics

IP Forwarding and ARP


RARP
Proxy ARP
ICMP

IP Forwarding and ICMP




3 {C} Herbert Haas 2005/03/11
Address ResoIution ProtocoI
2005/03/11 3 Address Resolution {ARP, RARP, Proxy ARP}
4 {C} Herbert Haas 2005/03/11
Why ARP?
On a muItipoint network every station
needs a Iayer-2 address
When IP packets shouId be sent to a IocaI
destination the sender must first
determine the corresponding Iayer-2
address
The Iayer-2 address couId be a MAC
address, a DLCI (Frame-ReIay) or simiIar

In this chapter we onIy focus on Ethernet




6 {C} Herbert Haas 2005/03/11
Routing Differences
Routing = finding a path to a
destination address
Direct deIivery performed by host

Destination network = IocaI network


Indirect deIivery performed by router

Destination network IocaI network

Packet is forwarded to defauIt gateway




7 {C} Herbert Haas 2005/03/11
Direct DeIivery
IP host checks if packet's destination
network is identicaI with IocaI network

By appIying the configured subnet mask


of the host's interface
If destination network = IocaI network
then the L2 address of the destination
is discovered using ARP

Remember: not necessary for


point-to-point connections


10 {C} Herbert Haas 2005/03/11
IP Host Facts
Learned MAC addresses are stored
in an ARP-cache

Aging timer: 20 minutes


IP hosts have aIso routing tabIes !

But typicaIIy onIy a static route to the


defauIt gateway is entered

DefauIt gateway for indirect deIivery




11 {C} Herbert Haas 2005/03/11
Using the DefauIt Gateway
DefauIt gateway deIivers packet in
behaIf of its host using a routing
tabIe
Host must determine MAC address
of defauIt gateway using ARP
IP datagram is handed over to
defauIt gateway


13 {C} Herbert Haas 2005/03/11
Indirect DeIivery (2)
1.0.0.0 / 8
3.0.0.0 / 8 2.0.0.0 / 8
R3
R1
R2
R4
IP: 1.0.0.9
MAC A
IP: 2.0.0.9
MAC B
IP: 3.0.0.9
MAC C
MAC U
IP: 1.0.0.1
Def.Gwy: 1.0.0.9
MAC V
IP: 1.0.0.2
Def.Gwy: 1.0.0.9
MAC Z
IP: 3.0.0.2
Def.Gwy: 3.0.0.9
MAC Y
IP: 3.0.0.1
Def.Gwy: 3.0.0.9
MAC W
IP: 2.0.0.1
Def.Gwy: 2.0.0.9
MAC X
IP: 2.0.0.2
Def.Gwy: 2.0.0.9
Host wants to send
IP Packet to 3.0.0.2
Net-ID unequaI
use def. Gateway R1
14 {C} Herbert Haas 2005/03/11
Indirect DeIivery (3)
ARP Request:
need MAC Addr
of IP 1.0.0.9
1.0.0.0 / 8
3.0.0.0 / 8 2.0.0.0 / 8
R3
R1
R2
R4
IP: 1.0.0.9
MAC A
IP: 2.0.0.9
MAC B
IP: 3.0.0.9
MAC C
MAC U
IP: 1.0.0.1
Def.Gwy: 1.0.0.9
MAC V
IP: 1.0.0.2
Def.Gwy: 1.0.0.9
MAC Z
IP: 3.0.0.2
Def.Gwy: 3.0.0.9
MAC Y
IP: 3.0.0.1
Def.Gwy: 3.0.0.9
MAC W
IP: 2.0.0.1
Def.Gwy: 2.0.0.9
MAC X
IP: 2.0.0.2
Def.Gwy: 2.0.0.9


15 {C} Herbert Haas 2005/03/11
ARP Response:
IP 1.0.0.9
MAC A
Indirect DeIivery (4)
1.0.0.0 / 8
3.0.0.0 / 8 2.0.0.0 / 8
R3
R1
R2
R4
IP: 1.0.0.9
MAC A
IP: 2.0.0.9
MAC B
IP: 3.0.0.9
MAC C
MAC U
IP: 1.0.0.1
Def.Gwy: 1.0.0.9
MAC V
IP: 1.0.0.2
Def.Gwy: 1.0.0.9
MAC Z
IP: 3.0.0.2
Def.Gwy: 3.0.0.9
MAC Y
IP: 3.0.0.1
Def.Gwy: 3.0.0.9
MAC W
IP: 2.0.0.1
Def.Gwy: 2.0.0.9
MAC X
IP: 2.0.0.2
Def.Gwy: 2.0.0.9




20 {C} Herbert Haas 2005/03/11
Indirect DeIivery (9)
1.0.0.0 / 8
3.0.0.0 / 8 2.0.0.0 / 8
R3
R1
R2
R4
IP: 1.0.0.9
MAC A
IP: 2.0.0.9
MAC B
IP: 3.0.0.9
MAC C
MAC U
IP: 1.0.0.1
Def.Gwy: 1.0.0.9
MAC V
IP: 1.0.0.2
Def.Gwy: 1.0.0.9
MAC Z
IP: 3.0.0.2
Def.Gwy: 3.0.0.9
MAC Y
IP: 3.0.0.1
Def.Gwy: 3.0.0.9
MAC W
IP: 2.0.0.1
Def.Gwy: 2.0.0.9
MAC X
IP: 2.0.0.2
Def.Gwy: 2.0.0.9
ARP Request:
need MAC Addr
of IP 3.0.0.2


21 {C} Herbert Haas 2005/03/11
Indirect DeIivery (10)
ARP Response:
IP 3.0.0.2
MAC Z
1.0.0.0 / 8
3.0.0.0 / 8 2.0.0.0 / 8
R3
R1
R2
R4
IP: 1.0.0.9
MAC A
IP: 2.0.0.9
MAC B
IP: 3.0.0.9
MAC C
MAC U
IP: 1.0.0.1
Def.Gwy: 1.0.0.9
MAC V
IP: 1.0.0.2
Def.Gwy: 1.0.0.9
MAC Z
IP: 3.0.0.2
Def.Gwy: 3.0.0.9
MAC Y
IP: 3.0.0.1
Def.Gwy: 3.0.0.9
MAC W
IP: 2.0.0.1
Def.Gwy: 2.0.0.9
MAC X
IP: 2.0.0.2
Def.Gwy: 2.0.0.9
22 {C} Herbert Haas 2005/03/11
Indirect DeIivery (END)
1.0.0.0 / 8
3.0.0.0 / 8 2.0.0.0 / 8
R3
R1
R2
R4
IP: 1.0.0.9
MAC A
IP: 2.0.0.9
MAC B
IP: 3.0.0.9
MAC C
MAC U
IP: 1.0.0.1
Def.Gwy: 1.0.0.9
MAC V
IP: 1.0.0.2
Def.Gwy: 1.0.0.9
MAC Z
IP: 3.0.0.2
Def.Gwy: 3.0.0.9
MAC Y
IP: 3.0.0.1
Def.Gwy: 3.0.0.9
MAC W
IP: 2.0.0.1
Def.Gwy: 2.0.0.9
MAC X
IP: 2.0.0.2
Def.Gwy: 2.0.0.9
Mac SA: C
Mac DA: Z
IP SA: 3.0.0.9
IP DA: 3.0.0.2


23 {C} Herbert Haas 2005/03/11
Reverse ARP
2005/03/11 23 Address Resolution {ARP, RARP, Proxy ARP}
24 {C} Herbert Haas 2005/03/11
Reverse ARP (RARP)
ARP assumes, that an IP station knows its
IP address (stored in NVRAM, on hard
disk, in config fiIe etc.).
DiskIess Machines usuaIIy don't have
such means so they must retrieve an IP
address for network booting.
RARP (Reverse ARP) provides IP
addresses for unconfigured stations.
RFC 903


25 {C} Herbert Haas 2005/03/11
Reverse ARP (RARP)
A station sends a RARP request
broadcast.
One station, the RARP server, Iooks
up the IP address for that MAC
address in a database and repIies.
Newer methods:

BOOTP

DHCP


28 {C} Herbert Haas 2005/03/11
Proxy ARP
2005/03/11 28 Address Resolution {ARP, RARP, Proxy ARP}
"The ARP Hack"


29 {C} Herbert Haas 2005/03/11
Proxy ARP (1)
Router connect onIy networks with
different net-IDs
Router with Proxy ARP enabIed aIso
connect networks with same Net-ID

Router repIies on ARP request in behaIf


of station in other segment

Security or performance reasons


"proxy" simpIy means 'instead of"
30 {C} Herbert Haas 2005/03/11
Proxy ARP (2)
Using Proxy ARP on routers, hosts do not
need defauIt gateway or routing entries to
reach other subnets
DefauIt router's address = own interface
address

Force ARP for every destination address


If the IocaI router is configured for Proxy-
ARP it repIies with an ARP response
cIaiming to be the destination host

Then accepts and forward the IP packet

Cisco routers have Proxy-ARP enabIed by


defauIt




33 {C} Herbert Haas 2005/03/11
RuIes (1)
Proxy ARP onIy aIIowed to hide
subnets - not networks !

Proxy ARP GW shouId not be used to


bypass normaI GWs
MuItipIe Proxy ARP GWs

Requesting host wiII use the first ARP


response it receives

SimpIe Ioad baIancing service


34 {C} Herbert Haas 2005/03/11
RuIes (2)
Proxy ARP GWs must not repIy if the
destination is reachabIe through the
same interface

Either destination is in same segment

Or another Proxy ARP GW wiII repIy,


knowing a better route


35 {C} Herbert Haas 2005/03/11
Disadvantages
Much ARP traffic

Forwarded by bridges! (Broadcasts)


Hosts need Iarger ARP caches
Address spoofing possibIe

Station cIaims to be another station


36 {C} Herbert Haas 2005/03/11
ICMP
36 Address Resolution {ARP, RARP, Proxy ARP}


37 {C} Herbert Haas 2005/03/11
The Internet ControI Message ProtocoI
If network cannot deIiver packets the
sender must be informed somehow !

Reasons: no route, TTL expired, ...


ICMP enhances network reIiabiIity
and performance by carrying error
and diagnostic messages
ICMP must be supported by every IP
station

ImpIementation differences!
38 {C} Herbert Haas 2005/03/11
SimpIe Operation
Any station (host or router) detecting
transmission probIems sends ICMP
error message back to the originator
ICMP gives feedback
ICMP messages are carried within IP
packets

ProtocoI fieId = 1

ICMP header and code in the IP data area




39 {C} Herbert Haas 2005/03/11
Important RuIe
If a IP packet carrying an ICMP
message cannot be deIivered

No additionaI ICMP error message is


generated to avoid an ICMP avaIanche

"ICMP must not invoke ICMP"


Exception: PING command

Echo request and echo response

Microsoft's tracert expects "TTL expired"


upon "Echo request"


41 {C} Herbert Haas 2005/03/11
Type FieId VaIues
(0) - Echo repIy ("PING")
(3) - Destination UnreachabIe
(4) - Source Quench (decrease data rate of sender)
(5) - Redirect (use different router)
(8) - Echo Request ("PING")
(11) - Time Exceeded (TTL = 0 or reassembIy timer expired)
(12) - Parameter ProbIem (IP header)
(13) - Time Stamp Request
(14) - Time Stamp RepIy
(15/16) - Information Request/RepIy (finding the Net-ID of
the network; e.g. SLIP)
(17/18) - Address Mask Request/RepIy
42 {C} Herbert Haas 2005/03/11
ExampIe: Codes for Type 3
(0) - Network unreachabIe: no path to network known or
network down; generated by intermediate or far-end
router.
(1) - Host unreachabIe: Host-ID can't be resoIved or host
not responding; generated by far-end router.
(2) - ProtocoI unreachabIe: protocoI specified in IP header
not avaiIabIe; generated by end system.
(3) - Port unreachabIe: port (service) specified in Iayer 4
not avaiIabIe; generated by end system.
(4) - Fragmentation needed and do not fragment bit set: DF
bit =1 but the packet is too big for the network (MTU);
generated by router.
(5) - Source route faiIed: Path in IP Options couIdn't be
foIIowed; generated by intermediate or far-end router.


43 {C} Herbert Haas 2005/03/11
IP Forwarding und ICMP(1)
1.0.0.0 / 8
3.0.0.0 / 8 2.0.0.0 / 8
R3
R1
R2
R4
IP: 1.0.0.9
MAC A
IP: 2.0.0.9
MAC B
IP: 3.0.0.9
MAC C
MAC U
IP: 1.0.0.1
Def.Gwy: 1.0.0.9
MAC V
IP: 1.0.0.2
Def.Gwy: 1.0.0.9
MAC Z
IP: 3.0.0.2
Def.Gwy: 3.0.0.9
MAC Y
IP: 3.0.0.1
Def.Gwy: 3.0.0.9
MAC W
IP: 2.0.0.1
Def.Gwy: 2.0.0.9
MAC X
IP: 2.0.0.2
Def.Gwy: 2.0.0.9
Host wants to send
IP Packet to 4.0.0.1
Net-ID unequaI
use def. Gateway R1


45 {C} Herbert Haas 2005/03/11
R1 ICMP message
to IP 1.0.0.1
"network unreachabIe"
IP Forwarding und ICMP(1)
1.0.0.0 / 8
3.0.0.0 / 8 2.0.0.0 / 8
R3
R1
R2
R4
IP: 1.0.0.9
MAC A
IP: 2.0.0.9
MAC B
IP: 3.0.0.9
MAC C
MAC U
IP: 1.0.0.1
Def.Gwy: 1.0.0.9
MAC V
IP: 1.0.0.2
Def.Gwy: 1.0.0.9
MAC Z
IP: 3.0.0.2
Def.Gwy: 3.0.0.9
MAC Y
IP: 3.0.0.1
Def.Gwy: 3.0.0.9
MAC W
IP: 2.0.0.1
Def.Gwy: 2.0.0.9
MAC X
IP: 2.0.0.2
Def.Gwy: 2.0.0.9
Iets send back an
ICMP message...
46 {C} Herbert Haas 2005/03/11
IP Forwarding und ICMP(2)
1.0.0.0 / 8
3.0.0.0 / 8 2.0.0.0 / 8
R3
R1
R2
R4
IP: 1.0.0.9
MAC A
IP: 2.0.0.9
MAC B
IP: 3.0.0.9
MAC C
MAC U
IP: 1.0.0.1
Def.Gwy: 1.0.0.9
MAC V
IP: 1.0.0.2
Def.Gwy: 1.0.0.9
MAC Z
IP: 3.0.0.2
Def.Gwy: 3.0.0.9
MAC Y
IP: 3.0.0.1
Def.Gwy: 3.0.0.9
MAC W
IP: 2.0.0.1
Def.Gwy: 2.0.0.9
MAC X
IP: 2.0.0.2
Def.Gwy: 2.0.0.9
Host wants to send
IP Packet to 3.0.0.5
Net-ID unequaI
use def. Gateway R1




50 {C} Herbert Haas 2005/03/11
IP Forwarding und ICMP(2)
1.0.0.0 / 8
3.0.0.0 / 8 2.0.0.0 / 8
R3
R1
R2
R4
IP: 1.0.0.9
MAC A
IP: 2.0.0.9
MAC B
IP: 3.0.0.9
MAC C
MAC U
IP: 1.0.0.1
Def.Gwy: 1.0.0.9
MAC V
IP: 1.0.0.2
Def.Gwy: 1.0.0.9
MAC Z
IP: 3.0.0.2
Def.Gwy: 3.0.0.9
MAC Y
IP: 3.0.0.1
Def.Gwy: 3.0.0.9
MAC W
IP: 2.0.0.1
Def.Gwy: 2.0.0.9
MAC X
IP: 2.0.0.2
Def.Gwy: 2.0.0.9
ARP Request:
need MAC Addr
of IP 3.0.0.5
...did not get an ARP
response back
Iets send back an
ICMP message...


51 {C} Herbert Haas 2005/03/11
IP Forwarding und ICMP(2)
1.0.0.0 / 8
3.0.0.0 / 8 2.0.0.0 / 8
R3
R1
R2
R4
IP: 1.0.0.9
MAC A
IP: 2.0.0.9
MAC B
IP: 3.0.0.9
MAC C
MAC U
IP: 1.0.0.1
Def.Gwy: 1.0.0.9
MAC V
IP: 1.0.0.2
Def.Gwy: 1.0.0.9
MAC Z
IP: 3.0.0.2
Def.Gwy: 3.0.0.9
MAC Y
IP: 3.0.0.1
Def.Gwy: 3.0.0.9
MAC W
IP: 2.0.0.1
Def.Gwy: 2.0.0.9
MAC X
IP: 2.0.0.2
Def.Gwy: 2.0.0.9
R
4

C
M
P

m
e
s
s
a
g
e
t
o

1
.
0
.
0
.
1
"
h
o
s
t


u
n
r
e
a
c
h
a
b
le
"


53 {C} Herbert Haas 2005/03/11
RuIes
The interface on which the packet comes
into the router is the same interface on
which the packet gets routed out
The subnet/network of the source IP
address is the same subnet/network of the
next-hop IP address of the routed packet
The datagram is not source-routed
The kerneI is configured to send redirects
54 {C} Herbert Haas 2005/03/11
Summary
On Layer 3, IP-Addresses are used to route
packets

On Layer 2 different addresses are used (e.g. MAC-


Address)

Mapping/ResoIution needed ARP


ARP is mostIy dynamic (static entries are
possibIe)
The other way round: RARP (BootP, DHCP)
ICMP is used to inform the originating IP-Host
about what happend with its IP Packet

IP Stacks do not neccesariIy Iisten to ICMP message

CouId be one way to impIement fIow-controI (ICMP -


source quench)


55 {C} Herbert Haas 2005/03/11
Quiz
Why is ARP not needed on seriaI Iines?
Why are ARP-Cache entries timeing out?
Why shouId you use DHCP instead of
RARP?
What happens if a router discards an ICMP
message?
Ever heard of "Inverse ARP"?

1
2005/03/11 {C} Herbert Haas
BootP and DHCP
Flexible and Scalable Host
Configuration

2
2 {C} Herbert Haas 2005/03/11
Shortcomings of RARP
Reverse Address Resolution
Protocol
OnIy IP Address distribution
No subnet mask
Using hardware address for
identification
New methods needed: BOOTP,
DHCP
RARP was one oI the Iirst protocols which oIIers automatically an IP Address to
a new connected client. But RARP is an old protocol with many disadvantages.
It can only distribute an IP Address without a subnet mask. RARP uses the
hardware address Ior identiIication. This make it impossible to connect new
clients to the network without some administrative work.

3
Bootstrap Protocol (BOOTP)
A static solution with many parameters

4
4 {C} Herbert Haas 2005/03/11
GoaI
CIients request IP address and other
parameters from server

Subnet mask, configuration fiIename, ...


IP addresses are predefined in a Iist

Fixed mapping MAC address IP


address
Defined in RFC 951 and RFC 1048
The Bootstrap Protocol can oIIer many important parameters to the client. The
most important parameters are the subnet mask and the conIiguration Iilename.
With the conIiguration Iilename it is possible to connect a no-disk client.
Also BOOTP uses a Iixed mapping via hardware address (Ethernet Mac
Address).

5
5 {C} Herbert Haas 2005/03/11
Bootstrap
Here is MAC A,
I need an IP address,
and something to boot!
Request-ID = 77
CIient IP = 0.0.0.0
MAC = A
Your IP = ?
Server IP = ?
Image FiIe = ?
Eth2 DA = FFFF.FFFF.FFFF
IP
DA = 255.255.255.255
SA = 0.0.0.0
DPort = 67
SPort = 68
UDP
B
O
O
T
P
BOOTP CIient
BOOTP
Server
TFTP
Server
In the picture above you see the classic bootstrap principle. The are 2 important
servers. The TFTP server with the conIiguration Iile and the BOOTP server.
AIter a new client connect to a network he needs an IP address and something to
boot. Via an IP broadcast (BOOTP works with UPD, Ports 67 and 68) he sends
out a request.

6
6 {C} Herbert Haas 2005/03/11
Bootstrap
Request-ID = 77
CIient IP = 0.0.0.0
MAC = A
Your IP = 192.60.30.10
Server IP = 192.60.30.20
Image FiIe = /tftpboot/dI.img
Eth2 DA = FFFF.FFFF.FFFF
IP
DA = 255.255.255.255
SA = 192.60.30.100
DPort = 68
SPort = 67
UDP
B
O
O
T
P
TFTP Server
192.60.30.20
BOOTP Server
192.60.30.100
Thank You !
192.60.30.10
BOOTP CIient
AIter the BOOTP server receipt the request Irom the BOOTP client, he uses his
Iixed mapping method (MAC A address ... IP address) to oIIer the client an
IP address. The BOOTP server also sends the client inIormation about the TFTP
server and the name oI the conIiguration Iile.

7
7 {C} Herbert Haas 2005/03/11
PrincipIes
Separation of the boot task into a
BOOTP-part and a TFTP-part
BOOTP server onIy needs to
maintain a smaII database !
Image- and configuration-fiIes can be
stored on another machine
BOOTP cIient is responsibIe for error
detection
AIter an error detection (timeout) there will be a retransmission. The timeout is
selected randomly Irom a special interval, which is increased as error last on -~
avoiding network overload. For the error detection the UDP and a checksum is
used. Also the IP datagram has the 'Do Not Fragment Bit set to 1.

8
8 {C} Herbert Haas 2005/03/11
BOOTP - Message Format
CLENT P ADDRESS
YOUR P ADDRESS
SERVER P ADDRESS
ROUTER P ADDRESS
CLENT HARDWARE ADDRESS (16 Octets)
SERVER HOST NAME (64 Octets)
BOOTFLENAME (128 Octets)
VENDOR SPECFC AREA (64 Octets)
OP HTYPE HLEN HOPS
SECONDS RESERVED
TRANSACTON D
In the picture above you see the BOOTP message Iormat. One line is 4 bytes
long. Note the 64-octet vendor speciIic area at the bottom oI the Irame. This
space can be used Ior various additional messages and will be extended by
DHCP.
In the middle part (red) the most important inIormation is carried, which is the
assigned IP address, the IP address oI a server Irom which this client can boot,
and an optional router IP address iI this server is located on another subnet.
The detailed meaning oI each Iield will be explained in the Iollowing slides.

9
9 {C} Herbert Haas 2005/03/11
BootP - Message FieIds
Operation Code (OP)

Message Type
Hardware Address Type (HTYPE)
Hardware Address Length (HLEN)
Hops

Broadcast Ioop/storm avoidance

Increased/checked by routers
The Hops Iield is important to avoid broadcast loops in a network. Every time a
BOOTP packet is checked by a router, the router increase the hops Iield per 1.
Operation Code (OP)
1. Boot request
2. Boot reply
Hardware Address Type (HTYPE)
Network Type (1. Ethernet 10MBit)
Hardware Address Length (HLEN)
6. Ethernet

10
10 {C} Herbert Haas 2005/03/11
BootP - Message FieIds
Transaction ID

Used for identification (random number)


Seconds

Seconds eIapsed since cIient started trying to


boot
CIient IP-address

FiIIed in by cIient in boot request if known


Your IP-address

FiIIed by server if cIient doesn't know its own


address
The Transaction ID consists oI a random number and ensures that a client
identiIies the correct reply packet among others, associated to its request. That is,
both the request and the assoicuated reply have the same Transaction ID.
Seconds is set to the number oI seconds that have elapsed since the client has
started booting. According to RFC 951: "This will let the servers know how long
a client has been trying. As the number gets larger, certain servers may Ieel more
'sympathetic' towards a client they don't normally service. II a client lacks a
suitable clock, it could construct a rough estimate using a loop timer. Or it could
choose to simply send this Iield as always a Iixed value, say 100 seconds."
II a router is conIigured to Iorward BOOTP requests (broadcasts) then it might
also wait until a certain value Ior "Seconds" has been exceeded. This measure
would mitigate broadcast storms.
The Client might Iill in its own IP address iI already known and other parameters
are requested.
Mostly "Your IP address" is used, which contains the IP address assigned to the
client.

11
11 {C} Herbert Haas 2005/03/11
BootP - Message FieIds
Server IP-address

Returned in boot repIy by server


Router IP-address

Server is part of another Subnet

IP-address of the BootP reIay


CIient Hardware-address

MAC-address of cIient
The "Server-IP-address" contains the IP address oI an optional boot server.
II a gateway does decide to Iorward the request, it should look at the 'giaddr'
(gateway IP address) Iield. II zero, it should plug its own IP address (on the
receiving cable) into this Iield. It may also use the 'hops' Iield to optionally
control how Iar the packet is reIorwarded. Hops should be incremented on each
Iorwarding. For example, iI hops passes '3', the packet should probably be
discarded.
The Client's HW address is needed to Iind an entry in the address-table at the
BOOTP server.

12
12 {C} Herbert Haas 2005/03/11
BootP - Message FieIds
Server Host Name

OptionaI server host name


BootfiIename

Contains directory path and fiIename of the


bootfiIe
Vendor Specific Area

OptionaIIy contain vendor information of the


BootP server

RFC 1048: aIso possibIe to mention the subnet


mask, hostname, domain name, DNS, etc
Optionally, the servers domain name can be speciIied. This Iield is limited to 64
bytes.
The "BootIilename" contains the directory path and Iilename oI the bootIile,
which is located at the server speciIied above.

13
13 {C} Herbert Haas 2005/03/11
Dynamic Host Configuration Protocol (DHCP)
A dynamic solution with even more parameters

14
14 {C} Herbert Haas 2005/03/11
PrincipIes
NearIy identicaI to BOOTP

SIightIy extended messages onIy

More parameters
Uses UDP communication

CIient-Side: Port 67

Server-Side: Port 68
Based on a Ieasing idea!

Dynamic configuration
RFC 2131 and RFC 2132
The Dynamic Host ConIiguration Protocol works nearly identical to BOOTP.
DHCP uses the same message Iormat with only slightly chances.
DCHP based on a leasing idea. The IP address will be leased Irom the server to
the client Ior a special time, aIter this time expired the client need to send his
request again.

15
15 {C} Herbert Haas 2005/03/11
FIexibIe Configurations
Automatic: Host gets permanent
address
Dynamic: Address has expiration
date/time (Ieasing) !
ManuaI: Fixed mapping MAC IP
In the slide above you see the three diIIerent kind oI conIiguration methods.
BOOTP uses a manual conIiguration, a Iixed mapping (MAC -~ IP). DHCP has
a dynamic conIiguration. The oIIered IP address Irom the server will be expire
aIter a special time (leasing idea).

16
16 {C} Herbert Haas 2005/03/11
Parameters
IP address
Subnet mask
DNS Server
NetBIOS Name Server
List of defauIt gateways
Ethernet EncapsuIation
Router Discovery (RFC 1256)
Path MTU Discovery (RFC 1191)
etc...
In this slide you see some conIiguration parameters which can send with DCHP.
It is also possible to transIer inIo about the maximal Iragment size, ARP cache
timeout, TCP keepalive, deIault TTL, source routing options and MTU.

17
17 {C} Herbert Haas 2005/03/11
How Does It Work - 1
Here is MAC A.
I need an IP
Address !
IP LEASE REQUEST
[DHCPDISCOVER]
DHCP CIient
DHCP Server 2
IP LEASE OFFER
[DHCPOFFER]
1.
2.
DHCP Server 1
In the slide above you see the basic principle oI DHCP. It is possible in a bigger
network that there are not only one DHCP server. The DHCP client connect to
the network at starts sending out a IP LEASE REQUEST |DHCPDISCOVER|
(via broadcast, like BOOTP). Every DHCP server in the network receives this
message. Every DHCP server has a own address pool. II one server has addresses
leIt in this pool, he sends back an IP LEASE OFFER |DHCPOFFER| (in this
oIIer there is the IP address Ior the client) to the client.

18
18 {C} Herbert Haas 2005/03/11
10.1.0.99
How Does It Work - 1
Source IP Address: 0.0.0.0
Dest. IP Address: 255.255.255.255
HW Address: MAC A
DHCPDISCOVER
Source IP Address: 10.1.0.20
Dest. IP Address: 255.255.255.255
Offered IP Address: 10.1.0.99
CIient HW Address: MAC A
Subnetmask: 255.255.255.0
LeaseIength: 48h
Server ID: 10.1.0.20
DHCPOFFER
10.1.0.20
1.
2.
10.1.0.10
DETAILED
This picture shows the same as the last one, but more detailed. The client sends
out his DHCPDISCOVER message and both servers receive it. Then server
10.1.0.20 sends back his DHCPOFFER. In this oIIer there are the IP address Ior
the client (OIIered IP Address), subnet mask, server ID and also the lease length.

19
19 {C} Herbert Haas 2005/03/11
How Does It Work - 2
IP LEASE ACKNOWLEGMENT
[DHCPACK]
DHCP CIient
DHCP Server 1
DHCP Server 2
IP LEASE SELECTION
[DHCPREQUEST]
3.
4.
Thank you server
2 for the IP
Address! Listen
everybody: I use
the information
from this server,
stop to offer!
AIter the client gets an oIIer Irom one server, he sends out an IP LEASE
SELECTION |DHCPREQUEST| to tell the other server that he will accept the
oIIer Irom server 2 and that the other servers can stop sending him oIIers. The
DHCPREQUEST is also a broadcast.

20
20 {C} Herbert Haas 2005/03/11
How Does It Work - 2
Source IP Address: 0.0.0.0
Dest. IP Address: 255.255.255.255
HW Address: MAC A
Req. IP Address: 10.1.0.99
Server ID: 10.1.0.20
DHCPREQUEST
Source IP Address: 10.1.0.20
Dest. IP Address: 255.255.255.255
Offered IP Address: 10.1.0.99
CIient HW Address: MAC A
Subnetmask: 255.255.255.0
LeaseIength: 48h
Server ID: 10.1.0.20
DHCPACK
10.1.0.99
10.1.0.20
10.1.0.10
3.
4.
DETAILED
One important thing is the server ID in the DHCPREQUEST. This server ID tells
the server Irom which the client gets his IP address that the client will take this
oIIered address. AIter server 2 receipt the DHCPREQUEST he sends back the
DHCPACK to acknowledgment this lease.

21
21 {C} Herbert Haas 2005/03/11
Bound
DHCPACK (success) is send by the
server who's offer was accepted
CIient receives the DHCPACK
CIient enters the BOUND state
TCP/IP is compIeteIy initiaIized
AIter the client receipt the DHCPACK (iI all was successIul) the client enters the
BOUND state. AIter the client is BOUND TCP/IP complete initialized and the
client is ready Ior data transIer.

22
22 {C} Herbert Haas 2005/03/11
DHCPNACK
DHCPNACK (no success) wiII be
send if

CIient tries to Iease the previous IP


address, but this address is no Ionger
avaiIabIe

CIient's IP address is invaIid

CIient may have been moved to an other


subnet
II the client receipt a DHCPNACK message Irom the server something went
wrong. Connection Iailure, IP address invalid, client move to an other subnet, etc
can all lead to a negative acknowledgment. II the client receipt this kind oI
message, he need to start again Irom the beginning (sending out a
DHCPDISCOVER).

23
23 {C} Herbert Haas 2005/03/11
DHCP - Message Format
CLENT P ADDRESS
YOUR P ADDRESS
SERVER P ADDRESS
ROUTER P ADDRESS
CLENT HARDWARE ADDRESS (64 Octets)
SERVER HOST NAME (64 Octets)
BOOTFLENAME (128 Octets)
OPTONS (312 Octets) DHCP MESSAGES !
OP HTYPE HLEN HOPS
SECONDS FLAGS FELD
TRANSACTON D
This picture shows the DHCP message Iormat. It is nearly completely the same
like the BOOTP message Iormat. The only diIIerent is the OPTION Field
(DHCP MESSAGES) which contains the DHCPREQUEST, DHCPOFFER,
DHCPDISCOVER, etc.

24
24 {C} Herbert Haas 2005/03/11
DHCP-specific Message FieIds
DHCPDICOVER

CIient broadcast to find DHCP server


DHCPOFFER

Response to a DHCPDISCOVER

Offering an IP address
DHCPREQUEST

Request the parameters offered by one server


DHCPINFORM

CIient ask for more information


The DCHPINFORM message is used Irom the client, iI this client needs more
inIormation then normal.

25
25 {C} Herbert Haas 2005/03/11
DHCP-specific Message FieIds
DHCPACK

AcknowIedgement from server to cIient


DHCPNACK

Negative ACK from server to cIient


DHCPDECLINE

Message from server to cIient indicating an


error
DHCPRELEASE

Message from server to cIient canceIing a


Iease and reIinquishing network address

26
26 {C} Herbert Haas 2005/03/11
Timer
After DHCPACK beginning of the Iease
period is registered
Located in the DHCPACK message

Lease Time

T1 (renewaI attempt)

T2 (sub renewaI attempt)


T1 and T2 are configured at the DHCP server

T1 = 0,5 x Iease time

T2 = 0,875 x Iease time


DHCP relies on a leasing idea. The oIIered IP address expired aIter a special
time. There are 3 times. There is a ,lease time', a T1 and a T2. T1 and T2 based
on the lease timer (T1 ~ 0.5 x lease time; T2 ~ 0,875 x lease time). This
multiplier is conIigured at the DHCP server.

27
27 {C} Herbert Haas 2005/03/11
Timer
T1 and T2 start when cIient is bound
CIient RENEW the Iease when T1
expired

CIient enters RENEWING state and


sends a DHCPREQUEST to the server

If server accept, a DHCPACK contains a


new Iease time
AIter the client enters the BOUND state, both timers start. II the client still in the
network aIter T1 expired, the client sends out an DHCPREQUEST message,
because he wants to renew the lease.

28
28 {C} Herbert Haas 2005/03/11
Timer
If the Iease couId not be RENEWED
after T1, the cIient makes another try
after T2

CIient try to connect other DHCP server


DHCP server can answer with

DHCPACK and RENEWING the Iease

DHCPNACK to force the cIient to


reinitiaIize
T2 is only a 2nd try. II something go wrong at the Iirst time, the client still have
the chance to renew his lease aIter T2 expired. In this try he also connect other
DHCP servers.
II the client receipt a DHCPACK his lease is renew. II the client gets a
DHCPNACK message the lease expired and the client starts Irom the beginning
(he sends out a DHCPDISCOVER to all DHCP servers).

29
29 {C} Herbert Haas 2005/03/11
Subnets
DHCP is reIated to BootP
DHCP messages are broadcast
based

Not forwarded by routers

Or routers are configured as


BOOTP ReIay Agent
DHCP and BOOTP sends out his packets via IP broadcast. But routers not
Iorwarded broadcasts -~ broadcast storm in the whole network. But there is a
special Iunction on routers called 'BOOTP Relay Agent which allows the
routers to Iorward this special BOOTP/DHCP messages.

1
2005/03/11 {C} Herbert Haas
Introducing TCP & UDP
nternet Transport Layers

2
2 {C} Herbert Haas 2005/03/11
TCP Facts (1)
Connection-oriented Iayer 4 protocoI
Carried within IP payIoad
Provides a reIiabIe end-to-end transport of
data between computer processes of
different end systems

Error detection and recovery

Sequencing and dupIication detection

FIow controI
RFC 793
In this Chapter we talk about TCP. TCP is a connection-oriented layer 4 protocol
and only works between the hosts. It synchronizes (connects) the hosts with each
other via the '3-Way-Handshake beIore the real transmission begins. AIter this
a reliable end-to-end transmission is established. TCP was standardized in
September 1981 in RFC 793. (Remember: IP was standardized in September
1981 too, RFC 791). TCP is always used with IP and it also protects the IP
packet as its checksum spans over (almost) the whole IP packet.
TCP provides error recovery, Ilow control and sequencing. The most important
thing with TCP is the Port-Number, we will discus later.

3
3 {C} Herbert Haas 2005/03/11
TCP Facts (2)
AppIication's data is regarded as
continuous byte stream
TCP ensures a reIiabIe transmission
of segments of this byte stream
Handover to Layer 7 at "Ports"

OSI-Speak: Service Access Point


Every IP packet which is sent along with TCP will be acknowledgment (error
recovery). From the TCP perspective we call each packet a segment.
TCP hides the details oI the network layer Irom the higher layers and Irees them
Irom the tasks oI transmitting data through a speciIic network. TCP provides its
service to higher layer through ports (OSI: Service Access Points).

4
4 {C} Herbert Haas 2005/03/11
Port Numbers
Using port numbers TCP (and UDP)
can muItipIex different Iayer-7 byte
streams
Server processes are identified by
WeII known port numbers : 0..1023

ControIIed by IANA
CIient processes use arbitrary port
numbers >1023

Better >8000 because of registered ports


Each communicating computer process is assigned a locally unique port number.
Using port numbers TCP can service multiple processes such as a web browser or
an E-Mail client simultaneously through a single IP address. In summary TCP
works like a stream multiplexer and demultiplexer.

5
5 {C} Herbert Haas 2005/03/11
Registered Ports
For proprietary server appIications
Not controIIed by IANA onIy Iisted in
RFC 1700
ExampIes

1433 Microsoft-SQL-Server

1439 Eicon X25/SNA Gateway

1527 OracIe

1986 Cisco License Manager

1998 Cisco X.25 service (XOT)

6000-6063 X Window System


Only the well known ports are reserved Ior common applications and services,
such as Telnet, WWW, FTP etc. They are in the range Irom 0 to 1023. These are
controlled by the Internet Assigned Numbers Authority (IANA).
There are also many registered ports which start at 1024 (e.g. Lotus Notes,
Cisco XOT, Oracle, license managers etc.). They are not controlled by the IANA,
only listed in RFC1700.

7
7 {C} Herbert Haas 2005/03/11
Sockets
Server process muItipIexes streams
with same source port numbers
according source IP address
(PortNr, SA) = Socket
Each stream ("fIow") is uniqueIy
identified by a socket pair
In a client-server environment a communicating server-process has to maintain
several sessions (and also connections) to diIIerent targets at the same time.
ThereIore, a single port has to multiplex several virtual connections. These
connections are distinguished through sockets. The combination IP address and
port number is called a "socket'.
For example: 10.1.1.2:80 |IP-Address : Port-Number|

11
11 {C} Herbert Haas 2005/03/11
TCP Header (1)
Source and Destination Port

16 bit port number for source and


destination process
Header Length

MuItipIe of 4 bytes

VariabIe header Iength because of


options (optionaIIy)
The Source and Destination Port Iields are 16 bits and used by the application.
The Header Length indicates where the data begins. The TCP header (even one
including options) is an integral number oI 32 bits long.

12
12 {C} Herbert Haas 2005/03/11
TCP Header (2)
Sequence Number (32 Bit)

Number of first byte of this segment

Wraps around to 0 when reaching 2


32
-1)
AcknowIedge Number (32 Bit)

Number of next byte expected by


receiver

Confirms correct reception of aII bytes


incIuding byte with number AckNr-1
Sequence Number: 32 bit. Number oI the Iirst byte oI this segment. II SYN is
present the sequence number is the initial sequence number (ISN) and the Iirst
data octet is ISN1.
Acknowledge Number: 32 bit. II the ACK control bit is set this Iield contains
the value oI the next sequence number the sender oI the segment is expecting to
receive. Once a connection is established this is always sent.

13
13 {C} Herbert Haas 2005/03/11
TCP Header (3)
URG-FIag

Indicates urgent data

If set, the 16-bit "Urgent Pointer" fieId is vaIid


and points to the Iast octet of urgent data

There is no way to indicate the beginning of


urgent data (!)

AppIications switch into the "urgent mode"

Used for quasi-outband signaIing


URG-Flag: 1 Bit. Control Bit.
Sequence number oI last urgent octet actual segment sequence number urgent
pointer
RFC 793 and several implementations assume the urgent pointer to point to the
Iirst octet urgent data. However, the "Host Requirements" RFC 1122 states
this as a mistake! When a TCP receives a segment with the URG Ilag set, it
notiIies the application which switch into the "urgent mode" until the last octet oI
urgent data is received. Examples Ior use: Interrupt key in Telnet, Rlogin, or FTP.

14
14 {C} Herbert Haas 2005/03/11
TCP Header (4)
PSH-FIag

TCP shouId push the segment


immediateIy to the appIication without
buffering

To provide Iow-Iatency connections

Often ignored
PSH-Flag: 1 Bit. Control Bit.
A TCP instance can decide on its own, when to send data to the next instance.
One strategy could be, to collect data in a buIIer and Iorward the data when the
buIIer exceeds a certain size. To provide a low-latency connection sometimes the
PSH Flag is set to 1. Then TCP should push the segment immediately to the
application without buIIing. But typically the PSH-Flag is ignored.

15
15 {C} Herbert Haas 2005/03/11
TCP Header (5)
SYN-FIag

Indicates a connection request

Sequence number synchronization


ACK-FIag

AcknowIedge number is vaIid

AIways set, except in very first segment


SYN-Flag: 1 Bit. Control Bit.
II the SYN bit is set to 1, the application knows that the host want to established a
connection with him. Also used to synchronization the sequence numbers. Most
Firewalls through away packets with SYN1 iI the host want to established a
connection to a application which the is server not allowed (security reasons).
ACK-Flag: 1 bit. Control Bit.
Acknowledgment Bit.

16
16 {C} Herbert Haas 2005/03/11
TCP Header (6)
FIN-FIag

Indicates that this segment is the Iast

Other side must aIso finish the


conversation
RST-FIag

ImmediateIy kiII the conversation

Used to refuse a connection-attempt


FIN-Flag: 1 bit. Control Bit.
The FIN-Flag is used in the 'disconnect process. It indicates that this segment is
the last one. AIter the other side has also sent a segment with FIN1, the
connection is closed.
RST-Flag: 1 bit. Control Bit.
Resets the connection immediately.

17
17 {C} Herbert Haas 2005/03/11
TCP Header (7)
Window (16 Bit)

Adjusts the send-window size of the


other side

Used with every segment

Receiver-based fIow controI

SeqNr of Iast octet = AckNr + window


Window Size: 16 bit. The number oI data octets beginning with the one indicated
in the acknowledgment Iield which the sender oI this segment is willing to
accept. See Slide 27.

18
18 {C} Herbert Haas 2005/03/11
TCP Header (8)
Checksum

CaIcuIated over TCP header, payIoad


and 12 byte pseudo IP header

Pseudo IP header consists of source


and destination IP address, IP protocoI
type, and IP totaI Iength;

CompIete socket information is


protected

Thus TCP can aIso detect IP errors


TCP Checksum: 16 bit. The checksum includes the TCP header and data area
plus a 12 byte pseudo IP header (one's complement oI the sum oI all one's
complements oI all 16 bit words). The pseudo IP header contains the source and
destination IP address, the IP protocol type and IP segment length (total length).
This guarantees, that not only the port but the complete socket is included in the
checksum.

19
19 {C} Herbert Haas 2005/03/11
TCP Header (9)
Urgent Pointer

Points to the Iast octet of urgent data


Options

OnIy MSS (Maximum Message Size) is


used

Other options are defined in RFC1146,


RFC1323 and RFC1693
Pad

Ensures 32 bit aIignment


Urgent Pointer: 16 bits. The urgent pointer points to the sequence number oI the
octet Iollowing the urgent data. This Iield is only be interpreted in segments with
the URG control bit set.
Options: Variable length. Options may occupy space at the end oI the TCP
header and are a multiple oI 8 bits in length. Only the Maximum Message Size
(MSS) is used. All options are included in the checksum.
Padding: Variable length. The TCP header padding is used to ensure that the
TCP header ends and data begins on a 32 bit boundary. The padding is composed
oI zeros.

21
21 {C} Herbert Haas 2005/03/11
Sequence Number
RFC793 suggests to pick a random
number at boot time (e.g. derived from
system start up time) and increment every
4 s
Every new connection wiII increments
SeqNr by 1
To avoid interference of spurious packets
OId "haIf-open" connections are deIeted
with the RST fIag
RFC 793 suggests to pick a random starting sequence numbers and an explicit
negotiation oI starting sequence numbers to make a TCP connect immune against
spurious packets.
Also disturbing segments (e.g. delayed TCP segments Irom old sessions etc.) and
old "halI-open" connections are deleted with the RST Ilag.

23
23 {C} Herbert Haas 2005/03/11
TCP Data Transfer
AcknowIedgements are generated for aII
octets which arrived in sequence without
errors (positive acknowIedgement)
DupIicates are aIso acknowIedged (!)

Receiver cannot know why dupIicate has been sent;


maybe because of a Iost acknowIedgement
The acknowIedge number indicates the sequence
number of the next byte to be received
AcknowIedgements are cumuIative: Ack(N)
confirms aII bytes with sequence numbers up to
N-1

Therefore Iost acknowIedgements are no probIem


The acknowledge number is equal to the sequence number oI the next octet to be
received.

25
25 {C} Herbert Haas 2005/03/11
TCP Timeout
Timeout wiII initiate a retransmission of
unacknowIedged data

VaIue of retransmission timeout infIuences


performance (timeout shouId be in reIation
to round trip deIay)

High timeout resuIts in Iong idIe times


if an error occurs

Low timeout resuIts in


unnecessary retransmissions
Adaptive timeout

KARN aIgorithm uses a backoff method to


adapt to the actuaI round trip deIay

26
26 {C} Herbert Haas 2005/03/11
TCP SIiding Window
TCP fIow controI is done with dynamic
windowing using the sIiding window protocoI
The receiver advertises the current amount of
octets it is abIe to receive
Using the window fieId of the TCP header
VaIues 0 through 65535
Sequence number of the Iast octet a sender may
send = received ack-number -1 + window size

The starting size of the window is negotiated during the


connect phase
The receiving process can infIuence the advertised
window, hereby affecting the TCP performance

27
27 {C} Herbert Haas 2005/03/11
TCP SIiding Window
HOST A
HOST B

....
[SYN] S=44 A=? W=8
[SYN, ACK] S=72 A=45 W=4
[ACK] S=45 A=73 W=8
[ACK] S=45 A=73 W=8
Advertised Window
(by the receiver)
Bytes in the send-buffer
written by the application
process
First byte that
can be send
Last byte that
can be send

28
28 {C} Herbert Haas 2005/03/11
TCP SIiding Window
During the transmission the sIiding window
moves from Ieft to right, as the receiver
acknowIedges data
The reIative motion of the two ends of the window
open or cIoses the window
The window cIoses when data is sent and
acknowIedged (the Ieft edge advances to the right)

The window opens when the receiving process on


the other end reads acknowIedges data and frees up
TCP buffer space (the right edge moves to the right)
If the Ieft edge reaches the right edge, the sender
stops transmitting data - zero window

29
29 {C} Herbert Haas 2005/03/11
TCP Enhancements
So far, onIy the very basic TCP procedures have
been mentioned
But TCP has much more magic buiIt-in
aIgorithms which are essentiaI for operation in
today's IP networks:

"SIow Start" and "Congestion Avoidance"

"Fast Retransmit" and "Fast Recovery"

"DeIayed AcknowIedgements"

"The NagIe AIgorithm"

SeIective Ack (SACK), Window ScaIing

....
AdditionaIIy, there are different impIementations
(Reno, Vegas, .)
'Slow Start and 'Congestion avoidance are mechanisms that control the
segment rate (per RTT).
'Fast Retransmit and 'Fast Recovery are mechanisms to avoid waiting Ior the
timeout in case oI retransmission and to avoid slow start aIter a Iast
retransmission.
Delayed Acknowledgements is typically used with applications like Telnet: Here
each client-keystroke triggers a single packet with one byte payload and the server
must response with both an echo plus a TCP acknowledgement. Note that also
this server-echo must be acknowledged by the client. ThereIore, layer-4 delays
the acknowledgements because perhaps layer-7 might want to send some bytes
also.
The Nagle algorithm tries to make WAN connections more eIIicient. We simply
delay the segment transmission in order to collect more bytes Irom layer-7.
Selective Acks enhance the traditional positive-ack-mechanism and allows to
selectively acknowledge some correctly received segments within a larger
corrupted block.
Window Scaling deals with the problem oI a jumping window in case the RTT-
BW-product is greater than 65535 (the classical max window size). This TCP
option allows to leIt-shiIt the window value (each bit-shiIt is like multiply by
two).

31
31 {C} Herbert Haas 2005/03/11
TCP Disconnect
A TCP session is disconnected simiIar
to the three way handshake
The FIN fIag marks the sequence number to be
the Iast one; the other station acknowIedges and
terminates the connection in this direction
The exchange of FIN and ACK fIags ensures, that
both parties have received aII octets
The RST fIag can be used if an error occurs
during the disconnect phase

32
32 {C} Herbert Haas 2005/03/11
UDP
UDP is a connectionIess Iayer 4 service
(datagram service)
Layer 3 Functions are extended by port
addressing and a checksum to ensure integrity
UDP uses the same port numbers as TCP
(if appIicabIe)
UDP is used, where the overhead of a connection
oriented service is undesirabIe or where the
impIementation has to be smaII
DNS request/repIy, SNMP get/set, booting by TFTP
Less compIex than TCP, easier to impIement
UDP is connectionless and supports no error recovery or Ilow control. ThereIore
an UDP-stack is extremely lightweight compared to TCP.
Typically applications that do not require error recovery but rely on speed use
UDP, such as multimedia protocols.

33
33 {C} Herbert Haas 2005/03/11
UDP Header
Destination Port Number Source Port Number
PAYLOAD
0 4 8 12 16 20 24 28 32
UDP Length UDP Checksum
The picture above shows the 8 byte UDP header. Note that the Checksum is oIten
not calculated, so UDP basically carries only the port numbers.
I personally think that the length Iield is just Ior Iun (or to align with 4 octets).

34
34 {C} Herbert Haas 2005/03/11
UDP
Source and Destination Port

Port number for addressing the process (appIication)

WeII known port numbers defined in RFC1700


UDP Length
Length of the UDP datagram (Header pIus Data)
UDP Checksum

Checksum incIudes pseudo IP header


(IP src/dst addr., protocoI fieId),
UDP header and user data;
ones compIement of the sum of aII ones compIements
Compared to the TCP Header, the UDP is very small (8 byte to 20 byte) because
UDP makes no error recovery or Ilow control.

35
35 {C} Herbert Haas 2005/03/11
Summary
TCP & UDP are Layer 4 (Transport)
ProtocoIs above IP
TCP is "Connection Oriented"
UDP is "Connection Less"
TCP impIements "FauIt ToIerance" using
"Positive AcknowIedgement"
TCP impIements "FIow ControI" using
dynamic window-sizes
The combination of IP-Address and
TCP/UDP-Port is caIIed a "Socket"

36
36 {C} Herbert Haas 2005/03/11
Quiz
What are advantages of TCP over UDP?
What are advantages of UDP over TCP?
What are important vaIues to define the
optimaI window-size?
WouId you use TCP or UDP for reaI-time
traffic? (VoIP, Video...)
When you downIoad something from the
Internet the downIoad rate is first smaII
and increases afterwards - WHY?

1
2005/03/11 {C} Herbert Haas
Network Address TransIation
AII you want to know about
In this chapter we discuss the idea oI Network Address Translation and special
issues associated to it. Invented in 1994, NAT became a quite popular
technique to save oIIicial network addresses and to hide the own network
topology Irom the Internet.
Note:
In this chapter the Cisco IOS syntax is used for configuration examples.
IOS is a trademark of Cisco Systems Inc.
See http://www.cisco.com for further information.

2
2 {C} Herbert Haas 2005/03/11
Reasons for NAT
Mitigate Internet address depIetion
Save gIobaI addresses (and money)
Conserve internaI address pIan
TCP Ioad sharing
Hide internaI topoIogy
NAT allows a router to swap packet addresses. The initial idea was to mitigate
IP address depletion by masquerading internal IP addresses with (perhaps a
smaller number oI) oIIicial addresses. We will discuss this later on.
The Iirst and the second point reIlect the same thing, but the Iirst statement
comes Irom the ISP while the second point is an argument Ior the customer.
The third point means that the customer does not need to change her address
plan when she switches to another ISP.
As stated in the Iourth point, NAT additionally allows Ior TCP load sharing.
Assume a bunch oI servers represented by a single IP address to the outside.
Finally, NAT improves network security by hiding the actual host addresses.
Frequently NAT boxes are combined with proxy and Iirewalling Iunctions.

3
3 {C} Herbert Haas 2005/03/11
Credits: The Creators of NAT
Paul Francis Kjeld Borch Egevang
NAT was invented in May 1994 by Paul Francis and Kjeld Borch Egevang.
Paul Francis is currently chieI scientist at Tahoe-Networks. K. Egevang works
at Intel Denmark. They have written RFCs about NAT, most importantly RFC-
1631, "The IP Network Address Translator (NAT)".

4
4 {C} Herbert Haas 2005/03/11
Terms (1)
inside outside
193. 99.99.1
193.99.99.4
Global addresses
193. 99.99.2
193. 99.99.3
(NAT not necessary)
To understand standard documents such as RFCs or vendor documents such as
Cisco white papers or similar, it is very important to understand Iour terms.
Firstly we have to distinguish the inside Irom the outside world. Inside is our
own network (which we want to hide using a NAT-enabled router later on).
Outside is the rest oI the world, especially the Internet.
Secondly, suppose we do not use NAT. ThereIore we use global addresses.
That is, we use addresses that are registered by the NIC and can be seen Irom
outside.

5
5 {C} Herbert Haas 2005/03/11
Terms (2)
10.1.1.1
10.1.1.2
10.1.1.3
10.1.1.4
Local addresses
NAT
inside outside
Using a NAT enabled router we can use inside local addresses which are not
unique in the world. This addresses are not registered and must be translated
to global addresses.
Note that we can already distinguish between inside-local and inside-global
addresses.

6
6 {C} Herbert Haas 2005/03/11
Terms (3)
10.1.1.1
10.1.1.2
10.1.1.3
10.1.1.4
193.99.99.1
193.99.99.4
193.99.99.2
193.99.99.3
This NAT-Table is maintained inside the router
Inside local
IP address
Inside global
IP address
Remember the terms inside local versus inside global.
OI course the inside global address is basically seen Irom the outside. But these
addresses belong to our hosts, so we call them inside.
Simple NAT translates between these two types oI addresses.

7
7 {C} Herbert Haas 2005/03/11
Terms (4)
Local versus global address

RefIects reaIm of usage (inside or


outside)
Inside versus outside worId

RefIects origin
Since these terms are so important and many people and some documents
conIuse them, we give a summary here.
Note that local addresses have local meaning. That is: inside devices can only
deal with packets having local addresses. The NAT router is responsible to
translate global addresses to local and vice versa if necessarv ! (II you later
understand the last two italic-written words, then you got it.)
Note that outside does not mean another (Ioreign) NAT-domain. Outside
means simply the Internet or everything beyond the NAT-router.

8
8 {C} Herbert Haas 2005/03/11
Terms Summary
Inside Network
Outside Network
NAT
Inside Local
Outside Local DA
SA Inside Global
Outside Global DA
SA
Outside Global
Inside Global DA
SA Outside Local
Inside Local DA
SA
This slide summarizes all terms by showing packets Ilowing Irom inside to
outside and Irom outside to inside. Local is what we can use inside our
network. Inside local source addresses are always private addresses otherwise
we won't use NAT.
Outside local addresses can be either private or registered. Mostly they are
registered, but in certain cases we might want to present oIIicial registered
addresses in incoming packets as being private addresses. See the slide
"Outside Address Translation" Ior this special case. Typically the outside local
address is mostly identical with the outside global address.
The inside global address is the oIIicial address oI our hosts as seen in the
Internet. What people mostly expect Irom NAT is to translate an inside local
address to an inside global address. Both addresses belong to a host inside our
network.
The outside global address is the oIIicial registered IP address oI an Internet
host. Mostly it is identical with our outside local address we use as destination
address Ior outgoing packets. See the slide "Outside Address Translation" Ior
exceptions.

9
9 {C} Herbert Haas 2005/03/11
Basic PrincipIe (1a)
10.1.1.1
NAT
198.5.5.55
193.9.9.99
10.1.1.1 193.9.9.1
10.1.1.2 193.9.9.2
.... ....
Inside Local IP Inside Global IP 10.1.1.2
Simple NAT Table
1) Suppose the user at host 10.1.1.1 opens a connection to host 198.5.5.55.
2) The Iirst packet that the router receives Irom host 10.1.1.1 causes the router
to check its NAT table.
3) The router replaces the source address with the inside global address Iound
in the NAT table. II no translation entry exists, the router determines that
the source address must be translated dynamically and selects a legal global
address Irom the predefined dynamic address pool and creates a translation
entry.
Note: static versus dvnamic entries.
Example Ior a static conIiguration:
ip nat inside source static 10.1.1.1 193.9.9.1
interface ethernet 0
ip address 10.1.1.99 255.0.0.0
ip nat inside
interface serial 0
ip address 193.9.9.99 255.255.255.0
ip nat outside

10
10 {C} Herbert Haas 2005/03/11
Basic PrincipIe (1b)
10.1.1.1 193.9.9.1
10.1.1.2 193.9.9.2
.... ....
Inside Local IP Inside Global IP
10.1.1.1
198.5.5.55 DA
SA 193.9.9.1
198.5.5.55 DA
SA
10.1.1.1
NAT
198.5.5.55
193.9.9.99
10.1.1.2
NAT
Simple NAT Table
In many NAT implementations the host portion oI an IP address remains
unchanged. Only the preIix is translated.
Example Ior a dvnamic conIiguration:
ip nat pool mynatconf 193.9.9.1 193.9.9.254 netmask 255.255.255.0
ip nat inside source list 1 pool mynatconf
!
interface ethernet 0
ip address 10.1.1.99 255.0.0.0
ip nat inside
!
interface serial 0
ip address 193.9.9.99 255.255.255.0
ip nat outside
!
access-list 1 permit 10.0.0.0 0.255.255.255

11
11 {C} Herbert Haas 2005/03/11
Basic PrincipIe (1c)
10.1.1.1 193.9.9.1
10.1.1.2 193.9.9.2
.... ....
Inside Local IP Inside Global IP
Simple NAT Table
10.1.1.1
198.5.5.55
DA
SA
193.9.9.1
198.5.5.55
DA
SA
10.1.1.1
NAT
198.5.5.55
193.9.9.99
10.1.1.2
NAT
1) Host 198.5.5.55 responds to host 10.1.1.1 by using the inside global
address 193.9.9.1 as destination address.
2) When the router receives a packet with the inside global address 193.9.9.1
it perIorms a NAT table lookup to determine the associated inside local
address.
3) The router translate 193.9.9.1 to 10.1.1.1 and Iorwards the packet to host
10.1.1.1.
FYI:
Inside-to-outside translation occurs aIter routing
Outside-to-inside translation occurs beIore routing

12
12 {C} Herbert Haas 2005/03/11
Basic PrincipIe (2a)
10.1.1.1
NAT
NAT
10.1.1.1
198.5.5.55
198.5.5.1
10.1.1.1 has
global address
193.9.9.1
10.1.1.1 has
global address
198.5.5.1
193.9.9.99
In this example we assume that the PC in the leIt network wants to send an IP
packet to the PC in the right network. Note that both networks use NAT.
Outside is everything between the two NAT-enabled routers.
By accident they use the same inside-local addresses. But this does not matter
anyway. You can also imagine using two completely diIIerent inside-local
addresses.

13
13 {C} Herbert Haas 2005/03/11
NAT
Basic PrincipIe (2b)
10.1.1.1
NAT
NAT
10.1.1.1
10.1.1.1
198.5.5.1 DA
SA 193.9.9.1
198.5.5.1 DA
SA 193.9.9.1
10.1.1.1 DA
SA
198.5.5.55 193.9.9.99
NAT
Observe these translations as depicted above:
1) The leIt host (10.1.1.1) send a packet to the right host (also 10.1.1.1). OI
course the right host is known by its outside-local address (198.5.5.1),
which is used as destination address.
2) The leIt NAT-enabled router translates only the source address (which was
an inside-local address) to an inside-global address (193.9.9.1). The
destination address (which is an outside-local address) remains unchanged
and is now called outside-global, while the packet traverses the Internet.
3) The right NAT-enabled router only changes the destination address (which
he regards as inside-global) by translating it to an inside-local one. The
source address is regarded as outside-global and remains unchanged but is
now called outside-local.

14
14 {C} Herbert Haas 2005/03/11
OverIoading (PAT)
Common problem:

Many hosts inside

But only one or a few inside-global addresses


available
Solution:

Many-to-one Translation

Aka "Overloading Inside Global Addresses"

Aka "PAT"
Many-to-one translation is acomplished by identiIying each traIIic according to
the source port numbers. This method is commonly known as Port Address
Translation (PAT). In the IETF documents you will also see the abbreviation
NAPT. In the Linux world it is known as masquerading.
When N inside hosts use the same source port numbers, the PAT-routers will
increase N-1 oI these identical source port numbers to the next Iree values.

15
15 {C} Herbert Haas 2005/03/11
OverIoading ExampIe (1)
PAT
10.1.1.1:1034
65.38.12.9:80 DA
SA
10.1.1.1
10.1.1.2
10.1.1.2:2138
65.38.12.9:80 DA
SA
173.3.8.1:1034
65.38.12.9:80 DA
SA
173.3.8.1:2138
65.38.12.9:80 DA
SA
65.38.12.9
10.1.1.1:1034
10.1.1.2:2138
173.3.8.1:1034
173.3.8.1:2138
65.38.12.9:80
65.38.12.9:80
65.38.12.9:80
65.38.12.9:80
Extended Translation Table
Outside Local Inside Global Inside Local Outside Global
TCP
TCP
Prot.
The port number is the diIIerentiator. Note that the TCP and UDP port number
range allows up to 65,536 number per IP address. This number is the upper
limit Ior simultaneous transmissions per inside-global IP address.
II the port numbers run out, PAT will move to the next IP address and try to
allocate the original source port again. This continues until all available ports
and IP addresses are utilized. II a PAT router run out oI addresses, it drops the
packet and sends an ICMP Host Unreachable message.
Generally, NAT/PAT is only practical when relatively Iew hosts in a stub
domain communicate outside oI the domain at the same time. In this case, only
a small subset oI the IP addresses in the own domain must be translated into
globally unique IP addresses.

16
16 {C} Herbert Haas 2005/03/11
PAT
10.1.1.1:1034
65.38.12.9:80
DA
SA
10.1.1.1
10.1.1.2
10.1.1.2:2138
65.38.12.9:80
DA
SA
173.3.8.1:1034
65.38.12.9:80
DA
SA
173.3.8.1:2138
65.38.12.9:80
DA
SA
65.38.12.9
OverIoading ExampIe (2)
Extended Translation Table
10.1.1.1:1034
10.1.1.2:2138
173.3.8.1:1034
173.3.8.1:2138
65.38.12.9:80
65.38.12.9:80
65.38.12.9:80
65.38.12.9:80
Outside Local Inside Global Inside Local Outside Global
TCP
TCP
Prot.
In this example both inside hosts (10.1.1.1 and 10.1.1.2) connect to the same
outside webserver. The outside local addresses are mostly identical to the
outside global addresses, but in some situations we might want to translate
them also (see next slides Ior examples).
The dynamic translation table (or translation matrix) ages out aIter some time.
The deIault timeouts are:
Non-DNS UDP 5 minutes (ip nat translation udp-timeout <seconds>)
DNS 1 minute (ip nat translation dns-timeout <seconds>)
TCP 24 hours (ip nat translation tcp-timeout <seconds>)
TCP RST/FN 1 minute (ip nat translation finrst-timeout <seconds>)
II overloading is not conIigured the timeout period is 24 hours per deIault.
(ip nat translation timeout <seconds>)
Above ConIiguration:
ip nat pool mypool 173.3.8.1 173.3.8.5 netmask 255.255.255.0
ip nat inside source list 1 pool mypool overload
interface ethernet 0
ip address 10.1.1.99 255.0.0.0
ip nat inside
interface serial 0
ip address 173.3.8.9 255.255.255.0
ip nat outside
access-list 1 permit 10.0.0.0 0.255.255.255

17
17 {C} Herbert Haas 2005/03/11
OverIapping Networks
= Same addresses are used
locally and globally
What can
happen?
Overlapping networks occur iI we use non-legal (not oIIicially assigned) IP
addresses that oIIicially belong to another network. We can do that iI we use
NAT to translate our internal addresses into global ones. However, iI we want
to communicate with the other network (that use our inside-local addresses as
global ones) we must consider some special issues...

18
18 {C} Herbert Haas 2005/03/11
Outside Address TransIation
9.3.1.2
193.9.9.2
x.x.x.x DA
SA
Hidden 9.0.0.0
network
9.3.1.8
193.9.9.2 DA
SA
Packet came Irom
"true" 9.0.0.0
network
10.0.0.8
9.3.1.2 DA
SA
9.3.1.8
First we examine the simple case. Suppose we used a class A network 9.0.0.0
Ior several years and now we want to give it back to the world (thereby earning
a lot oI money Irom our ISP).
Now we will present our network through NAT to the outside world.
Obviously the class A range we had given away will be used by other
customers, so incoming packets might have the same source addresses as we
still use Ior our devices. Clearly we should renumber our hosts with RFC1918
private addresses.
But iI we had a big number oI hosts we might not want to renumber all
devices, instead we will translate the source addresses oI incoming packets iI
they come Irom the true class-A network 9.0.0.0. By changing to an outside-
local address, these packets can be routed outside.

19
19 {C} Herbert Haas 2005/03/11
DNS ProbIem (1)
5.1.2.3
"1ahoo"
5.1.2.10
DNS server
195.44.33.11
DNS request for host "1ahoo"
SA5.1.2.3 / DA195.44.33.11
Hidden 5.1.2.0/24
network
Legal 5.1.2.0/24
network
This is a more tricky issue. Usually we do not know IP addresses oI outside
hosts, rather we ask a DNS server Ior name resolution.

20
20 {C} Herbert Haas 2005/03/11
DNS ProbIem (2)
5.1.2.3
"1ahoo"
5.1.2.10
DNS server
195.44.33.11
DNS request for host "1ahoo"
SA178.12.99.3 / DA195.44.33.11

21
21 {C} Herbert Haas 2005/03/11
DNS ProbIem (3)
5.1.2.3
"1ahoo"
5.1.2.10
DNS server
195.44.33.11
DNS reply: host "1ahoo" is 5.1.2.10
SA195.44.33.11 / DA 178.12.99.3
!OVERLAPPING ALERT!
We cannot tell our hosts
that "1ahoo" has IP address 5.1.2.10...
They would think that 1ahoo is inside
and would try a direct delivery...!!!
But what, iI the DNS server replies an IP address which is supposed to be
inside our own network? In this case the NAT router must manipulate the
layer-7 DNS inIormation and translate the global-outside addresses.

22
22 {C} Herbert Haas 2005/03/11
DNS ProbIem (4)
5.1.2.3
"1ahoo"
5.1.2.10
DNS server
195.44.33.11
DNS reply: host "1ahoo" is 9.9.9.9
SA 195.44.33.11 / DA5.1.2.3
Now my hosts must
ask me
where 9.9.9.9 is...
The router examines every DNS reply, ensuring that the resolved address is not
used inside. In such overlapping situations the router will translate the address.
Note:
Cisco NAT is able to inspect and perIorm address translation on A (Address)
and PTR (Pointer) DNS Resource Records.

23
23 {C} Herbert Haas 2005/03/11
DNS ProbIem (5)
5.1.2.3
"1ahoo"
5.1.2.10
DNS server
195.44.33.11
Message for host "1ahoo"
SA5.1.2.3 / DA9.9.9.9
DA9.9.9.9...?
Must be translated
OI course iI the destination address oI outgoing packets match a previously
introduced outside-local address, it must be translated into a outside-global
address.
The same perIormance is done in a converse situation where the DNS server is
inside and a DNS request is sent by an outside host. II the name resolution
result in an inside local address the NAT router has to translate this address.
NOTE: Cisco IOS does not translate addresses inside DNS zone transIers.

24
24 {C} Herbert Haas 2005/03/11
DNS ProbIem (6)
5.1.2.3
"1ahoo"
5.1.2.10
DNS server
195.44.33.11
Message for host "1ahoo"
SA195.44.33.11 / DA5.1.2.10
5.1.2.3 195.44.33.11 5.1.2.10 9.9.9.9
Inside Local Inside Global Outside Global Outside Local
NAT
Table
To prepare our router Ior overlapping addresses we use either a static or a
dynamic conIiguration.
Static: (rest is similar as in previous examples)
ip nat outside source static 5.1.2.10 9.9.9.9
Dynamic:
ip nat pool insidepool 195.44.33.11 195.44.33.13 netmask 255.255.255.0
ip nat pool outsidepool 9.9.9.1 9.9.9.255 prefix-length 24
ip nat inside source list 1 pool insidepool
ip nat outside source list 1 pool outsidepool
!
interface ethernet0
ip address 5.1.2.99 255.0.0.0
ip nat inside
!
interface serial0
ip address 195.44.33.99 255.255.255.0
ip nat outside
!
access-list 1 permit 5.1.2.0 0.0.0.255

25
25 {C} Herbert Haas 2005/03/11
TCP Load Sharing (1)
MuItipIe servers represented by a
singIe inside-gIobaI IP address

Virtual host address


New TCP session requests to the
VirtuaI Host are forwarded to one of
a group of reaI hosts

Rotary group
TCP load sharing is an enhanced NAT Ieature and is used inside the Intranet
because this has nothing to do with private address translation. II we want to
oIIer a highly loaded speciIic service to users, we can employ a NAT router to
map a single inside-global address (the virtual host address which is known to
the users) to multiple inside-local addresses, each assigned to a real host.
Everytime a user connects to the virtual host and wants to establish a session,
this session is mapped to one oI the real hosts in a round-robin manner. That is
why the group oI real hosts is called "rotary group".
Note that the NAT router has no idea oI the load distribution. Neither the
service availability is known to the router!

32
32 {C} Herbert Haas 2005/03/11
NAT and FTP
FTP controI session negotiates port
numbers

PORT and PASV parameters must be


processed by NAT router when doing
overIoading (ASCII coded!!!)
Non-standard FTP port numbers are
mostIy supported today

Cisco: ip nat service command


IP addresses and port numbers are carried by the FTP PORT and PASV
parameters. The problem here is that these addresses are in human readable
ASCII Iormat the address length is variable! This aIIects the TCP segment
length and the SEQ and ACK numbers. These parameters must be transIormed
Ior the duration oI the connection.
ConIiguration oI a non-standard port number on a Cisco IOS NAT router:
ip nat service list <acl> ftp tcp port <port-nr>
The access list speciIies the name Ior the inside local address oI the FTP
server. The port number speciIies the non-standard FTP control port.

33
33 {C} Herbert Haas 2005/03/11
NAT and ICMP
Many ICMP payIoads contain IP headers

NAT must transIate both addresses and


checksum
PING

Echo request & Echo are matched by ICMP-


identifier

Used by NAT instead of port numbers


(overIoading)

If fragmented, onIy fragment 0 contains this


identifier

NAT tracks IP identifier for foIIowing fragments


Consider a NAT conIiguration with overloaded addresses ( using an extended
translation table). Overloading requires some sort oI identiIier in order to
distribute incoming packets to the corresponding internal hosts. Since ICMP is
carried directly within IP, NAT cannot utilize port numbers, but the ICMP
identiIier is used instead. This is only important Ior query messages such as
PING which uses echo request and echo ICMP messages. Both ICMP message
types contain a 16 bit identiIier Iield and a 16 bit sequence number Iield
(according RFC 792 both are only optional but indeed commonly used). Only
Iragment 0 creates the translation entry. II a Iragment N with N~0 arrives Iirst
at the router, it gets dropped.

34
34 {C} Herbert Haas 2005/03/11
NAT and ...
H.323: TCP/UDP session bundIes, ASN.1
encoded IP addresses in payIoad
NetBOS over TCP/P (NBT): packet
header information at inconsistent offsets
SNMP: dynamic NAT makes it impossibIe
to track hosts (traps) over Ionger periods
of time

35
35 {C} Herbert Haas 2005/03/11
Security (1)
UsuaIIy PAT can be detected

TypicaI transIation signatures


LocaI topoIogy cannot be seen
outside

TypicaIIy SYN-ACKS from outside are


bIocked
Some ISPs do not allow customers to use PAT. The employment oI PAT can
be detected by looking Ior translation hints. For example a Linux box typically
perIorms Port Address Translation (PAT) using ports between 61000 and
64000, others use ranges starting above 32000, while a TCP/UDP end-system
would start at 1024 Ior each socket. Additionally, most NAT routers decrement
the TTL as packets are routed.
Thus an ISP will see that the source ports are at very high values and the TTL
is one less than expected. The TTL problem can be solved with most operating
systems but requires some administration skills. In windows the deIault TTL
can be adjusted by modiIying the registry while Linux needs a recompilation.
Most hardware NAT routers (such as Cisco routers) will not change the source
port where possible. II it does have to change the source port, it chooses the
next Iree port and so on. In such cases no ISP would suspect NAT...


36
36 {C} Herbert Haas 2005/03/11
Security (2)
TypicaIIy prevents attacks Iike
SMURF and WinNuke

NAT cannot protect aII DoS attacks


Security requires additionaI software

MaiIfiIters etc.
Encrypted L3 payIoad must not
contain address/port information
Some NAT routers perIorm stateIul packet inspection (SPI), which allows
NAT devices to Iilter harmIul packets such as SYN-Iloods. SPI is merely a
marketing term meaning enhanced Iirewalling Ieatures.
NAT cannot translate payload address inIormation iI the payload is encrypted.
Secure Socket Layer (SSL) and Secure Shell (SSH) are implemented as
encrypted TCP payload but the TCP head is not encrypted. Thus, NAT can
deal with SSL and SSH without problems. On the other hand, problems may
occur with Kerberos, X-Windows, Session Initiation Protocol (SIP), remote
shell (RSH), and others NAT-sensitive protocols.

37
37 {C} Herbert Haas 2005/03/11
Drawbacks of NAT
TransIation is ressource intensive (deIays)
Encrypted protocoIs cannot be transIated
Increased probabiIity of mis-addressing
Might not support aII appIications
Hiding hosts might be a negative effect
ProbIems with SNMP, DNS, ...
Ressource demand means, the traIIic matrix requires lots oI RAM while
augmented protocol handling requires CPU power.
Each NAT session consumes about 160 bytes in DRAM (using Cisco IOS).
From this we conclude that 10,000 translations would consume 1.6 MB.
Mis-addressing occurs as the administrator is responsible Ior a proper NAT
conIiguration.
SNMP traIIic is not supported by Cisco IOS NAT because oI the MIB-
dependent style oI SNMP packets.

38
38 {C} Herbert Haas 2005/03/11
Configuration Commands (1)
DecIare interfaces to be
inside/outside
ip nat { inside | outside }
Define a pooI of addresses (gIobaI)
ip nat pool <name> <start-ip>
<end-ip> { netmask <netmask>
| prefix-length <prefix-
length> } [ type { rotary } ]
Note that a pool oI addresses must only be deIined Ior dynamic translation.
II you plan to employ static translation only you can skip the second command.

39
39 {C} Herbert Haas 2005/03/11
Configuration Commands (2)
EnabIe transIation of inside source
addresses
ip nat inside source { list <acl> pool <name>
[overload] | static <local-ip> <global-ip> }
EnabIe transIation of inside destination
addresses
ip nat inside destination { list <acl> pool
<name> | static <global-ip> <local-ip> }
EnabIe transIation of outside source
addresses
ip nat outside source { list <acl> pool <name>
| static <global-ip> <local-ip> }
Packets Irom addresses that match those on the simple access-list are translated
dvnamicallv using the previously deIined address pool. The keyword
|overload| enables PAT. The access list must permit only those addresses that
are to be translated.
Inside destination address translation should use addresses Irom a previously
deIined rotarv pool. A destination address (oI an incoming packet) matching
the access list will be replaced with an address oI the rotary pool in a round-
robin manner. See the previous section about TCP load sharing.
Outside source address translation is necessary Ior overlapping networks. See
the corresponding previous section.

40
40 {C} Herbert Haas 2005/03/11
CIearing Commands
Clear all dynamic NAT table entries
clear ip nat translation *
Clear a simple dynamic inside or inside+outside
translation entry
clear ip nat translation inside <global-ip>
<local-ip> [outside <local-ip global-ip>]
Clear a simple dynamic outside translation entry
clear ip nat translation outside <local-ip>
<global-ip>

Clear an extended dynamic translation entry


clear ip nat translation <protocol> inside <global-
ip> <global-port> <local-ip> <local-port>
[outside <local-ip> <local-port> <global-ip>
<global-port>]

41
41 {C} Herbert Haas 2005/03/11
Further Information
RFC 1631 (NAT)
RFC 3022 (TraditionaI NAT)
RFC 2694 (DNS ALG)
RFC 2766 (IPv4 to IPv6 TransIation)
NAT FriendIy AppIication Design
GuideIines (Draft)


42 {C} Herbert Haas 2005/03/11
Summary
NAT hides inside from outside
Important to know terms inside/outside
versus IocaI/gIobaI
NAT devices must aIso be abIe to process
L4-L7 headers
Some protocoIs might bever be supported
(SNMP, NBT, ...)
SimpIe TCP Ioad sharing possibIe
NAT processing is resource intensive

43
43 {C} Herbert Haas 2005/03/11
TODO
RFC 2766 (IPv4-IPv6 NAT-ProtocoI TransIation)
NAT with ISP muItihoming and routing
SpeciaI NAT situations by exampIe, case studies
DEBUG commands
IPSec TunneI and NAT
IP MuIticast and NAT
...wiII be covered in future reIeases!

1
2005/03/11 {C} Herbert Haas
Routing Introduction
Direct vs. Indirect DeIivery
Static vs. Dynamic Routing
Distance Vector vs. Link State

2
'The most simple wav to
accelerate a Router
is at 9.8 m/sec/sec.`
Seen on Usenet

3
3 {C} Herbert Haas 2005/03/11
Routing Basics
Routing Introduction

Direct DeIivery

Indirect DeIivery

Static Routing

DefauIt Routing
Dynamic Routing

Distance Vector Routing

Link State Routing


In this chapter we talk about routing basics. About the 2 ways to deliver a packet:
direct and indirect, and then about the 3 kinds oI routing: the static routing, the
deIault routing and then about the most important routing today, the dynamic
routing.

4
4 {C} Herbert Haas 2005/03/11
What is routing?
Finding a path to a destination
address
Direct deIivery performed by host

Destination network = IocaI network


Indirect deIivery performed by router

Destination network IocaI network

Packet is forwarded to defauIt gateway


There are 2 ways to delivery a packet. The direct delivery and the indirect
delivery. When there is a direct delivery (destination network local network)
the host makes Ior example an ARP-request (Ethernet) and then delivery the
packet to the right host. II there is a indirect delivery (destination network local
network) the router Iorwards the packet to his deIault gateway.

5
5 {C} Herbert Haas 2005/03/11
Direct DeIivery
IP host checks if packet's destination
network is identicaI with IocaI network

By appIying the configured subnet mask


of the host's interface
If destination network = IocaI network
then the L2 address of the destination
is discovered using ARP

Not necessary on point-to-point


connections
BeIore the IP host sends out his packet, he checks iI the destination address oI the
packet is identical with the local network (subnet mask). II the destination
network local network the IP host needs a layer 2 address to deliver the packet
correct. To do this he sends out an ARP-request. With the inIormation the host
receives he can send the packet to the right host in his local network.

6
6 {C} Herbert Haas 2005/03/11
IP Host Facts
AIso IP hosts have routing tabIes !

But typicaIIy onIy a static route to the


defauIt gateway is entered
ARP cache aging timer: 20 minutes
Note that also simple workstations and PCs maintain routing tablesbut not Ior
routing passthrough packets, rather locally originated packets should be routed to
the most reasonable next hop. Typically, the routing table consists only oI a
single entry, which is the deIault gateway Ior this host.
But also additional entries can be made, indicating other gateways Ior some
dedicated routes.
Additionally, an ARP cache must be maintained by a host. The ARP cache stores
layer-2 MAC addresses and associated IP addresses oI interIaces to which
communication had occured recently. Any ARP result is stored in this cache,
thus subsequent packets to the same destination do not invoke the ARP each
time. Per deIault the ARP cache is Ilushed aIter 20 minutes. OI course this value
can be conIigured individuallyeven by DHCP.

7
7 {C} Herbert Haas 2005/03/11
Indirect DeIivery
DefauIt gateway deIivers packet in
behaIf of its host using a routing
tabIe
Routing tabIe components

Destination network (+ subnet mask)

Next hop (+ outgoing interface)

Metric (+ Administrative Distance)


Every router has his own routing table. This table contains many inIormation
such as destination network subnet mask, next hop, metric, etc. II the
destination address oI a packet local network the host sends the packet to the
router. The router compares this address with his routing table and make a
Iorward decision. Most oI the time router in small networks have a so called
default gateway. For example this gateway is used to Iorward a packet to a
router who is connected to the internet.

8
8 {C} Herbert Haas 2005/03/11
Router
InitiaIIy Unix workstations with
severaI network interface cards
Today speciaIized hardware
Cisco 3600 Router
The picture above shows one oI the most used routers today, the Cisco 3600
platIorm, employing various Ethernet and Serial interIaces.
Update: Today (2008) the most commonly used Cisco router series are the
'Integrated Services Routers 800 (SOHO), 1800, 2800, and 3800, while in larger
networks the 7200 or 7600 series routers are Iound.

9
9 {C} Herbert Haas 2005/03/11
Routing TabIe ExampIe
Gateway of last resort is 175.18.1.2 to network 0.0.0.0
10.0.0.0 255.255.0.0 is subnetted, 4 subnets
C 10.1.0.0 is directly connected, Ethernet1
R 10.2.0.0 [120/1] via 10.4.0.1, 00:00:05, Ethernet0
R 10.3.0.0 [120/5] via 10.4.0.1, 00:00:05, Ethernet0
C 10.4.0.0 is directly connected, Ethernet0
R 192.168.12.0 [120/3] via 10.1.0.5, 00:00:08, Ethernet1
S 194.30.222.0 [1/0] via 10.4.0.1
S 194.30.223.0 [1/0] via 10.1.0.5
C 175.18.1.0 255.255.255.0 is directly connected, Serial0
S* 0.0.0.0 0.0.0.0 [1/0] via 175.18.1.2
In the picture above there is example oI a routing table. 0.0.0.0 is used Ior deIault
gateway. The single letters at the beginning oI each entry indicates how the routes
were learned, Ior example "C" corresponds to "Directly Connected", "R" means
"learned by RIP", "S" means "static route", and so on. The numbers in the
brackets denote the administrative distance and the metric. For example |120/5|
means AD120, metric5.

10
10 {C} Herbert Haas 2005/03/11
IP Routing Basics
E0
E0
S0
S0
S1
S1
10.0.0.0
172.16.0.0
172.20.0.0
10.0.0.254
172.16.0.2
192.168.2.1
192.168.4.1
192.168.3.1
192.168.3.2
E0
S0
S1
E0
E1
192.168.1.0
172.20.0.254
192.168.1.254 192.168.1.253
192.168.2.2
192.168.4.2
192.168.3.0
192.168.2.0
192.168.4.0
Routing Table
Net-ID / Mask Next-Hop Metric Port
10.0.0.0 / 8 local 0 e0
172.16.0.0 / 16 192.168.3.2 1 s1
172.20.0.0 / 16 192.168.2.2 2 s0
192.168.1.0 / 24 192.168.2.2 1 s0
192.168.2.0 / 24 local 0 s0
192.168.3.0 / 24 local 0 s1
192.168.4.0 / 24 192.168.3.2 1 s1
In the picture above there is small network, and a good example oI a routing
table. For example a host in network 10 want to send a packet to a user in
network 192.168.1.
The destination address local address so the router must do a Iorward decision.
The router compare the destination address with his routing table and Iound the
right match (192.168.1.0/ 24 192.168.2.2 1 s0). Now he sends out the packet
via port s0 to the next hop, the router with the IP-Address oI 192.168.2.2. This
router is direct connected to the network 192.168.1. AIter an ARP-request the
packet deliver to the right user.

11
11 {C} Herbert Haas 2005/03/11
Static or Dynamic
Static routing entries are configured
manuaIIy

Override routes Iearned via dynamic routing

Can be set as permanent (wiII not be removed


if interface goes down)

OnIy way for certain technoIogies (DDR)


Dynamic routing entries are Iearned by
routing protocoIs

Adapts to topoIogy changes

But additionaI routing-traffic overhead


The only diIIerent between static and dynamic routing is, that static routing
entries conIigured manually, and dynamic routing entries are learned by routing
protocols. Static routes can be set as permanent, this means that such kind oI
entries will not be removed when the interIace goes down or cannot overwrite by
routing protocols.

12
12 {C} Herbert Haas 2005/03/11
Reasons for Static Routing
Very Iow bandwidth Iinks (e. g. diaIup
Iinks)
Administrator needs controI over the Iink
Backup Iinks
Link is the onIy path to a stub network
Router has very Iimited resources and
cannot run a routing protocoI
ip route prefix mask {ip-address | interface-type interface-number} [distance] [tag tag] [permanent]
Tag vaIue that can be used as a
"match" vaIue for
controIIing redistribution via
route maps
Specifies that the route
wiII not be removed,
even if the
interface shuts down
(II you dont understand the tag keyword then please wait Ior the Iollow-up
lectures).

13
13 {C} Herbert Haas 2005/03/11
Routing Paradigm
Destination Based Routing

Source address is not taken into account for


the forward decision
Hop by Hop Routing

IP datagram's foIIow the signposts given by


routing tabIe entries

Network's routing state must be Ioop-free and


consistent
Least Cost Routing

TypicaIIy onIy the best path is entered into


routing tabIe
The IP routing paradigm is Iundamental in IP routing. Firstly, IP routing is
"destination based routing", that means the source IP address is never examined
during the routing process. Secondly, IP routing is "hop-by-hop", which
emphasizes the diIIerence to virtual circuit principles. The routing table in every
router within the autonomous system must be both accurate and up to date so that
datagrams can be directed across the network to their destination.
In IP the path oI a packet is not pre-deIined and not connection oriented, rather
each single router perIorms a routing decision Ior each packet. Thirdly, IP
routing is "least cost" in that only that path with the lowest metric is selected in
case oI multiple redundant paths to the same destination.
Note that several vendors extend these rules by providing additional Ieatures, but
the routing paradigm generally holds Ior most oI the routers in the Internet, at
least Ior the basic routing processes.

16
16 {C} Herbert Haas 2005/03/11
DefauIt Routing
SpeciaI static route

Traffic to unknown destinations are


forwarded to defauIt router
("Gateway of Last Resort")
Routing tabIe entry "0.0.0.0 0.0.0.0"
HopefuIIy, defauIt gateway knows
more destination networks
Advantage: SmaIIer routing tabIes!
To get smaller routing tables there is the so called deIault gateway. When a
router receives a packet, and when the router couldn`t Iind the destination address
oI the packet in his routing table he is Iorward this packet over his deIault
gateway, hopeIully the next router knows more.

19
19 {C} Herbert Haas 2005/03/11
DefauIt Routing (3)
DefauIt Routes to the Internet
Internet
Host Route:
195.54.190.220/32 - S0
C:> ipconfig
IP Address. . . . . : 195.54.190.220
Subnet Mask . . . . : 255.255.255.0
Default Gateway . . : 195.54.190.12
C:> route print
Network Netmask Gateway Interface Metric
0.0.0.0 0.0.0.0 195.54.190.12 195.54.190.220 1
195.54.190.12
S0
Also your home pc uses the deIault gateway.
Router IP Address 195.54.190.12
Once the host dials in, the router assigns an IP-Address (195.54.190.220) and a
deIault gateway (195.54.190.12) to that host and also creates a "Host Route"
(dynamic) that points to that host. The host takes that deIault gateway
inIormation and creates a deIault route pointing to its local interIace

20
20 {C} Herbert Haas 2005/03/11
On Demand Routing (ODR)
Efficient for hub-and-spoke topoIogies

Same configuration at each router


Uses CDP to send the prefixes of attached
networks from the spokes, or stub networks, to
the hub or core router

CDP does this automaticaIIy (!)


The hub router sends its interface address of the
shared Iink as the defauIt route for the stub router
Note:

Don't enabIe routing protocoIs on spoke routers

CDP must be enabIed (don't forget e. g. ATM interfaces)


Every 60 sec a CDP message is sent per defauIt
(change with "cdp timer" command)
(config)# router odr ! OnIy on hub router
ODR has the advantage oI sending minimal inIormation, such as the preIix and
mask and the metric oI one, every 60 seconds by deIault. This inIormation
populates the routing table oI the hub router and can be redistributed into a
routing protocol. Because the mask is sent in the update, VLSM can be used.

21
21 {C} Herbert Haas 2005/03/11
Dynamic Routing
Each router can run one or more
routing protocoIs
Routing protocoIs are information
sources to create routing tabIe
Routing protocoIs differ in
convergence time, Ioop avoidance,
network size, compIexity
In contrast to static routing where every route must be conIigured manually,
dynamic routing works with one or more routing protocols. These protocols
inIorm the router and create the routing table automatically. Widely used in the
Internet.

22
22 {C} Herbert Haas 2005/03/11
Routing ProtocoI Comparison
Routing ProtocoI CompIexity Max. Size
Convergence
Time
ReIiabiIity
RIP very simpIe 16 Hops Up to 480 secs
Not absoIuteIy
Ioop-safe
ProtocoI
Traffic
High
RIPv2 very simpIe 16 Hops Up to 480 secs
Not absoIuteIy
Ioop-safe
High
IGRP simpIe x x medium medium
EIGRP compIex x x x x
OSPF
very
compIex
Thousands
of Routers
Fast High
Iow/
depends
IS-IS compIex
Thousands
of Routers
Fast High x
BGP-4 compIex
more than
100,000 networks
Fast Very High x
The table above gives a rough comparison oI the most important routing
protocols used today. Note that some values can not easily determined and are
leIt blank Ior this reason.

23
23 {C} Herbert Haas 2005/03/11
Metric
Routing protocoIs typicaIIy find out
more than one route to the destination
Metrics heIp to decide which path to
use

Hop count

Cost (reciprocaI vaIue of bandwidth)

Load, ReIiabiIity, DeIay, MTU


OIten router Iind more than one path to Iorward a packet to a given destination.
The metric helps router Iind the "best" way. Note that there are several types oI
metrics used in modern routing protocols. Typically they cannot be compared
with each other. For example a simple hop-count is no measure Ior speed
(bandwidth).

24
24 {C} Herbert Haas 2005/03/11
Administrative Distance
SeveraI routing protocoIs
independentIy find out different
routes to same destination

Which one to choose?


"Administrative Distance" is a
trustiness-vaIue associated to each
routing protocoI

The Iower the better

Can be changed
II several diIIerent routing protocols suggest diIIerent paths to the same
destination at the same time, the router makes a trustiness decision based on the
"Administrative Distance", which is a Cisco Ieature. Each routing protocol has
assigned a static AD value indicating the "trustiness" the lower the better. OI
course these values can be manipulated Ior special purposes.

25
25 {C} Herbert Haas 2005/03/11
Administrative Distances Chart
RIP
OSPF
IGRP
I-EIGRP
E-BGP
I-BGP
E-EIGRP
EGP
IS-IS
EIGRP Summary Route
Static route to next hop
Static route through interface
DirectIy Connected
Unknown
120
110
100
90
20
200
170
140
115
5
1
0
0
255
Note the diIIerence between static routes, iI the next hop either points to an
interIace (AD1) or iI the route is conIigured as directly connected (AD0)
AD also tells the router that E-BGP updates are more trustworthy than I-BGP
messages.

26
26 {C} Herbert Haas 2005/03/11
Remember
1) Using the METRIC one routing protocoI determines the
best path to a destination.
2) A router running muItipIe routing protocoIs might be toId
about muItipIe possibIe paths to one destination.
3) Here the METRIC cannot heIp for decisions because different
type of METRICS cannot be compared with each other.
4) A router chooses the route which is proposed by the
routing protocoI with the lowest ADMINISTRATIVE DISTANCE

27
27 {C} Herbert Haas 2005/03/11
AD with Static Routes
Each static route can be given a different
administrative distance
This way faII-back routes can be
configured
DiaIup ISDN
AD = 5
AD = 10
AD = 20
In the example above, there are several static routes to same destination. There
are three paths with diIIerent quality (more or less hops, BW, ...). So every path
has assign a diIIerent AD. II there are problems with the main path (AD 5) the
router automatically change to the next path (AD 10) and so on.

28
28 {C} Herbert Haas 2005/03/11
CIassification
Depending on age:

CIassfuI (no subnet masks)

CIassIess (VLSM/CIDR supported)


Depending on scope:

IGP (Inside an Autonomous System)

EGP (Between Autonomous Systems)


Depending on aIgorithm:

Distance Vector (Signpost principIe)

Link State (Roadmap principIe)


All routing protocols can be classiIied three-Iold. II routing protocols are able to
carry a subnet mask Ior each route we call them "classless", otherwise "classIul".
Today, most modern routing protocols are classless and thereIore support VLSM
and CIDR. II routing protocols are used inside an autonomous system we call it
"Interior Gateway Protocol (IGP)", while only "Exterior Gateway Protocols
(EGPs)" are used between autonomous systems. Technically, all routing
protocols use one oI two possible algorithms: "Distance Vector" protocols rely on
the signpost principle, while "Link State" protocols maintain a road-map Ior the
whole network.

29
29 {C} Herbert Haas 2005/03/11
Distance Vector (1)
After powering-up each router onIy knows
about directIy attached networks
Routing tabIe is sent periodicaIIy to aII
neighbor-routers
Received updates are examined, changes
are adopted in own routing tabIe
Metric information (originaIIy) is number of
hops
"BeIIman-Ford" aIgorithm
Distance vector protocols works with the Signpost principle. A Part oI the own
routing table is sent periodically to all neighbor routers (e.g.: RIP: every 30
seconds).
A signpost carries the Destination network, the Hop Count (metric, "distance")
and the Next Hop.
AIter a router receives a update, he extracts new inIormation's. Known routes
with worse metric are ignored.

30
30 {C} Herbert Haas 2005/03/11
Distance Vector (2)
Next hop is aIways originating router

TopoIogy behind next hop unknown

Signpost principIe
Loops can occur!
AdditionaI mechanisms needed:

Maximum hop count

SpIit horizon (with poison reverse)

Triggered update

HoId down
ExampIes: RIP, RIPv2, IGRP (Cisco)
Routing loops are big problems with distance vector protocols. Because oI the
simple principle oI Distance Vector protocols, we cannot prevent rooting loops.
Access Lists, Disconnection and connections, Router malIunction, etc can always
lead to it, there is no 100 solution.

31
31 {C} Herbert Haas 2005/03/11
Link State (1)
Each two neighbored routers
estabIish adjacency
Routers Iearn reaI topoIogy
information

Through "Link State Advertisements"

Stored in database (Roadmap principIe)


Updates onIy upon topoIogy
changes

Propagated by flooding
(very fast convergence)
Link-state routing protocol were designed Ior large networks. This kind oI
protocols are more reliable and convergence Iast.
The smallest topological unit is simply the inIormation: ROUTER-LINK-
ROUTER

32
32 {C} Herbert Haas 2005/03/11
Link State (2)
Routing tabIe entries are caIcuIated
by appIying the Shortest Path First
(SPF) aIgorithm on the database

Loop-safe

AIternative paths immediateIy known

CPU and memory greedy


Large networks can be spIit into
areas
ExampIes: OSPF, Integrated IS-IS
Applying the SPF algorithm on the link state database, each router can create
routing table entries by its own.

33
33 {C} Herbert Haas 2005/03/11
Summary
Routing is the "art" of finding the best way
to a given destination
Can be static or dynamic

Static means: YOU are defining the way


packets are going

Dynamic means: A routing protocoI is "trying"


to find the best way to a given destination
Two important principIes:

Distance Vector

Link State


34 {C} Herbert Haas 2005/03/11
Quiz
What are advantages of static
routing?
What are advantages of dynamic
routing?
Why are defauIt routes used to
access the Internet?
Why is the convergence time Iower
with Iink-state routing protocoIs?

1
2005/03/11 {C} Herbert Haas
RIP
Signpost Routing, Version 1

2
2 {C} Herbert Haas 2005/03/11
Routing Information ProtocoI
Interior Gateway ProtocoI (IGP)
Distance-Vector Routing ProtocoI

BeIIman Ford AIgorithm

RFC 1058 reIeased in 1988


CIassfuI

No subnet masks carried


Distributed through BSD UNIX 4.2 in
1982 (routed)
RIP is a so-called distance vector routing protocol its routing updates are like
"signposts" pointing to the shortest-hop path to known destination networks. The
algorithm has been developed by R. E. Bellman, L. R. Ford, and D. R. Fulkerson
and has Iirst been implemented in the ARPANET in 1969. In the mid-1970s,
Xerox created the "Gateway InIormation Protocol" (GWINFO) to route the Palo
Alto Research Center (PARC) Universal Protocol, also known as "PUP". PUP
became the Xerox Network Systems (XNS) protocol suite and GWINFO became
XNS RIP. And XNS-RIP was the basis Ior Novell's IPX RIP, Appletalk's Routing
Table Maintenance Protocol (RTMP), and IP RIP. We will only discuss IP RIP
here.
RIP is an Interior Gateway Protocol (IGP), that is, RIP is only used inside an
Autonomous System. Further explainations are given in the BGP modules.
RIP is an classIul routing protocol, because RIP (version 1) does not bind subnet-
masks to the routes. So RIP (version 1) assumes classIul addressing. Subnet
masks can be used as long as discontiguos subnetting is avoided.
Typically, every UNIX variant includes routed |route-dee Ior routing demon| as
part oI the operating system, so UNIX-workstations can be conIigured to
determine each RIP router in the network and hence a deIault-route entry would
not be necessary.

3
3 {C} Herbert Haas 2005/03/11
RIP Basics
Signpost principIe

Own routing tabIe is sent periodicaIIy


(every 30 seconds)
Receiver of update extracts new
information

Known routes with worse metric are ignored


What is a signpost made of ?

Destination network

Hop Count (metric, "distance")

Next Hop ("vector", given impIicitIy by


sender's address! )
The whole distance vector philosophy is based upon the signpost principle each
router sends periodically a copy oI his own routing table to each neighbor. Upon
receiving such routing update, a router extracts unknown routes or routes that
improved in metrics. For RIP the update period is 30 seconds.
Using this principle, each router learns how to reach destinations only via
signpost the routing details the path are unknown. The routing update
(signpost) basically consists oI a list oI destination networks and hop counts
("distances") associated to it. For all these destinations there is only one next hop:
the sending router's address.

4
4 {C} Herbert Haas 2005/03/11
"Routing By Rumour"
Good news propagate quickIy

30 seconds per network


Bad news are ignored

Except when sent by routers from which


these routes had been Iearned initiaIIy

But better news from ANY router wiII be


preferred
UnreachabIe messages propagate
sIowIy

180 seconds per network


Bad news ( network reachabilities with worse metric) are only accepted iI this
message has been sent by that router Irom which we previously learned about that
route.
Since RIP should discover the best routes to each destination, any routing update
is accepted that contains a better route than previously learned.
A route is declared unreachable without being reIreshed by routing updates
during 180 seconds.
In the worst case "bad news" propagate very slowly through the network. Special
unreachable-messages have been introduced in order to improve the convergence
time. Unreachable messages are normal routing updates but with metric set to
"inIinity".

5
5 {C} Herbert Haas 2005/03/11
Without SpIit Horizon (1)
1.0.0.0 2.0.0.0
e0 s0
s0
e0
12.0.0.0
1.0.0.0 1
Router A Router B
1.0.0.0 direct e0
NET Hops IF
12.0.0.0 direct s0
2.0.0.0 1 s0
2.0.0.0 direct e0
NET Hops IF
12.0.0.0 direct s0
1.0.0.0 1 s0
NET Hops
DA=, SA=A
2.0.0.0 2
12.0.0.0 1
2.0.0.0 1
NET Hops
DA=, SA=B
1.0.0.0 2
12.0.0.0 1
This is the basic principal oI RIP (Without Split Horizon). Every 30 seconds a
router sends his whole routing table to every neighbor router and increases the
Hop-Count by 1.
The router who receives this data add the new inIormation in his routing table. II
a router already knows about a better path Ior example a direct connection to a
net -- he will ignore this inIormation.

6
6 {C} Herbert Haas 2005/03/11
Without SpIit Horizon (2)
1.0.0.0 2.0.0.0
e0 s0
s0
e0
12.0.0.0
Router A Router B
1.0.0.0 direct e0
NET Hops IF
12.0.0.0 direct s0
2.0.0.0 1 s0
2.0.0.0 ??? ??
NET Hops IF
12.0.0.0 direct s0
1.0.0.0 1 s0
1
2
In this example we see what would happen iI network 2 crashes. Immediately,
router B has no more inIormation about this net. What would happen iI router A
sends a routing update now?

7
7 {C} Herbert Haas 2005/03/11
Without SpIit Horizon (3)
1.0.0.0
e0 s0
s0
12.0.0.0
Router A Router B
1.0.0.0 direct e0
NET Hops IF
12.0.0.0 direct s0
2.0.0.0 1 s0
2.0.0.0 2 s0
NET Hops IF
12.0.0.0 direct s0
1.0.0.0 1 s0
1.0.0.0 1
NET Hops
DA=, SA=A
2.0.0.0 2
12.0.0.0 1
2.0.0.0 3
NET Hops
DA=, SA=B
1.0.0.0 2
12.0.0.0 1
3
4
5
Now router B receives a routing update Irom router A including reach ability
inIormation about network 2. Because router B has no inIormation about network
2 he adds this inIormation in his routing table and continuous sending his normal
routing updates to router A, hereby increasing the hop count by 1.

8
8 {C} Herbert Haas 2005/03/11
Without SpIit Horizon (4)
1.0.0.0
e0 s0
s0
12.0.0.0
Router A Router B
1.0.0.0 direct e0
NET Hops IF
12.0.0.0 direct s0
2.0.0.0 3 s0
2.0.0.0 4 s0
NET Hops IF
12.0.0.0 direct s0
1.0.0.0 1 s0
1.0.0.0 1
NET Hops
DA=, SA=A
2.0.0.0 4
12.0.0.0 1
2.0.0.0 5
NET Hops
DA=, SA=B
1.0.0.0 2
12.0.0.0 1
7
8
9
6
...Count to Infinity...
During count to infinity packets
to network 2.0.0.0 are caught in a
routing Ioop
Either router A or router B has inIormation about the Network 2, both router will
increase the hop count by 1 every routing update. Count to inIinity accurse. Now
Update packets are caught in a routing loop.

9
9 {C} Herbert Haas 2005/03/11
SpIit Horizon
A router wiII not send information
about routes through an interface
over which the router has learned
about those routes

ExactIy THIS is spIit horizon


Idea: "Don't teII neighbor of routes
that you Iearned from this neighbor"

That's what humans (aImost) aIways do:


Don't tell me what I've told you !
Cannot 100% avoid routing Ioops!
Nowadays all routers work with Split Horizon, there is now RIP-Network without
it. The principle oI Spilt Horizon is simple: 'Don`t tell neighbor oI routes that
you learned Irom him.

10
10 {C} Herbert Haas 2005/03/11
RIP At Work (A)
1.0.0.0
1.0.0.0 direct e0
NET Hops IF
2.0.0.0
3.0.0.0
1.0.0.0 1
NET Hops
DA=, SA=A
Router A Router B
e0 s0
s1
s0
s1
s0 s1
e0
e0
12.0.0.0 direct s0
31.0.0.0 direct s1
Router C
12.0.0.0
2
3
.
0
.
0
.
0
3
1
.
0
.
0
.
0
2.0.0.0 direct e0
NET Hops IF
12.0.0.0 direct s0
1.0.0.0 1 s0
23.0.0.0 direct s1
31.0.0.0 1 s0
3.0.0.0 direct e0
NET Hops IF
31.0.0.0 direct s0
1.0.0.0 1 s0
23.0.0.0 direct 21
12.0.0.0 1 s0
31.0.0.0 1
1.0.0.0 1
NET Hops
DA=, SA=A
12.0.0.0 1
Split Horizon at work: Router A didn`t tell router B about the network 12 and
router A didn`t tell router C about the network 31, because the router knows that
router B must have a direct connection to network 12 and that router C must have
a direct connection to network 31.

11
11 {C} Herbert Haas 2005/03/11
RIP At Work (B)
1.0.0.0
1.0.0.0 direct e0
NET Hops IF
2.0.0.0
3.0.0.0
2.0.0.0 1
NET Hops
DA=, SA=B
Router A Router B
e0 s0
s1
s0
s1
s0 s1
e0
e0
12.0.0.0 direct s0
31.0.0.0 direct s1
2.0.0.0 1 s0
23.0.0.0 1 s0
Router C
12.0.0.0
2
3
.
0
.
0
.
0
3
1
.
0
.
0
.
0
2.0.0.0 direct e0
NET Hops IF
12.0.0.0 direct s0
1.0.0.0 1 s0
23.0.0.0 direct s1
31.0.0.0 1 s0
3.0.0.0 direct e0
NET Hops IF
31.0.0.0 direct s0
1.0.0.0 1 s0
23.0.0.0 direct 21
12.0.0.0 1 s0
2.0.0.0 1 s1
23.0.0.0 1
2.0.0.0 1
NET Hops
DA=, SA=B
12.0.0.0 1
And so router B tells router A only about network 2 and 23 and router C only
about network 2 and 12.

12
12 {C} Herbert Haas 2005/03/11
RIP At Work (C)
1.0.0.0
1.0.0.0 direct e0
NET Hops IF
2.0.0.0
3.0.0.0
Router A Router B
e0 s0
s1
s0
s1
s0 s1
e0
e0
12.0.0.0 direct s0
31.0.0.0 direct s1
2.0.0.0 1 s0
23.0.0.0 1 s0
3.0.0.0 1 s1
Router C
12.0.0.0
2
3
.
0
.
0
.
0
3
1
.
0
.
0
.
0
2.0.0.0 direct e0
NET Hops IF
12.0.0.0 direct s0
1.0.0.0 1 s0
23.0.0.0 direct s1
31.0.0.0 1 s0
3.0.0.0 1 s1
3.0.0.0 direct e0
NET Hops IF
31.0.0.0 direct s0
1.0.0.0 1 s0
23.0.0.0 direct 21
12.0.0.0 1 s0
2.0.0.0 1 s1
3.0.0.0 1
NET Hops
DA=, SA=C
31.0.0.0 1
3.0.0.0 1
NET Hops
DA=, SA=C
23.0.0.0 1
Router C do the same. At the end every router knows the route to every network.

13
13 {C} Herbert Haas 2005/03/11
Count To Infinity
Main probIem with distance vector
protocoIs
UnforeseeabIe situations can Iead to
count to infinity

Access Iists

Disconnection and connections

Router maIfunctions

....
During that time, routing Ioops occur!
Because oI the simple principle oI RIP (Distance Vector protocol), we cannot
prevent Count to InIinity. Access Lists, Disconnection and connections, Router
malIunction, etc can always lead to it, there is no 100 solution.
We need a more general approach to avoid that Maximum Hop Count, that's
the only IailsaIe solution.

14
14 {C} Herbert Haas 2005/03/11
Count To Infinity (1)
1.0.0.0 2.0.0.0
3.0.0.0
Router A Router B
e0 s0
s1
s0
s1
s0 s1
e0
e0
Router C
2.0.0.0 ??? ?
NET Hops IF
Router D
s2
4.0.0.0
s0
4.0.0.0 direct e0
NET Hops IF
2.0.0.0 2 s0
. . . . . . . .
e0
. . . . . . . . .
Lets us look to another example where Count to InIinity is approaching. Although
Split Horizon is implemented !
We have a network with 4 routers, suddenly net 2 crash.

15
15 {C} Herbert Haas 2005/03/11
Count To Infinity (2)
1.0.0.0
3.0.0.0
Router A Router B
e0 s0
s1
s0
s1
s0 s1
e0
Router C
2.0.0.0 3 s2
NET Hops IF
Router D
s2
4.0.0.0
s0
4.0.0.0 direct e0
NET Hops IF
2.0.0.0 2 s0
. . . . . . . .
e0
s1
s2
. . . . . . . . .
2.0.0.0 3
NET Hops
DA=, SA=D
. . . . . . .
And a new connection established between router B and router D. Now, a normal
routing update is send Irom router D to router B (with inIormation about net 2, oI
course).

16
16 {C} Herbert Haas 2005/03/11
Count To Infinity (3)
1.0.0.0
3.0.0.0
Router A Router B
e0 s0
s1
s0
s1
s0 s1
e0
Router C
2.0.0.0 3 s2
NET Hops IF
Router D
s2
4.0.0.0
s0
4.0.0.0 direct e0
NET Hops IF
2.0.0.0 5 s0
. . . . . . . .
e0
s1
s2
. . . . . . . . .
2.0.0.0 4
NET Hops
DA=, SA=B
. . . . . . .
2.0.0.0 4
NET Hops
DA=, SA=B
. . . . . . .
2.0.0.0 5
NET Hops
DA=, SA=C
. . . . . . .
Router B doesn`t know where network 2 is gone. So he sends inIormation about
network 2 (increasing hop count by 1) to every neighbor router.

17
17 {C} Herbert Haas 2005/03/11
Count To Infinity (4)
1.0.0.0
3.0.0.0
Router A Router B
e0 s0
s1
s0
s1
s0 s1
e0
Router C
2.0.0.0 6 s2
NET Hops IF
Router D
s2
4.0.0.0
s0
4.0.0.0 direct e0
NET Hops IF
2.0.0.0 5 s0
. . . . . . . .
e0
s1
s2
. . . . . . . . .
2.0.0.0 6
NET Hops
DA=, SA=D
. . . . . . .
Count to Infinity situations cannot be avoided in
any situation (drawback of signpost principIe)
Basic soIution: Maximum Hop Count = 16
Count to inIinity accurse. Only the maximum Hop Count, the basic solution, can
stop this problem.

18
18 {C} Herbert Haas 2005/03/11
Maximum Hop Count = 16
1.0.0.0 2.0.0.0
3.0.0.0
Router A Router B
e0 s0
s1
s0
s1
s0 s1
e0
e0
Router C
Router D
s2
4.0.0.0
s0
e0
s1
s2
Upon network faiIure, the route is marked as INVALID (hop count 16) and propagated.
1
2.0.0.0 16 -
NET Hops IF
. . . . . . . .
2
2.0.0.0 16
NET Hops
DA=, SA=B
. . . . . . .
2.0.0.0 16
NET Hops
DA=, SA=B
. . . . . . .
2.0.0.0 16
NET Hops
DA=, SA=B
. . . . . . .
3
3
3
4.0.0.0 direct e0
NET Hops IF
2.0.0.0 16 -
. . . . . . . . .
2.0.0.0 16 -
NET Hops IF
. . . . . . . .
2.0.0.0 16 -
NET Hops IF
. . . . . . . .
4
4
4
AIter 16 Hops the Net 2 is now marked as invalid.
OI course, this unreachabilty-inIormation would be propagated deeper into the
network iI there are additional routers.

19
19 {C} Herbert Haas 2005/03/11
Maximum Hop Count
Defining a maximum hop count of 16
provides a basic safety factor
But restricts the maximum network
diameter
Routing Ioops might stiII exist during
480 seconds (1630s)
Therefore severaI other measures
necessary
The maximum hop count is a basic saIety Iactor, but it is also the main drawback
oI RIP. It restrict the maximum network diameter, and the rooting loops exist Ior
480 seconds. During Count to InIinity there is a bad routing and the network must
deal with unnecessary traIIic. So we need other measures like Poison Reverse.

20
20 {C} Herbert Haas 2005/03/11
AdditionaI Measures
SpIit Horizon

Suppressing information that the other


side shouId know better

Used during normaI operation but


cannot prevent routing Ioops !!!
SpIit Horizon with Poison Reverse

DecIare Iearned routes as unreachabIe

"Bad news is better than no news at aII"

Stops potentiaI Ioops due to corrupted


routing updates

21
21 {C} Herbert Haas 2005/03/11
SpIit Horizon With Poison Reverse
1.0.0.0 2.0.0.0
e0 s0
s0
e0
12.0.0.0
1.0.0.0 1
Router A Router B
1.0.0.0 direct e0
NET Hops IF
12.0.0.0 direct s0
2.0.0.0 1 s0
2.0.0.0 direct e0
NET Hops IF
12.0.0.0 direct s0
1.0.0.0 1 s0
NET Hops
DA=, SA=A
2.0.0.0 16
12.0.0.0 1
2.0.0.0 1
NET Hops
DA=, SA=B
1.0.0.0 16
12.0.0.0 1
Note: poison reverse overrides spIit horizon when a network is Iost
Split horizon with poisoned reverse includes also reverse routes in updates, but
sets their metrics to inIinity. This is saIer than simple split horizon: II two
gateways have routes pointing at each other, advertising reverse routes with a
metric oI 16 will break the loop immediately.
Note: Split Horizon with Poison Reverse is not used by Cisco Routers (however
poison updates are indeed used when e. g. an interIace goes down).

22
22 {C} Herbert Haas 2005/03/11
AdditionaI Measures
Remember: good news overwrite bad
news

UnreachabIe information couId be overwritten


by uninformed routers
(which are beyond scope of spIit horizon)
HoId Down

Guarantees propagation of bad news


throughout the network

Routers in hoId down state ignore good news


for 180 seconds
RIP needs long time to send bad news over the whole network (remember the 480
seconds). To guarantee that the bad news send throughout the network, the hold
down measure is implemented. AIter a router receives 'bad news he will ignore
all 'good news about the same route Ior 180 seconds.
Note: Hold-down timers are not explicitly required by RFC 1058. However most
vendors (also Cisco) implemented it.

23
23 {C} Herbert Haas 2005/03/11
HoId Down (1)
3.0.0.0
Router A Router B
s0
s1
s0
s1
s0 s1
e0
Router C
Router D
s2
4.0.0.0
s0
e0
Router C receives unreachable message (4.0.0.0, 16) from router D
Router C declares 4.0.0.0 as invalid (16) and enters hold-down state
Router E
1
4.0.0.0 16
NET Hops
DA=, SA=D
. . . . . . .
2
4.0.0.0 16 -
NET Hops IF
... ... ..
3
4.0.0.0 16
NET Hops
DA=, SA=B
. . . . . . .
4
4.0.0.0 16 -
NET Hops IF
... ... ..
4.0.0.0 16 -
NET Hops IF
... ... ..
5
5
4.0.0.0 3 s0
NET Hops IF
... ... ..
s0 s1
In this example we see the Iunctionary oI Hold Down. AIter Net 4 crashes, router
D send this inIormation to Router C. Router C added this inIormation and
activate 'hold down. AIter this he sends this inIormation to his neighbor routers,
which do the same aIter they receive the inIormation about net 4.

24
24 {C} Herbert Haas 2005/03/11
4.0.0.0 16 -
NET Hops IF
... ... ..
HoId Down (2)
3.0.0.0
Router A Router B
s0
s1
s0
s1
s0 s1
e0
Router C
Router D
s2 s0
nformation about network 4.0.0.0 with better metric is ignored for 180
seconds
4.0.0.0 16 -
NET Hops IF
... ... ..
4.0.0.0 4
NET Hops
DA=, SA=E
. . . . . . .
Router E
4.0.0.0 16 -
NET Hops IF
... ... ..
4.0.0.0 3 s0
NET Hops IF
... ... ..
s0 s1
I'II ignore that,
I'm in HoId Down
Router E didn`t get inIormation that net 4 crashes yet, so he normally sends his
routing update. But the inIormation`s Irom router E couldn`t overwrite routing
inIormations oI router B or router A. Because these router are in the 'hold down
status, and ignore these update messages.

25
25 {C} Herbert Haas 2005/03/11
4.0.0.0 16 -
NET Hops IF
... ... ..
HoId Down (3)
3.0.0.0
Router A Router B
s0
s1
s0
s1
s0 s1
e0
Router C
Router D
s2 s0
Time enough to propagate the unreachabilty of network 4.0.0.0
4.0.0.0 16 -
NET Hops IF
... ... ..
4.0.0.0 16
NET Hops
DA=, SA=A
. . . . . . .
Router E
4.0.0.0 16 -
NET Hops IF
... ... ..
4.0.0.0 16 s0
NET Hops IF
... ... ..
s0 s1
Soon every router knows that network 4 is unreachable.

26
26 {C} Herbert Haas 2005/03/11
Triggered Update
To reduce convergence time, routing
updates are sent immediateIy upon
events (changes)
On receiving a different routing
update a router shouId aIso send
immediateIy an update

CaIIed triggered update


To speed up the convergence time, 'triggered update has been introduced.
AIter a router notice a network Iailure, he immediately sends a routing update to
indicate this Iailure. So the router didn`t wait Ior the expiration oI the 30 seconds.
Triggered update can used with all events (e.g. a new link established).

27
27 {C} Herbert Haas 2005/03/11
RIP Timers Summary
UPDATE (30 seconds)

Period to send routing update


INVALID (180 seconds)

Aging time before decIaring a route invaIid


("16") in the routing tabIe
HOLDDOWN (180 seconds)

After a route has been invaIided, how Iong a


router wiII wait before accepting an update
with better metric
FLUSH (240 seconds)

Time before a non-refreshed routing tabIe


entry is removed
The FLUSH timer is also known as "Garbage Collection Timer" and RFC 1058
suggests additional 120 seconds aIter expiring oI the INVALID timer.
HOLDDOWN timers are not explicitely required by RFC 1058, however they are
supported by most implementations today, e. g. by Cisco IOS. Note that the
FLUSH timer expires beIore the HOLDDOWN timer.

29
29 {C} Herbert Haas 2005/03/11
RIP Messages
Request (command = 1)

Ask neighbor to send response containing


aII or part of the routing tabIe

TypicaIIy used at startup onIy


Response (command = 2)

THE Routing Update

TypicaIIy sent every 30 seconds without


expIicit request
Note that a request is Ior speciIic entries (i. e. not Ior the whole table), the
requested inIormation is returned in any case, that is no split horizon is perIormed
and even subnets are returned iI requested. II there is exactly one entry in the
request, with an address Iamily identiIier oI zero and a metric oI inIinity (16), this
is a request to send the entire routing table.

30
30 {C} Herbert Haas 2005/03/11
DetaiIs
RIP message is sent within UDP payIoad

UDP Port 520, both source and destination port

Maximum message size is 512 bytes


L2 Broadcast + IP Broadcast

Because we do not know neighbor router


addresses

On shared media one update is sufficient


Version = 1
Address famiIy for IP is 2
II RIP messages are generated Irom any other port than 520 even "silent"
processes must response. This is an old RFC requirement, don't expect everything
works that way...

31
31 {C} Herbert Haas 2005/03/11
Timer Synchronization
In case of many routers on a singIe
network

Processing Ioad might affect update timer

Routers might get synchronized

CoIIisions occur more often


Therefore either use

ExternaI timer

Or add a smaII random time to the update


timer
(30 seconds + RIP_JITTER = 25...35 seconds)

32
32 {C} Herbert Haas 2005/03/11
RIP Disadvantages
Big routing traffic overhead

Contains nearIy entire routing tabIe

WAN Iinks (!)


SIow convergence
SmaII network diameter
No discontiguos subnetting
OnIy equaI-cost Ioad baIancing
supported

(if you are Iucky)


RIP is an old protocol and only used in small networks.

33
33 {C} Herbert Haas 2005/03/11
Summary
First important distance vector
impIementation (not onIy for IP)
Main probIem: Count to infinity

Maximum Hop Count

SpIit Horizon

Poison Reverse

HoId Down
CIassIess, SIow, SimpIe


34 {C} Herbert Haas 2005/03/11
Quiz
How couId sIower gateways/Iinks be
considered for route caIcuIation
WouIdn't TCP be more reIiabIe than
UDP?
Does maximum hop-count mean that
I can onIy have 15 net-IDs ?

1
2005/03/11 {C} Herbert Haas
RIP Version 2
The Classless Brother

2
2 {C} Herbert Haas 2005/03/11
Why RIPv2
Need for subnet information and VLSM
Need for Next Hop addresses for each
route entry
Need for externaI route tags
Need for muIticast route updates
RFC 2453
Because Subnetting and VLSM get more important RIPv2 was created. RIPv2
was introduced in RFC 1388, "RIP Version 2 Carrying Additional InIormation",
January 1993. This RFC was obsolete in 1994 by RFC 1723 and Iinally RFC
2453 is the Iinal document about RIPv2.
In comparison with RIPv1 the new RIPv2 also support several new Ieatures such
as, routing domains, route advertisements via EGP protocols or authentication.

3
3 {C} Herbert Haas 2005/03/11
MuIticast Updates
RIPv1 used DA=broadcast

Seen by each IP host

SIows down other IP stations


RIPv2 uses DA=224.0.0.9

OnIy RIPv2 routers wiII receive it


RIPv2 uses the IP-Address 224.0.0.9 to transIer his routing updates. With this
advantage only RIPv2 routers see this messages, and will not slow down the
diIIerent station (RIPv1 and broadcast addresses).
RIPv2 is also an alternative choice to OSPF.

4
4 {C} Herbert Haas 2005/03/11
. . . . . . . . .
Message Format
Command Version Unused or Routing Domain
Address FamiIy Identifier Route Tag
IP Address
Subnet Mask
Next Hop
Metric
Address FamiIy Identifier Route Tag
IP Address
Subnet Mask
Next Hop
Metric
Up to 25 route entries
RIPv2 utilizes the unused Iields oI the RIPv1 message-Iormat. New Iields are the
'routing tag, 'subnet mask and the 'next hop.

5
5 {C} Herbert Haas 2005/03/11
Version and Routing Domain
RIPv1 used version "1"
RIPv2 uses version "2" (*surprise*)
According RFC the next two bytes
are unused
However, some impIementations
carry the routing domain here

SimpIy a process number


The routing domain indicates the routing-process Ior which the routing-update is
destined. Now routers can support several domains within the same subnet.

6
6 {C} Herbert Haas 2005/03/11
Subnet Mask
RIPv2 is a cIassIess routing protocoI
For each route a subnet mask is
carried
Discontinuous Subnetting and VLSM
is supported
Remember RIP is an classIul routing protocol, because RIPv1 does not bind
subnet-masks to the routes. So RIPv1 assumes classIul addressing.

7
7 {C} Herbert Haas 2005/03/11
Next Hop
Identifies a better next hop address
than impIicitIy given (SA)

OnIy if one exists (better metric)

0.0.0.0 if the sender is next hop


EspeciaIIy usefuI on broadcast muIti-
access network for peering

Indirect routing on a broadcast segment


wouId be ...siIIy.
With the ,next hop' router announces which networks can be reached over other
routers.
Note that the next-hop router must be located in the same subnet as the sender oI
the routing-update.

8
8 {C} Herbert Haas 2005/03/11
Route Tag
To distinguish between internaI
routes (Iearned via RIP) and externaI
routes (Iearned from other protocoIs)
TypicaIIy AS number is used

Not used by RIPv2 process

ExternaI routing protocoIs may use the


route tag to exchange information
across a RIP domain
Route Tag contains the autonomous system number Ior EGP and BGP. When the
router receive a routing-update with a routing tag unequal zero, the associated
path must be
distributed to other routers. In that way interior routers notice the existence oI
exterior networks (tagging exterior routes).
For example iI routes were redistributed Irom EGP into RIPv2, these routes can
be tagged.

9
9 {C} Herbert Haas 2005/03/11
Next Hop and Route Tag
RIPv2
BGP
+ RIPv2
22.22.22.0/24 77.77.77.0/24
AS 65501
AS 65502
10.0.0.1/24 10.0.0.2/24 10.0.0.3/24 10.0.0.4/24 10.0.0.5/24 10.0.0.6/24
2 2
2 65502
22.22.22.0
255.255.255.0
10.0.0.5
1
2 65502
77.77.77.0
255.255.255.0
10.0.0.6
3
In the picture above there are two diIIerent autonomous systems on the same
LAN. The routers in the Iirst AS use RIPv2 the second AS use BGP. Each entry
assigned a AS number (65501/65502). The LeIt AS could apply policies on these
special (external) routes or redistribute them with BGP to some other ASs. Note
that only 10.0.0.4 speaks RIPv2, so Ior eIIiciency only this one advertises the
external routes (22.22.22.0/77.77.77.0) but by indicating the true next hops. This
is an important special rule on shared medium (true next hops must be
indicated) !

10
10 {C} Herbert Haas 2005/03/11
Authentication
Hackers might send invaIid routing updates
RIPv2 introduces password protection as
authentication
InitiaIIy onIy Authentication Type 2 defined

16 pIaintext characters (!)


RFC 2082 proposes keyed MD-5 authentication
(Type 3)
MuItipIe keys can be defined, updates contain a key-id
And a unsigned 32 bit sequence number to prevent
repIay attacks
Cisco IOS supports MD5 authentication (Type 3,
128 bit hash)
IF a router receives routing updates without valid authentication are ignored by
the receiving router, because only trusted router are accepted.
When using MD5 authentication, the Iirst but also the last routing entry space is
used Ior authentication purposes. The MD5 hash is calculated using the routing
update plus a password. Thus, authentication and message integrity is provided.
The "Authentication Type" is Keyed Message Digest Algorithm, indicated by the
value 3 (1 and 2 indicate "IP Route" and "Password", respectively)

11
11 {C} Herbert Haas 2005/03/11
. . . . . . . . .
Authentication
Command Version Unused or Routing Domain
0xFFFF Authentication Type
Password
Password
Password
Password
Address FamiIy Identifier Route Tag
IP Address
Subnet Mask
Next Hop
Metric
Up to 24 route entries
The picture above shows a RIPv2 Message which contains authentication entry's.
The password is only a plain text. II the password is under 16 octets, it must be
leIt-justiIied and padded to the right with nulls.

12
12 {C} Herbert Haas 2005/03/11
Key Chain
Cisco's impIementation offers key
chains

MuItipIe keys (MD5 or pIaintext)

Each key is assigned a Iifetime


(date, time and duration)
Can be used for migration

Key management shouId reIy on


Network Time ProtocoI (NTP)
Several independent routing domains running RIPv2 with diIIerent process
numbers ("routing domain"). With using key chains this domains can be work
together (synchronize) at a special time or date.

13
13 {C} Herbert Haas 2005/03/11
RIPv1 Inheritance (1)
AII timers are the same

UPDATE

INVALID

HOLDDOWN

FLUSH
Same convergence protections

SpIit Horizon

Poison Reverse

HoId Down

Maximum Hop Count (aIso 16 !!!)


RIPv1 uses many timers to regulate its perIormance. This timers are the same in
RIPv2. The routing update timer is set to 30 seconds, with a small random
amount oI time added whenever the timer is reset. A route is declared invalid
without being reIreshed by routing updates during 90 seconds. The 'holddown
status retains 180 seconds. In this time a router ignore update messages about a
special network. AIter 240 Seconds (Flush timer) a non-reIreshed routing table
entry will be removed.
RIPv2 also using the same convergence protections such as Split Horizon, Hold
Down, etc. Note that the Maximum Hop Count is still to be backwards
compatibility.

14
14 {C} Herbert Haas 2005/03/11
RIPv1 Inheritance (2)
Same UDP port 520
AIso maximum 25 routes per update

EquaIIy 512 Byte payIoads


RIPv2 also inherit the bad consequences oI this small routing updates.
What happened iI we want to advertise MANY routes with many single updates.
There will be a big overhead (IP UDP RIP header).

15
15 {C} Herbert Haas 2005/03/11
RIPv1 CompatibiIity
RIPv1 CompatibiIity Mode

RIPv2 router uses broadcast addresses

RIPv1 routers wiII ignore header extensions

RIPv2 performs route summarization on


address cIass boundaries
DisabIe: (config-router)# no auto-summary
RIPv1 Mode

RIPv2 sends RIPv1 messages


RIPv2 Mode

Send genuine RIPv2 messages


RIPv2 is totally backwards compatible with existing RIP implementations.
There is also an compatibility switch, which allows to chance between three
diIIerent settings:
1. RIP-1 Modus. Only RIP-1 packets are sent
2. RIP-1 compatibility Modus. RIP-2 packets are broadcast
3. RIP-2 Modus. RIP-2 packets are multicast.
The recommended deIault Ior this switch is RIP-1 compatibility.

16
16 {C} Herbert Haas 2005/03/11
Summary
Most important: RIPv2 is cIassIess

Subnet masks are carried for each route


MuIticasts and next hop fieId
increase performance
But stiII not powerfuI enough for
Iarge networks

17
17 {C} Herbert Haas 2005/03/11
Quiz
What is a routing domain?
Why is "infinity" stiII 16?

1
2005/03/11 {C} Herbert Haas
OSPF - Introduction
The ETF Routing Master
Part 1

2
'Difkstra
probablv
hates me`
Linus TorvaIds in kerneI/sched.c

3
3 {C} Herbert Haas 2005/03/11
"Open Shortest Path First"
OfficiaI (IETF) successor of RIP

RIP is sIow

RIP is unreIiabIe

RIP produces too much routing traffic

RIP onIy aIIows 15 hop routes


OSPF is a Iink-state routing protocoI

InherentIy fast convergence

Designed for Iarge networks

Designed to be reIiabIe
Yes, RIP is
bad
Voodoo...
OSPF was developed by IETF to replace RIP. In general link-state routing
protocols have some advantages over distance vector, like Iaster convergence,
support Ior lager networks.
Some other Ieatures oI OSPF include the usage oI areas, which makes possible a
hierarchical network topologies classless behavior,there are no such a problem
like in RIP with discontiguous subnets. OSPF also supports VLSM
and authentication.

4
4 {C} Herbert Haas 2005/03/11
OSPF Background
OSPF is the IGP recommended by the IETF
"Open" means "not proprietary"
Dijkstra's Shortest Path First aIgorithm is
used to find the best path
OSPF's father: John Moy

Version 1: RFC 1131

Version 2: RFC 2328 (244 pages !!!)

And a Iot of additionaI OSPF reIated RFCs


avaiIabIe...
The Internet Engineering Task Force (IETF) strictly recommends to use OSPF Ior
Interior Gateway routing (i. e. within an AS) instead oI RIP or other protocols.
Integrated IS-IS is an alternative routing protocol but not explicitly recommended
by the IETF. Note that IS-IS has been standardized by the ISO world.
Both (Integrated) IS-IS and OSPF use Dijkstra's Iamous Shortest Path First (SPF)
algorithm to determine all best paths Ior a given topology.
OSPF version 2 has been speciIied in RFC 2328. Note that there are a lots oI
additional RFCs around OSPF. Use http://www.rIc-editor.org/rIcsearch.html to
Iind them all.

5
5 {C} Herbert Haas 2005/03/11
Dijkstra's SPF AIgorithm
Used in graph theory
Very efficient
CaIcuIates aII paths
to aII destinations at
once
Creates a (Ioop-free)
tree with IocaI router
as source
See SPF section for
more detaiIs
Edsger W. Dijkstra
(1930-2002)
The Dijkstra's SPF algorithm is generally used in graph theory and was not
invented especially Ior IP routing. The most interesting point on the SPF
algorithm is its eIIiciency. SPF is capable to calculate all paths to all destinations
at once. The result oI the SPF algorithm is a loop-less tree with the local router
as source.

6
6 {C} Herbert Haas 2005/03/11
OSPF Ideas
Metric: "Cost" = 10
8
/BW (in bit/s)

Therefore easiIy configurabIe per interface


OSPF Routers exchange reaI topoIogy
information

Stored in dedicated topoIogy databases


Now Routers have a "roadmap"

Instead of signposts (RIP)


IncrementaI updates

NO updates when there is NO topoIogy change


In the Cisco IOS implementation starting with 11.2, the cost is calculated
automatically by the simple Iormula 10,000,000/BW.
Here the bandwidth parameters on a routers interIace are used, thus it is
especially important to conIigure it on the serial interIaces.
In other OSPF implementations cost must be conIigured manually Ior each oI the
interIaces.
OSPFand other link state protocolsexchange true topology inIormation
which is stored in a dedicated database by each router. This database acts like a
"roadmap" and allows a router to determine all best routes.
Note that once OSPF got the toplogy database there is no need to exchange
Iurther routing traIIicunless the topology changes. In this case only
incremental updates are made.

7
7 {C} Herbert Haas 2005/03/11
What is TopoIogy Information?
The smaIIest topoIogicaI unit is
simpIy the information eIement
ROUTER-LINK-ROUTER
So the question is: Which router is
Iinked to which other routers?
R1
R2
R3
R4
R5
R1- R2
R1- R5
R2- R3
R2- R4
R4- R5
Link Database:
The Link Database
exactIy describes
the roadmap =
Obviously the dots are routers and the links between the routers are actually
networks. The basic idea oI OSPF and the topology table is that simple.
OSPF is actually much more complicated. There are 5 types oI networks deIined in OSPF: point-
to-point networks, broadcast networks, non-broadcast multi-access networks, point-to-
multipoint networks, and virtual links. Furthermore it is reasonable to divide the topology
into multiple "areas" to increase perIormance ("divide and conquer"). These are the reasons
why OSPF is a rather complex protocol. This is explained later.

8
8 {C} Herbert Haas 2005/03/11
OSPF Routing Updates
The routing updates are actuaIIy
Iink state updates

Parts of Iink state database are


exchanged

Instead of parts of routing tabIe (RIP)


AppIying the SPF aIgorithm on the
Iink state database, each router can
create routing tabIe entries by its
own
The Links State Updates LSUs are sent in a special packets Link State
Advertisments LSAs. There are several types oI LSAs, depending on what kind
oI inIormation is sent and which router originated it.

9
9 {C} Herbert Haas 2005/03/11
OSPF ProtocoI
AII OSPF messages are carried
within the IP payIoad ("raw IP")

ProtocoI number 89
Error recovery and session
management is covered by OSPF
itseIf
MuIticast address 224.0.0.5

"AII OSPF routers"


LSUs are encapsulated in IP packet directly, unlike RIP where we have an
additional UDP overhead. IP is not reliable by itselI, but OSPF updates are
transmited reliable using Link State Acknolegements LSAck. There are 2
multicast addresses which are reserved Ior OSPF, 224.0.0.5 Ior all OSPF
routers and 224.0.0.6 Ior designated and back designated OSPF routers.

10
10 {C} Herbert Haas 2005/03/11
LSA FIooding
LSA's are smaII packets, forwarded by
each router without much modifications
through the whoIe OSPF area (!)
Much faster than RIP updates

RIP must receive, examine, create, and send


Convergence time

Detection time + LSA fIooding + 5 seconds


before computing the topoIogy tabIe = "a few
seconds"
When the router gets a new inIormation in its link state database it should send
this inIormation to all adjacent routers Ilood. The packets are small, only the
changes are sent and not the whole database. All other routers do the same,
receive new inIormation, update link state database, Ilood changes to others.

11
11 {C} Herbert Haas 2005/03/11
OSPF Overview
Large networks: "Divide and conquer" into areas

LSA-procedures inside each area

But distance-vector updates between areas


AdditionaI compIexity because of performance
optimizations
Limit number of adjacencies in a muIti-access network
OSPF

Limit scope of fIooding through "Areas"

DeaI with stub areas efficientIy


Learn externaI routes efficientIy
ReaIized through different LSA types
Fast convergence, aImost no routing traffic in
absence of topoIogy changes
PerIomance is very important with OSPF, to run SPF algorithm a CPU resources
are required, to store a link state database an additional memory, compared to RIP
we need much more routers resources. Some additional improvements were
made to OSPF in order to improve perIormance. Areas were introduced to limit
the Ilooding oI LSAs, Stub Areas to minimize a link state database and routing
tables.
Several types oI LSAs were implemented:
Type 1 Router LSA
Type 2 Network LSA
Type 3 Network Summary LSA
Type 4 ASBR Summary LAS
Type 5 AS External LSA
Type 6 Group Membership LSA
Type 7 NSSA External LSA
and others

1
2005/03/11 {C} Herbert Haas
OSPF - Link State EstabIishment
The ETF Routing Master
Part 2

2
2 {C} Herbert Haas 2005/03/11
Basic PrincipIe (1)
Consider two routers, Iucky
integrated in their own networks...
The routers on the slide have 2 stable networks, there are no periodic link state
updates, just hello messages.

3
3 {C} Herbert Haas 2005/03/11
Basic PrincipIe (2)
SuddenIy, some brave administrator connects
them via a seriaI cabIe...
Both interfaces are stiII in the "Down state"
What do
we have
here...?
?
?
?
Let's make
a Iink there!
AIter the link is connected, the routers detect a new network (OSPF is conIigured
on the interIace and interIaces are enabled).

4
4 {C} Herbert Haas 2005/03/11
Basic PrincipIe (3)
Init state:

FriendIy as routers are, they weIcome each


other using the "HeIIo protocoI".
HeIIo
HeIIo
OSPF routers send Hello packets out all OSPF enabled interIaces on a multicast
address 224.0.0.5. Then the router waits Ior a reply (another hello Irom the other
side) which must arrive within 4 x hello interval, otherwise the router Ialls back
to the down state again. That is, the init state lasts only up to 4 times the hello
interval.

5
5 {C} Herbert Haas 2005/03/11
Basic PrincipIe (4)
Two-way state:
Each HeIIo packet contains a Iist of aII neighbors (IDs)
Even the two routers themseIves are now Iisted (=> 2-way state
condition)
Both routers are going to estabIish the new Iink in their database...
HeIIo
HeIIo
II two routers sharing a common link and they agree on a certain parameters in
their respective Hello packets, they will become neighbours.

6
6 {C} Herbert Haas 2005/03/11
Basic PrincipIe (5)
Exstart state:
Determination of master (highest IP address) and sIave
Needed for Ioading state Iater
Exchange state:
Both router start to offer a short version of their own roadmap, using "Database
Description Packets" (DDPs)
DDPs contain partiaI LSAs, which summarize the Iinks of every router in the
neighbor's topoIogy tabIe.
Database
Description
Database
Description
Note:
Networks are caIIed "Iinks".
DDPs contain Iinks and associated router-IDs of
the originators of the corresponding LSAs.
AIter neighbourship is established, the routers enter the "exstart state" and
determine who oI them is master and who is slave. This will be needed later as
the master will begin to send LS-Request packets. The rule is simple: the router
with the highest IP address (oI the two involved interIaces on that link) is master.
Then, both routers enter the exchange state and exchange database description
packets (DDPs), which contain partial LSAs and thereIore can be regarded as a
summary oI their topology database.
Note: typically a series oI DDPs are sent Irom each side. Each advertised link is
identiIied by a OSPF router ID, which represents the originator oI that
inIormation.
Both routers send out a series oI database description packets containing the
networks held in the topology database. These networks are reIerred to as .
Most oI the inIormation about the links has been received Irom other routers (via
LSAs). The router ID reIers to the source oI the link inIormation.
Each link will have an interIace ID Ior the outgoing interIace, a link ID, and a
metric to state the value oI the path. The database description packet will not
contain all the necessary inIormation, but just a summary (enough Ior the
receiving router to determine whether more inIormation is required or whether it
already contains that entry in its database).

7
7 {C} Herbert Haas 2005/03/11
Basic PrincipIe (6)
Loading State:

One router (here the right one) recognizes some


missing Iinks and asks for detaiIed information using a
"Link State Request" (LSR) packet...
LS Request
The reciever checks its database, sees it is a new inIormation and requests a
detailed inIormation with Link State Request packet LSR.

8
8 {C} Herbert Haas 2005/03/11
Basic PrincipIe (7)
The Ieft router repIies immediateIy with the
requested Iink information, using a
"Link State Update" (LSU) packet ...
LS Update
As a reply the leIt router sends a Link State Update packet LSU which contains
detailed inIormation about requested links.

9
9 {C} Herbert Haas 2005/03/11
Basic PrincipIe (8)
The right router is very thankfuI, and
returns a "Link State AcknowIedgement"...
LS Ack
Link State Acknowledgement LSAck is used to make sure that the inIormation is
recieved.

10
10 {C} Herbert Haas 2005/03/11
Basic PrincipIe (9)
Then the Ieft router recognizes some
unknown Iinks and asks for further
detaiIs...
LS Request
LSR is sent in the other direction asking Ior detailed inIormation.

11
11 {C} Herbert Haas 2005/03/11
Basic PrincipIe (10)
The right router sends detaiIed
information for the requested unknown
Iinks...
LS Update
Then a LSU is sent back.

12
12 {C} Herbert Haas 2005/03/11
Basic PrincipIe (11)
The Ieft router repIies with a Iink state acknowIedgement -
a new adjacency has been estabIished...
Neighbors are "fuIIy adjacent" and reached the "fuII state"
LS Ack
LSAck saying thanks Ior inIo.

13
13 {C} Herbert Haas 2005/03/11
Basic PrincipIe (12)
Both routers teII aII other routers about aII IocaI
adjacencies by fIooding Iink state advertisements
(LSAs)
Both routers now see their own IDs Iisted in the
periodicaIIy sent HeIIo packets
L
S
A
LSA
LSA
LSA
LSA
L
S
A
These are so-caIIed
"Router LSAs".
Other LSA types wiII
be expIained soon...
Now the both routers have a new inIormation in their databases. This inIormation
is Ilooded to all othe adjacent routers as a router LSA or LSA type 1 in wich the
router sends inIormation about its own links.

14
14 {C} Herbert Haas 2005/03/11
Database Inconsistency
When connecting two networks, LSA fIooding
onIy distributes information of the IocaI Iinks
of the invoIved neighbors (!)
It might happen iI you connect two existing networks together. Some routers may
miss a new inIormation.

15
15 {C} Herbert Haas 2005/03/11
SoIutions
Every router sends its LSAs every 30
minutes (!)

Long inconsitency times


OptionaIIy fIash updates configured

Upon receiving an LSA a router not onIy


forwards this LSA but aIso immediateIy
sends its own LSAs

Cisco defauIt (can be turned off)


According to RFC to solve a problem each router sends a so-called reIreshment
LSA every 30 minutes.

16
16 {C} Herbert Haas 2005/03/11
FinaIIy: Convergence!
When LSAs are fIooded, OSPF is
quiet (at Ieast for 30 minutes)
OnIy HeIIo's are sent out on every
interface to check adjacencies

TopoIogy changes are quickIy detected

DefauIt HeIIo intervaI: 10 seconds (LAN,


60 sec WAN)

HeIIos are terminated by neighbors


AIter Iooding the routers are recalculating their routing tabeles, using SPF
algorithm. There are no periodic updates like in RIP. Just Hello packets are sent
every 10 seconds by deIault. II a router does not get a Hello Irom the neigbour Ior
40 seconds, it decides the neigbour is dead and this is a dead interval, which is 4
times the hello interval by deIault.

1
2005/03/11 {C} Herbert Haas
OSPF - MuItiaccess Networks
The ETF Routing Master
Part 3

2
2 {C} Herbert Haas 2005/03/11
Broadcast MuIti-Access Media (1)
When severaI OSPF routers have access
to the same Ethernet segment they wouId
create n(n-1)/2 adjacencies
Furthermore, SPF aIgorithm requires to
represent a fuIIy meshed network as tree
Consider the Ilooding process aIter establishment oI each adjacency!!! The
Iormation oI an adjacency between every attached router would create a lot oI
unnessesary LSAs. Arouter would Ilood an LSA to all its adjacent neighbours,
creating many copies oI the same LSA on the same network.

3
3 {C} Herbert Haas 2005/03/11
Broadcast MuIti-Access Media (2)
SoIution: EIect one "Designated Router" (DR) to represent the
whoIe LAN segment
EIection uses the HeIIo protocoI
DR sends Network LSA

List of aII IocaI routers


Ensures that every router on the Iink has the same topoIogy
database
AIso contains subnet mask (!)
Each other router estabIishes an adjacency onIy to the DR
Using "AII DR" muIticast address 224.0.0.6
DR
To prevent the problems described in the previous slide, a Designated Router
(DR) is elcted on a multi-access network. DR is responsible Ior representation oI
the multi-access network and all the routers on it to the rest oI network and
management oI Ilooding process on a multi-access network. The network itselI
becomes a "pseudonode" on the graph. The pseudonode is represented by the DR.
All other routers peer with the DR, which inIorms them oI any changes on the
segment.
For LAN segments, the Router LSA does NOT contain the subnet mask.
The subnet mask Ior this LAN segment is also carried inside the Network LSA.

4
4 {C} Herbert Haas 2005/03/11
Broadcast MuIti-Access Media (3)
OnIy the DR wiII send LSAs to the rest of the
network
For backup purposes aIso a Backup DR is
eIected (BDR)

AII routers aIso estabIish adjacencies to the BDR

BDR itseIf aIso estabIishes adjacency to DR


DR BDR
The network itselI becomes a "pseudonode" on the graph. The pseudonode is
represented by the DR.
Each multi-access interIace has a "Router Priority" ranging Irom 0 to 255 (deIault
1). Routers with a priority oI 0 cannot become DR or BDR. The election process
is perIormed with Hello packets which carry the priority. II some routers have the
same priority, the one with the highest numerical Router ID wins. II a DR Iails
the BDR becomes active immediately (Hello stays out) and a new election Ior the
BDR is started.
Note: AIter election oI DR and BDR, adding a new router with higher priority
will not replace them. The Iirst two routers immediately become DR and BDR.
The only way to control the election is to set the priority Ior all other routers
("DROTHER") to zero, so they cannot become DR or BDR.

5
5 {C} Herbert Haas 2005/03/11
Router ID
Each router is a node in the graph (Iink
state database) and identified by a Router
ID
AutomaticaIIy seIected via heIIo process

Choose numericaIIy highest IP address of aII


Ioopback interfaces

If no Ioopback interfaces then choose highest


IP address of physicaI interfaces

OptionaIIy, on Cisco routers, a priority vaIue


can be configured (0.no DR/BDR, 255.max
chance to win, 1. defauIt)

HeIIo packet contains DR


Note that loopback interIaces are more stable than any physical interIace.
Furthermore it's easier Ior an administrator to manage the network using loopback
addresses Ior Router-IDs.
II there is more than one router on the segment with the same priority level, the
election process picks the router with the highest router ID. The deIault priority
on a Cisco router is 1.

6
6 {C} Herbert Haas 2005/03/11
DR/BDR EIection Process
EIection process starts if no DR/BDR Iisted in the
heIIo packets during the init state (i. e. when two
routers begin to estabIish an adjacency)

Note: if aIready one DR/BDR chosen, any new router in the


LAN wouId not change anything!
Therefore, the power-on order of routers is criticaI !!!
AIways configure Ioopback interface in order to
"name" your routers

Loopback interface never goes down

Ensures stabiIity

SimpIe to manage
It is recommended in OSPF to use the loopback interIaces Ior router ID. You
shold conIigure a loopback interIace Iirst and then start the OSPF process,
otherwise the highest ip address Irom a physical interIace will be taken.

1
2005/03/11 {C} Herbert Haas
OSPF - Areas
Why OSPF Complicated
Part 2

2
'An algorithm
must be seen
to be believed`
DonaId .E. Knuth

3
3 {C} Herbert Haas 2005/03/11
OSPF Areas
To improve performance divide the
whoIe OSPF domain in muItipIe
Areas
Restrict Router LSA and Network
LSA within these Areas
AII areas must be connected to the
so-caIIed "Backbone Area"

"Area 0"
As each link is identiIied by a router LSA in the OSPF database, the total OSPF
routing traIIic increases with the number oI links and thus with the size oI the
network. Also the amount oI network LSA will increase in larger networks. The
basic idea oI OSPF to overcome these limitations is to partition the whole OSPF
domain into smaller "areas". The basic idea is to Iilter router LSAs and network
LSAs on the borders between areas. Network reachabilities Irom outside is
advertised through other LSA types. These details are discussed next.

4
4 {C} Herbert Haas 2005/03/11
ABR
Area 0
Area 1
Area 2 Area 5
Area Border Router (ABR):
Terminates Router LSAs
and Network LSAs
Forwards Network Summary LSAs
Router LSA
Network LSA
LSA 1
LSA 2
L
S
A

1
L
S
A

1
L
S
A
1
L
S
A

2
L
S
A

2
L
S
A

2
LSA 3 Network Summary LSA
L
S
A
3
L
S
A
3
L
S
A

3
L
S
A
3
LSA 3
L
S
A

3
L
S
A

3
L
S
A

3
Note:
Network Summary LSAs
are Distance Vector
updates !!!
ABR
ABR
ABR
TraIIic Irom one area to another area Ilows through dedicated routers only, so
called Area Border Routers (ABRs). The ABRs Iilter Router LSAs and Network
LSAs. Network destinations in other areas are advertised by so-called "Network
Summary LSAs", which carry simple distance-vector inIormation i. e. which
networks can be reached by which ABR.
Actually, we will deal with the Iollowing OSPF router types:
Internal Routers (IR): Has all interIaces inside an area
Backbone Routers (BR): Has at least one interIace in the backbone area
Area Border Routers (ABR): Has interIaces in at least two areas
Autonomous System Boundary Routers (ASBR): Has at least one interIace in a
non-OSPF domain; redistributes external routes into the OSPF domain
ASBRs are discussed next.

5
5 {C} Herbert Haas 2005/03/11
ASBR
Area 0
Area 1
Area 2 Area 5
Router LSA
Network LSA
LSA 1
LSA 2
LSA 3 Network Summary LSA
ABR
ABR
ABR
Autonomous System
Border Router (ASBR)
Imports foreign routes via
AS ExternaI LSA
ASBR
AS ExternaI LSA
ASBR Summary LSA LSA 4
LSA 5
L
S
A

5
L
S
A
5
L
S
A

5
LSA 4
L
S
A

5
L
S
A

4
L
S
A

4
L
S
A
4
L
S
A

4
L
S
A
5
L
S
A

4
L
S
A
5
LSA 5
L
S
A

5
L
S
A

5
L
S
A

5
L
S
A

5
When an ABR receives an
AS ExternaI LSA it emits
ASBR Summary LSAs
to aII routers
An Autonomous System Border Router (ASBR) sends the summary
inIormation about Ioreign networks to OSPF networks, using LSA type 5. On
ASBRs you have to run 2 routing processes: OSPF and some other routing
protocolthe router redistributes routing inIomation between OSPF and other
routing process.

6
6 {C} Herbert Haas 2005/03/11
Stub Area
Area 0
Area 1
Stub
Area 2
Area 5
Router LSA
Network LSA
LSA 1
LSA 2
LSA 3 Network Summary LSA
ABR
ABR
ABR
ASBR
AS ExternaI LSA
ASBR Summary LSA LSA 4
LSA 5
L
S
A

5
L
S
A
5
L
S
A

5
LSA 4
L
S
A

5
L
S
A
4
L
S
A
5
AS ExternaI LSA and
ASBR Summary LSA
are not sent into a
Stub Area
L
S
A

2
L
S
A
1
LSA
3
L
S
A

3
L
S
A

3
L
S
A

3
An ASBR could send a lot oI external routes, tose will be Ilooded into OSPF
network. ABRs propogate this inIormation into other OSPF areas, each router in
the area knows all external links and they are stored in link state database. In
order to reach the external destination, the router still needs to send a packet to
ABR. We can make a database oI internal router smaller, iI we create a stub area.
A stub area means that ABR does not sent an external LSAs into this area, instead
ABR advertises a deIault route (0.0.0.0)

7
7 {C} Herbert Haas 2005/03/11
TotaIIy Stubby Area
Area 0
Area 1
TotaIIy
Stubby
Area 2
Area 5
Router LSA
Network LSA
LSA 1
LSA 2
LSA 3 Network Summary LSA
ABR
ABR
ABR
ASBR
AS ExternaI LSA
ASBR Summary LSA LSA 4
LSA 5
L
S
A

5
L
S
A
5
L
S
A

5
LSA 4
L
S
A

5
L
S
A
4
L
S
A
5
L
S
A

2
L
S
A
1
LSA
3
L
S
A

3
No externaI or
summary LSA
are sent into a
TotaIIy Stubby Area
Cisco Specific
A Ciscos propritary extention to the Stub Area. The ABR will not advertise an
external LSAs, like into a stub area, in addition ABR will not send a summary
LSAs Irom other areas, instead a deIault route is injected into Totally Stubby
area.

8
8 {C} Herbert Haas 2005/03/11
Not So Stubby Area (NSSA)
Area 0
Area 1
NSSA
Area 2
Area 5
Router LSA
Network LSA
LSA 1
LSA 2
LSA 3 Network Summary LSA
ABR
ABR
ABR
ASBR
AS ExternaI LSA
ASBR Summary LSA LSA 4
LSA 5
L
S
A

5
L
S
A
5
L
S
A

5
LSA 4
L
S
A

5
L
S
A
4
L
S
A
5
L
S
A

2
L
S
A
1
LSA
3
L
S
A

3
LSA 7 NSSA ExternaI LSA
L
S
A

3
L
S
A

3
ABR wiII transIate the Type 7
LSA into a Type 5 LSA onIy
if the Type 7 LSA has
the P-bit set to 1
LSA 7
L
S
A
7
ASBR advertizes routes
of another routing
domain via NSSA
ExternaI LSA
ASBR
L
S
A

5
The NSSA ASBR has the option oI setting or clearing the P-bit in the NSSA
External LSA. II the P-bit is set any ABR will translate this LSA into an AS
External LSA (Type 5).

9
9 {C} Herbert Haas 2005/03/11
Summarization
Efficient OSPF address design requires
hierarchicaI addressing
Address pIan shouId support
summarization at ABRs
Area 0
Area 10
Area 20
Area 30
20.1.0.0/16
...
20.254.0.0/16
21.1.0.0/16
...
21.254.0.0/16
22.1.0.0/16
...
22.254.0.0/16
20/8
2
1
/
8
2
2
/8
Summarization is an other way to keep a router database smaller. The ABR
instead oI sending each single subnet Irom the area, creates a summary route and
advertises it into a diIIerent area. Note that summarization is turned oII by deIault
(i. e. must be explicitly turned on).

10
10 {C} Herbert Haas 2005/03/11
VirtuaI Links
Another way to
connect to area 0
using a point-to-point
unicast tunneI
Transit area must
have fuII routing
information

Must be stub area


Bad Design!
ABR
ABR
Area 0
VirtuaI
Link
Area 1
Area 2
An OSPF design requires that all areas have to be contiguous and must be
connected to the backbone area. II it is not a case, like on the slide,
you have to use a Virtual Link in order to connect area 2 to area 0 A
virtual link is considered as part oI area 0 thus the area ID is 0.0.0.0.

11
11 {C} Herbert Haas 2005/03/11
VirtuaI Link ExampIe
Now router 3.3.3.3 has
an interface in area 0
Thus router 3.3.3.3
becomes an ABR

Generates summary
LSA for network
7.0.0.0/8 into area 1 and
area 0

AIso summary LSAs in


area 2 for aII the
information it Iearned
from areas 0 and 1
Area 0
Area 1
Area 2
Router
1.1.1.1
Router
2.2.2.2
Router
3.3.3.3
4.0.0.1
5.0.0.1
5.0.0.2
6.0.0.2
6.0.0.3
7.0.0.3
A router 3.3.3.3 is now connected to area 0 ,directly' and like a normal ABR
generates a summary LSAs in both directions

12
12 {C} Herbert Haas 2005/03/11
VirtuaI Link Configuration ExampIe
Area 0
Area 1
Area 2
Router
1.1.1.1
Router
2.2.2.2
Router
3.3.3.3
4.0.0.1
5.0.0.1
5.0.0.2
6.0.0.2
6.0.0.3
7.0.0.3
router ospf 5
network 4.0.0.0 0.255.255.255 area 0
network 5.0.0.0 0.255.255.255 area 1
area 1 virtual-link 3.3.3.3
router ospf 5
network 7.0.0.0 0.255.255.255 area 2
network 6.0.0.0 0.255.255.255 area 1
area 1 virtual-link 1.1.1.1
Note virtual link goes to a router ID on the other end not to an ip address on the
interIace

13
13 {C} Herbert Haas 2005/03/11
GRE instead of VirtuaI Link
AIternative soIution
Good: Transit area can be a aIso a
stub area
Bad: AII traffic is encapsuIated

Not onIy routing traffic

Increased overhead
In some cases it is not possible to use a virtual link, as a possible solution ap ip
tunnel could be implemented.

14
14 {C} Herbert Haas 2005/03/11
Summary
Area concept supports Iarge
networks

Keeps topoIogy tabIe smaII

Reduces routing traffic


But additionaI LSA types necessary
Inter-Area Routing is Distance Vector
OriginaIIy OSPF designed for ToS
routing - too resource greedy!

15
15 {C} Herbert Haas 2005/03/11
Quiz
When shouId we spIit the OSPF
domain into areas?
What about Areas and addressing
pIans?
Why must aII areas be connected to
the backbone area?

1
2005/03/11 {C} Herbert Haas
OSPF - LSAs
Why there is a dirty dozen of them
Part 3

2
2 {C} Herbert Haas 2005/03/11
LSA Sequence Number
In order to stop fIooding, each LSA
carries a sequence number
OnIy increased if LSA has changed

So each router can check if a particuIar


LSA had aIready been forwarded

To avoid LSA storms


32 bit number
LSA Nr. 44
LS
A
N
r. 44
LSA Nr. 44
L
S
A
N
r
. 4
4
L
S
A

N
r
.

4
4
SIower:
Is discarded!
When reaching the end oI the 32 bit sequence number the associated router will
wait Ior an hour so that this LSA ages out in each link state database. Then the
router resets the sequence number (lowest negative number i. e. MSB1,
80000001) and continues to Ilood this LSA.
Each LSA carries also a 16 bit age value, which is set to zero when originated and
increased by every router during Ilooding. LSAs are also aged as they are held in
each router's database. II sequence numbers are the same, the router compares the
ages the younger the better but only iI the age diIIerence between the recently
received LSA is greater than MaxAgeDiII; otherwise both LSAs are considered to
be identical.
Note:
Radia Perlman proposed a "Lollipop" sequence number space but today a linear
space is used as described above.
Since signed integers are used to describe sequence numbers, 8000001 represents
the most-negative number in a hexadecimal Iormat. To veriIy this, the 2-
complement oI this number must be calculated. This can be done in two steps.
First calculate the 1-complement by simply inverting the binary number, that is
the most signiIicant byte "0x80" which is "1000 000" is transIormed to "0111
111", the least signiIicant byte "0x01" which is "0000 0001" is transIormed to
"1111 1110" and all other bytes inbetween are now "1111 1111". Secondly, in
order to receive the 2-complement, "1" must be added. Then the Iinal result is
"0111 1111 1111 1111 1111 1111 1111 1111", which is the absolute number
(without sign).

3
3 {C} Herbert Haas 2005/03/11
DetaiIed FIooding Decisions
LSA is identified by
its

LS type
Link State ID

Advertising Router
The most recent one
of two instances of
the same LSA is
determined by:
LS sequence number

LS checksum

LS age
MaxAgeDiff (15 min)
as toIerance vaIue
Greater
SeqNr
On comparing two LSAs,
the most recent recent one
is that with.
Greater
Checksum
MaxAge SmaIIer Age
Same SeqNr
Same Checksum
AgeDiff >
MaxAgeDiff
One LSA has
MaxAge
AgeDiff <
MaxAgeDiff
Both are
considered
to be
identicaI
Each LSA carries also a 16 bit age value, which is set to zero when originated and
increased by every router during Ilooding. LSAs are also aged as they are held in
each router's database. II sequence numbers are the same, the router compares the
ages the younger the better but only iI the age diIIerence between the recently
received LSA is greater than MaxAgeDiII; otherwise both LSAs are considered to
be identical.


4 {C} Herbert Haas 2005/03/11
LS Age
Originating router sets LS age = 0 seconds
Increased during fIooding by InfTransDeIay by
every router
AIso increased whiIe stored in database
Age is never incremented past MaxAge (60 min)
LSAs having MaxAge:
Are not used in routing tabIe caIcuIation anymore

Are refIooded immediateIy

Are aIways considered as most recent

Thus quickIy fIushed from routing domain


ResponsibIe router maintains LSRefreshTime (30
min) to refresh LSAs periodicaIIy

5
5 {C} Herbert Haas 2005/03/11
Router LSA - Type 1
Router ID (Highest IP address)
Number of Links
Link Descriptions

Link type (P2P, Stub, ...)

Neighboring router ID

Router interface address

ToS (typicaIIy not supported today)

Metrics

6
6 {C} Herbert Haas 2005/03/11
Network LSA - Type 2
DR's IP address
One Subnet mask for this broadcast
segment
List of Router-IDs of aII routers in the
broadcast segment

7
7 {C} Herbert Haas 2005/03/11
Network Summary LSA - Type 3
Originated by ABRs onIy
Each LSA Type 3 contains a number of

Destination networks + Subnet masks

Metric for each destination network


This is basicaIIy a distance-vector
routing information (!)

8
8 {C} Herbert Haas 2005/03/11
ASBR Summary LSA - Type 4
Originated by ABRs
Advertise routes to ASBRs
NearIy identicaI to Type 3

Except destination is ASBR not a


network
Each LSA Type 4 contains

Router IDs of ASBRs

Mask 0.0.0.0 (host route)

Metric

9
9 {C} Herbert Haas 2005/03/11
AS ExternaI LSA - Type 5
Originated by ASBRs

ExternaI type 1

ExternaI type 2 (defauIt)


Advertises

ExternaI routes

DefauIt route
Contains

ExternaI Net-ID + Mask

Metric

Next hop (externaI, not ASBR)



10
10 {C} Herbert Haas 2005/03/11
NSSA ExternaI LSA - Type 7
Originated by ASBRs within NSSAs
AImost identicaI to Type 5

But onIy fIooded within NSSA


RFC 1587

11
11 {C} Herbert Haas 2005/03/11
Other LSAs
Group Membership LSA (6)

For MOSPF
ExternaI Attribute LSA (8)

AIternative to IBGP

ShouId transport BGP information within an


OSPF domain

Not yet impIemented, no RFC yet (?)


Opaque LSA (9)

AppIication specific information

Link IocaI scope


Opaque LSAs are e. g. used as load indication messages with MPLS.

12
12 {C} Herbert Haas 2005/03/11
Other LSAs
Opaque LSA (10)

AppIication specific information

Area-IocaI scope
Opaque LSA (11)

AppIication specific information

AS scope
Opaque LSAs are e. g. used as load indication messages with MPLS.

13
13 {C} Herbert Haas 2005/03/11
GeneraI OPSF Packet Structure
Carried directIy in IP (protocoI number 89)
AII OSPF packets begin with a 24-byte OSPF
packet header
Version = 2 Type
Router ID of originating router
OSPF Packet Length
Area ID of originating area
Checksum Authentication Type
Authentication
Authentication
Packet Data
(HeIIo, Database Description, LS Request, LSU, LS Ack)
32 bits
1. HeIIo
2. Database Description
3. Link State Request
4. Link State Update
5. Link State ACK
2
4

b
y
t
e
s
The OSPF version we use today is version 2. The packet type identiIies the
actual OSPF message type that is carried in the packet data area at the bottom.
The OSPF packet length describes the number oI bytes oI the OSPF packet
including the OSPF header. Router and Area IDs identiIy the originator oI this
packet. II a packet is sent over a virtual link, the Area ID will be 0.0.0.0, because
virtual links are considered part oI the backbone area. The checksum is calculates
over the entire packet including the header.
Three authentication types had been deIined:
0 No authentication
1 Simple clear text password authentication
2 MD5 Checksum
II the Authentication Type 1, then a 64 bit clear text password is carried in the
authentication Iields. II the Authentication Type 2, then the authentication
Iields contain a key-ID, the length oI the message digest, and a nondecreasing
cryptographic sequence number to prevent replay attacks. The actual message
digest would be appended at the end oI the packet.
The eIIiciency oI routing updates also depends on the maximum transIer unit
(MTU) deIined. Cisco deIined a MTU oI 1500 bytes Ior OSPF.

14
14 {C} Herbert Haas 2005/03/11
HeIIo Packet
Network Mask of originating interface
Options HeIIo IntervaI Router Priority
Router Dead IntervaI
Designated Router
Backup Designated Router
Neighbor #1
Neighbor #n
.
Must match with
receiving interface
In seconds. Must match!
(10 secs on LAN, 30 secs on
non-broadcast networks)
N/P 0 0 DC EA MC E T
OSPF Demand Circuits supported
ExternaI Attributes LSAs supported
NSSA ExternaI LSAs supported
TransIate LSA7 to LSA5
(carried in LSA7 onIy)
Options
MOSPF supported
AS ExternaI LSAs supported
ToS supported
To ensure compatibiIity
Used to eIect
DR and BDR
"manuaIIy"
(0-255)
Seconds
before
neighbor is
decIeared dead.
Must match!
(4 x heIIo intervaI)
IP address of
interface of
DR
IP address of
interface of
BDR
Type 1
The network mask must match the mask on the receiving interIace, ensuring that
they share a segment and network.
The Options Iield is also used by other message types. II the Router Priority is set
to zero this router cannot become DR or BDR.
Note that the Iields "Designated Router" and "Backup Designated Router" only
contain the interIace IP address oI the DR or BDR on that network, not the router
ID !!
II these numbers are unknown or not necessary (other network type) then these
Iields are set to 0.0.0.0.
It is important to know that neighbors must have conIigured identical Hello and
Dead Intervals.

15
15 {C} Herbert Haas 2005/03/11
Database Description Packet
DD Sequence Number
Interface MTU 0 0 0 0 0 I M

Options
LSA Headers
Size of the Iargest IP packet
that can be sent without fragmentation
Same definition as for
the HeIIo Packet
Marks the initiaI packet
of a series of DD packets
More DD
packets
wiII foIIow
Master=1
SIave=0
To ensure
that the
fuII sequence
of DD packets
are received
Type 2
AIso caIIed "DDP"
The DD sequence number is set by the master to some unique value in the Iirst
DD packet. This number will be incremented in subsequent packets.

16
16 {C} Herbert Haas 2005/03/11
Link State Request Packet
Link State Type
Link State ID
Advertising Router
Link State Type
Link State ID
Advertising Router
Link State Type
Link State ID
Advertising Router
.....
Which type of LSA is
requested (Router LSA,
Network LSA, ...)
Usage depends
on the LSA type
Router ID of
originator
of this LSA
Type 3
Note that the Link State Request Packet uniquely identiIies the LSA by Type, ID,
and advertising router Iields oI its header. It does not include the sequence
number, checksum, and age, because the requestor is not interested in a speciIic
instance oI the LSA but in the most recent instance.


17 {C} Herbert Haas 2005/03/11
Link State Update Packet
LSAs
Number of LSAs
LSUs contain one or more LSAs (limited by MTU)
Used for flooding and response to LS requests
LSUs are carried hop-by-hop
Type 4


18 {C} Herbert Haas 2005/03/11
Link State ACK Packet
LSA Headers
Each LSA received must be explicitely acknowledged
reliable flooding!
Acknowledged LSA is identified by LSA header
Single Link State ACK packet can acknowledge
multiple LSAs
The LS ACK packet consists
onIy of a Iist of LSA headers
(and an OSPF header of course)
Type 5

19
19 {C} Herbert Haas 2005/03/11
The LSAs
Link State ID
Age Options LSA Type
Router ID of Advertising Router
Sequence Number
Checksum Length
LSA Body
Same definition as for
the HeIIo Packet
LSA
Header
These three fieIds
uniqueIy identify
every LSA
Time in seconds since
this LSA was originated.
Incremented at each router.
Usage depends
on LSA Type
Incremented each time
a new instance of the
LSA is originated
CaIcuIated over whoIe
LSA except Age fieId
Number of bytes
of LSA header + body
All LSAs have the LSA header at the beginning. This LSA header is also used in
Database Description and Link State Acknowledgement packets.
The Age is incremented by InfTransDelay seconds at each router interIace this
LSA exits. The Age is also incremented in seconds as it resides in a link state
database.
The Options Iield describes optional capabilities supported at that topological
portion described by this LSA.
The LSA Type describes which inIormation is carried in the LSA Body. Here
the structural diIIerences between Router LSAs, Network LSAs, etc. are
identiIed.
The Link State ID is used diIIerently by the LSA types. Basically this Iield
contains some inIormation identiIying the topological portion described by this
LSA. For example a Router ID or an interIace address is used here. The
Iollowing slides will explain this Iield Ior each LSA type.
The Router ID identiIies the originating router oI this LSA.
The Sequence Number helps routers to identiIy the most recent instance oI this
LSA.
The Checksum is a so-called 8 bit Fletcher checksum, providing more protection
than traditional checksum methods such as used Ior TCP. The Iirst eight bits
contain the 1's complement sum oI all octets, while the second eight bits contain a
high-order sum oI the running sums. See RFC 1146 Ior more details.

20
20 {C} Herbert Haas 2005/03/11
Router LSA
Link State ID = Advertising Router ID
Age Options LSA Type = 1
Router ID of Advertising Router
Sequence Number
Checksum Length
IdenticaI
Number of Links 0 0 0 0 0 V E B 0 0 0 0 0 0 0 0
Link ID
Link Data
Link Type Number of ToS Metric
ToS ToS Metric 0 0 0 0 0 0 0 0
One ToS
metrc for
each ToS
...
Link ID
Link Data
Link Type Number of ToS Metric
ToS ToS Metric 0 0 0 0 0 0 0 0
ToS ToS Metric 0 0 0 0 0 0 0 0
1st Link
2nd Link
LSA
Header
Router LSAs are generated by all OSPF routers and must describe all links oI the
originating router!
The V-bit (Virtual Link Endpoint) is set to one iI the originating router is a
virtual link endpoint and this area is a transit area. The E-bit (External) is set iI
the originating router is an ASBR. The B-bit (Border) is set iI the originating
router is an ABR.
The Link ID and Link Data depend on the Link Type Iield which describes the
general type oI connection the link provides.
Link Tvpe 1 is a point-to-point link, the Link ID describes the Neighbor Router ID and the Link
Data Iield contains the IP address oI the originating router's interIace to the network.
Link Tvpe 2 is a link to a transit network, the Link ID describes the interIace address oI the
Designated Router and the Link Data Iield contains the IP address oI the originating router's
interIace to the network.
Link Tvpe 3 is a link to stub network, the Link ID describes the IP network number or subnet
address and the Link Data Iield contains the network's IP address or subnet mask.
Link Tvpe 4 is a virtual link, the Link ID describes the neighboring router's Router ID and the Link
Data contains the MIB-II iIIndex value Ior the originating router's interIace.
Number of ToS speciIies the number oI ToS Metrics listed Ior this link. For
each ToS an additional line is appended to this link state section. Generally, ToS
is not used today anymore and the Number oI ToS Iield is set to all-zero.
Metric is the cost oI the interIace that established this link.

21
21 {C} Herbert Haas 2005/03/11
Network LSA
Link State ID = IP address of DR's interface to this network
Age Options LSA Type = 2
Router ID of Advertising Router
Sequence Number
Checksum Length
LSA
Header
Network Mask
Attached Router
Attached Router
..
Network LSAs are originated by DRs and describe the multi-access network and
all routers attached to it, including the DR.

22
22 {C} Herbert Haas 2005/03/11
Network Summary LSA
Link State ID = IP address of advertised network
Age Options LSA Type = 3
Router ID of Advertising Router
Sequence Number
Checksum Length
LSA
Header
Network Mask
Metric 0 0 0 0 0 0 0 0
ToS Metric ToS
...
f a default route is advertised, both the Link State D
and the Network Mask fields will be 0.0.0.0
Also used for route summarization
Note: Cisco only supports ToS=0
OptionaI
A Network Summary LSA is originated by an ABR and advertises networks
external to an area.

23
23 {C} Herbert Haas 2005/03/11
ASBR Summary LSA
Link State ID = Router ID of ASBR being advertised
Age Options LSA Type = 4
Router ID of Advertising Router
Sequence Number
Checksum Length
LSA
Header
0.0.0.0
Metric 0 0 0 0 0 0 0 0
ToS Metric ToS
OptionaI
...
Note: Cisco only supports ToS=0
A ASBR Summary LSA is originated by an ABR and advertises ASBRs external
to an area.


24 {C} Herbert Haas 2005/03/11
Autonomous System ExternaI LSA
Link State ID = IP address of destination
Age Options LSA Type = 5
Router ID of Advertising Router
Sequence Number
Checksum Length
LSA
Header
Network Mask
E 0 0 0 0 0 0 0 Metric
Forwarding Address
ExternaI Route Tag
E ToS Metric
Forwarding Address
ExternaI Route Tag
ToS
...
OptionaI
When describing a default route, both the Link State
D and the Network Mask are set to 0.0.0.0.
Metric
types
E1 and
E2
Next hop
(0.0.0.0
if ASBR
is next
hop)
Not used
by OSPF


25 {C} Herbert Haas 2005/03/11
NSSA ExternaI LSA
Same structure as AS ExternaI LSA
Forwarding address is

Next hop address for the network


between NSSA and adjacent AS, if this
network is advertised as internaI route

Router ID of NSSA-ASBR otherwise



1
2005/03/11 {C} Herbert Haas
Shortest Path First
Dijkstra's Famous Algorithm

2
'The question of whether
computers can think is
like the question of whether
submarines can swim`
Edsger Wybe Dijkstra

3
3 {C} Herbert Haas 2005/03/11
Dijkstra's SP AIgorithm
Famous paper "A note on two
probIems in connection with graphs"
(1959)
SingIe source SP probIem in a
directed graph
Important appIications incIude

Network routing protocoIs (OSPF, IS-IS)

TraveIIer's route pIanner


Single source SP algorithms Iind all shortest paths to all vertices at once. The
only diIIerence to single-pair SP algorithms is the termination condition.

4
4 {C} Herbert Haas 2005/03/11
Terms
Graph G(V,E) consists of vertices V and
edges E
Edges are assigned costs c
"Length" of graph c(G) = sum of aII costs

Assumed to be positive ("Distance Graph")


"Distance" between two vertices d(v,v') =
min{c(p)}, p.path

Can be infinite
p with c(p) = d(v,v') is caIIed shortest path
sp(v,v')
SPs are easier to calculcate Ior distance graphs where the costs are only positive.


5 {C} Herbert Haas 2005/03/11
Definitions
SeIect start vertex s
Three sets of vertices:

SeIected (sp aIready caIcuIated)

Boundary (currentIy subject of caIcuIation)

Outside (not yet examined)


SeIected
Boundary
Outside
Start
vertex s
Each vertex is assigned
1. a predecessor
2. a distance
3. a booIean "seIected"


6 {C} Herbert Haas 2005/03/11
The AIgorithm
InitiaIize Vertices
v.predecessor = none
v.distance =
v.seIected = faIse
SeIect S
s.predecessor = s
s.distance = 0
s.seIected = true
Add neighbors of S to boundary
SeIect V with Iowest distance from boundary
Add neighbors of V to boundary
For these neighbors caIcuIate distance using V as predecessor
Previous vertices might get better totaI distance

7
7 {C} Herbert Haas 2005/03/11
ExampIe
1 2
7
9
3
6
5 4
8
2
15 6
15
11
9
4
2
15
4
1 2
1
3
6
1
2
3
7
4
6
9
5
8
0
2
6
8
8
9
9
9
12
1
1
2
2
3
1
4
4
5
2 2 1 6 9 1 7 15 1
6 9 1 7 8 2 3 6 2
6 9 1 7 8 2 4 8 3 9 21 3
6 9 1 4 8 3 9 10 7 8 23 7
6 9 1 8 23 7 9 9 4 5 9 4
9 9 4 5 9 4 8 20 6
5 9 4 8 13 9
8 12 5
SeIected Boundary
Start
vertex s
vertex
number
distance predecessor
Note that the leIt list ("Selected") is sometimes called the PATH list, and the right
list ("Boundary") is sometimes called the TENT list (Irom tentative). It's got
nothing to do with a beer tent.


8 {C} Herbert Haas 2005/03/11
ResuIt
SingIe
source SP
MinimaI
Iength
CompIete
1 2
7
9
3
6
5 4
8
2
15 6
15
11
9
4
2
15
4
1 2
1
3
6
Start
vertex s
1
2
3
7
4
6
9
5
8
0
2
6
8
8
9
9
9
12
1
1
2
2
3
1
4
4
5
SeIected

9
9 {C} Herbert Haas 2005/03/11
Performance
Greedy aIgorithm
Most criticaI: ImpIementation of boundary
data structure

No expIicit structure: O(|V|


2
)

Fibonacci heap: O(|E|+|V| Iog |V|)


AIternatives

BeIIman-Ford (RIP) aIgorithm

FIoyd-WarshaII aIgorithm

A* aIgorithm
Extends SPF with a estimation function to enhance
performance in certain situations
The SPF algorithm is oI 'greedy type. Dijkstra originally proposed to treat the
boundary vertices like outside vertices, thereIore no explicit data structure is
needed Ior the boundary vertices. This implementation is eIIicient Ior graphs
with lots oI edges but not eIIicient with so-called "thin" graphs. One oI the best
implementations use Fibonacci heaps Ior boundary representation.
Alternative algorithms are Ior example the Bellman-Ford or the Floyd-Warshall
algorithm, which bases on Belman`s optimization principle ('iI the shortest path
Irom A to C runs over B, then the partial path AB must also be the shortest
possible).

10
10 {C} Herbert Haas 2005/03/11
About E. W. Dijkstra
Born in 1930 in Rotterdam
Degrees in mathematics and theoreticaI
physics from the University of Leyden
and a Ph.D. in computing science from
the University of Amsterdam

Programmer at the Mathematisch


Centrum, Amsterdam, 1952-62
Professor of mathematics, Eindhoven
University of TechnoIogy, 1962-1984
Burroughs Corporation research feIIow,
1973-1984
SchIumberger CentenniaI Chair in
Computing Sciences at the University of
Texas at Austin, 1984-1999
Retired as Professor Emeritus in 1999
1972 recipient of the ACM Turing Award,
often viewed as the NobeI Prize for
computing
Died 6 August 2002
Edsger W. Dijkstra
(1930-2002)
Member oI the Netherlands Royal Academy oI Arts and Sciences, a member oI
the American Academy oI Arts and Sciences, and a Distinguished Fellow oI the
British Computer Society. He received the 1974 AFIPS Harry Goode Award, the
1982 IEEE Computer Pioneer Award, and the 1989 ACM SIGCSE Award Ior
Outstanding Contributions to Computer Science Education. Athens University oI
Economics awarded him an honorary doctorate in 2001. In 2002, the C&C
Foundation oI Japan recognized Dijkstra "Ior his pioneering contributions to the
establishment oI the scientiIic basis Ior computer soItware through creative
research in basic soItware theory, algorithm theory, structured programming, and
semaphores".
Dijkstra enriched the language oI computing with many concepts and phrases,
such as structured programming, separation oI concerns, synchronization, deadly
embrace, dining philosophers, weakest precondition, guarded command, the
excluded miracle, and the Iamous "semaphores" Ior controlling computer
processes. The OxIord English Dictionary cites his use oI the words "vector" and
"stack" in a computing context.
(Source: http://www.cs.utexas.edu)

1
2005/03/11 {C} Herbert Haas
CIDR
The Life Belt of the nternet

2
2 {C} Herbert Haas 2005/03/11
EarIy IP Addressings
Before 1981 onIy cIass A addresses
were used

OriginaI Internet addresses comprised


32 bits (8 bit net-id = 256 networks)
In 1981 RFC 790 (IP) was finished
and cIasses were introduced

7 bit cIass A networks

14 bits cIass B networks

21 bits cIass C networks


IP is an old protocol which was born with several design Ilaws. OI course this
happened basically because IP was originally not supposed to run over the
whole world.
The classIul addressing scheme led to a big waste oI the 32 bit address space.
A short address design history:
1980 ClassIul Addressing RFC 791
1985 Subnetting RFC 950
1987 VLSM RFC 1009
1993 CIDR RFC 1517 - 1520

3
3 {C} Herbert Haas 2005/03/11
Address CIasses
From 1981-1993 the Internet was CIassfuI (!)
EarIy 80s: Jon PosteI voIunteered to
maintain assigned network addresses

Paper notebook
Internet Registry (IR) became part of IANA
PosteI passed his task to SRI InternationaI

MenIo Park, CaIifornia

CaIIed Network Information Center (NIC)


Until 1993 the Internet used classIul routing. All organizations were assigned
either class A, B, or C network numbers. In the early 1980s, one oI the inventors
oI the Internet, Jon Postel, volunteered to maintain all assigned network addresses
simply using a paper notebook!
Later the Internet Registry (IR) became part oI the IANA and Jon Postel's task
was passed to the Network InIormation Center, which is represented by SRI
International.
FYI: See http://www.iana.org

4
4 {C} Herbert Haas 2005/03/11
CIassfuI - Drawbacks
"Three sizes don't fit aII" !!!

Demand to assign as IittIe as possibIe

Demand for aggregation as many as


possibIe
Assigning a whoIe network number

Reduces routing tabIe size

But wastes address space CIass B supports


65534 host addresses,
whiIe cIass C supports 254...
But typicaI organizations
require 300-1000 !!!
Using the Iull classes oI the addresses it was diIIicult to match all needs.

5
5 {C} Herbert Haas 2005/03/11
Subnetting
Subnetting introduced in 1984

Net + Subnet (=another IeveI)

RFC 791

InitiaIIy onIy staticaIIy configured


CIasses A, B, C stiII used for gIobaI
routing !

Destination Net might be subnetted

SmaIIer routing tabIes


By introduction oI subnetting (RFC 791) a network number could be divided into
several subnets. Thus large organizations who needed multiple network numbers
are assigned a single network number which is Iurther subnetted by themselves.
This way, subnetting greatly reduced the Internet routing table sizes and saved the
total IP address space.

6
6 {C} Herbert Haas 2005/03/11
Routing TabIe Growth (88-92)
MM/YY ROUTES MM/YY ROUTES
ADVERTISED ADVERTISED
------------------------ -----------------------
Feb-92 4775 Apr-90 1525
Jan-92 4526 Mar-90 1038
Dec-91 4305 Feb-90 997
Nov-91 3751 Jan-90 927
Oct-91 3556 Dec-89 897
Sep-91 3389 Nov-89 837
Aug-91 3258 Oct-89 809
Jul-91 3086 Sep-89 745
Jun-91 2982 Aug-89 650
May-91 2763 Jul-89 603
Apr-91 2622 Jun-89 564
Mar-91 2501 May-89 516
Feb-91 2417 Apr-89 467
Jan-91 2338 Mar-89 410
Dec-90 2190 Feb-89 384
Nov-90 2125 Jan-89 346
Oct-90 2063 Dec-88 334
Sep-90 1988 Nov-88 313
Aug-90 1894 Oct-88 291
Jul-90 1727 Sep-88 244
Jun-90 1639 Aug-88 217
May-90 1580 Jul-88 173
Growth in routing table size, total numbers
Source for the routing table size data is MERIT
The list above shows the growth oI the routing tables Irom 1988 until 1992 in
total numbers.

7
7 {C} Herbert Haas 2005/03/11
Network Number Statistics, ApriI 1992
CIass A
CIass B
CIass C
126 48 54%
16383 7006 43%
2097151 40724 2%
TotaI AIIocated AIIocated %
Source: RFC 1335
OnIy 2% of more than 2
miIIion CIass C addresses
assigned !!!
The table above shows a statistic Ior the assignement oI IP addresses in April
1992. Obviously, class A and B addresses have been allocated quicker than class
C addresses. In the Iollowing years the utilization oI class C addresses increased
rapidly while class A and B addresses were spared.
Especially VLSM and NAT (invented 1994) supported the utilization oI class C
addresses.

8
8 {C} Herbert Haas 2005/03/11
Supernetting (RFC 1338)
In 1992: RFC 1338 stated scaIing probIem:

CIass B exhaustion

No cIass for typicaI organizations avaiIabIe


UnbearabIe growth of routing tabIe
Use subnetting technique aIso in the Internet !

Do hierarchicaI IP address assignment !


Aggregation = "Supernetting"
(SmaIIer netmask than naturaI netmask)
Source: www.cisco.com
RFC 1338 introduced Supernetting: an Address Assignment and Aggregation
Strategy, now obsoleted by RFC 1519.

9
9 {C} Herbert Haas 2005/03/11
CIassfuI Routing Update
194.20.1.0/24
194.20.2.0/24
.
.
.
194.20.30.0/24
194.20.31.0/24
194.20.1.0
194.20.2.0
194.20.3.0
.
.
.
194.20.30.0
194.20.31.0
BGP-3
BGP-3 was a classIull routing protocol, sending the inIormation about major
class A, B, and C networks only.

10
10 {C} Herbert Haas 2005/03/11
Now CIassIess and Supernetting
194.20.0.0/19
194.20.1.0/24
194.20.2.0/24
.
.
.
194.20.30.0/24
194.20.31.0/24
BGP-4
BGP-4 is classless, it can aggregate a range oI class C network in one supernet.

11
11 {C} Herbert Haas 2005/03/11
CIDR
September 1993, RFC 1519:
CIassIess Inter-Domain Routing
(CIDR)
Requires cIassIess routing protocoIs

BGP-3 upgraded to BGP-4

New BGP-4 capabiIities were drawn on a


napkin, with aII impIementors of
significant routing protocoIs present
(Iegend)

RFC 1654
RFC 1519 introduced Classless Inter-Domain Routing (CIDR): an Address
Assignment and Aggregation Strategy
RFC 1654 a draIt standard Ior BGP 4
RFC 1771 a standard Ior BGP - 4

12
12 {C} Herbert Haas 2005/03/11
Address Management
ISPs assign
contiguous blocks of
contiguous blocks of
contiguous blocks ...
of addresses to their customers
Aggregation at borders possibIe !
Tier I providers fiIter routes with
prefix Iengths Iarger than /19

But more and more exceptions today...


To minimize the sizes oI the routing tables ISPs use agregation, giving the
customers the contiguous blocks oI networks or subnets. Most oI the ISPs would
not accept routes Irom other ISP iI the preIix is longer than /19.

13
13 {C} Herbert Haas 2005/03/11
InternationaI Address Assignment
August 1990, RFC 1174 (by IAB)
proposed regionaIIy distributed
registry modeI

RegionaIIy means continentaI ;-)


RegionaI Internet Registries (RIRs)

RIPE NCC

APNIC

ARIN
RFC 1174 IAB Recommended Policy on Distributing Internet IdentiIier
Assignment.
This RFC represents the oIIicial view oI the Internet Activities Board (IAB), and
describes the recommended policies and procedures on distributing Internet
identiIier assignments and dropping the connected status requirement.

14
14 {C} Herbert Haas 2005/03/11
RIRs
RIPE NCC (1992)

Rseaux IP Europens (RIPE) founded the


Network Coordination Centre (NCC)
APNIC (1993)

Asia Pacific Information Centre


ARIN (1997)

American Registry for Internet Numbers


AfriNIC

Africa
LACNIC

Latin America and Caribbean


RIPE NCC is located in Amsterdam and serves 109 countries including Europe,
Middle-East, Central Asia, and AIrican countries located north oI the equator.
The RIPE NCC currently consists oI more than 2700 members.
APNIC was relocated to Brisbane (Australia) in 1998. Currently there are 700
member organizations. Witin the APNIC there are also Iive National Internet
Registries (NIRs) in Japan, China, Korea, Indonesia, and Taiwan, representing
more than 500 additional organizations.
AIriNIC and LACNIC are relatively new RIRs (2002?).

15
15 {C} Herbert Haas 2005/03/11
ICANN, RIRs, and LIRs
IANA
APNIC ARIN RIPE NCC LACNIC AfriNIC
ICANN
ASO DNSO PSO
IP PoIicies Names Parameters
CounciI CheIIo ACONET AT-Net
... ...
RIRs
LIRs
AIter Ioundation oI the ICANN, the Internet Assignment Numbers Authority
(IANA) is only responsible Ior IP address allocation to RIRs.
Other sub-organizations oI the ICANN:
Address Supporting Organization (ASO), which was Iounded by
APNIC, ARIN, and RIPE NCC, and should oversee the recommendations
oI IP policies
Domain Name Supporting Organization (DNSO) is responsible Ior
maintaining the DNS
Protocol Supporting Organization (PSO) is responsible Ior registration
oI various protocol numbers and parameters used by RFC protocols
Originally, all tasks oI these sub-organizations were perIormed by the IANA only.
Today the IANA only cares Ior address assignment to the RIRs.
The slide above shows a Iew oI the long list oI LIRs in Austria. These LIRs are
those who are widely known by Internet users as "Internet Service Providers".

16
16 {C} Herbert Haas 2005/03/11
CIDR Concepts Summary
Coordinated address aIIocation
CIassIess routing
Supernetting

17
17 {C} Herbert Haas 2005/03/11
RFC 1366 Address BIocks
192.0.0.0 - 193.255.255.255 ... MuItiregionaI
194.0.0.0 - 195.255.255.255 ... Europe
198.0.0.0 - 199.255.255.255 ... North America
200.0.0.0 - 201.255.255.255 ... CentraI/South America
202.0.0.0 - 203.255.255.255 ... Pacific Rim
RFC 1366 Guidelines Ior Management oI IP Address Space, was obsoleted by
1466 in 1993, in 1996 an RFC 2050 came out.

18
18 {C} Herbert Haas 2005/03/11
CIass A Assignment
IANA responsibiIity

RFC 1366 states: "There are only approximately


77 Class A network numbers which are unassigned, and
these 77 network numbers represent about 30% of the
total network number space."
64.0.0.0 - 127.0.0.0 were reserved for
the end of (IPv4) days ?

Recent assignments
(check IANA website)
The Class A addresses assignment is controled by the IANA.

19
19 {C} Herbert Haas 2005/03/11
CIass B Assignment
IANA and RIRs requirements

Subnetting pIan which documents more


than 32 subnets within its
organizationaI network

More than 4096 hosts


RFC 1366 recommends to use
muItipIe CIass Cs wherever possibIe
In order to receive a class B address, an organization must IulIill strict
requirements such as employing more than 4096 hosts and more than 32 subnets.

20
20 {C} Herbert Haas 2005/03/11
CIass C Assignment
If an organization requires more than a
singIe CIass C, it wiII be assigned a bit-
wise contiguous bIock from the CIass C
space
Up to 16 contiguous CIass C networks per
subscriber (= one prefix, 12 bit Iength)
Organization Assignment
1) requires fewer than 256 addresses 1 class C network
2) requires fewer than 512 addresses 2 contiguous class C networks
3) requires fewer than 1024 addresses 4 contiguous class C networks
4) requires fewer than 2048 addresses 8 contiguous class C networks
5) requires fewer than 4096 addresses 16 contiguous class C networks
Example (RFC 1366) Ior Class C assignment:
For instance, an European organization which requires Iewer than 2048 unique
IP addresses and more than 1024 would be assigned 8 contiguous class C
network numbers Irom the number space reserved Ior European networks,
194.0.0.0 - 195.255.255.255. II an organization Irom Central America required
Iewer than 512 unique IP addresses and more than 256, it would receive 2
contiguous class C network numbers Irom the number space reserved Ior
Central/South American networks, 200.0.0.0 - 01.255.255.255.

21
21 {C} Herbert Haas 2005/03/11
RFC 1918 - Private Addresses
In order to prevent address space
depIetion, RFC 1918 defined three
private address bIocks

10.0.0.0 - 10.255.255.255 (prefix: 10/8)

172.16.0.0 - 172.31.255.255 (prefix: 172.16/12)

192.168.0.0 - 192.168.255.255 (prefix: 192.168/16)


Connectivity to gIobaI space via
Network Address TransIation (NAT)
RFC 1918 deIines an "Address Allocation Ior Private Internets", that is three
address spaces, which should only be used in private networks.
Any route to this network must be Iiltered in the Internet! Any router in the
Internet must not keep any RFC 1918 address in its routing table!
Together with these addresses, Network Address Translation (NAT) is needed iI
private networks should be connected to the Internet.
This solution greatly reduces the number oI allocated IP addresses and also the
routing table size because now class C networks can be assigned very eIIiciently,
using a preIix up to /30.

22
22 {C} Herbert Haas 2005/03/11
NAT ExampIe
10.0.0.1/8
10.0.0.2/8
10.0.0.3/8 10.0.0.4/8
Inside LocaI network
10.0.0.0/8
Inside GIobaI network
194.10.20.0/24
DA=X.X.X.X
SA=10.0.0.4
DATA
DA=X.X.X.X
SA=194.10.20.4
DATA
Network Address Translation (NAT) In order to be able to comunicate with
internet we have to translate private addresses (inside local) into oIicial, assigned
by an ISP (inside global).

23
23 {C} Herbert Haas 2005/03/11
But...
Source: www.cisco.com
But this is not really the end oI the story. The growth rate oI the Internet was and
is generally exponential, that is exp(k*x). Soon aIter the introduction oI CIDR
the progressive Iactor k increased dramatically, thus even CIDR could only
reduce k, but not the general exponential character.
It is interesting to question how long the (also exponential) growth rate oI silicon
memory and processing power together with CIDR and NAT can mitigate the
eIIects oI the Internet growth.
As Ior today, the only solution to deal with this problem in the long run is to
introduce IPv6 and a more hierarchical routing strategy.

1
2005/03/11 {C} Herbert Haas
BGP
ntroduction and Basic
Procedures

2
2 {C} Herbert Haas 2005/03/11
Border Gateway ProtocoI (BGP)
BGP-3

Was cIassfuI

CentraI AS needed (didn't scaIe weII)

Not further discussed here!

RFC 1267
BGP-4

CIassIess

Meshed AS topoIogies possibIe

Used today - discussed in the foIIowing


sections!!!

RFC 1771
BGP is a distance vector protocol. This means that it will announce to its
neighbors those IP networks that it can reach itselI. The receivers oI that
inIormation will say 'iI that AS can reach those networks, then I can reach them
via it.
II two diIIerent paths are available to reach one and the same IP subnet, then the
shortest path is used. This requires a means oI measuring the distance, a metric.
All distance vector protocols have such means. BGP is doing this in a very
sophisticated way by using attributes attached to the reachable IP subnet.
BGP sends routing updates to its neighbors by using a reliable transport. This
means that the sender oI the inIormation always knows that the receiver has
actually received it. So there is no need Ior periodical updates or routing
inIormation reIreshments. Only inIormation that has changed is transmitted.
The reliable inIormation exchange, combined with the batching oI routing
updates also perIormed by BGP, allows BGP to scale to Internet-sized networks.

3
3 {C} Herbert Haas 2005/03/11
BGP-4 at a GIance
Carried within TCP

ManuaIIy configured neighbor-routers

Therefore reIiabIe transport (port 179)


Neighbor routers estabIish Iink-state

HeIIo protocoI (60 sec intervaI)


IncrementaI Updates upon topoIogy
changes

New routes are updated

Lost routes are withdrawn


Each route is assigned a poIicy and an AS-
Path Ieading to that network

Using attributes
A router which has received reachability inIormation Irom a BGP peer, must be
sure that the peer router is still there. Otherwise traIIic could be routed towards a
next-hop router that is no longer available, causing the IP packets to be lost in a
black hole.
TCP does not provide the service to signal that the TCP peer is lost, unless some
application data is actually transmitted between the peers. In an idle state, where
there is no need Ior BGP to update its peer, the peer could be gone without TCP
detecting it.
ThereIore, BGP takes care oI detecting its neighbors presence by periodically
sending small BGP keepalive packets to them. These packets are considered
application data by TCP and must thereIore be transmitted reliably. The peer
router must also, according to the BGP speciIication, reply with a BGP keepalive
packet.

4
4 {C} Herbert Haas 2005/03/11
Path Vector ProtocoI
Metric: Number of AS-Hops
AII traversed ASs are carried in the
AS-Path attribute

BGP is a "Path Vector protocoI"

Better than Distance Vector because of


inherent topoIogy information

No Ioops or count to infinity possibIe


Each BGP update consists oI one or more IP subnets and a set oI attributes
attached to them. The intrinsic metric is the number oI AS hops. Note that this
metric is given implicitly by a AS path attribute, which is a vector oI all ASs
traversed.

5
5 {C} Herbert Haas 2005/03/11
BGP Database
BGP routers aIso maintain a BGP
Database

Roadmap information through path


vectors

Attributes
Routing TabIe caIcuIated from BGP
Database
CPU/Memory resources needed
The designers oI the BGP protocol have succeeded in creating a highly scalable
routing protocol, which can Iorward reachability inIormation between
Autonomous Systems, also known as Routing Domains. They had to consider an
environment with an enormous amount oI reachable networks and complex
routing policies driven by commercial rather than technical considerations.
TCP, a well-known and widely proven protocol, was chosen as the transport
mechanism. That decision kept the BGP protocol simple, but it put an extra load
on the CPU or the routers running BGP. The point-to-point nature oI TCP might
also introduce a slight increase in network traIIic, as any update that should be
sent to many receivers has to be multiplied into several copies, which are then
transmitted on individual TCP sessions to the receivers.
Whenever there was a design choice between Iast convergence and scalability,
scalability was the top priority. Batching oI updates and the relative low
Irequency oI keepalive packets are examples where convergence time has been
second to scalability.

6
6 {C} Herbert Haas 2005/03/11
Some Interesting Numbers
Today's Internet BGP Backbone
Routers are burdened

About 100,000 routes (!)

About 10,000 Autonomous Systems


AIthough excessive CIDR, NAT, and
DefauIt Routes
CoIIapse expected

Looking for new soIutions


Internet routers do a hard job. The number oI networks is increasing
exponentially since the early 1990s and the only way to overcome routing table
exhaustion is to apply excessive supernetting (CIDR), NAT, and deIault routing.
In 2001 about 100,000 routes have been counted in typical BGP Internet router.
Moreover, 10,000 ASs have been registrated.
Although this techniques signiIicantly reduce the table growths a collapse is
expected to happen in the near Iutureunless other techniques will be explored.

7
7 {C} Herbert Haas 2005/03/11
Basic Idea of BGP is Easy !
1) BGP notifies other Autonomous Systems
about reachabiIities of networks
2) Each singIe route has attributes
associated to it
3) Routers can appIy poIicies for each
route based on these attributes
(e.g. fiItering routes)
The text above summarizes the basic BGP-4 Iunctionality. As it can be seen its
not so complicated as many people think.

8
8 {C} Herbert Haas 2005/03/11
BGP Limitations
Destination based routing

No poIicies for source address


Hop-by-hop routing

Leads to hop-by-hop poIicies

ConnectionIess nature of IP

Mitigated through
Community attribute
Peer groups
There are still some limitations in BGP. It is impossible to implement source
address-based policies with BGP (unless supported by vendor speciIic
techniques). Furthermore BGP is still hop-by-hop routing, that is, the
connectionless nature oI IP makes it impossible to Ioresee what the next routers
will do with the route.

9
9 {C} Herbert Haas 2005/03/11
Neighborship EstabIishment
Open Message

BGP Version (4)

AS number

BGP Router-ID (IP address)

HoId Time
ProbIems are indicated with Notification
message
AS 1
AS 2
Open
Open
Net 11
Net 12
Net 48
Net 49
Net 11
Net 12
Net 48
Net 49
The BGP protocol is carried in a TCP session, which must be opened Irom one
router to the other. In order to do so, the router attempting to open the session
must be conIigured to know to which IP address to direct its attempts.

10
10 {C} Herbert Haas 2005/03/11
NLRI Update
After open message, aII known routes are
exchanged using update messages
Contains network Iayer reachabiIity
information (NLRI)

List of prefix and Iength


AS 1
AS 2
Update
Update
Net 11
Net 12
Net 48
Net 49
AS1:
Net 11
Net 12
AS2:
Net 48
Net 49
Net 11
Net 12
Net 48
Net 49
Net 48
Net 49
Net 11
Net 12
Once the BGP session is established, routing updates start to arrive. Each BGP
routing update consists oI one or more entries (routes). Each route is described by
the IP address and subnet mask along with any number oI attributes. The next-
hop, AS-path and origin attributes must always be present. Other BGP attributes
are optionally present.

11
11 {C} Herbert Haas 2005/03/11
Steady State
After Open/Update procedure, BGP is
nearIy quiet - No periodic updates !
OnIy keepaIive messages are sent

19 Bytes

Per defauIt every 60s


AS 1
AS 2
KeepaIive
KeepaIive
Net 11
Net 12
Net 48
Net 49
Net 11
Net 12
Net 48
Net 49
Net 48
Net 49
Net 11
Net 12
AIter Iinishing the update process, no periodic updates are sent, just keepalives by
deIault every 60 seconds

12
12 {C} Herbert Haas 2005/03/11
TopoIogy Change:
IncrementaI Updates upon topoIogy
or attribute changes
Withdraw message upon Ioss of
network
AS 1
AS 2
withdraw
Net 48
Net 11
Net 12
Net 48
Net 49
Net 11
Net 12
Net 48
Net 49
Net 48
Net 49
Net 11
Net 12
II there is a topology change, only inIormation about the changes is transmited.

13
13 {C} Herbert Haas 2005/03/11
RIB
BGP routing information is stored in RIBs
RIBs might be combined (vendor specific)
OnIy best paths are forwarded to the
neighboring ASs
AIternative paths remain in the BGP tabIe

"FeasibIe routes" in Adj-RIB-In

Are used if the originaI path is withdrawn



14
14 {C} Herbert Haas 2005/03/11
BGP R Routing I Information B Bases
Input
PoIicy
Engine
BGP
Decision
Process
Adj-RIB-In
LocaI RIB
IP Routing TabIe
Adj-RIB-In
Adj-RIB-In
Adj-RIB-In
Output
PoIicy
Engine
Adj-RIB-Out
Adj-RIB-Out
Adj-RIB-Out
Adj-RIB-Out
Inbound Updates
are stored here
Choose preferred
route according
attributes
FiIter routes
according poIicy
appIied on attributes
"Best" paths to
destinations pIus
attributes
"Best" routes to
destinations
FiIter routes
according poIicy
before sending
with update message
Outbound Updates
are stored here
The Adj-RIB-In maintains also Ieasible routes, whereas only the best route is kept
in the Local RIB. In case oI a withdrawn message Ior this single best route, the
best Ieasible route becomes active.

15
15 {C} Herbert Haas 2005/03/11
Quiz
How many routes are maintained by
BGP today?
How many AS-numbers have been
defined aIready?
How Iong is the typicaI BGP
convergence time?

1
2005/03/11 {C} Herbert Haas
BGP
Message Structure

2
2 {C} Herbert Haas 2005/03/11
BGP Header Format
Marker (16 Bytes)
Contains optionaIIy authentication
Used to detect Ioss of synchronization
Length (2 Bytes) Type (1 Byte)
0 15 7 23 31
The smaIIest BGP message is 19 Bytes
(no data fieId, e. g. keepaIive)
The maximum Iength is 4,096 Bytes
(aIso incIuding header)
E. g. MD5
FYI FYI
This is the basic BGP header Iormat. This and the Iollowing slides marked with a
"FYI" at the upper-leIt oI the slide are only given "Ior yout interest". It is usually
not necessary to know this details by heartunless you plan to go deeper in BGP.

3
3 {C} Herbert Haas 2005/03/11
Open Message
My Autonomous System
HoId Time
Version
0 15 7 23 31
BGP Identifier
OptionaI
Param. Len.
OptionaI Parameters
4 for BGP-4
Maximum amount of time in
seconds between keepaIive
messages. Peers choose the
smaIIer one. HoId time 0 means:
connection is aIways up!
Highest IP address
of router (incIuding
Ioopback)
Can be 0
Each parameter is a
tripIet {type, Iength,
vaIue). ExampIe:
Authentication
parameter
FYI FYI

4
4 {C} Herbert Haas 2005/03/11
Notification Message
Data
Error
0 15 7 23 31
Error
Subcode
Notification is aIways sent when an error is detected.
After that, the connection is cIosed.
FYI FYI

5
5 {C} Herbert Haas 2005/03/11
KeepaIive Message
Consists of header onIy (19 bytes)
Must be sent before hoId time
expires
Recommended keepaIive rate
= 1/3 of hoId time
Not necessary if update message is
sent
Keepalive messages are sent periodicaly, by deIault at 60 seconds interval.

6
6 {C} Herbert Haas 2005/03/11
The Update Message
UnfeasibIe Routes Length
(2 Bytes)
Withdrawn Routes
(variabIe)
0 15 7 23 31
TotaI Path Attribute Length
(2 Bytes)
Path Attributes
(variabIe)
NLRI
(variabIe)
AII NLRIs that appIy
for the attributes
mentioned above!
NLRIs with different
attribute Iist are sent
in another message!
Dead
Routes
Attributes
Associated
Routes
1
2
3
The picture above shows the most important message within BGP: the update.
Note that the update message consists oI three parts. The Iirst part contains all
unIeasible routes (also known as "withdrawn" routes). The second part contains a
consistent set oI attributes Ior the Iollowing regular routes listed in the third part
oI the message.
Note that another update message has to be sent iI a route (NLRI) should be
advertised with a diIIerent set oI attributes.

7
7 {C} Herbert Haas 2005/03/11
Withdrawn Routes
Prefix
(VariabIe)
Length
(1 Byte)
Padded for byte-
aIignment (padding
bits irreIevant
Length in bits of the IP address prefix.
A Iength of zero indicates a prefix that
matches aII IP addresses.
...How destinations are specified within an update
FYI FYI

1
2005/03/11 {C} Herbert Haas
BGP
Attributes

2
2 {C} Herbert Haas 2005/03/11
Attribute Types
OptionaI WeII-known
Transitive Non-Transitive
CompIete PartiaI
Discretionary Mandatory
ORIGIN (1)
AS_PATH (2)
NEXT_HOP (3)
LOCAL_PREFERENCE (5)
ATOMIC_AGGREGATE (6)
MULTI_EXIT_DISC (4)
ORIGINATOR_ID (9)
CLUSTER_LIST (10)
AGGREGATOR (7)
COMMUNITY (8)
(consistency)
Each BGP update consists oI one or more IP subnets and a set oI attributes
attached to them. Some oI the attributes are required to be recognized by all BGP
implementations. Those attributes are called well-known BGP attributes.
Attributes that are not well known are called optional. These could be attributes
speciIied in a later extension oI the BGP protocol or even private vendor
extensions not documented in a standard document.

3
3 {C} Herbert Haas 2005/03/11
Path Attributes
Attribute Type
(2 Bytes)
Attribute Length
(1 or 2 Bytes)
Attribute VaIue
(variabIe)
11 10 9 8 15 14 13 12 3 2 1 0 7 6 5 4
Attribute Type Code
WeII-known (0)
OptionaI (1)
Non-transitive (0)
Transitive (1)
unused
CompIete (0)
PartiaI (1)
1 Byte Attribute Length (0)
2 Byte Attribute Length (1)
Attribute FIags
Each Attribute
consists of the tripIet
{Type, Length, VaIue}
Each attribute consists oI a so called TLVs Type, Length, Value.

4
4 {C} Herbert Haas 2005/03/11
WeII-known Mandatory
AS_Path contains aII ASs traversed
for this route
Next_Hop indicates the Iast EBGP
router Ieading to this route

Not necessariIy the physicaI next hop


Origin indicates how this route was
Iearned
There is a small set oI three speciIic well-known attributes that are required to be
present on every update. These three are the AS-path, next-hop and origin
attributes. They are reIerred to as well-known mandatorv attributes.
Other well-known attributes may or may not be present depending on the
circumstances under which the updates are sent and the desired routing policy.
The well-known attributes that could be present, but are not required to be
present, are called well-known discretionarv attributes.

5
5 {C} Herbert Haas 2005/03/11
Path Vector ProtocoI (1)
48.0.0.0/8
49.0.0.0/8
AS_Path=(AS1)
Next_Hop=R1
AS 1
AS 2
AS 3
R1 R2
R3
R4
48.0.0.0/8
49.0.0.0/8

6
6 {C} Herbert Haas 2005/03/11
Path Vector ProtocoI (2)
48.0.0.0/8
49.0.0.0/8
AS 1
AS 2
AS 3
R1 R2
R3
R4
OSPF
LSA-5
48.0.0.0/8
49.0.0.0/8
Redistribution into IGP Redistribution into IGP
(e. g. OSPF) (e. g. OSPF)
AS_Path=(AS1)
Next_Hop=R1
Note: Note:
Next Hop is Next Hop is
stiII R1 ! stiII R1 !
48.0.0.0/8
49.0.0.0/8

7
7 {C} Herbert Haas 2005/03/11
Path Vector ProtocoI (3)
48.0.0.0/8
49.0.0.0/8
AS 1
AS 2
AS 3
R1 R2
R3
R4
AS_Path=(AS2, AS1)
Next_Hop=R3
48.0.0.0/8
49.0.0.0/8

8
8 {C} Herbert Haas 2005/03/11
ORIGIN
VaIue 0: IGP

Routes Iearned via network statement (NLRI is


member of originating AS)
VaIue 1: EGP

Learned via redistribution from EGP to BGP


VaIue 2: INCOMPLETE

Learned via redistribution from IGP to BGP

ExampIe: redistribute static (Cisco)


WeII-known
Mandatory
1
The origin attribute is set when the route is Iirst injected into the BGP. II
inIormation about an IP subnet is injected using the network command or via
aggregation (route-summarization within BGP) the origin attribute is set to IGP.
II the IP subnet is injected using redistribution, the origin attribute is set to
unknown or incomplete (these two words have the same meaning). The origin
code, EGP, was used when the Internet was migrating Irom EGP to BGP and is
now obsolete.

9
9 {C} Herbert Haas 2005/03/11
AS_PATH
Composed of a sequence of
AS path segments
An AS path segment is represented by a
tripIe

Path segment type (1 byte)


1 = AS_Set (unordered set of ASs)
2 = AS_Sequence (ordered set of ASs)

Path segment Iength (1 byte)

Path segment vaIue (variabIe, 2 bytes per AS)


WeII-known
Mandatory
2
The AS-path attribute is modiIied each time the inIormation about a particular IP
subnet passes over an AS border. When the route is Iirst injected into the BGP the
AS-path is empty.
Each time the route crosses an AS boundary the transmitting AS prepends its own
AS number to appear Iirst in the AS-path. The sequence oI ASes, through which
the route has passed, can thereIore be tracked using the AS-path attribute.

10
10 {C} Herbert Haas 2005/03/11
Who is NEXT_HOP?
The boundary router that advertized the
route in this AS is the next hop

Recursive routing tabIe Iookup might be


necessary to determine the true physicaI next
hop
Exception:

On muIti-access media (Ethernet, FDDI) aIways


the physicaI next hop must be indicated
IGP
AS 1 AS 1
AS 2 AS 2
Net 30
N
e
t

3
0

v
i
a

R
3

R3
R1
R2
R1 and R2 have BGP session
estabIished, R3 speaks IGP onIy.

R2 advertises R3 as next hop to
Net 30 because R3 is on the
same physicaI media.
WeII-known
Mandatory
3
The next-hop attribute is also modiIied as the route passes through the network. It
is used to indicate the IP address oI the next-hop routerthe router to which the
receiving router should Iorward the IP packets toward the destination advertised
in the routing update.

11
11 {C} Herbert Haas 2005/03/11
MULTI_EXIT_DISC
AS 7
AS 8
Net 11 Net 11
MED 50
Net 11
MED 100
Net 11
To discriminate muItipIe
exit or entry points
Must not be forwarded
to other neighbor AS
OptionaI
Non-transitive
4
One oI the non-transitive optional attributes is the Multi-Exit-Discriminator
(MED) attribute which is also used in the route selection process. Whenever there
are several links between two adjacent ASes, multi-exit-discriminator may be
used by one AS to tell the other AS to preIer one oI the links over the other Ior
speciIic destinations.

12
12 {C} Herbert Haas 2005/03/11
LOCAL_PREF
AS 7
AS 8
AS 9
Routers prefer route with highest IocaI
preferences
OnIy attached to IocaIIy originated
routes and those received from externaI
neighbors (defauIt vaIue: 100)
LocaI Preference is sent with IBGP
updates onIy (not to externaI routers)
Net 88 Net 88
Net 88 Net 88
LocaI Pref. LocaI Pref.
200 200
Net 88 Net 88
LocaI Pref. LocaI Pref.
100 100
WeII-known
Discretionary
5
N
e
t 8
8
: L
P
=
2
0
0
Local Preference is used in the route selection process. The attribute is carried
within an AS only. A route with a high local preIerence is preIerred over a route
with a low value. By deIault, routes received Irom peer AS are tagged with the
local preIerence set to the value 100 beIore they are entered into the local AS. II
this value is changed through BGP conIiguration, the BGP selection process is
inIluenced. Since all routers within the AS get the attribute along with the route, a
consistent routing decision is made throughout the AS.

13
13 {C} Herbert Haas 2005/03/11
ATOMIC_AGGREGATE
OptionaIIy the Atomic_Aggregate
attribute indicates that some BGP
router made an AS aggregation

When seIecting the Iess specific route


on overIapping routes (rejecting the
more specific route)
Length 0
WeII-known
Discretionary
6
The Atomic Aggregate attribute is attached to a route that is created as a result oI
route summarization (called aggregation in BGP). It signals that inIormation that
was present in the original routing updates may have been lost when the updates
where summarized into a single entry.

14
14 {C} Herbert Haas 2005/03/11
AGGREGATOR
Contains the AS number and IP
address of the BGP speaker that
formed the aggregate route
UsefuI for troubIeshooting
OptionaI
Transitive
7
Aggregator identiIies the AS and the router within that AS that created a route
summarization, aggregate.

15
15 {C} Herbert Haas 2005/03/11
COMMUNITY
Group of destinations that share a common
poIicy

Each destination couId be member of muItipIe


communities

Carried across ASs


Community strings are simpIe poIicy IabeIs

Any BGP router can tag routes in incoming and


outgoing routing updates or when doing
redistribution

Any BGP router can fiIter routes in incoming or


outgoing updates or seIect preferred routes
based on communities
OptionaI
Transitive
8
A Community is a numerical value that can be attached to certain routes as they
pass a speciIic point in the network. The community value can then be checked at
other points in the network Ior Iiltering or route selection purposes. BGP
conIiguration may cause routes with a speciIic community value to be treated
diIIerently than others.

16
16 {C} Herbert Haas 2005/03/11
Community ExampIe (1)
Assume AS 100 wants AS 300 to use the
155 Mbit/s Iink to reach own networks

MED: not possibIe (non-transitive)

LocaI Preference: wiII admin of AS 300 set it?


Best and easiest: Use community !
AS 100 AS 200
AS 300
155 Mbit/s
64 kbit/s
DefauIt traffic fIow
Desired traffic fIow
The picture above gives an example where the comunity could be implemented.

17
17 {C} Herbert Haas 2005/03/11
Community ExampIe (2)
Receiving a community string means
"appIy the predefined poIicy"
In our exampIe 300:67 means:
"set IocaI preference to 50"
AS 100 AS 200
AS 300
155 Mbit/s
64 kbit/s
NLRIs, 300:67
DefauIt traffic fIow
Desired traffic fIow
The picture above gives an example where the comunity could be implemented
(continued Irom previous slide).

18
18 {C} Herbert Haas 2005/03/11
Defining Communities
More than one BGP community per
route aIIowed

By defauIt, communities are stripped in


outgoing BGP updates
Private range:
0x00010000 - 0xFFFEFFFF
Common practice

High order 16 bit: AS number

Low order 16 bit: LocaI significance



19
19 {C} Herbert Haas 2005/03/11
WeII-known Communities
Reserved ranges: 0x00000000 - 0x0000FFFF and
0xFFFF0000 - 0xFFFFFFFF
0xFFFFFF01 means: NO_EXPORT

Routes received carrying this vaIue shouId not be


advertised to EBGP peers, except ASs of a
confederation
0xFFFFFF02 means: NO_ADVERTISE

Routes received carrying this vaIue shouId not be


advertised at aII (both IBGP and EBGP peers)
0xFFFFFF03 means: NO_EXPORT_SUBCONFED
Routes received carrying this vaIue shouId not be
adverised to EBGP peers, incIuding members of a
confederation (Cisco: LOCAL_AS)
Easy to memorize: Values oI all-zeroes and all-ones in high-order 16 bits are
reserved.

20
20 {C} Herbert Haas 2005/03/11
Administrative Weight (Cisco)
No attribute - just a IocaI parameter
AppIies onIy to routes within an
individuaI router
Number between 0 and 65535

The higher the weight the more


preferabIe the route
InitiaIIy invented to transIate pubIic
routing poIicies (EGP)
Note that the Administrative Weight is a Cisco speciIic atrribute.

21
21 {C} Herbert Haas 2005/03/11
Decision Hierarchy
1. Prefer highest weight (Cisco)
2. Prefer highest IocaI preference
3. Prefer IocaIIy originated routes
4. Prefer shortest AS-Path
5. Prefer Iowest origin code
6. Prefer Iowest MED
7. Prefer EBGP path over IBGP path
8. Lowest IGP metric to next hop
9. Prefer oIdest route for EBGP paths
10. Prefer path with Iowest neighbor BGP router ID
II routes have same local preIerence the route that was locally originated will be
preIerred. At last the BGP router ID can be used as tie-breaker.

1
2005/03/11 {C} Herbert Haas
BGP
nternal and External BGP

2
2 {C} Herbert Haas 2005/03/11
EBGP and IBGP
EBGP
E
B
G
P
E
B
G
P
IBGP
I
B
G
P
I
B
G
P
Interior BGP or "IBGP" allows edge routers to share NLRI and associated
attributes, in order to enIorce an AS-wide routing policy.
IBGP is responsible to assure connectivity to the "outside world" i. e. to other
autonomous systems. That is, all packets entering this AS and were not blocked
by policies should reach the proper exit BGP router. All transit routers inside the
autonomous system should have a consistent view about the routing topology.
Furthermore, IBGP routers must assure "synchronization" with the IGP, because
packets cannot be continuously Iorwarded iI the IGP routers have no idea about
the route. Thus, IBGP routers must await the IGP convergence time inside the
AS. Obviously this aspect assumes that BGP routes are injected to transit IGP
routers by redistribution. The story with synchronization is explained a Iew
slides later.

3
3 {C} Herbert Haas 2005/03/11
InternaI and ExternaI BGP
EBGP messages are exchanged between
peers of different ASs

EBGP peers shouId be directIy connected


Inside an AS this information is forwarded
via IBGP to the next BGP router

IBGP messages have same structure Iike


EBGP messages
Administrative Distance

IBGP: 200

EBGP: 20 (preferred over aII IGPs)


Some vendors including Cisco also allow EBGP peers to be logically linked over
other hops inbetween. This "Multi-Hop" Ieature might introduce BGP-
inconsistency and weakens the reliability as the BGP-TCP sessions cross other
routers, so in practice a direct peering should be achieved.
Routing inIormation learned by IBGP messages has much higher administrative
distance than inIormation learned by EBGP. Because oI this, routes are preIerred
that do not cross the own autonomous system.

4
4 {C} Herbert Haas 2005/03/11
Loop Detection
Update is onIy forwarded if own AS
number is not aIready contained in
AS_Path
Thus, routing Ioops are avoided
easiIy
But this principIe doesn't work with
IBGP updates (!)
Therefore IBGP speaking routers
must be fuIIy meshed !!!

For EBGP sessions loop-Iree topology is guaraneed by checking AS-Path, but it


is not the case Ior IBGP sessions.

5
5 {C} Herbert Haas 2005/03/11
BGP IGP Redistribution
OnIy routes Iearned via EBGP are
redistributed into IGP

To assure optimaI Ioad distribution

Cisco-IOS defauIt fiIter behavior


E
B
G
P
:
N
e
t
X
IBGP: Net X
I
G
P
:

N
e
t

X
I
G
P
:

N
e
t

X
I
G
P
:
N
e
t

X
I
G
P
:

N
e
t

X
IG
P
: N
et X
IGP: Net X
IG
P
: N
e
t X
IG
P
: N
e
t X
IGP: Net X
Routes learned via IBGP are never redistributed into IGP. This is the Cisco IOS
"deIault Iilter" behavior. Obviously, iI a router learned a route via IBGP, it is not
a external (direct) peer Ior this route.

6
6 {C} Herbert Haas 2005/03/11
Synchronization With IGP
Routes Iearned via IBGP may onIy be
propagated via EBGP if same information
has been aIso Iearned via IGP

That is, same routes aIso found in routing


tabIe (= are reaIIy reachabIe)
Without this "IGP-Synchronization" bIack
hoIes might occur
IG
P
:
N
e
t X
IGP: Net X
IBGP: Net X IBGP: Net X
IGP: Net X
IG
P
: N
e
t X
E
B
G
P
:
N
e
t
X E
B
G
P
:

N
e
t

X
1
2
2
2
3
4
5
6
When a BGP router learns about an exterior network via an IBGP session, this
router does not enter this route into its routing table nor propagates this route via
EBGP because the IGP-transit routers might not be aware about this route and
thereIore convergence has not been occurred yet. The BGP router should
propagate the learned route until this route has been entered into its routing table
by IGP.
To understand this issue remember that BGP routing inIormation is transported
almost instantaneous between two BGP peers, while IGP updates might need
quite a long time until reaching the other side oI the AS. As illustrated in step 2
in the picture above, the IBGP message has been received by the BGP peer on the
right border already, while the Iirst IGP update (advertising the same network X)
was injected by the leIt BGP peer and only reached the next IGP router at this
time.

7
7 {C} Herbert Haas 2005/03/11
Avoid Synchronization
Synchronization with IGP means injecting
thousands of routes into IGP

IGP might get overIoaded

Synchronization dramaticaIIy affects BGP's


convergence time
AIternatives

Set defauIt routes Ieading to BGP routers


(might Iead to suboptimaI routing)

Use onIy BGP-routers inside the AS !


But then, these
routers must be
fuIIy meshed.?
Synchronization is an old idea and leads to unwanted eIIects. First oI all, most
IGPs are not designed to carry a huge number oI routes as needed in the Internet.
Thus IGPs might get overloaded when ten thousands oI external routes should be
propagated in addition to the interior routes.
Furthermore, external routes are not needed inside an AS and typically a deIault
route pointing to an BGP border router is suIIicient (however this might lead to
suboptimal routes as the deIault route might not be the best route). And Iinally,
the consistency oI the global BGP routing map would depend on the convergence
oI several (lots oI) IGP routers a situation that should be avoided!
Note that BGP injection into IGP and required BGP synchronization is not
necessary iI the AS is a transit AS only, such as many ISP networks. ISP
networks have typically BGP routers only and thus need no synchronization.
Fortunately many routers today (including Cisco routers) support the option to
turn oII synchronization.

8
8 {C} Herbert Haas 2005/03/11
FuIIy Meshed IBGP Routers
Does not scale

n(n-1)/2 links
Resource and
configuration
challenge
Solutions:

Route Reflectors

Confederations
Note: These are IogicaI IogicaI IBGP connections!
The physicaI topoIogy might Iook different!
Every BGP router maintains IBGP sessions with all other internal BGP routers oI
an AS. Obviously, this Iully meshed approach does not scale, especially it
becomes a resource and manageability problem iI the number oI BGP sessions in
one router exceeds 100.
Remember that each BGP session corresponds to a TCP connection, which
requires a lot oI system resources. Additionally BGP sessions must be manually
established, so a Iully meshed environment is also a conIiguration problem. This
is also the reason, why BGP cannot replace traditional IGPs in "normal"
autonomous systems. ISPs demand Ior Iast BGP convergence and do not need
IGP in general.
Generally, there are two solutions to circumvent this problem: Route ReIlectors
and ConIederations. Both techniques are discussed in the next slides.

9
9 {C} Herbert Haas 2005/03/11
Route RefIector
RR
CIient
CIient
CIient
CIient
CIient
RR mirrors BGP
messages for
"clients"
RR and clients
belong to a
"cluster"
Only RR must be
configured

Clients are not


aware of the RR
E
B
G
P
IBGP
I
B
G
P
I
B
G
P
IB
G
P
I
B
G
P
E
B
G
P
I
B
G
P
IBGP
IB
G
P
I
B
G
P
I
B
G
P
Note: AIthough these are IogicaI IBGP connections,
the physicaI topoIogy shouId be the main indicator main indicator
for an efficient cIuster design (which router becomes RR)
Route reIlectors are dedicated BGP routers that act like a mirror Ior IBGP
messages. All BGP routers that peer with a RR are called "clients" and belong to
a "cluster". Clients are normal BGP routers and have no special conIiguration
they have no awareness oI a RR.
Using RRs there are only n-1 links.

10
10 {C} Herbert Haas 2005/03/11
RR CIusters
RR
RR
RR
Non-cIient
Only RRs are
fully meshed
Special
Attributes care
for loop-
avoidance
"Non-clients"
must be fully
meshed with
RRs

And with other


non-clients
CIuster 1
CIuster 3
CIuster 2
Clients are considered as such because the RR lists them as clients.

11
11 {C} Herbert Haas 2005/03/11
RR Issues
RRs do not change IBGP behavior or attributes
RRs onIy propagate best routes
SpeciaI attributes to avoid routing updates
reentering the cIuster (routing Ioops)

ORIGINATOR_ID
Contains router-id of the route's originator in the IocaI AS;
attached by RR (OptionaI, Non-Trans.)

CLUSTER_LIST
Sequence of cIuster-ids; RR appends own cIuster-id when
route is sent to non-cIients outside the cIuster
(OptionaI, Non-Transitive)
It is important to know that RRs preserve IBGP attributes. Even the NEXTHOP
remains the same, otherwise routing loops might occur. Imagine two clusters
whose RRs are logically interconnected via IBGP but physically via clients. II
one oI these RRs learns about a NLRI Irom the other RR, this RR would reIlect
that inIormation to its clients also to that client who Iorwarded this NLRI
inIormation to this RR.
Obviously the NEXTHOP attribute must remain the same, that is pointing to the
RR oI the other cluster and not to the local RR, because there is no physical
connection between the RRs.
II a RR learns the same NLRI Irom multiple client peers, only one path will be
propagated to other peers. ThereIore, when RRs are used, the number oI path
available to reach a given destination might be lower than that oI a Iully-meshed
approach. Thus, suboptimal routing can only be avoided iI the logical topology
maps the physical topology as close as possible.

12
12 {C} Herbert Haas 2005/03/11
Redundant RRs
RR
RR
RR is single
point of failure

Other than fully


meshed
approach
Redundant
RRs can be
configured
Clients
attached to
several RRs
CIuster 1
CIuster 2
RR
RR
Clients are considered as such because the RR lists them as clients.

13
13 {C} Herbert Haas 2005/03/11
Confederations
AIternative to route refIectors
Idea: AS can be broken into muItipIe sub-ASs
Loop-avoidance based on AS_Path
AII BGP routers inside a sub-AS must be fuIIy
meshed
EBGP is used between sub-ASs
AS 200
AS 65070
AS 65080
EBGP
EBGP
IBGP
IBGP
Confederation 200
Sub-ASs invisibIe
from outside !!!
(Private AS numbers
are removed from
AS_PATH)
EBGP
Sub-ASs should utilize the private range oI AS numbers (64512-65534).

14
14 {C} Herbert Haas 2005/03/11
RRs versus Confederations
RRs are more popuIar

SimpIe migration (onIy RRs needs to be configured


accordingIy)

Best scaIabiIity
Confederations drawbacks

Introducing confederations require compIete AS-


renumbering inside an AS
Major change in IogicaI topoIogy
SuboptimaI routing (Sub-ASs do not infIuence externaI
AS_PATH Iength)
Confederations benefits

Can be used with RRs

PoIicies couId be appIied to route traffic between sub-ASs



1
2005/03/11 {C} Herbert Haas
IP MuIticast
Compendium
Table of Contents:
Introduction
Realtime Protocols
Multicast Addresses
IGMP
Layer 2 Multicast
Session InIormation
Multicast Routing Basics
Multicast Routing Protocols
- DVMRP
- MOSPF
- CBT
- PIM-DM
- PIM-SM
- Interdomain Multicast: MBGP and MSDP
Reliable Multicast

2
2 {C} Herbert Haas 2005/03/11
Introduction

3
3 {C} Herbert Haas 2005/03/11
New IP AppIications
Corporate Broadcasts
Distance Learning/Training
Video Conferencing
Whiteboard/CoIIaboration
MuIticast FiIe Transfer
MuIticast Data and FiIe RepIication
ReaI-Time Data DeIivery for FinanciaI
AppIications
Video-On-Demand
Live TV and Radio Broadcast to the Desktop
MuIticast Games
Real-time applications include games, live broadcasts, Iinancial data delivery,
whiteboard collaboration, and video conIerencing. Non-real-time applications
include Iile transIer, data and Iile replication, and video on demand (VoD).

4
4 {C} Herbert Haas 2005/03/11
MuIticast ModeIs
One-to-many
One host is muIticast source, other hosts are receivers
SimpIest and most important type
Might onIy be jitter sensitive (voice/video)
Many-to-many
Hosts are both senders and receivers

AII hosts are in same muIticast group


Might be deIay sensitive (bidirectionaI communication
forbids more than 0.5 sec deIays)
FIexibIe variants
Many-to-one (impIosion probIem!)
The many-to-many multicast concept supports several new applications such
as collaboration systems, concurrent processing, and distributed interactive
simulations.
Other models involve the many-to-one model, where many receivers may send
data back to one sender (similar to Iew-to-many). These models are typically
used in Iinancial applications/networks. Consider auctions Ior example. Here
any number oI receivers might send data back to a source (via unicast or
multicast). Note the "implosion problem" as a response storm might occur
when responses arrive simultaneously.
Modern solutions to these problems involve bidirectional trees and other
mechanisms. For example, responses could be sent "out-oI-band". However,
most implementations require modiIications oI the applications.

5
5 {C} Herbert Haas 2005/03/11
Unicast vs. MuIticast
Perfect bandwidth utiIization for
"simuIcasts" required:
Audiostreaming,
Videostreaming,
Conferencing,
Data Distribution
Source S
Group G
(Unicast addresses might be unknown)
Minimize Load! MDT
II several (iI not thousands) oI users should receive a certain service then there
are two choices oI implementation: Either sending multiple unicast packets or a
single multicast packet. The latter solution requires a special conIigured
network which supports Iorwarding oI multicast packets. We call this a
Multicast Distribution Tree (MDT). Only a MDT allows the simultaneous
delivery oI data to multiple receivers (simulcast).
Note that sending multiple unicast packets might signiIicantly impact some
local links, especially those close to the source. Using IP multicast just
requires the source to send one packet at a time. We denote the sender as
"Source S" and the receivers as "Group G". Both S and G are identiIied by IP
numbers. Multicast packets use a class D destination address, which
corresponds to the group G.
Typical applications Ior IP multicast: Audio and video streaming,
conIerencing, and other traIIic distribution applications ("warehousing").

6
6 {C} Herbert Haas 2005/03/11
Facts
DeveIoped in the Iate 1980s

First used 1992 during IETF Conference


BuiIding bIock for QoS

RSVP and RTP


UDP based

No Congestion Avoidance!

Packet drops occur!


CIassification based on distribution trees

Shortest Path Trees

Shared Trees
IP Multicast routing has been developed in the late 1980s and had a great
impact on QoS research in the Internet. RSVP and RTP serve as helper
protocol Ior IP Multicast, which is Iully UDP based and thereIore lacks
congestion avoidance and error recovery.
All multicast methods can be classiIied according to their type oI distribution
tree. Either "Shortest Path Trees" (SPT) or "Shared Trees" are used. These
are explained next.
It might be interesting to know that the Iirst notable use oI IP multicast was
during the IETF conference in 1992 where the whole conIerence (video and
audio) had been multicasted.

7
7 {C} Herbert Haas 2005/03/11
How IP MuIticast Works...
Sources don't care at aII!
SimpIy send muIticast packets to the first-hop router
First-hop router

Forwards muIticast packets into the muIticast-tree


Intermediate routers
Determines upstream interface (to first-hop router) and
downstream interfaces (RPF check)
Last-hop routers
Are Ieafs of this tree

Receive users registration via IGMP


Communicate group membership to upstream routers
RFC 1112 deIines 'Host Extensions Ior Multicast Support. The very basic
idea is that members join and leave multicast groups and the routers must
manage this!

8
8 {C} Herbert Haas 2005/03/11
The Mbone
WorId-wide muIticast backbone

Based on tunneIs

PIayground for experiments


Rich Mbone tooIset

Session Directory (SDR)

VisuaI Audio TooI (VAT)

Robust Audio TooI (RAT)

Video Conferencing TooI (VIC)

Whiteboarding TooI (WB)


The Mbone had been developed since 1992 and had become a world-wide
overlay network with dedicated multicast routers at important nodes.
The Session Directory (SDR) tool allows multicast group members to view
advertised multicast sessions and launch appropriate multicast applications to
join an existing session. SDR is based on SD (Session Directory), but they are
not compatible, because SDR implements a later version oI the Session
Description Protocol (SDP).
The Visual Audio Tool (VAT) supports audio conIerencing and allows
multiple participants to share audio interactively. VAT is based on the RTP.
VAT (and RAT) supports various codecs such as PCM, GSM, LPC4, etc.
The Video Conferencing tool (VIC) allows video conIerencing among
multiple participants. VIC utilizes the H.261 video compression codec.
The Whiteboarding tool (WB) allows multiple participants to collaborate
interactively in a text and graphical environment. Documents may be either in
plain ASCII text or PostScript. WB relies on a reliable multicast protocol such
as the Scalable Reliable Multicast (SRM).

9
9 {C} Herbert Haas 2005/03/11
MBone Map (2000)
This picture above shows a map oI the Mbone by the year 2000. Just look at
the basic structure oI this network. Nodes and areas oI higher and lower density
can be seen. DVMRP is used between them, which is explained later on.
Mrouted is an implementation oI the Distance-Vector Multicast Routing
Protocol (DVMRP), an earlier version oI which is speciIied in RFC-1075. In
order to support multicasting among subnets that are separated by (unicast)
routers that do not support IP multicasting, mrouted includes support Ior
"tunnels", which are virtual point-to-point links between pairs oI mrouteds
located anywhere in the Internet.

10
10 {C} Herbert Haas 2005/03/11
Integrated MuIticast
IP
UDP
RTP/RTCP
ReIiabIe Transport
(SRM, MFTP, PGM, ...)
Audio Video
Whiteboard
Data
Distribution
(or sync)
...
...
H.323, SIP, ...
G.7xx H.261, MPEG ...
M
D
T

-

P
r
o
t
o
c
o
I
s

DVMRP, MOSPF, CBT,
PIM-DM, PIM-SM, ...
The diagram above shows the basic layer structure oI a Iully-Ieatured multicast
inIrastructure.
All data is sent over UDP over IP which reIlects the inherent connectionless
nature oI multicast communication.
The yellow area (leIt halI) shows real-time applications which need a
presentation layer (codecs) and a session layer (H.323, SIP, ...) and some real-
time transport protocols (RTP, RTCP).
The green area (right halI) shows applications that demand Ior reliable data
transmission and are typically non-realtime. Special protocols (SRM, MFTP,
PGM, ...) are needed that provide feedback whether sent data has been
delivered or not.
But any multicast environment relies on protocols that establish and maintain a
Multicast Distribution Tree (MDT). Such protocolsoIten called multicast
routing protocolsare Ior example PIM, DVMRP, MOSPF, CBT, ..., which
are all explained soon. This important Iunctionality is depicted by the red area
on the leIt. SRM stands Ior Scalable Reliable Multicast but there are lots oI
other protocols such as MFTP and PGM... all these are explained later.

11
11 {C} Herbert Haas 2005/03/11
ReaItime ProtocoIs

12
12 {C} Herbert Haas 2005/03/11
Audio and Video
Are typicaIIy transported by
RTP/RTCP
Feedback mechanism very important

For maintaining muIticast distribution


tree (MDT)

For appIications to switch codecs when


bandwidth becomes scarce
Practically all modern realtime protocols are sent via RTP/RTCP. RTCP
provides the Ieedback mechanism which allows to react upon congestion
problems.
II congestion occurs, most sources can change the codec. This is learned via
RTCP.

13
13 {C} Herbert Haas 2005/03/11
ReaItime Transmission
ReaI Time Transport ProtocoI (RTP)

ConnectionIess environment

PayIoad type identification and sequence


numbering

Time-stamping and deIivery monitoring


RTP ControI ProtocoI (RTCP)

Provides feedback on current network


conditions

HeIps with Iip synchronization and QoS


management, etc
IP UDP RTP PayIoad (20-160 Bytes)
12 Byte 8 Byte 20 Byte
The Real Time Protocol (RTP) provides Iast UDP delivery plus payload type
identiIication and sequence numbers. Additionally a time stamp is used to
veriIy delivery delays.
The 16 bit sequence number increments by one Ior each RTP data packet sent, and may be
used by the receiver to detect packet loss and to restore packet sequence. The initial value oI
the sequence number should be random (unpredictable) to make known-plaintext attacks on
encryption more diIIicult, even iI the source itselI does not encrypt because the packets may
Ilow through a translator that does.
The 32 bit timestamp reIlects the sampling instant oI the Iirst octet in the RTP data packet. The
sampling instant must be derived Irom a clock that increments monotonically and linearly in
time to allow synchronization and jitter calculations. The resolution oI the clock must be
suIIicient Ior the desired synchronization accuracy and Ior measuring packet arrival jitter (one
tick per video Irame is typically not suIIicient).
The RTP control protocol (RTCP) is based on the periodic transmission oI
control packets to all participants in the session, using the same distribution
mechanism as the data packets. The underlying protocol must provide
multiplexing oI the data and control packets, Ior example using separate port
numbers with UDP.
The primary Iunction is to provide feedback on the quality oI the data distribution. This is an integral part
oI the RTP`s role as a transport protocol and is related to the Ilow and congestion control Iunctions oI other
transport protocols. The Ieedback may be directly useIul Ior control oI adaptive encodings. Furthermore,
RTCP carries a persistent transport-level identifier Ior an RTP source called the canonical name or
CNAME. Furthermore, iI all participants send RTCP packets, the rate must be controlled in order Ior
RTP to scale up to a large number oI participants. By having each participant send its control packets to all
the others, each can independently observe the number oI participants. This number is used to calculate the
rate at which the packets are sent. A Iourth, optional Iunction is to convey minimal session control
information, Ior example participant identiIication to be displayed in the user interIace.

14
14 {C} Herbert Haas 2005/03/11
RTP Facts
RTP does NOT provide:

ReIiabIe packet deIivery

QoS

Prevent out-of-order deIivery


RTP uses mixers

SpeciaI reIays to combine separate video


streams into one video stream

AIso care for synchronization

OptionaIIy re-encode an originaI stream to


meet Iink-specific bandwidth requirements
Note that applications itselI must re-sequence any packets that were sent out oI
order. This can be done using the timestamp

15
15 {C} Herbert Haas 2005/03/11
RTCP Facts
Sent by RTP receivers

RTCP provides feedback for RTP senders and


other receivers!

Sent to same muIticast group!


RTP sender (=muIticast source) uses
RTCP information to

Log group activity

Measure QoS conditions


Other RTP receivers Iearn totaI RTCP
utiIization

Try to keep totaI utiIization beIow 5% of


network bandwidth
All multicast receivers periodically send RTCP control packets to the same
multicast group address which is used Ior RTP delivery. This provides a
Ieedback loop to both the sender and receivers.
ThereIore, all receivers use the RTCP packets Irom their partners to limit the
RTCP rate itselI and keep the RTCP-based network utilization below 5 oI
the available bandwidththus making RTCP very scalable!
When the sender receives a RTCP packet, it may adapt to changes in the
network (available bandwidth situation and congestion conditions) and keep
track oI the receivers.

16
16 {C} Herbert Haas 2005/03/11
RTP Compression

SimpIe substitution principIe

OnIy point-to-point !

Not CPU intensive !

Might be memory greedy


IP UDP RTP PayIoad (20-160 Bytes)
12 Byte 8 Byte 20 Byte
H PayIoad (20-160 Bytes)
4 Byte
H PayIoad (20-160 Bytes)
1 Byte
RTP Header Compression
and No UDP Checksum
RTP Header Compression
and UDP Checksum
RTP compression uses a simple substitution principle. It works only on point-
to-point links and requires the terminating devices to maintain a substitution
table. Each IPUDPRTP header combination is replaced by a one or Iour
byte (with UDP checksum) label.
Obviously, this compression method is not CPU intensive but might be
memory greedy iI a router must deal with lots oI RTP connections (multicast
environments or similar).
BeIore Cisco IOS Release 12.0(7)T, iI compression oI TCP or Real-Time Transport
Protocol (RTP) headers was enabled, compression was perIormed in the process
switching path. That meant that packets traversing interIaces that had TCP or RTP
header compression enabled were queued and passed up to the process to be switched.
This procedure slowed down transmission oI the packet, and thereIore some users
preIerred to Iast switch uncompressed TCP and RTP packets.
Now, iI TCP or RTP header compression is enabled, it occurs by deIault in the Iast-
switched path or the Cisco Express Forwarding-switched (CEF-switched) path,
depending on which switching method is enabled on the interIace. Furthermore, the
number oI TCP and RTP header compression connections was increased to 1000
connections each.
II neither Iast switching nor CEF switching is enabled, then iI TCP or RTP header
compression is enabled, it will occur in the process-switched path as beIore.
Prerequisite requirements:
CEF switching or Iast switching must be enabled on the interIace.
HDLC, PPP, or Frame Relay encapsulation must be conIigured.

TCP header compression or RTP header compression or both must be enabled.


TCP and RTP header compression is perIormed in the CEF-switched path or Iast-
switched path automatically. No conIiguration tasks are required.

17
17 {C} Herbert Haas 2005/03/11
ReaItime Streaming ProtocoI
RTSP = "Internet VCR remote controI
protocoI"
Efficient deIivery of streamed muItimedia
over IP networks

CIient-Server based

Large-scaIe audio/video on demand

VCR-styIe controI functionaIity


AIso uses RTP for deIivery
RFC 2326
Other than MicrosoIt's Active Streaming Format (ASF) which is used to stream
the content oI a Iile system over a network, RTSP is client server based.
It is designed to address the needs Ior eIIicient delivery oI streamed multimedia
over IP networks and works well both Ior large audiences as well as single-
viewer media-on-demand. RealNetworks, Netscape Communications and
Columbia University jointly developed RTSP within the
MMUSIC working group oI the Internet Engineering Task Force (IETF). In
April, 1998, it was published as a Proposed Standard by the IETF.
H.323 and RTSP are complementary in Iunction. H.323 is useIul Ior setting up
audio/video conIerences in moderately sized peer-to-peer groups, whereas
RTSP is useIul Ior large-scale broadcasts and audio/video-on-demand
streaming. One could think oI H.323 as oIIering services equivalent to a
telephone with three-way calling, while RTSP oIIers services like a video store
with delivery services, a VCR or cable television. RTSP provides "VCR-style"
control Iunctionality such as pause, Iast Iorward, reverse, and absolute
positioning, which is beyond the scope oI H.323 and RTP.
Both H.323 and RTSP use RTP as their standard means oI actually delivering
the multimedia data. This data-level compatibility makes eIIicient gateways
between the protocols possible, since only control messages need to be
translated.

18
18 {C} Herbert Haas 2005/03/11
MuIticast Addresses
These section covers the IP class D address range (224.0.0.0-239.0.0.0).

19
19 {C} Herbert Haas 2005/03/11
Reserved CIass D Addresses
IANA reserved range 224.0.0.0 to
224.0.0.255 to be local scope:

224.0.0.1 = aII muIticast systems on subnet

224.0.0.2 = aII routers on subnet

224.0.0.4 = aII DVMRP routers

224.0.0.5 = aII OSPF routers

224.0.0.6 = aII OSPF designated routers

224.0.0.9 = aII RIPv2 routers

224.0.0.10 = aII (E)IGRP routers

224.0.0.13 = aII PIMv2 routers


Multicasts in this IANA-deIined range are never Iorwarded beyond this IP-
network regardless oI the actual TTL value (which is typically set to 1).
'ftp.//ftp.isi.edu/in-notes/iana/assignments/multicast-addresses` is the
authoritative source Ior reserved multicast addresses.
Note that the address range 224.0.0.0 to 224.0.0.255 is regarded to be local
scope. The above listing shows only some reserved "well-known" addresses
Irom this range.

20
20 {C} Herbert Haas 2005/03/11
Other CIass D Addresses
GIobaI scope: 224.0.1.0 to 238.255.255.255

Internet-wide dynamicaIIy aIIocated muIticast


appIications

TypicaIIy Mbone appIications


AdministrativeIy scoped: 239.0.0.0 to
239.255.255.255

LocaIIy administrated muIticast addresses (Iike


RFC 1918 addresses)

Organization-IocaI scope: 239.192.0.0/14

Site-IocaI scope: 239.255.0.0/16


Administratively scoped multicast addresses are "private" addresses (similar
to RFC 1918 unicast addresses) and must not be used within the Internet. The
administratively scoped multicast address space consists oI a local scope range
and an organization-local scope.
The IPv4 Local Scope may grow downward Irom 239.255.0.0/16 into the
reserved ranges 239.254.0.0/16 and 239.253.0.0/16. However, these ranges
should not be utilized until the 239.255.0.0/16 space is no longer suIIicient.
The IPv4 Organization Local Scope 239.192.0.0/14 is the space Irom which
an organization should allocate sub-ranges when deIining scopes Ior private
use. The ranges 239.0.0.0/10, 239.64.0.0/10 and 239.128.0.0/10 are unassigned
and available Ior expansion oI this space. These ranges should be leIt
unassigned until the 239.192.0.0/14 space is no longer suIIicient. This is to
allow Ior the possibility that Iuture revisions oI this document may deIine
additional scopes on a scale larger than organizations.
See RFC 2365 Ior Iurther inIormation.

21
21 {C} Herbert Haas 2005/03/11
Static Group Address Assignment for
Interdomain MuIticast
Temporary method to aIIow Internet
content providers to assign static
muIticast addresses

For inter-domain purposes


Group range 233.x.x.0 to 233.x.x.255

x.x contains AS number

Remaining Iow-order octet used for


group assignment within AS
One oI the methods Ior static address allocation Ior multicast groups is deIined
in Internet standard RFC 2770 titled "GLOP Addressing in 233/8".
Until Multicast Address Set-Claim (MASC) has been Iully speciIied and
deployed, many content providers oI the Internet require something at the very
least to begin address allocation. This necessity is being addressed with a
temporary method oI static multicast address allocation.
See IETF draIt "draIt-ietI-mboned-glop-addressing-xx.txt" and RFC 2770.

22
22 {C} Herbert Haas 2005/03/11
SSM Addressing
For gIobaIIy known sources and
source-specific distribution trees

Across domains
Group range: 232.0.0.0/8

232.0.0.0 to 232.255.255.255
The increasing demand Ior interdomain multicast routing led to some interim
solutions such as GLOP. But GLOP addressing is restricted to the last byte,
which results in 255 uniquely identiIied groups only.
When the sources (senders) have to be globally known, a special range oI
multicast addresses can be used Ior those servers. Additionally the specialized
multicast protocol called "Source SpeciIic Multicast (SSM)" can be used,
which supports building the distribution tree at the source Ior any group
address Irom the range 232.0.0.1 232.255.255.255.
DeIined in IETF draIt "draIt-holbrook-ssm-00.txt".

23
23 {C} Herbert Haas 2005/03/11
Dynamic MuIticast Addressing
Method of SDR (Mbone)

Sessions announced over weII-known


muIticast groups (e.g. 224.2.127.254)

Address coIIisions detected and resoIved at


session creation time via Iookup into an SDR
cache

Not scaIabIe
MuIticast Address Set-CIaim (MASC)

HierarchicaI concept

ExtremeIy compIex garbage-coIIection


probIem

Under deveIopment
The Session Directory (SDR) is an important application Ior the Mbone. SDR
detects collisions when creating new sessions and switch to an unused address.
This method was suIIicient in the old Mbone but today the increasing number
oI sessions revealed that this method does not scale well.
MASC is a new proposal Ior a dynamic multicast address allocation that is
being developed by the Multicast-Address Allocation (malloc) Working Group
oI the Internet Engineering Task Force (IETF).
MASC requires domains to lease IP multicast group address space Irom their
parent domain. These leases are good Ior only a set period. It is possible that
the parent domain may grant a completely diIIerent range at lease renewal time
because oI the need to reclaim address space Ior use elsewhere in the Internet.
This task is indeed very complex!
MASC is part oI the hierarchical Multicast Address Allocation Architecture
(MAAA) and represents the top level oI this architecture. When a certain range
oI multicast addresses is allocated at the top level, the underlying hierarchies
use additional protocols Ior address assignment. Within a domain (AS or
service provider) the Address Allocation Protocol (AAP) is used. The
Multicast Address Dynamic Client Allocation Protocol (MADCAP) is
merely a modiIied DHCP and allows address assignment at leaI segments Ior
the multicast sources. Servers Ior address allocation within the MAAA
architecture are called Multicast Address Allocation Servers (MAAS).
See "draft-ietf-malloc-masc-01.txt" Ior detailed MASC principles.

24
24 {C} Herbert Haas 2005/03/11
IGMP

25
25 {C} Herbert Haas 2005/03/11
Internet Group Membership ProtocoI
Used (mainIy) by hosts
To teII designated routers about desired group membership
Supported by nearIy aII operating systems
IGMP Version 1
"I want to receive (*, G)"
SiIIy: Leaving group onIy by being siIent...
Specified in RFC 1112 (oId)
IGMP Version 2
AIso: "I do not want to receive this any Ionger"
Specified in RFC 2236 (current)
IGMP Version 3
"I want to receive (S, G)"
DR can directIy contact source
StiII under deveIopment
The Internet Group Management Protocol (IGMP) is primarily used by hosts to
tell the DR about their desire to receive multicast traIIic. Upon receiving
IGMP messages the DR may retrieve the speciIied multicast by joining the
MDT.
IGMP is carried directly within IP using protocol number 2.
The initial specification for IGMP (now considered as v1) was documented
in RFC 1112 ("Host Extensions Ior IP Multicasting", August 1989, StanIord
University). Soon several shortcomings oI IGMPv1 had been discovered (e. g.
hosts leave group by not responding) and this led to the development oI
IGMPv2.
To tell the whole truth: IGMP Version 0 had been speciIied in RFC 988 and
obsoleted by RFC 1112.
Using IGMPv2, hosts can send leave message to the router. The router
immediately sends a query in order to check iI there is really no host wanting to
be a member oI this group. II there is no answer within three seconds (!) the
group is pruned Irom the multicast tree. IGMPv2 was ratiIied in November
1997 in RFC 2236 ("Internet Group Management Protocol, Version 2" by
Xerox PARC).
IGMPv3 is still under development. Please check out draIt-ietI-idmr-igmp-
v3-??.txt ...as things change quickly...

26
26 {C} Herbert Haas 2005/03/11
IGMP
DR send every 60-120s Host Membership queries to
224.0.0.1
TeIIing aII active groups to IocaI receivers
Interested hosts send IGMP "report"
With destination address = group address !
Countdown-based, TTL=1
224.1.1.1 224.1.1.1 224.1.1.1 224.1.1.1 224.1.1.1
Periodic
"Host Membership Query"
to 224.0.0.1 ("AII Hosts")
OnIy one member repIies
with a "report" message
The basic principle is this:
The designated router sends periodically a "Host Membership Query" using the
destination address oI 224.0.0.1 ("all hosts"). Note: The TTL is set to 1.
Upon receiving a "Host Membership Query" Irom the router each host starts a
countdown for each group it is member oI. The countdown is initialized by a
random value (IGMP v1: something between 0 and 10 seconds).
Any host reaches the zero value Iirst sends a "Host Membership Report
Message". Again the TTL is set to 1. Any other host oI this group can
immediately cancel its countdown and does not need to reply. This method saves
bandwidth and processing by the hosts.
Using IGMPv1, hosts leave group simply by not responding. The DR sends
three query messages (one every 60 seconds) and iI no host replies this subnet is
pruned Irom the multicast tree. This is indeed silly because during 3 minutes the
whole LAN is Ilooded with unwanted multicast traIIic.
Using IGMPv2, hosts can send leave message to the router. The router
immediately sends a query in order to check iI there is really no host wanting to
be a member oI this group. II there is no answer within 3 seconds (!) the group is
pruned Irom the multicast tree.
Note: Join messages can be also sent immediately without being queried by the
DR in advance ("asynchronous joins").

27
27 {C} Herbert Haas 2005/03/11
Other Important Differences
IGMPv1

Does not eIect designated query router


Task for muIticast routing protocoI (different mechanisms
impIemented)
Often resuIts in muItipIe queriers on a singIe muItiaccess
network
Makes generaI queries onIy
Contain Iisting of aII active groups
IGMPv2 (backwards compatibIe with IGMPv1)

Router with Iowest IP address becomes IGMP querier on


this LAN segment
GeneraI queries specify "Max Response Time"
Maximum time within a host must respond

AIIows for group-specific query


To determine membership of a singIe group
IGMPv2 can do group-specific queries to query membership only in a single
group instead oI all groups. This is much more eIIicient to determine any leIt
members oI a group without asking all groups Ior a report. This group-speciIic
query is not send to 224.0.0.1 but to the group's address G.
Initially all IGMPv2 routers think they are queriers but must give up
immediately when a lower IP address query is noticed on the same LAN
segment.
Each time a host leaves the group (by sending the IGMPv2 leave message to
the group address) the designated router sends a group-speciIic query to check
whether this was the last host leaving the group.
When CGMP is used in the LAN, the IGMPv2 leave message mechanism also
helps the router to better manage the CGMP state in the switch.
II an IGMPv2 host is present in an IGMPv1 environment (including DR)
this host must always send IGMPv1 reports and may suppress IGMPv2 leave
messages.
II an IGMPv1 host is present in an IGMPv2 environment the DR must wait
Ior the IGMPv1 timeout to be sure iI this v1-host wants to enter a group
because this host cannot deal with the advanced IGMPv2 query response
intervals. Furthermore, the router must ignore v2 leave messages Ior all groups
the v1-host is part oI (until the 3-minute timer expires Ior this host).

28
28 {C} Herbert Haas 2005/03/11
IGMP ProtocoI DetaiIs
Version Type Unused Checksum
Group Address
1 = Host Membership Query
2 = Host Membership Report
Type
Max Response
Time
Checksum
Group Address
IGMPv1
IGMPv2
IP ProtocoI Number = 2
0 4 8 16 31
Like ICMP, IGMP is a integral part of IP. It is required to be implemented by
all hosts wishing to receive IP multicasts. IGMP messages are encapsulated in
IP datagrams, with an IP protocol number oI 2. All IGMP messages are sent
with IP TTL 1, and contain the IP Router Alert option (RFC 2113) in their IP
header. The unused Iield in version 1 is zeroed when sent, ignored when
received.
IGMPv2 Type field:
0x11 Membership Query. There are two sub-types oI Membership Query
messages: The General Query, which is used to learn which groups have
members on an attached network and the Group-SpeciIic Query, which is used
to learn iI a particular group has any members on an attached network. These
two messages are diIIerentiated by the Group Address.
0x12 Version 1 Membership Report. This is deIined Ior backwards-
compatibility with IGMPv1.
0x16 Version 2 Membership Report
0x17 Leave Group
IGMPv2 Max Response Time field This Iield is meaningIul only in
Membership Query messages, and speciIies the maximum allowed time beIore
sending a responding report in units oI 1/10 second. In all other messages, it is
set to zero by the sender and ignored by receivers.

29
29 {C} Herbert Haas 2005/03/11
IGMPv3
Hosts couId even send a Iist of
sources

Either (S, G) or [(S1, S2, ..., Sn), G]


Advantages:

Leaf routers can buiId a source


distribution tree without RPs

LAN switches, which wouId do IGMP


snooping
Using IGMPv3 a host can directly say "I want to receive traIIic to group G
Irom source S". Those host could directly connect to the source!
IGMP v3lite is a Cisco speciIic SSM transition solution toward IGMPv3 Ior
application developers and users that have to rely on host operating systems
that do not yet support IGMPv3. Using IGMP v3lite, application developers
can support IGMP v3 Ior SSM (Source SpeciIic Multicast) beIore the host
supports IGMP v3 itselI in the operating system. IGMP v3lite works together
with Cisco IOS routers in the network.

30
30 {C} Herbert Haas 2005/03/11
Layer 2 MuIticast

31
31 {C} Herbert Haas 2005/03/11
L2/L3 Address Mapping
Switches shouId aIso perform L2 muIticast
for efficient muIticast deIivery

Address mapping required


Strange soIution standardized:

23 Iow-order bits of muIticast IP address is


mapped into 23 Iow-order MAC address bits

MAC prefix is aIways "01-00-5e"

5 bits of IP address are Iost !!!


The loss oI 5 address bits when mapping L3 multicast addresses to L2 MAC
addresses were not originally intended. One oI the inventors, Dr. Steve Deering
asked Ior 16 OUIs to map all 28 bits oI the Layer 3 IP multicast address into
unique Layer 2 MAC addresses.
UnIortunately, the IEEE charged $1000 Ior each OUI assigned, which meant
that Dr. Deering requested actually that his advisor spend $16,000 to continue
his research. Because oI budget constraints, the advisor agreed to purchase a
single OUI Ior Dr. Deering. However, the advisor also chose to reserve halI oI
the MAC addresses in this OUI Ior other graduate research projects and granted
Dr. Deering the other halI.
This action resulted in Dr. Deering having only 23 bits oI MAC address space
with which to map 28 bits oI IP multicast addresses. It is unIortunate that it
was not known then how popular IP Muticast would become. II they had
anticipated such popularity, Dr. Deering might have been able to collect
suIIicient Iunds Irom interested parties to purchase all 16 OUIs.
Dr. Steve Deering recently joined Cisco Systems, where he is working on the development oI
very high-speed internet routers. Prior to that, he spent six years at Xerox's Palo Alto Research
Center, engaged in research on advanced internet technologies, including multicast routing,
mobile internetworking, scalable addressing, and support Ior multimedia applications over the
Internet. He is present or past chair oI numerous IETF Working Groups, inventor oI IP
multicast and co-Iounder oI the Internet Multicast Backbone (the MBone), and the lead
designer oI the new version oI the Internet Protocol, IPv6. He received his Ph.D. Irom StanIord
University.

32
32 {C} Herbert Haas 2005/03/11
Address Mapping to Ethernet
MAC prefix "01-00-5e" indicates L3-L2 mapping
OnIy 23 bits had been reserved for Ethernet:
32:1 OverIapping!
11100000 00000000 00000001 00000001
00000001 00000000 01011110 00000000 00000001 00000001
01 00 5e
224 0 1 1
fixed Iost
fixed
23 Bits
32 Bit MuIticast IP Address 224.0.1.1
48 Bit MuIticast MAC Address 01-00-5e-0-1-1
AIter IP multicast packets have been routed to the last hop router, i. e. the
router which has receivers directly attached to it, the multicast method should
be continued on layer 2 iI a broadcast capable medium is used. II Ethernet is
used, the '0x01005e preIix has been reserved Ior mapping L3 IPmc addresses
into L2 MAC addresses.
UnIortunately, only 23 bits oI the IP address can be mapped into MAC
addresses, which leads to a 32:1 overlap oI L3 addresses to L2 addresses.
That is several L3 addresses can map to the same L2 multicast address! This is
also valid Ior FDDI. Token Ring addresses have other bit order, which lead to
much bigger problems but this is not considered here.
For example, all oI the Iollowing IPmc addresses map to the same L2 multicast
oI 01-00-5e-0a-00-01:
224.10.0.1, 225.10.0.1, 226.10.0.1, 227.10.0.1
228.10.0.1, 229.10.0.1, 230.10.0.1, 231.10.0.1
232.10.0.1, 233.10.0.1, 234.10.0.1, 235.10.0.1
236.10.0.1, 237.10.0.1, 238.10.0.1, 239.10.0.1
224.138.0.1, 225.138.0.1, 226.138.0.1, 227.138.0.1
228.138.0.1, 229.138.0.1, 230.138.0.1, 231.138.0.1
232.138.0.1, 233.138.0.1, 234.138.0.1, 235.138.0.1
236.138.0.1, 237.138.0.1, 238.138.0.1, 239.138.0.1

33
33 {C} Herbert Haas 2005/03/11
MuIticast Switching
NormaI switches fIood muIticast
frames through every port

No entries in CAM tabIe (how to Iearn?)

Waste of LAN capacity


Some switches aIIow to configure
dedicated muIticast ports

Not satisfying because users want to


join and Ieave dynamicaIIy over any port
Some switches (including Cisco Catalysts) allow to conIigure dedicated
multicast ports. But this solution does not scale well as users change the groups
Irequently.

34
34 {C} Herbert Haas 2005/03/11
MuIticast Switching SoIutions
Cisco Group Management ProtocoI (CGMP)

SimpIe but proprietary


For routers and switches
IGMP snooping
CompIex but standardized
AIso proprietary impIementations avaiIabIe

For switches onIy


GARP MuIticast Registration ProtocoI (GMRP)
Standardized but not wideIy avaiIabIe

For switches and hosts


Router-port Group Management ProtocoI (RGMP)

SimpIe but Cisco-proprietary


For routers and switches
CGMP has been created by Cisco Systems and is still a proprietary protocol
which can be used by a router to tell a switch the content oI IGMP messages
which had been sent by hosts.
A switch can apply IGMP snooping and hereby intercepts IGMP messages
Irom the host to the DR in order to learn the MAC addresses. Switches should
be L3 aware otherwise the perIormance will degrade.
GMRP stands Ior Generic Attribute Registration Protocol (GARP) Multicast
Registration Protocol and uses GARP to register and propagate multicast
membership inIormation in a switching domain. GARP is a Layer 2 transport
mechanism which allows switches and end systems to communicate various
inIormation throughout the switching domain.
RGMP is also a Cisco proprietary solution and requires an environment where
one switch is connected to routers only.

35
35 {C} Herbert Haas 2005/03/11
CGMP
Sent by routers to switches
Destination address 0100.0cdd.dddd
Message contains
Type fieId (join or Ieave)
MAC address of IGMP cIient (host)
MuIticast MAC address of group
Now switch can create muIticast tabIe
Low CPU overhead
Version Type Count Reserved
GDA
GDA USA
USA
0 4 8 16 31
The router translates IGMP membership messages into CGMP join messages
and Iorwards them to switches.
The CGMP messages contain the Unicast Source Address ("USA", the MAC
address oI the client) and the Group Destination Address ("GDA", the
multicast MAC address that maps a multicast group IP address).
The switches use the CGMP inIormation to populate the CAM tables with the
correct multicast entries. The dedicate address 0100.0cdd.dddd is used to
address the Network Management Processor (NMP) inside the switches.
In the CGMP type field the value 0 denotes a "join" and 1 means "leave".
A "leave" message with a nonzero GDA and an all-zeros USA is used to
globally delete the group in all switches. This is necessary aIter the last
member has leIt the group. A "leave" message with all zeros in both the GDA
and the USA Iields means that all groups must be deleted in all switches. This
occurs when CGMP is disabled on the router or the command clear ip cgmp is
executed on a router interIace.
Ethereal (see http://www.ethereal.com/) is a good GPL-based sniIIer (or
politically correct: "protocol analyzer") which is also able to decode CGMP.
Note: CGMP does not work in combination with IGMP snooping.


36 {C} Herbert Haas 2005/03/11
CGMP - Notes (HIDDEN)
Supported by wide range of routers
and switches
ConfIicts with IGMP snooping
How to Iearn about aII receivers in
spite of the report suppression
mechanism?

Good question...

37
37 {C} Herbert Haas 2005/03/11
IGMP Snooping
Switches must decode IGMP

Which traffic shouId be forwarded to which


ports?

Read IGMP membership reports and Ieave


messages

Either by NMP (sIow) or by speciaI ASICs


The CAM tabIe must aIIow muItipIe port
entries per MAC address!

AIso the CPU port (e.g. 0) must be added!

Upon high mc-traffic Ioad the CPU gets


overIoaded!

SpeciaI ASICs might differentiate IGMP from


data traffic to improve performance
BeIore the Iirst host joins a group G, there is no entry in the CAM table Ior this
group's associated MAC address. ThereIore the first IGMP group membership
report is flooded to all ports including the switch CPU and the port to the DR.
Now the CPU can enter the MAC address Ior this group G in the CAM table
together with this host's associated port, the port to the DR, and the CPU
port (Cisco: 0). Thus, three ports are entered aIter the Iirst IGMP membership
message.
It is evident to include the CPU port in this CAM table entry. Otherwise, the switching engine
could not Iorward any Iurther IGMP message (to this group G) to the CPU. The CPU needs to
see all IGMP packets (Ior this group G) Ior Iurther IGMP snooping. Remember that normally
(without multicast-capable CAM tables) any multicast would be Ilooded through all ports
automatically. But now measures must be taken to assure that the CPU gets all packets!
II another host oI the same group sends an IGMP membership message, the
switch simply Iorwards this message to all listed ports Ior this group and adds
the corresponding port to the CAM table.
Now consider a high-rate multicast traIIic (e. g. 10 Mbit/s videostream)
addressed to the group G. Since the CPU port is entered in the CAM table,
every packet to group G is also Iorwarded to the CPU, which scans it Ior IGMP
inIormation. Clearlyat a certain pointthe CPU explodes. ThereIore IGMP
snooping is no elegant solution Ior high-end switches.
II an additional ASIC is implemented which is able to scan L3 information,
this ASIC could separate IGMP and bulk data packets. Then the CPU only gets
IGMP packets. But it is still not elegant!

38
38 {C} Herbert Haas 2005/03/11
GARP MuIticast
Registration ProtocoI
IEEE 802.1p GARP (Generic Attribute
Registration ProtocoI) extended for IP
muIticast

Runs on hosts and switches


Pro-active processing:

Hosts must aIso join to switch using GMRP

Switch configures CAM tabIe and notifies


other switches
Incoming mc-traffic can be efficientIy
switched
GMRP and GARP are part oI the IEEE 802.1p standard and must be supported
by operating systems in the hosts and switches. This standard is not yet
Iinished.
Using GMRP a host must also send a "join" to the neighboring switch.
This switch conIigures its CAM table (i. e. sets a Iilter) and exchanges
membership inIormation with neighboring switches.
Any incoming multicast traIIic is simply switched according to the CAM table.
There is no need to snoop Ior IGMP packets.

39
39 {C} Herbert Haas 2005/03/11
Switch/Router ProbIems
Any switch connected to muItipIe routers
must forward all muIticast traffic to all
routers!

Since routers don't send IGMP membership


reports

Routers might get Iots of unneeded packets!


Using RGMP a router can teII a switch aII
muIticast groups the router manages

Router-onIy switched topoIogies onIy!


Router-Port Group Management Protocol (RGMP) is a proprietary Cisco
protocol that allows to restrict multicast traIIic that switches send to router
ports.
RGMP may be used only with sparse mode protocols because they are based
on an explicit join to the multicast group.

40
40 {C} Herbert Haas 2005/03/11
RGMP DetaiIs
Routers periodicaIIy send heIIo messages to the
switch
Switch Iearns about existence of routers
Routers send RGMP (*, G) joins for groups they
beIong to
WeII-known address 224.0.0.25
Restrictions:
Not aII routers need to support RGMP
No directIy connected sources aIIowed
HeIIo Join (*, G)
RGMP supplements IGMP snooping in that also a router will send report
messages. Using IGMP only, a switch can learn about receivers via report
messages but routers usually do not send reports.
Routers can use RGMP (*, G) join and leave messages to tell the switch about
multicast groups that they want to receive. AIter a switch receives RGMP (*,
G) joins Irom a router, it only Iorwards multicast traIIic Ior joined groups.
RGMP does not allow any directly connected sources but it supports directly
connected receivers. TraIIic to directly connected receivers is restricted using
IGMP snooping (which must be turned on).
RGMP enabled switches will not Iorward any RGMP packet but non-RGMP
switches will do because a 224.0.0.X (X25) multicast address is used.
ThereIore it is strongly recommended to use non-RGMP switches only as leaI-
switches (with an upstream link to RGMP switches).
See RFC 3488 (InIormational).

41
41 {C} Herbert Haas 2005/03/11
Session Information

42
42 {C} Herbert Haas 2005/03/11
Session Information
PotentiaI receivers must be informed
about muIticast sessions

Sessions are avaiIabIe before receiver


Iaunches appIication

Might be announced via weII-known


muIticast group address

Or via pubIicIy avaiIabIe directory


services

Or via web-page or even E-MaiI


There are several ways to tell potential receivers about multicast sessions.
Either specialized protocols are used or well-deIined data structures (MIME,
XML) which are distributed via a web-page or E-Mail.

43
43 {C} Herbert Haas 2005/03/11
SDR (1)
Mbone session description protocoI and
transport mechanism

Used by sources for assigning new muIticast addresses


Checks sdr cache to avoid confIicts
Creates a session and sends its description via sdr
announcements (224.2.127.254)
Anybody can announce a session
Source is part of the session description
Announcement frequency
Ratio number of session / avaiIabIe BW = const
TypicaIIy 5-10 minutes
Late join Iatency probIem avoided by caching
At the receiver side, sdr is used to learn about available groups and sessions.
At the sender side, sdr is used to create new sessions and to avoid address
conIlicts. During session creation the senders consult their sdr caches (note that
senders are also receivers) and choose one oI the unused multicast addresses.
AIter creating the session, the senders start to announce it using all the
inIormation needed by receivers to successIully join the session, including session
schedule, codecs, multicast group address and port numbers, contact inIormation,
etc.
SDR announcements are typically sent every 5-10 minutes. This announcement
frequency depends on the number oI sessions to be announced and on the
bandwidth oI the outgoing interface through which the announcement will be
sent. Each SDR application tries to keep this ratio at a constant value.
This might lead to the late join latency problem Ior potential receivers that miss
the last announcement. Newly enabled receivers must wait Ior the next
announcement.
Caching mechanisms are used to avoid this late join latency problem. Regardless
whether a multicast application is running or not, the operating system (or any
other low-level multicast management process) caches all announcements locally.
When a multicast application is started, it Iirst scans this cache. Note that also
inactive sessions are announced and cached.
Cisco routers can also cache the SDR inIormation but only to create more
descriptive outputs. This allows a routerIor exampleto use a descriptive
session name instead oI the multicast group address.

44
44 {C} Herbert Haas 2005/03/11
SDR (2)
RFC 2327 onIy specifies variabIes
but no transport mechanism

Session Announcement ProtocoI


(SAP, RFC 2974)

Session Initiation ProtocoI


(SIP, RFC 2543)

ReaI Time Streaming ProtocoI


(RTSP, RFC 2326)

E-maiI (MIME/SDR) and aIso web pages


The RFC 2327 only deIines the standard set oI variables that describe multicast
sessions but does not specify the transport mechanism oI these variables.
Several transport mechanisms can be used:
The Session Announcement Protocol (SAP) can be used to carry the session
inIo.
The Session Initiation Protocol (SIP) is primarily a signaling protocol Ior
Internet conIerencing, Internet telephony, event notiIication, and instant
messaging.
The Real Time Streaming Protocol (RTSP) is basically a control protocol in
a multimedia environment, typically used together with RTP/RTCP. RTSP
provides VCR-like Iunctions but can also carry inIormation oI a multicast
session.
Finally using a special MIME deIinition, even E-Mail can carry session
variables. ThereIore, session inIormation can also be stored on web pages
because HTTP is also MIME aware.

45
45 {C} Herbert Haas 2005/03/11
Security
Receiver identification

GeneraIIy not needed except for security and


feedback mechanisms (QoS)

Provided by RTCP

AppIications might use unicast return


messages
MuIticast fIows from the sender and from
receivers may be encrypted for security
reasons

If receivers are not known to the sender, the


encryption may be done onIy one way
When RTP is used then the co-protocol RTCP can be used to transport
identiIication inIormation Irom the receivers to the senders either via unicast or
multicast. This is very important in enhanced security environments such as in
a conIerencing environment.

46
46 {C} Herbert Haas 2005/03/11
MuIticast Routing
Basics

47
47 {C} Herbert Haas 2005/03/11
MuIticast Routing Basics
Opposite function than traditionaI unicast
routing:

Unicast routing caIcuIates the path to the


destination of the packet

MuIticast routing caIcuIates the path to the


origin of the packet
Basic aIgorithm: Reverse Path Forwarding
(RPF)

Prevents forwarding Ioops

Ensures shortest path from source to


receivers
The very basic message is: Multicast routing tries to find the best interface
to the source! On the other hand, traditional unicast routing wants to
determine the best interIace to the destination.
The closest interIace to the source is necessary in order to check whether a
multicast packet arrived indeed on the upstream interIacean interIace which
belongs to the MDT. This check is called "Reverse Path Forwarding" (RPF),
which is explained next.

48
48 {C} Herbert Haas 2005/03/11
In Other Words...
MuIticast routing:
Which is best path to the source?
Prevent muIticast storms: Tree!
Routers do
"Reverse Path Forwarding" (RPF)

Forwards a muIticast packet onIy if received on


the upstream interface to the source

Check source IP address in the packet against


routing tabIe to determine upstream interface
Unicast routing is busy to determine the best path to any destination. Multicast
routing works backwards, looking Ior the best path to the source. The
problem with IP multicast packets is that Iorwarding might lead to the same
problem as in bridged Ethernet LANs with redundant links: Broadcast Storms
(although we should call them "Multicast Storms" here.)
But routers have on big advantage over bridges: They know the topology oI
the network. Hence Multicast Storms can be easily avoided by applying a
simple algorithm known as "Reverse Path Forwarding" (RPF). Using RPF,
each router that receives a multicast packet checks (using its routing table)
whether this receiving interIace is actually the closest to the source. II this is
the case then this interIace is an upstream interface and the packet can be
Iorwarded through all other interIaces.

49
49 {C} Herbert Haas 2005/03/11
RPF Check
Router forwards muIticast packet onIy if it
was received on the upstream interface to
the source
Then this packet is aIready on the distribution
tree
UtiIizes unicast routing tabIe to determine
the nearest interface to the source
RPF check faiIs: packet is siIentIy discarded
RPF check succeeds: packet is forwarded
according OIL
AIter perIorming the RPF check it is said that the RPF check succeeds when
the multicast packet arrived on that interIace speciIied in the unicast routing
table to reach the source. Otherwise, the RPF check fails.
Thus, RPF ensures that multicast packets will Iollow the shortest path Irom
the source to the receivers and ensures that there are no loops on that path.
The determined RPF interface Ior a speciIic source is used until the next RPF
check is perIormed. Note that the unicast topology might change between RPF
check events. ThereIore the RPF check interval should not be too large.
On Cisco routers the RPF check is perIormed every five seconds by deIault.
Each multicast router must maintain an Outgoing Interface List (OIL or
oilist) which contains all downstream interIaces. In the OIL each interIace is
associated with an TTL threshold.

50
50 {C} Herbert Haas 2005/03/11
RPF Check
RPF Check
prevents dupIicate
forwarding
Look one step
ahead

Determine if
outgoing Iink is on
upstream path for
the next router

Avoids any
dupIicates
20.0.0.1
224.0.0.1
RPF Check
faiIed
RPF can be enhanced by looking "one step ahead". That is, duplicate
packets can be avoided iI packets are only Iorwarded on links which are
upstream to the next router. This can easily calculated using a link state
protocol.
Cisco routers perIorm a RPF check every 5 seconds by deIault.
A so-called "Outgoing Interface List" (OIL), contains interIaces pointing
to
Multicast neighbors
Receivers
Administrative pre-conIigured interIaces
Using the OIL allows a quick decision on which interIaces the packet
should be Iorwarded. II the OIL list is empty, then a "Prune" message is
sent to the upstream router.

51
51 {C} Herbert Haas 2005/03/11
MuIticast Scoping using TTL
Packet's TTL is decremented by 1 when packet
arrives at incoming interface
Then the packet is forwarded according OIL
which aIso contains TTL threshoIds per interface

May be configured to Iimit the forwarding of muIticast


packets with TTL>threshoId
DefauIt threshoId = 0 (no threshoId)
Company Network
TTL=64
Management
TTL=16
Engineering
TTL=16
Marketing
TTL=8
TTL-ThreshoId=64
TTL-ThreshoId=16
TTL-ThreshoId=8
Setting thresholds on multicast routers allows to deIine boundaries Ior certain
multicast traIIic. This is Ior example useIul to disallow external multicast
traIIic to enter the own network, or to prevent private multicast packets Irom
leaving the own domain.
TTL thresholds may be set on each interface. A zero TTL threshold means no
threshold is set. When a multicast packet arrives at a router's interIace, its TTL
is decremented by one. II the resulting TTL is less than or equal to zero, this
packet is dropped. (This rule is exactly correct on Cisco routers. Other vendors
might deIine other TTL handling rules.) Then the router determines the
outgoing interIaces (using RPF check and OIL).
II a speciIic interIace has a TTL threshold set unequal zero then the packet's
TTL is checked against this TTL threshold. Only iI the packet's TTL is greater
or equal than the speciIied threshold, this packet is Iorwarded out oI this
interIace.
In the example above there are three autonomously managed domains
(management, engineering, and marketing) which have their own TTL
threshold set on their respective boundaries. For example, multicast packets in
the engineering domain will be originated with a TTL oI 15 and cannot leave
this domain. Additionally, company-wide multicast packets might be sourced
using a TTL oI 63.

52
52 {C} Herbert Haas 2005/03/11
MuIticast Scoping using Addresses
Scoping via TTL threshoIds reIies on
the TTL configurations

Might be unknown or unpredictabIe


Administrative boundaries can be
created using address scoping

Traffic which does not match the ACL


cannot pass this interface

In both directions!
When TTL scoping is used together with broadcast and prune multicast
protocols, any router discarding multicast packets cannot prune any upstream
source anymore. Additionally, TTL-based multicast scoping does not support
overlapping zones.
Address scoping allows to establish "administrative" multicast boundaries
based on the group address. This method is much more Ilexible than TTL
scoping. Any multicast packet that does not match an ACLwhich must be
speciIiedis dropped, no matter Irom which direction the packet came.
Overlapping zones are now possible to implement and requires to use
diIIerent address spaces within those zones. However, this might result in a
complex administration task.

53
53 {C} Herbert Haas 2005/03/11
Administrative Boundaries
Company Network
239.200.x.x
Management
239.195.x.x
Engineering
239.195.x.x
Marketing
239.196.x.x
239.192.0.0/10
239.195.0.0/16
239.196.0.0/16
239.1.x.x 239.1.x.x
SeriaI0: Administrative boundary
for aII 239.1.0.0/16 packets
As shown in the picture at the top, multicast packets Ior a speciIied group
address (or range, as ACLs are used) cannot pass this interIace in neither
direction.
The bottom example shows how three administrative domains can be
multicast-isolated Irom each other but it is still possible to receive the
company's multicast 239.200.x.x anywhere within its boundaries.
Additionally it can be seen (look at the management and engineering clouds)
that several zones can use the same address boundaries.


56
56 {C} Herbert Haas 2005/03/11
Shared Tree
(*, G) = (*, 224.1.1.1) and (*, 224.2.2.2)
30.0.0.3
20.0.0.2
Rendezvous
Point (RP)
Shared Tree
224.1.1.1 224.1.1.1 224.1.1.1
224.2.2.2 224.2.2.2 224.2.2.2
Shared trees utilize a so-called "Rendezvous Point" (RP), which distributes
multicast traIIic to its attached receivers. The idea is similar as the
supermarket principle: "Customers should not have to visit every manuIacturer
but rather buy everything at the shop around the corner."
In this sense, the RP acts as supermarket and oIIers multicast traIIic Irom
several sources. Typically, each RP is a leaf of a SPT, which is rooted at a
source. That is, the shared tree principle is mostly used in combination with a
SPT.
Shared trees consume memory oI order O(G) but might result in sub-optimal
paths Irom the source to all receivers. Furthermore they may introduce extra
delay. Thus, only a clever combination oI both SPT and shared trees might be
most eIIicient. As explained later in this chapter, this led to the development
oI "PIM-SM".

57
57 {C} Herbert Haas 2005/03/11
MuIticast Routing
ProtocoIs
Until now, the student should have noticed, that multicast-enabled routers
maintain so-called (S, G) and (*, G) entries in their mroute table.
(S,G) entries: For this particular source S sending to this particular group G,
traIIic is Iorwarded via the shortest path Irom the source.
(`,G) entries: For any (*) source sending to this group G, traIIic is Iorwarded
via a meeting point Ior this group.

58
58 {C} Herbert Haas 2005/03/11
MuIticast ProtocoI Types
Dense Mode: Push method

InitiaI traffic is fIooded through whoIe


network

Branches without receivers are pruned


(for a Iimited time period onIy)
Sparse Mode: PuII method

ExpIicit join messages

Last-hop routers puII the traffic from the


RP or directIy from the source
Multicast routing protocols are either dense mode or sparse mode.
The dense mode principle uses a "push" method to create the distribution tree.
Multicast packets are Ilooded throughout the network and each router creates
its OIL using the RPF check and "prune" messages to cut oII unnecessary
branches oI the tree. That is, aIter the initial Ilood, branches without receivers
are pruned. But aIter a timeout, traIIic is Ilooded throughout the network again.
Typically every 3 minutes a Ilood and prune occurs.
The sparse mode principle uses the opposite method in that routers which
want to be part oI the tree must send explicit "join" messages. Thus, the sparse
mode supports a "pull" method Ior tree establishment. Note: Branches
without receivers never get any multicast traffic!

59
59 {C} Herbert Haas 2005/03/11
MuIticast ProtocoIs Overview
DVMRP Distance Vector MuIticast Routing ProtocoI
MOSPF MuIticast OSPF
PIM-DM ProtocoI Independent MuIticast - Dense Mode
PIM-SM ProtocoI Independent MuIticast - Sparse Mode
CBT Core Based Trees
...and others...
DVMRP: Version 1 (RFC 1075) was used in the early MBONE and is
obsolete and unused today. DVMRPv2 is the current implementation and is
used through-out the MBONE, although it is only an 'Internet-DraIt. Version
3 is under development.
Although MOSPF (RFC 1584) is the only multicast routing protocol which is
Iound on the RFC standard track, most experts think (even John Moy) that
there is more research needed. Actually MOSPF is not really implemented
anywhere.
PIM-DM (Internet-draIt) is useIul Ior small networks only, as it is not really
scalable.
PIM-SM (RFC 2362- v2) is the most sophisticated and useable multicast
routing protocol, supporting any underlying unicast routing protocol. Cisco
recommends PIM-SM Ior today's multicast applications.
Other proposals include CBT (RFC 2189), OCBT, QOSMIC, SM, etc., and
are mostly oI academic interest. New technologies might be expected Ior the
next years.

60
60 {C} Herbert Haas 2005/03/11
What is what?
DVMRP
MOSPF
PIM-DM
PIM-SM
CBT
Dense Mode
Sparse Mode
Dense mode operation had been used Ior the easiness, with which all multicast
data reach end users. Using sparse mode operation each user has to explicitly
request needed data, e.g. iI these messages are lost, the rest oI the network is
not inIluenced. But special measures have to be taken to assure the delivery oI
multicast data to all users.
When interconnecting sparse and dense mode domains there is still the
problem oI legacy dense mode receivers. No data is Iorwarded Irom the SM
to the DM domain because routers in the DM domain do not have any
Iorwarding states Ior any (*) sources, groups and interIaces.
Even iI the receiver knows the group and wants to join, the IGMP join message
is discarded by the nearest router as it does not have any state Ior the group.
For Cisco-based dense mode domains there is a workaround using an ip
igmp helper-address command. For mrouted domains there is only the
possibility by starting to send data to the group: data is Ilooded over the DM
domain including the border router, Iorwarding states are created and the SM
domain is learning about receivers wanting that data.
Actually this problem applies Ior SDR announcements only. All the other
widely used MBONE tools run RTCP, which always sends some control
messages to the multicast group addressthat is you are always a sender
whenever you join any group using these tools. Here the problem oI DM non-
pruners is completely and democratically enough solved.

61
61 {C} Herbert Haas 2005/03/11
DVMRP - Facts
Dense mode protocoI (Prune and Graft)
Distance Vector announcements of
networks

SimiIar to RIP but cIassIess

Infinity = 32 hops
Creates Truncated Broadcast Trees (TBTs)

Each source network in the DVMRP cIoud


produces its own TBT

Source Tree principIe


DVMRP is quite similar to RIP but it also carries subnet masks Ior each
network and allows Ior 31 hops32 hops is considered "unreachable".
DVMRP is basically a dense mode protocol and thereIore Iloods immediately
the whole network with traIIic, while routers create a tree using RPF. Soon,
prune messages cut down the tree to a necessary size. ThereIore this tree is
called a "Truncated Broadcast Tree" (TBT).
Note that prune messages do not destroy the tree, they just stop the traIIic.
Prunes periodically time-out and thereIore cause a reIlooding oI packets.
DVMRP routing inIormation is carried inside of IGMP (IP protocol 2)
packets. The IGMP type code Ior DVMRP is 0x13. That is, analyzing
DVMRP packets requires sniIIing oI IGMP packets and Iurther decoding.
However, Ethereal can do that easily...
II two routers share the same Ethernet segment, then that router with the lower
IP address on that segment will Iorward multicast traIIic. This is determined
through routing updates between the routers.
DVMRP is similar to PIM DM because both protocols use the broadcast and
prune mechanism and an unicast routing table Ior RPF checks. DVMRP
builds its own unicast routing table while PIM DM utilizes an underlying
unicast routing protocol to build a multicast routing table.

62
62 {C} Herbert Haas 2005/03/11
DVMRP - FIood
50.0.0.2 50.0.0.1
30.0.0.2 30.0.0.1
1
1
1
2 2
33
33
2 34
3 35 35
SpeciaI Poison
Reverse message is sent
to the upstream neighbor
to indicate that this
router is downstream
DVMRP updates create
broadcast truncated tree
(TBT)
In case of same metrics,
the Iower IP address wins
This picture shows the creation oI a SPT using DVMRP.
A TBT is built for each source subnet. The source subnet interIace is a
dedicated interIace.
Each router simply announces the distance to any source network, like any
other routing protocol. But note that some routers receiving such updates reply
with a special poison reverse message, indicating, that they are indeed
"downstream" in the tree.
The poison reverse message contains a distance oI 32 plus the received
distance in the previous announcement oI the neighbor.
II DVMRP updates are received on two diIIerent interIaces, only the interIace
closer to the source is considered as "upstream". II there is more than one
upstream interIace the IP address oI the sender (connected to the announced
source network) is used as tie breaker: the lowest address wins.
Also, iI two (or more) routers are attached to the same LAN segment and
announce the same distance to each other, the router having the higher address
will stop sending.

63
63 {C} Herbert Haas 2005/03/11
DVMRP - Source Tree
50.0.0.2 50.0.0.1
30.0.0.2 30.0.0.1
Source tree
estabIished.
Traffic is
muIticasted.
This picture shows how the multicast SPT has been established and traIIic is
Iorwarded downstream.
Side-note: The old MBONE uses currently dense mode operation, where data
are Ilooded everywhere and each end user has to reIuse explicitly to accept
them (pruning). II the prune message is Ior any reason lost or not sent, whole
network suIIers Irom continuous, un-needed data Ilow.

64
64 {C} Herbert Haas 2005/03/11
DVMRP - Prune
50.0.0.2 50.0.0.1
30.0.0.2 30.0.0.1
Prune
P
r
u
n
e
Prune Prune
Some routers are
Ieaf nodes (have no
receivers) and send
a "(S,G) prune"
message
Now, some routers notice that there are no receivers attached to them.
ThereIore they send a "Prune" message upstream, in order to truncate the tree.
Note that a Iirst hop router (which is directly connected to a source) will never
send a prune message upstream (i. e. to the source).

65
65 {C} Herbert Haas 2005/03/11
DVMRP - TBT
50.0.0.2 50.0.0.1
30.0.0.2 30.0.0.1
Source tree
remains
estabIished but
traffic is pruned
AIter the pruning, the tree is truncated to its smallest useable size.
But note: although we call it a "Truncated Broadcast Tree" (TBT), the
tree is not really destroyed! Only the forwarding of the traffic as been
turned off. The tree is still there!
Each upstream router which received a prune message maintains a "state" Ior a
certain period (3 minutes per deIault). AIter expiring, the traIIic is again
Ilooded and another prune message might be sent.

66
66 {C} Herbert Haas 2005/03/11
DVMRP - Graft
50.0.0.2 50.0.0.1
30.0.0.2 30.0.0.1
Graft
Graft
If some hosts again
beIong to a group,
they notify their
router and the
pruned state is
removed by
a "graft (S,G)"
message
II some group appears again at a router which recently sent a prune message,
then the router can remove the "state" on its upstream neighbor, by sending a
"GraIt" message, which speciIies (S, G).
A graIt message is acknowledged by the router and then Iorwarded to the next
router. Soon the multicast traIIic again Ilows downstream the tree.

67
67 {C} Herbert Haas 2005/03/11
DVMRP Facts
Significant scaIing probIems

SIow Convergence (RIP-Iike)

Significant amount of muIticast routing state


information stored in routers

No support for shared trees

Maximum number of hops < 32


Used in the MBONE

Today worIdwide avaiIabIe and accessibIe

VirtuaI network through IP tunneIs


Every router has to store the (S,G) inIormation, which is very memory
demanding. ThereIore DVMRP does not scale well. Furthermore, shared
trees are not supported at all, and the maximum path length is limited by 32
hops. DVMRP has been used in the MBONE.
The MBONE (multicast backbone) is used to transmit conIerence
proceedings and Ior desktop video conIerencing. Multicast routing and
Iorwarding is provided by tunnels between dedicated devices. The MBONE
caused signiIicant disruption to the Internet when popular events were active.


68
68 {C} Herbert Haas 2005/03/11
MOSPF
UsefuI onIy in OSPF domains
IncIude muIticast information in OSPF Iink states
Group Membership LSAs fIooded throughout OSPF
routing domain
Each router knows compIete network topoIogy!
MOSPF Area Border Routers (MABR) wouId improve
performance
Significant scaIing probIems

Dijkstra aIgorithm run for EVERY muIticast (SNet, G)


pair!
OnIy a few (S,G) shouId be active
No shared tree support
Not used
MOSPF denotes the Multicast Extension to OSPF and is described in RFC
1584. It only works in OSPF domains and suIIers Irom signiIicant scaling
problems. A Dijkstra's SPF rerun is necessary on Ilapping links and changes oI
group membership.
MOSPF is not supported by any vendors today (AFAIK).

69
69 {C} Herbert Haas 2005/03/11
PIM-DM
ProtocoI Independent
UtiIizes any underIying unicast routing protocoI
SimiIar to DVMRP but
No TBT because no dedicated muIticast protocoI in use
Instead: RPF, fIood and prune is performed
For smaII networks onIy
Every router maintains (S, G) states

InitiaI fIooding causes dupIicate packets on some Iinks


Easy to configure

Two command Iines


UsefuI for smaII triaI networks
The Protocol Independent Multicast - Dense Mode (PIM-DM) supports any
underlying unicast routing protocols, including static, RIP, IGRP, EIGRP, IS-
IS, BGP, and OSPF.
When a PIM-DM router receives multicast traIIic via its (upstream) RPF
interIace it Iorwards the multicast traIIic to all oI its PIM-DM neighbors.
But then, the next-hop routers might receive packets also on non-RPF
interfaces! Clearly this silly method results in duplicate packets on some
links. These non-RPF Ilows are normal Ior the initial flooding oI data and will
be corrected by a PIM DM pruning mechanism.
Special "assert" messages are used to prune another routers interIace. Flood
and prune is perIormed every 3 minutes. II the metric is equal, then the
highest IP address on an interIace wins.
Note: PIM-DM can be used together with DVMRP.
PIM-DM is easy to conIigure, there are only two commands necessary.
PIM-DM is an Internet draIt.

70
70 {C} Herbert Haas 2005/03/11
PIM-DM: InitiaI FIooding
DupIicate
packets!!!
(S, G) state in each router
The example above shows some routers which receive packets on non-RPF
interIaces. The routers will discard these packets because only packets received
through the upstream interIace are considered as good packets.
Duplicate packets can occur on some links during the initial Ilooding oI data
and will be removed by a PIM DM pruning mechanism, Iollowing in the next
step.
Also note, that each router must maintain a (S, G) state.

71
71 {C} Herbert Haas 2005/03/11
PIM-DM: Pruning
P
r
u
n
e
Prune (Assert)
StiII (S, G) state in each router !
Pruned because
unwanted traffic!
Pruned because
dupIicate packets
on LAN segment!
Pruning occurs aIter the initial Ilooding (which is done every 3 minutes by
deIault) and serves Ior two purposes: First, branches can be cut oII when there
are no Iurther receivers downstream; secondly, a router can use a so-called
assert message to stop another router Irom sending packets to its non-upstream
(i. e. non-RPF-) interIace. The latter method Iorces the other router to prune its
own interIace.
Again: The prune state lasts three minutes by deIault. Then a new Ilooding
occurs over all links!
Note, that each router must still maintain the (S, G) state.

72
72 {C} Herbert Haas 2005/03/11
PIM-DM: Assert Mechanism
Each router receives the
same (S, G) packet through
an interface Iisted in the
oiIist
OnIy one router shouId
continue sending
Both routers send "PIM
assert" messages

To compare administrative
distance and metric to source
If assert vaIues are equaI,
the highest IP address wins
Packets are
received on
muIti-access
oiIist interfaces
Assert 120:3
Assert 120:2
Okay, you won!
I wiII prune
my interface...
Sweet! I wiII
serve this LAN
segment...
The PIM assert mechanism is used to eliminate duplicate Ilows on the same
multi-access segment. Other than DVMRP (which establishes a TBT in
advance using a dedicated multicast routing protocol), the assert mechanism is
only perIormed when duplicate packets appear on this link.
When a router receives a (S, G) packet via a multi-access interface which is
listed in the (S, G) oilist, then it will send an assert message, telling the other
router a so-called assert value.
The assert value contains both the administrative distance oI this router and
the metric toward the source. The administrative distance is evidentially the
high-order part oI this assert value. Obviously the other router sends also an
assert message.
Now both routers compare these values to determine who has the best path (i.
e. lowest value) to the source. II both values are the same, the highest IP
address is used as tiebreaker. Losing routers prune their interIace, whereas the
winning router continues to Iorward multicast traIIic onto the LAN segment.

74
74 {C} Herbert Haas 2005/03/11
PIM-SM
ProtocoI Independent

UtiIizes any underIying unicast routing protocoI


Supports both source and shared trees
Uses a Rendezvous Point (RP)
Sources are registered at RP by their first-hop router
Groups are joined by their IocaI designated router (DR)
to the shared tree, which is rooted at the RP
Best soIution today
OptimaI soIution regardIess of size and membership
density
Variants
BidirectionaI mode (PIM-bidir)
Source Specific MuIticast (SSM)
The Protocol Independent Multicast Sparse Mode (PIM-SM) has been
deIined in RFC 2362 and is the most useIul multicast protocol today. PIM-SM
relies on a explicit pull concept. TraIIic is only Iorwarded to receivers that ask
Ior it (i. e. send a join message).
PIM-SM utilizes a Rendezvous Point (RP) which roots a shared tree to the
groups. The groups are joined by their local designated router (DR) to this
shared tree. Basically, PIM SM uses shared distribution trees, but it may also
switch to the source rooted distribution tree.
Sources are registered to RPs by so called register packets, which are created
by the Iirst hop routers, closest to the source. A single copy oI the multicast
packet is sent through the RP to the registered receivers. Group members are
joined to the shared tree by their local designated router. A shared tree that is
built this way is always rooted at the RP.
By the way: PIM-SM is the only solution recommended by Cisco.
The bidirectional PIM mode (PIM-bidir) had been designed Ior many-to-
many applications such as needed Ior conIerencing and whiteboarding
purposes.
The Source Specific Multicast (SSM) is a variant oI PIM-SM that only builds
source speciIic shortest path trees. This solution does not need an active RP
and uses the source-speciIic group address range 232/8.

75
75 {C} Herbert Haas 2005/03/11
PIM-SM / User becomes active
RP
Join
group
"G"
DR knows RP
Join (*,G) Join (*,G)
User joins group: The picture above shows how a receiver tells its
designated router (DR) that he becomes active and wants to listen to group G.
This is done using IGMP on the local LAN segment.
The DR sends a 1oin (`, G) to the RP. Obviously the DR must know the IP
address oI the RP. Obviously the DR does not need to know the IP address oI
the source. Obviously the human receiver should at least know what he wants
to listen to.

76
76 {C} Herbert Haas 2005/03/11
PIM-SM / Create Shared Tree
RP Join (*,G) Join (*,G)
Join message
creates
branch of
shared tree
Shared tree to RP: This (*, G) join message is Iorwarded hop-by-hop toward
the RP and hereby a branch oI the shared tree is established. Now multicast
traIIic Ior group G may Ilow down the shared tree to the receiver.

77
77 {C} Herbert Haas 2005/03/11
PIM-SM / Register Source
RP
Source sends
muIticast
traffic
Designated router
encapsuIates muIticast
traffic in unicast
"register" packets
RP decapsuIates register
packets and forwards
them down to the shared
tree
DR registers at RP: The source (Ior G) becomes active and sends multicast
packets, which are encapsulated by the Iirst router (DR) into unicast packets.
These "register" packets are sent to the RP. Obviously this DR must also
know the IP address oI the RP.
The RP decapsulates this packets and Iorwards the multicast packets (which
had been carried inside the register packets) downstream to the group G.

78
78 {C} Herbert Haas 2005/03/11
PIM-SM / Create Source Tree
RP
Join (S, G)
RP joins SPT: Now the RP creates a shortest-path tree (SPT) by sending an
(S, G) join toward the source. Now (S, G) states are created in all routers along
this new SPT path.
Note: Also the RP must maintain a (S, G) state.

79
79 {C} Herbert Haas 2005/03/11
PIM-SM / Create Source Tree
RP
Register
Stop (S, G)
Source Tree
(S, G)
RP stops registering: As soon as native multicast packets arrive at the RP
(over the newly established SPT) the RP sends a "Register Stop (S, G)"
message to the Iirst-hop router, in order to stop the sending oI unnecessary
register packets.

80
80 {C} Herbert Haas 2005/03/11
PIM-SM / Switchover
RP
Join (S, G)
Shortcut: PIM-SM is able to switchover to the shortest connection to the
source. That is, last-hop routers (i.e. routers with directly connected members)
can switch to the Shortest-Path Tree and bypass the RP iI the traffic rate is
above a conIigured threshold called the ~SPT-Threshold.
Note: The default value of the SPT-Threshold in Cisco routers is zero.
ThereIore the deIault behaviour Ior Cisco PIM-SM leaI routers is to
immediately join the SPT to the source as soon as the Iirst packet arrives via
the (*,G) shared tree.

81
81 {C} Herbert Haas 2005/03/11
PIM-SM / Pruning
RP
Prune (S, G)
Disconnect from RP: Now, special (S, G) RP-bit prune messages are sent
up the shared tree to prune only the (S, G) traIIic Irom this shared tree. This
prune is important to avoid duplicate packets.
RP may disconnect from source DR: When the (S, G) prune (with RP-bit
set) arrives at the RP the RP sends (S, G) prune messages back toward the
source to stop the unnecessary (S, G) traIIic.
Note: OI course the RP may only do this iI the RP has received an (S, G) RP-
bit prune via all branches, i. e. no receiver on the shared tree wants to
receive the (S, G) traffic from the RP anymore.

82
82 {C} Herbert Haas 2005/03/11
PIM-SM Summary
Now we Iearned:

PIM-SM can aIso create SPT (S, G) trees

But in a much more economicaI way than PIM-


DM (fewer forwarding states)
PIM-SM is:

Efficient, even for Iarge scaIe muIticast


domains

Independent of underIying unicast routing


protocoIs

Basis for inter-domain muIticast routing used


with MBGP and MSDP
Please consider the Iollowing issues:
PIM-SM can be eIIiciently used Ior both sparse and dense distribution oI
multicast receivers.
There is no need to flood multicast traIIic at any time.
On the other hand a RP is needed at least Ior the initial setup oI a MDT.
PIM-SM can also work together with DVMRP.

83
83 {C} Herbert Haas 2005/03/11
Addendum: Bidir-PIM
Less routers states

OnIy one (*, G) for muItipIe sources


No (S, G)
Same tree for traffic from sources toward RP and from
RP to receivers
Trees may scaIe to an arbitrary number of sources
Now bidirectionaI groups

Coexist with traditionaI unidirectionaI groups


AII routers must recognize them (via PIMv2 fIags)
Dedicated bidir RP required
Designated Forwarder (DF) required
No register packets anymore
Knows best unicast route to RP
DF needed on any Iink between participant and RP
Bidir-PIM was introduced with Cisco IOS version 12.1(2)T (5/00). Traditional PIM-SM
is unidirectional that is the traIIic Irom sources to the RP is encapsulated in register
packets. But this encapsulation and de-capsulation consumes a signiIicant amount oI
CPU power. Additionally, the SPT which is built between RP and source (initiated by
the RP) requires (*, G)(S, G) entries on routers between RP and source.
Using a many-to-many multicast model (where each participant is both receiver and
sender) the (*, G) and (S, G) entries appear everywhere along the path Irom participants
and the associated RP. This results in a signiIicant RAM and CPU overhead and may
become a signiIicant issue Ior example with stock trading applications where thousands
oI stock market traders perIorm trades via a multicast group.
Bidirectional PIM avoids both encapsulation and (S, G) states. The trick is to ensure
that the path taken by packets Ilowing Irom the participant (source or receiver) to the RP
and the reverse will be the sameonly (*, G) states are necessary!
Note: Regular PIM SM groups may coexist with bidirectional groups.
A Designated Forwarder (DF) is needed on every link and knows the best unicast route
to the RP. The DF Iorwards both downstream and upstream traIIic (Irom link to RP).
Like in normal PIM-SM the receivers send (*, G) Joins which are Iorwarded by the last-
hop DR toward the RP which is serving the group. But now the DF acts as DR. When a
router receives a join message Ior a bidirectional group the router must determine iI it is
the DF Ior this link and Ior this group. The router either inspects (*, G) state or RP DF
election inIormation when there is no (*, G) entry. The shared tree is established between
the receiver segments and the RP.

84
84 {C} Herbert Haas 2005/03/11
Addendum: PIM-SS
Source-Specific MuIticast (SSM)

Much simpIer when sources are weII known


Immediate shortcut receiver to source

No need to create shared tree

DR sends (S, G) join directIy to source

No MSDP needed for finding sources


IGMPv3 needed!

Or IGMPv3 Iite

Or URL Rendezvous Directory (URD)


The PIM-SS provides all beneIits oI PIM-SM but avoids shared trees. Instead
source-speciIic shortest-path trees (SPTs) are built immediately upon receiving a
group membership report Ior a speciIied source. SSM is particularly recommended in
cases where there is a single source sending to a given group (one-to-many
applications).
Note that there is no need Ior RPs Ior SSM groups because the discovery oI sources
is done via some other method (Ior example web-based directory etc.)
BeIore SSM, it was necessary to acquire a unique IP multicast group address Ior any
service a source would provide. This was necessary to ensure that diIIerent sessions
would not collide with each other on the same shared tree.
But when using SSM, traIIic Irom each source is uniquely Iorwarded only via a SPT
and different sources may use the same SSM multicast group addresses.
Receivers must have IGMPv3 implemented in order to send a (S, G) to the DR
which Iorwards an appropriate PIM-join messages to the source. IGMPv3 lite is a
lightweight interim solution to implement SSM.
The URL Rendezvous Directory (URD) communicates (S, G) inIormation via
HTTP redirect messages (TCP port 659). That is, the browser oI the receiver host is
redirected by a website to the well-known port 659 with the multicast group and
source address as parameters. The DR scans the traIIic Ior this port and thereIore
learns about the address inIormation.
The PIM-SS is still a draIt proposal (draIt-bhaskar-pim-ss-00.txt).

85
85 {C} Herbert Haas 2005/03/11
SSM - Notes
Take care that no shared tree uses
the same group address

SSM protocoIs cannot avoid address


coIIisions

Register/Join packets to 232/8 shouId


be fiItered
The dedicated address range 232/8 had been reserved exclusively Ior SSM
SPTs. That is, no other router may build a shared tree Ior any group having an
address Irom this range ("global well-known sources").

86
86 {C} Herbert Haas 2005/03/11
Inter-domain MuIticast Routing
The GEANT network will provide multicast on all production routers. On the
backbone the multicasting is entirely native and sparse mode using PIM-SM.
Multicast runs on the same physical inIrastructure together with unicast data.
Most oI the connected National Research Networks have also native
connections. All connections between all participants are done via PIM-
SM/MSDP/MBGP. The previous TEN-155 Multicasting topology is adapted to
the GEANT topology in the backbone.

87
87 {C} Herbert Haas 2005/03/11
BGP Mcast Extensions
Border Gateway MuIticast ProtocoI
(BGMP)

Supports gIobaI, scaIabIe inter-domain


muIticast

OnIy disadvantage: Far from compIetion!


MBGP/MSDP as intermediate soIution

MBGP communicates muIticast RPF


information between AS's

MSDP distributes active source information


between PIM-SM domains
The Border Gateway Multicast Protocol (BGMP) is Iar Irom completion
because oI its complexity. TodayiI really neededthe combination
MBGP/MSDP is used as intermediate solution.
MBGP allows to exchange multicast RPF inIormation between Autonomous
Systems (AS). The only diIIerence Irom ordinary BGP-4 is diIIerent NLRI
code in BGP messages, allowed by RFC-2283, which are so-called MBGP
multicast NLRIs. In other words MBGP is only an extension to BGP. Since
MBGP cannot build multicast distribution trees, an additional protocol is used:
MSDP.
MSDP is utilized by a PIM-SM domain to tell another PIM-SM domain that
active sources exist. Then the routers oI the other PIM-SM domain can send (S,
G) joins to interconnect sources and receivers in distant domains via inter-
domain branches oI the SPT.

88
88 {C} Herbert Haas 2005/03/11
Note
ISPs often want to use a separate
muIticast topoIogy

But PIM reIies on underIying unicast routing


protocoI

Reverse path might be different


MBGP creates muIticast database

FiIIed with muIticast NLRIs=(S, G)


PIM-SM supposes one (cIosed)
administrative muIticast domain

MSDP sessions between RPs to interconnect


muItipIe domains

SimiIar to eBGP (TCP)


Routers, which communicate with each other using RFC-2283 BGP extensions
("MBGP") can exchange routing inIormation oI several protocols (similar to
IS-IS, when conIigured to carry IP routing inIormation). This Iunctionality oI
BGP can be used to exchange reachability information of multicast sources.
In the current DVMRP Mbone, tunnels are used to bypass non-multicast
capable routers in the Internet. MBGP instead creates separate multicast
routing tables. ThereIore, different unicast and multicast topologies may
exist: Some parts oI the network can be used by unicast only, some by
multicast only.
When a router is sending unicast data it never looks into the multicast table.
But when the router wants to perIorm a RPF check Ior multicast packets it can
use both tables. When the multicast and unicast topology is identical, the
multicast table is indeed useless, but iI the topologies are diIIerent this second
(multicast) table is used to solve the RPF problem.
PIM-SM supposes the existence oI one administrative domain having one RP
where receivers can request multicast data. But in the real Internet everybody
wants to control his own domain. ThereIore a new protocol was proposed:
MSDP, which allows to communicate active multicast sources among several
RPs Irom diIIerent domains.
MSDP runs over TCP similarly to eBGP. There is no need Ior Iull mesh TCP
connections.

89
89 {C} Herbert Haas 2005/03/11
MSDP
MSDP peering from source RP to

Border routers

Other AS's RP
If MSDP peer is a RP and has a (*, G) entry

This means there exists some interested


receiver

Then a (S, G) entry is created an a shortcut to


the source is made

Furthermore the receiver itseIf might


switchover to the source
In the PIM Sparse mode model, multicast sources and receivers must register
with their local Rendezvous Point (RP). Actually, the closest router to the
sources or receivers registers with the RP but the point is that the RP knows
about all the sources and receivers Ior any particular group. RPs in other
domains have no way oI knowing about sources located in other domains.
MSDP is an elegant way to solve this problem. MSDP is a mechanism that
connects PIM-SM domains and allows RPs to share inIormation about active
sources. When RPs in remote domains know about active sources they can pass
on that inIormation to their local receivers and multicast data can be Iorwarded
between the domains.
The RP in each domain establishes an MSDP peering session using a TCP
connection with the RPs in other domains or with border routers leading to the
other domains. When the RP learns about a new multicast source within its
own domain (through the normal PIM register mechanism), the RP
encapsulates the Iirst data packet in a Source Active (SA) message and sends
the SA to all MSDP peers. The SA is Iorwarded by each receiving peer using a
modiIied RPF check, until it reaches every MSDP router in the interconnected
networkstheoretically the entire multicast internet.
II the receiving MSDP peer is an RP, and the RP has a (*,G) entry Ior the
group in the SA (there is an interested receiver), the RP will create (S,G) state
Ior the source and join to the shortest path tree Ior the state oI the source. The
encapsulated data is decapsulated and Iorwarded down that RP's shared tree.
When the packet is received by a receiver's last hop router, the last-hop may
also join the shortest path tree to the source. The source's RP periodically sends
SAs, which include all sources within that RP's own domain.

90
90 {C} Herbert Haas 2005/03/11
MBGP/MSDP (1)
ASs estabIish muIticast peering using MBGP
Via speciaI MuIticast RPF NLRI types
Used by PIM-SM to send (S, G) joins
MSDP teIIs aII RPs about active sources
Using Source Active (SA) messages

Containing (S, G) information


AS 1
AS 2
AS 3
AS 4
RP
RP
RP
RP
MBGP
SA: 194.1.1.1, 225.5.5.5
S
A
:
1
9
4
.1
.1
.1
, 2
2
5
.5
.5
.5
MBGP
Register
(194.1.1.1, 225.5.5.5)
Join
(*, 225.5.5.5)
Routers on the borders oI domains establish MBGP peering and exchange
multicast RPF NLRI which is used by PIM SM to determine which way to send
(S, G) joins.
RP-routers establish MSDP peering and exchange inIormation on active
sources and groups.
Note: Since BGP now has to deal with both inter-domain unicast NLRI and
inter-domain multicast NLRI the resulting inter-domain unicast traIIic paths
may diIIer Irom inter-domain multicast traIIic paths.

91
91 {C} Herbert Haas 2005/03/11
MBGP/MSDP (2)
Receiver joined IocaI RP

Via (*, G) message


LocaI RP joins source directIy

Via (S, G) message


AS 1
AS 2
AS 3
AS 4
RP
RP
RP
RP
Join
(194.1.1.1, 225.5.5.5)
The source sent register packetshereby declaring (S, G)and the receiver
sent (*, G) join packets to the local RP.
Since the local RP learned about (S, G) via the previous SA messages, the RP
can directly join the SPT which is rooted at the remote DR by sending a (S,
G) join.

92
92 {C} Herbert Haas 2005/03/11
MBGP/MSDP (3)
MuIticast traffic fIows directIy from the source to
the receiver
AIong a SPT downstream (to perhaps muItipIe receivers)
Note: DRs and intermediate routers are omitted
for simpIicity!
AS 1
AS 2
AS 3
AS 4
RP
RP
RP
RP
Now the DR oI the source can send the multicast traIIic directly to the
receiver's RP over a SPT.

93
93 {C} Herbert Haas 2005/03/11
ReIiabIe MuIticast

94
94 {C} Herbert Haas 2005/03/11
What is this? Who needs it?
ReIiabIe transmission means: no singIe bit gets
Iost over MDT !!!
TraditionaI muIticast can't guarantee that-and
doesn't need to!
Audio and video does not bother
But important for data-based appIications
Whiteboarding
Efficient Usenet updates
Database synchronization
etc...
AIso reaI-time demands
FinanciaI data deIivery
Traditional multicast had been designed to distribute bulky voice and video
streams most eIIiciently. Clearly those are realtime protocols which
additionally must be transmitted isochronously.
Isochronous means: a piece oI voice or video data (typically some 30-200
Bytes) are only relevant at one instant oI timebut a Iew seconds later it is
useless. This time relates to the playback-buffers within the multicast
applications and in case oI interactive applications this time also depends on
the aIIordable response time.
For example during a Voice over IP (VoIP) conIerence, it is important that the
total round-trip-time (RTT) is no longer than 0.5-1 secondsotherwise the
users would get angry, as they cannot debate in a reasonable way.
But note that it is absolutely irrelevant iI, let's say 0.01 oI all transmitted
bytes gets damaged due to higher sun spot activity. You won't notice!
But also pure data applications would like to utilize IP multicast technology.
For example Usenet ("News") data could be eIIiciently updated to all servers
that belong to this world-wide network. Here, it is very important that each
single bit is transmitted reliably without being corrupted.
Soon we'll see that this is accomplished not by deploying all-optical networks
and quantum entangled photons (joke-alert!) but with a simple but specialized
reliable multicast protocol.

95
95 {C} Herbert Haas 2005/03/11
ReIiabIe MuIticast (1)
Remember: IP muIticast is UDP based!
No guaranteed packet deIivery!
No congestion controI
Not intended for data transactions!
RTP/RTCP onIy cares for
DupIicates

Sequence
ReIiabIe muIticast requires UDP-based
acknowIedgements
TCP cannot do muIticast by nature (too much overhead,
state variabIes, buffers, timers, ...)
Security issues for financiaI data deIivery etc.!!!
Best eIIort delivery results in occasional packet drops. Many real-time
multicast applications such as video and audio streaming may be aIIected by
these losses. On the other hand it is clearly useless to request retransmissions oI
each lost data.
However, some compression algorithms may be severely aIIected by even low
drop rates; this causes the picture to become jerky or to Ireeze Ior several
seconds while the decompression algorithm recovers.
Duplicate packets may occasionally be generated as multicast network
topologies change.

96
96 {C} Herbert Haas 2005/03/11
ReIiabIe MuIticast (2)
Guaranteed data deIivery is provided by
reIiabIe muIticast protocoIs
StiII UDP based of course

But ACKs are additionaIIy impIemented:


Feedback loop

Data recovery mechanisms

Congestion controI mechanisms


Remember that TCP cannot help to implement a reliable multicast
protocol as TCP supports only unicast transmissions.
This is because TCP maintains peer-speciIic timers and buIIers in order
to process a very complex algorithm that supports reliability, high
perIormance, and network Iairness.
ThereIore also reliable multicast must be UDP based. But now additional
higher-layer Iunctionality is introduced. The most important Iunction is
the feedback loop which is Iundamental Ior data recovery mechanisms.
Optionally (and recommended) are congestion control mechanisms
which signiIicantly enhance the perIormance oI reliable multicast
implementations. Note that packets are typically dropped when
congestion occurs.

97
97 {C} Herbert Haas 2005/03/11
Feedback Loop
Either performed by the source

End-to-end feedback Ioop (Iatency!)

Intermediate devices don't need to be


muIticast aware

Receivers send NACKs back to source


Or locally

Hop-by-hop feedback Ioop

Intermediate "repair servers" cache packets


for retransmissions

Nearest upstream server performs


retransmission upon NACK
If not possibIe, NACK is sent to next upstream server
There are two basic methods used to implement reliable multicasting:
The first method requires all receivers to send a negative acknowledgement
(NACK) back to the source. Thus the source alone is responsible Ior
retransmissions. The intermediate routers do not need to be reliable multicast-
aware. This method simply employs an end-to-end feedback loop. Obviously,
this can lead to signiIicant reparation delays when the path between source and
receiver is long.
Note: Other than a normal acknowledgement (ACK) a NACK is only sent
when a packet is missing. Remember that normal ACKs are sent (e. g. with
TCP) Ior each packet that arrives properly. But this would not scale! Imagine
thousand oI receivers sending thousand oI ACKs back to the single poor
source...this would kill it! Instead only NACKs are sent when packets are
missing.
The second method requires to employ special servers for retransmission
within the path between source and receivers. These "repair servers" are also
multicast receivers and copy each packet in their cache. When the cache is Iull
the next packets overwrite the oldestlike a FIFO principle. II a receiver sends
a NACK upstream, the Iirst server which Iinds the missing packet in its cache
will perIorm the retransmission. Otherwise the NACK is Iorwarded upstream
to the next server. Hence the Ieedback loop is provided on a hop-by-hop basis.

98
98 {C} Herbert Haas 2005/03/11
Optimizing Recovery
One Iost packet typicaIIy Ieads to a "NACK storm"

Sender must coIIapse aII associated NACKs and


retransmit onIy once
On a LAN onIy one receiver needs to send a NACK
(NACK suppression aIgorithm)
Congestion-controIIed retransmissions
Congestion is often cause of missing packets

Sender shouId retransmit when congestion is over


UnidirectionaI Iinks (e. g. sateIIite)
FEC against interferences
Redundant transmission against buffer overfIows
Congestion controI CRITICAL
When a packet is dropped at a certain point in the MDT, every receiver residing
along the downstream path will send a NACK. The source will be overwhelmed by
NACKsbut all NACKs request to retransmit the same packet! That is, the source
must logically collapse all corresponding NACKs and retransmit only this (single
unique one precious) packet.
On a LAN the so-called NACK suppression algorithm could be implemented. It
works similar as with IGMP report suppression: Upon missing a packet every
concerned station starts a countdown which is initialized with a random value. The
station whose countdown Iirst expires explodes...no...sends a retransmission oI
course!
Nevertheless, sources should be able to perIorm congestion control. In most cases a
buIIer congestion on some helpless router is the real cause Ior a lost packet. It
makes the problem even worse iI the retransmission occurs during the congestion.
Finally, a Ieedback loop makes no sense in some cases (asymmetric bandwidth) or
is even impossible with unicast (e. g. satellite) links. Here the source must
introduce suIIicient redundancy into the multicast traIIic so that the receiver could
restore the missing inIormation by itselI.
Two possibilities: Forward Error Correction (FEC) methods (such as Hamming
Codes or Reed-Solomon Codes) can be used to mitigate sporadic bit errors caused
by interIerences or similar. II whole packets are dropped because oI buIIer overIlow
reasons then it might help to send the same packets again and again aIter some
typical congestion period. But without eIIective congestion control, the multicast
transmission is permanent endangered.

99
99 {C} Herbert Haas 2005/03/11
ProtocoI Overview
ReIiabIe MuIticast ProtocoI (RMP)

Token rotating scheme


ReIiabIe MuIticast Transfer ProtocoI 2
(RMTP-2)

ReIies upon "Top Node"


MuIticast FiIe Transfer ProtocoI (MFTP)

Repair cycIes
ScaIabIe ReIiabIe MuIticast (SRM)

Straight and simpIe


Pragmatic GeneraI MuIticast (PGM)

"Receivers seIf-heIp association"


The Reliable Multicast Protocol (RMP) relies on a rotating-token scheme to
ensure reliability and message order. Missing packets are signaled using
NACKs via multicast to all receivers. Only stations having the unique token
are allowed to multicast an acknowledgment Ior the recently received packets.
RMP is comparable with SRM.
The Reliable Multicast Transfer Protocol 2 (RMTP-2) requires a trusted
"top node" available Ior each sender. This top node issues a permission and
control parameters Ior the sender and provide a single point oI control/monitor
Ior network managers.
The Multicast File Transfer Protocol (MFTP) provides reliable non-real-
time bulk data transIer. At Iirst a source transmits the whole data volume and
then collects all NACK packets Irom all receivers. By applying a logical OR
operation on the NACK packets the source determines the collective need Ior
repairs (NACK collapse). Then the source starts a summary-retransmission and
so on.
The Scalable Reliable Multicast (SRM) enables reliable multicast delivery oI
data packets but without any sequence or delay guarantees.
The Pragmatic General Multicast (PGM) ensures reliable multicast delivery
and guarantees correct packet sequence and no duplicates. When a packet is
lost the aIIected receiver continuously sends NACKs until the next upstream
router replies with a NACK ConIirmation Message (NCF). Since this NCF is
sent via multicast downstream, all other receivers see it and may perIorm a
local recovery by sending the missing packet. PGM is one oI the most
promising solutions.

100
100 {C} Herbert Haas 2005/03/11
RMP
UsefuI for reaI-time, coIIaborative
appIications
NACKs are sent to muIticast address

Assures NACK suppression

AIIows any member to perform retransmission


Token rotation scheme

Owner of token may send ACK referring to


recentIy received packets

AIIows Iate joined members to inform about


missing packets
Retransmission to muIticast group
RMP has been built Ior online collaboration applications with (soIt) real-time
demands. NACKs (and all other packets) are always sent to a multicast group
address in order to prevent other members oI sending the same NACK (NACK
suppression) and to invite any member to perIorm the retransmission.
Furthermore, a token rotation scheme has been introduced to provide
additional reliability, especially Ior late joiners.
Every time the token is passed to a member, this member sends an ACK to all
other members (again addressing this multicast group) which refers to all
recently received packets. Thus, late joined members can Iigure out which
packets they had missed and can request Ior retransmission (as soon as they get
the token). Obviously those late-joiners can only be served as long the
requested packets are available in a cache.
Also the retransmissions are sent to the multicast group address, which might
cause duplicate packets. ThereIore, the receivers must be able to detect and
eliminate duplicates.

101
101 {C} Herbert Haas 2005/03/11
RMTP
UsefuI for buIk data distribution
HierarchicaIIy structured
Periodic status messages:

Sent by Ieaf receivers to their designated


receivers (DR)

ReIayed via higher Iayer Designated Receivers


up to the Sender
LocaI retransmission and Iate joins
possibIe
Caching mechanisms in Designated
Receivers
The Reliable Multicast Transport/TransIer Protocol (RMTP) represents
actually a whole family oI similar protocols which are mainly used Ior reliable
bulk data distribution such as Iile transIers.
An RMTP environment is hierarchically organized whereas each "layer"
relays status messages vertically Irom receivers toward the source. Those status
messages are periodically sent by the receivers upwards to the next-level
designated receiver(s) and so on. This way it is assumed that some near
receiver could perIorm the desired retransmission based on its cache.
Again, late joiners could be served. The RMTP protocol is slightly similar with
PGM.

102
102 {C} Herbert Haas 2005/03/11
MFTP - 1. What is it?
UsefuI for non-reaItime buIk data
distribution onIy

DeveIoped by StarBurst Communications and


Cisco Systems

Internet-draft February 1997


AIso incIudes diagnostic tooIs

MuIticast ping (senders Iearn group


popuIation)
Good scaIabiIity (thousands...)
FIexibIe transport

Unicast, muIticast, or broadcast dependent on


number of receivers and medium
MFTP had been developed by StarBurst Communications
(www.starburstcom.com, now acquired by Adero, Inc. in March 2000) and
improved by Cisco. MFTP had been developed to transport bulk data such as
Iile transIer in an eIIicient and reliable waybut not too Iast. Similar as
traditional FTP, MFTP also consists oI a Multicast Control Protocol (MCP)
and a Multicast Data Protocol (MDP).
Although MFTP is not the Iastest protocol it is scalable to thousands oI
receivers over one-hop networks such as satellite links. Furthermore MFTP is
flexible regarding the underlying medium. Depending on the number oI
receivers a sender my also choose unicast instead oI multicast. II the
underlying network does not support multicast the sender may also choose
broadcast.

103
103 {C} Herbert Haas 2005/03/11
MFTP - 2. How does it?
Server announces transmission and waits
for receiver registration

Hereby Iearning popuIation

Announcement contains fiIename and size

WeII-known muIticast group address for


announcements

Registration suppression on LANs


Then data is sent and NACKs coIIected

NACKs are coIIapsed, retransmission


afterwards

SeveraI retransmissions if necessary (sIow but


reIiabIe)
The source announces any transmission in advance to a well-known multicast
group address by speciIying the Iilename, the Iile size and other common Iile
parameters. Note that the draIt standard does not speciIy the multicast address
itselI (just well-known to users) but the UDP port 5402.
During a given registration time all interested receivers will register and the
source learns the population size. If multiple receivers reside on the same
LAN segment a registration suppression is perIormed as in other protocols
already mentioned.
Note the simple basic principle: The announcements are sent to a (well-known)
public group address to which everybody listens so that interested receivers
may join but the actual data is then sent to a private group address.
During transmission the source collects NACKs but the actual retransmission
is done after the complete Iile had been transmitted once. The source patiently
repeats the retransmissions until all packets are received correctly by all
receivers.
Again, the source collapses all NACKs by a logical OR operation.

104
104 {C} Herbert Haas 2005/03/11
MFTP - 3. ProtocoI DetaiIs
FiIe is sent in bIocks

Some 1000 packets per bIock

Consists of Data Transmission Units (DTUs)

Source sends status request message after


each bIock
NACKs are sent after each bIock

Containing bit-map indicating bad DTUs

Unicast
ACKs couId be sent but are typicaIIy
turned off to reduce traffic

OnIy one ACK at the session end is required


All data oI a Iile is sent in blocks oI same length. Each block is Iurther
subdivided by one or more Data Transmission Units (DTUs), which consists oI
several IP packets. AIter each block the source sends a status request message
indicating that whole block has been sent. At this time every receiver checks iI
data is missing and iI so a NACK can be sent which contains a bit map
indicating block/DTU/packet numbers oI the missing data.
The source does not wait Ior the NACKs but continues to send the next block
and so on. Only at the end oI the Iile the source starts to retransmit all
requested packetsas already mentioned.
Also ACK messages are deIined but the source does not expect them. Typically
they are not sent in order to reduce the amount oI traIIic. Only at the end oI the
transmission, all receivers must send one ACK to signal that the whole
transmission succeeded.

105
105 {C} Herbert Haas 2005/03/11
MFTP - 4. Three Group ModeIs
CIosed groups

Members are known by source

OnIy those members may register


Open Iimited groups

Unknown members

Source expects registration


UnIimited groups

No registration expected
MFTP allows the multicast service provider to deIine three diIIerent types oI
groups.
All members oI a closed group must be known by the source. This model
allows Ior dedicated authorization and is typically applied only Ior a small
number oI receivers. Here the source speciIies the receivers within the
announcement.
Members oI an open limited group are not speciIied by the announcement.
Any receiver may join the source but must register to the source. The number
oI receivers is typically limited.
Members oI an unlimited group do not need to register and even the source
sends no announcements at all. There are no limits in group size.

106
106 {C} Herbert Haas 2005/03/11
SRM
For whiteboarding (wb) in Mbone and generaI
data distribution

Does not care for ordered packet deIivery


NACKs are sent to group
Both NACK and retransmission suppression
Two modeIs: ALF and LWS
AppIication LeveI Framing (ALF)
Data is uniqueIy identified by Source-ID and Page-ID
Time stamp, Sequence Number
AppIication must re-sequence
Light-Weight Sessions (LWS)
AdditionaI session messages as feedback Ioop
IdeaI for conferencing appIications
SRM is used by the whiteboarding tool (wb) oI the Mbone toolset and some other
distributed interactive applications such as simulations or distributed computing
environments.
NACKs are sent to the group (and not directly to the source) to invite any receiver
to start the retransmission. Both repair packets and retransmissions are not sent
immediately but aIter expiration oI a suppression timer.
Two SRM models had been deIined, Application Level Framing (ALF) and
Light-Weight Sessions (LWS).
ALF applies an identiIier, a sequence number and a time stamp to each packet
which allows a receiver to easily identiIy and NACK lost data.
LWS simply establishes a session between source and receivers and provides
special session control messages (which are exchanged between source and
receivers). These session messages are used by the receivers to tell the source which
packets had been received and the source uses them to check the receiver states.
Note that NACK messages are used independently! When a NACK gets lost the
source still notice outstanding packets by tracking the session messages. OI course a
source might not track back to the very beginning oI a session but rather an actual
time frame is considered. Thus also late-joiners can be served depending on the
time Irame used by the caches.
Optionally nodes can estimate the distance to a sender using session messages. The
average bandwidth utilization (typically below 5) oI the session messages is either
preset by a reservation protocol (such as RSVP) or adaptively controlled by a
congestion control algorithm.

107
107 {C} Herbert Haas 2005/03/11
PGM
Best known soIution (Cisco)

DupIicate-free, ordered deIivery

SeveraI appIication-friendIy features

MuItipIe senders and receivers

Independent of Iayer 3

Internet-Draft, January 1998


Routers support local feedback Ioops

"PGM Assist features"


The Pragmatic General Multicast (PGM) is one oI the most Ilexible and
scalable solutionsand implemented on Cisco routers!
On Cisco devices the "PGM Router Assist"-Ieature must be enabled. Those
routers will not perIorm any retransmissions by themselves but support great
assistance in efficient NACK forwarding/filtering and searching an
appropriate retransmitter, which is either the source itselI or another receiver
which has enough cached data.
Note that PGM is basically layer-3 independent but Cisco IOS only supports
PGM over IP.

108
108 {C} Herbert Haas 2005/03/11
PGM - Basic PrincipIe (1)
Source sends ordered
data (ODATA) containing

Transport session identifier


(TSI)
Sequence number (SQN)
Source sends aIso Source
Path Messages (SPM)
InterIeaved with ordered
muIticast data
Provides an upstream path

Not shown in the picture


What the heII...?
The source sends ordered data (called ODATA by the draIt) which is identiIied
by a Transport Session Identifier (TSI) and a sequence number (SQN). By
identiIying each session by a TSI label, any number oI sources can be handled
by PGM. The SQN then Iurther identiIies a packet within the TSI-labeled
session.
Note: Any NACK will reIer to a TSI/SQN pair.
The receiver learns the inIormation about the next upstream hop Irom the
Source Path Message (SPM), which is periodically interleaved with the data.
Each PGM network element inserts its interIace IP address (and removes the
previous address) through which the SPM message is sent downstream.
ThereIore, each downstream PGM network element can maintain state
inIormation which can be used to send unicast NAKs upstream to the source.
SPM messages have a separate sequence number and must also be NAK-ed iI
not received.
Note: There must be at least one PGM-enabled router between any source and
any receiver.

109
109 {C} Herbert Haas 2005/03/11
PGM - Basic PrincipIe (2)
Upon faiIure: NACK is
sent to upstream PGM
router
Unicast to the address
indicated in SPM
Upstream PGM router
sends NACK Confirmation
(NCF)
To muIticast group
downstream
EnabIes NACK
suppression
Upstream PGM router
creates TSI/SQN
retransmission state and
forwards NACK upstream
to source
NACK
2/7
NCF
2/7
Unicast to
upstream
PGM router
MuIticast to
suppress
further NACKs
Suppressed
receiver
State 2/7
State 2/7
State 2/7
Oh, I have to
repeat
something...
NACKs are sent unicast to the upstream router, which had been learned by the
SPM. Each NACK contains TSI/SQN inIormation.
When a router receives a NACK, this router replies with a NACK
Confirmation Message (NCF) which is sent as multicast through the same
interIace which received the NACK so that other receivers can suppress their
NACKs. PGM-enabled routers never propagate NCFs.
Then this router Iorwards the NACK upstream and creates a state Ior the
TSI/SQN pair. This state allows the router to filter any additional NACK and
to Iorward any retransmission. The router continuously propagates the NACK
upstream toward the sender until it also receives a NCF Irom an upstream
PGM router. This NACK/NCF process is repeated between each pair oI PGM
enabled routers until the source itselI receives the NACK.
PGM also supports local recovery: Any local receiver may respond with a
"NCF-Redirect" option and hence becomes a Designated Local
Retransmitter (DLR) which retransmits the requested data Irom its own
cache. The router Iorwards all subsequent NACKs directly to this DLR and not
upstream.

110
110 {C} Herbert Haas 2005/03/11
PGM - Options
Late joining

Sources indicate whether IateIy joined


receivers may request aII missing data
Time stamps

Receivers teII urgency of retransmissions


Reception quaIity reports

Sent by receivers for congestion controI


Fragmentation

To confirm to MTU
FEC

To reduce seIective retransmissions


PGM also supports some application-friendly options such as late joining,
time stamps, reception quality reports, and others.
The late joining option allows a source to tell lately joining receivers whether
or not they may request all missing packets.
Additionally time stamps can be used in NACKs to allow receivers to tell any
PGM device "how urgent" the missing data must be retransmitted.
Reception quality reports may be used in NACKs to support congestion
control. This is inserted by the receivers and utilized by the source.
PGM supports data fragmentation in order to conIorm with the maximum
transmission unit (MTU) supported by the network layer.
Furthermore, FEC can be enabled to reduce the number oI selective
retransmissions.

111
111 {C} Herbert Haas 2005/03/11
Summary
MuIticast routing requires creation of
spanning trees

Avoid muItipIe packets

Avoid muIticast storms


Source-based and Shared trees
Push and PuII methods
IGMP to announce group membership
Current favourite: PIM-SM
AIso reIiabIe muIticast soIutions avaiIabIe

PGM is most important



1
2005/03/11 {C} Herbert Haas
MPLS Introduction
MPLS

2
2 {C} Herbert Haas 2005/03/11
TerminoIogy
LSR - LabeI Switch Router
LER - LabeI Edge Router
FEC - Forwarding EquivaIent CIass
LSP - LabeI Switched Path
FIB - Forwarding Information Base
LIB - LabeI Information Base
LFIB - LabeI Forwarding Information Base
TIB - Tag Information Base
PHP - PenuItimate Hop Popping
LDP - LabeI Distribution ProtocoI
TDP - Tag Distribution ProtocoI
RSVP - Resource Reservation ProtocoI
CR-LDP - Constrained Routing LDP
This slide lists a Iew oI the thousand important abbreviations.

3
Why MPLS?
Once upon a time...
Computer science:
A study akin to numerology and astrology, but lacking the
precision oI the Iormer and the success oI the latter.
Networking science:
The costly enumeration oI the obvious.
MPLS:
No science at all.

4
4 {C} Herbert Haas 2005/03/11
Drawbacks of IP Networks
IP uses structured addresses for both:

Routing
Forwarding
In other words: The "IP Routing Paradigm
"
Hop-by-hop routing (sIow)
Destination based routing (Large routing tabIes)
Least cost routing (no Ioad baIancing)
ATM: Layer 2 and 3 topoIogies often
different (hub & spoke)
ManuaI VC estabIishment necessary
TE?
QoS?
VPN?
Transport?
ATM-Switch IP-Router
Destination based least cost IP routing does not support load balancing.
Although policy based routing is supported by most vendors this solution does
not scale. Also there are no satisIying solutions available Ior TaIIic
Engineering (TE) and Quality oI Service (QoS). Indeed there are some working
IP VPN solutions (e. g. IPSec based) but it is still a scalability issue.

6
6 {C} Herbert Haas 2005/03/11
MPLS Idea
MPLS is a provider technoIogy
AppIication: Transport network!
Inside versus border versus outside domains:
Core routers
Provider Edge routers (PE-routers)

Customer Edge routers (CE-routers)


AIso ATM switches can run MPLS
Know L3 topoIogy
Core Routers
PE
PE
PE CE
CE
CE
Service Provider C
u
s
t
o
m
e
r
A
C
u
s
to
m
e
r B
Customer C
There is one unique basic concept with MPLS which is the idea oI a border and
a core network.
What iI the border MPLS routers are somehow clever, determine where the
packet has to go (perIorm the whole routing process) and add a simple but
signiIicant label on it (the packet), so that all subsequent MPLS routers
(somehow) know what to do with it
Actually the whole principle had been stolen Irom the ATM worldbut it has
also been improved. ATM can only swap two labels (the VPI and the VCI, but
mostly they are swapped together).
Wouldn't there be much greater Ilexibility iI we had more labels per packet?

7
7 {C} Herbert Haas 2005/03/11
MPLS BuiIding BIocks
MPLS
Transport
MPLS (Advanced) VPN
MPLS MuIticast
MPLS ATOM
MPLS TE
MPLS QoS
You aIways need this! You can choose
from these
"Advanced Features"
Carrier supporting Carrier
(between severaI ASs)
Any Transport
over MPLS
The MPLS technology supports diIIerent types oI so called MPLS Applications
like the one shown in the graphic above.
MPLS Transport is the base MPLS Application which needs to be
conIigured iI you want to use other MPLS Applications like MPLS
VPN, MPLS TE etc. MPLS Transport can be used to replace pure layer
3 IP Iorwarding with Label switching.
MPLS VPN can be used to built closed user groups on top oI the
MPLS Transport system.
MPLS Multicast is needed iI Multicast transport through an MPLS
cloud is desired.
MPLS Atom allows you to tunnel Ethernet, Frame-relay and ATM
traIIic through an MPLS domain.
MPLS TE can be used to overcome load-balancing limitations oI IP
routing protocols by the use oI traIIic engineering tunnels.
MPLS QoS is used iI you want to support diIIerent traIIic classes
inside your MPLS network.

8
MPLS Transport
The most fundamental feature...
II you understand MPLS Transport then you will Iollow the rest oI it...

9
9 {C} Herbert Haas 2005/03/11
MPLS at a GIance
IP does destination based routing

Hop-by-hop routing efforts

Each hop must know aII routes (100,000)


MPLS repIaces the gIobaI IP destination
address by a IocaIIy used label
LabeI can identify many things: FEC

VPN-ID, TE TunneIs, QoS ,


MuIticast groups, ...
MPLS was Iormerly known as "tag switching" and was invented by Cisco.
Today it is standardized as Multiprotocol Label Switching by the IETF.
The major diIIerence between the IP and MPLS Iorwarding plane is:
An IP router uses the longest match routing rule when it scans through the IP
Iorwarding table. This means the subnet mask inIormation stored in the IP
Iorwarding table determines how many bits oI the incoming packets IP address
must match with the IP entry in the Iorwarding table. In case oI more than one
match is Iound, the longest match wins.
The MPLS Iorwarding engine does not use the longest match routing rule.
MPLS always requires an exact match between the incoming Label and the
Label Iorwarding table.
A label in MPLS could also identiIy other things than only a destination, it
could be used to 'signal a QoS group, Multicast group, MPLS TE tunnels etc.
ThereIore we assign a label to a Forward Equivalent Class (FEC), which has
a common meaning. The FEC simply tells what the label stands Ior (e. g. a
VPN, a next-hop, a QoS-class, ...).

10
10 {C} Herbert Haas 2005/03/11
MPLS Header
"Layer 2.5" can be used over Ethernet, 802.3 or PPP Iinks
Frame mode
MPLS over ATM is different than over packet interface
CeII mode
ATM can onIy swap VPI/VCI, no stacking!
ATM encapsuIates MPLS-IP packet inside AAL5
Layer 2
(Ethernet, PPP)
LabeI Prec. S TTL IP
20 Bit 3 1 8
One 4 Byte MPLS header
Layer 2
MPLS
Header 3
MPLS
Header 2
MPLS
Header 1
IP
LabeI Stack
The MPLS Header is made up oI Iour bytes and is located between the layer
two header and the layer three header. The existence oI an MPLS header is
indicated by the layer two type Iield entry 0x8848.
The MPLS header is made up oI a:
20 bit label Iield used Ior Iorwarding,
3 Experimental bits typically used to carry IP Precedence
settings,
1 bit bottom oI stack (0 indicates last label in the stack, 1
indicates there are some more labels on top oI the bottom label)
TTL Iield in which by deIault the IP TTL value is copied to
when a Label is inserted.
II MPLS is used on top oI ATM, the VPI/VCI Iield oI the standard ATM cell
header is used to carry the label inIormation. There is no additional MPLS
header involved because this would require hardware changes iI you want to
migrate existing ATM devices to support MPLS.
Note: The labels 0 to 15 are reserved. ThereIore the lowest usable label number
is 16 and the highest possible label is 1,048,575 (which is actually 2`20-1).
Only Iour out oI the 16 reserved labels have been deIined by RFC 3032, which
are: 0 "IPv4 Explicit Null Label", 1 "Router Alert Label", 2 "IPv6 Explicit Null
Label", 3 "Implicit Null Label".

11
11 {C} Herbert Haas 2005/03/11
LabeI Switch Routers (LSRs)
Any Cisco IOS 12.0 based router can
do MPLS
Performs standard operations:

Insert (impose) a IabeI

Remove (pop) a IabeI

Swap IabeIs during forwarding


MuItipIe IabeIs occur for exampIe:

MPLS VPNs (egress router/VPN)

MPLS TE (tunneI/destination)
MPLS is basically a software solution. With Cisco IOS version 12.0, routers
are able to perIorm CEF switching (explained soon in detail), which is the
basis Ior MPLS. That is, nearly any Cisco router (except the smallest home
oIIice devices) are able to do MPLS.
MPLS routers are also called "Label Switch Routers" (LSRs) and must be
able to perIorm the Iollowing basic operations: Insert (or "impose") a label
(this is essential Ior edge routers), remove (or "pop") a label (this is essential
Ior last hop routers), and swap labels (this is always done during packet
Iorwarding).
Several reasons lead to a label stack. For example, with MPLS VPNs, the top
label identiIies the egress router while a second label identiIies the VPN itselI.
Thus the egress router can (as soon as the packet arrived) pop the outermost
label and Iorward the packet to the right interIace according to the inner label.
Another example is MPLS Traffic Engineering (TE), where the outer label
points to the TE tunnel endpoint and the inner label to the Iinal destination
itselI.

12
12 {C} Herbert Haas 2005/03/11
Important Concepts
LDP (RFC) or TDP
(Cisco)
CEF is required
(Cisco Patent)
Routing tabIe is
256-way "mtrie"
Better than Fast
Switching: AIso 1st
Packet fast!
DCEF = per
interface
MPLS appIications
onIy differ in the
usage of the controI
pIane
VPN, TE, QoS, ...
AII use data pIane
equivaIentIy
IGP
IP Routing TabIe
LIB
FIB
OSPF, IS-IS,
RIPv2, .
CoIIects aII
LDP or TDP
information
LFIB
ControI PIane
Data PIane (Forwarding PIane)
LabeI-IN,
LabeI-OUT,
L2-Information
IP
Best IabeI
according
routing metric
MPLS needs diIIerent types oI tables which are interacting to provide MPLS
Iorwarding Iunctionality.
The IP routing table is a common routing table which is built by the
IGP in use.
The FIB table is processed Irom the inIormation held in the routing
table plus all necessary layer 2 inIormation and label InIormation
needed Ior packet Iorwarding. All incoming IP packets are Iorwarded
related to the inIormation kept in the FIB table.
The LIB table holds all the corresponding Label IP Destination
relationships. The LIB is built using either LDP or TDP updates. Both
protocols distribute Label to IP preIix bindings. The LIB is a database
oI all possible labels.
The LFIB only holds the best Labels out oI the LIB and is actually
used to Iorward MPLS packets. What the best label in the LIB are is
determined by the Next Hop inIormation supplied by the local IGP.

13
13 {C} Herbert Haas 2005/03/11
Important Databases
FIB

This is the CEF database

Contains L2/L3 headers, IP addresses, IabeIs,


next hop, metric

The routing tabIe is onIy a subset of the FIB


LIB

Contains all IabeIs and associated destinations


LFIB

Contains seIected IabeIs used for forwarding

SeIection based on FIB


This slide summarized the three important databases which had been
introduced with MPLS.

14
14 {C} Herbert Haas 2005/03/11
MPLS AppIications
Any IGP
IP RT
LDP/TDP
FIB
LFIB
Different ControI PIanes
Data PIane (Forwarding PIane)
Unicast Fwd.
M-RT
PIMv2
MuIticast Fwd.
OSPF/ISIS
IP RT
LDP
MPLS TE
Any IGP
IP RT
LDP/TDP
MPLS QoS
IP RT
MPLS VPN
RSVP LDP BGP
Any IGP
The diagram above illustrates how diIIerent MPLS applications use a diIIerent
control plane. It is in Iact the control plane which determines the FECsin
other words, what label-based Iorwarding is good Ior.
But all applications use the same (primitive) data plane.
Note that there are diIIerent types oI MPLS-based Multicast. MPLS Multicast
is discussed in another chapter, soon...

15
15 {C} Herbert Haas 2005/03/11
LabeI Switching (1)
Both routing updates and LDP/TDP
distribute reachabiIity information
RT
20/8 via R6
FIB
20/8 via R6 no Iab.
LFIB
In
41
Out
-
R1 R2 R3 R4 R5 R6
20/8 via R6
Routing
Update
20/8
FIB
20/8 via R5 use 41
LFIB
In
22
Out
41
FIB
20/8 via R4 use 22
LFIB
In
89
Out
22
FIB
20/8 via R3 use 89
LFIB
In
-
Out
89
20/8 use 41 20/8 use 22 20/8 use 89
20/8 via R5 20/8 via R4 20/8 via R3 20/8 via R2
The picture above shows how a label-switched path is established Irom leIt
(near the 'destination network 20/8) to the right. Both routing updates and
label distribution protocol (LDP or TDP) distribute reachability inIormation Ior
this destination network.

16
16 {C} Herbert Haas 2005/03/11
LabeI Switching (2)
R5 must perform doubIe Iookup:
LFIB teIIs "remove the IabeI"
FIB teIIs "use next hop R6"
LabeI shouId be removed one hop earIier (by R4) !!!!
RT
20/8 via R6
FIB
20/8 via R6 no Iab.
LFIB
In
41
Out
-
R1 R2 R3 R4 R5 R6
20/8
FIB
20/8 via R5 use 41
LFIB
In
22
Out
41
FIB
20/8 via R4 use 22
LFIB
In
89
Out
22
FIB
20/8 via R3 use 89
LFIB
In
-
Out
89
20.0.0.1 20.0.0.1 89 20.0.0.1 22 20.0.0.1 41 20.0.0.1
The picture above shows how packets can now be sent using a MPLS header.
Label switching is perIormed on each hop (LSR) inside the provider domain
(R2, R3, R4, R5). The LFIB tables are used to perIorm a Iast lookup.
But R5 cannot Iind any outgoing label in its LFIB. AIter this unsuccessIul
lookup, R5 looks into the FIB and determines the next hop. Note that this
double lookup would be done Ior every packet! ThereIore it would be
reasonable to remove the label even one hop earlier (the penultimate hop, R4)
in order to leave R5's LFIB empty.

17
17 {C} Herbert Haas 2005/03/11
PHP (1)
Last hop router (R5) teIIs penuItimate router (R4)
to remove IabeI

"PenuItimate Hop Popping" (PHP)


AIso caIIed "ImpIicit NuII LabeI"
RT
20/8 via R6
FIB
20/8 via R6 no Iab.
LFIB
In
-
Out
-
R1 R2 R3 R4 R5 R6
20/8 via R6
Routing
Update
20/8
FIB
20/8 via R5 do POP
LFIB
In
22
Out
POP
FIB
20/8 via R4 use 22
LFIB
In
89
Out
22
FIB
20/8 via R3 use 89
LFIB
In
-
Out
89
20/8 do POP 20/8 use 22 20/8 use 89
In this scenario "Penultimate Hop Popping" (PHP) is illustrated. Now R5 does
not allocate an incoming label Ior this destination but rather announces to R4 to
use an "implicit null" label. It is also said, that R4 should perIorm the "POP"
operation. The label number "3" had been reserved to represent the "do POP"
command.

18
18 {C} Herbert Haas 2005/03/11
PHP (2)
R5 onIy performs singIe Iookup in FIB
Note: PHP does not work with ATM

VPI/VCI cannot be removed


RT
20/8 via R6
FIB
20/8 via R6 no Iab.
LFIB
In
-
Out
-
R1 R2 R3 R4 R5 R6
20/8
LFIB
In
22
Out
POP
FIB
20/8 via R4 use 22
LFIB
In
89
Out
22
FIB
20/8 via R3 use 89
LFIB
In
-
Out
89
20.0.0.1 20.0.0.1 89 20.0.0.1 22 20.0.0.1 20.0.0.1
FIB
20/8 via R5 do POP
Note that some router in between (e.g. R3) can be conIigured as aggregation
point. That is, this router may aggregate several preIixes using a shorter preIix
(e. g. 20/6) and a dedicated label. In this case the label-switched path is broken
into two segments. The penultimate router (just beIore the aggregation router)
already perIorms "POP" and the aggregation router thereIore must perIorm a
routing table lookup (this is necessary especially when the destination is more
speciIic than the announced aggregatethere might be diIIerent downstream
paths Irom the aggregation point).
Note: ATM LSRs must not aggregate because they cannot Iorward IP packets.
Also aggregation must not be used in applications where an end-to-end tunnel
is required, such as as in MPLS VPNs.

19
19 {C} Herbert Haas 2005/03/11
1 - Routing Updates
MPLS CIoud
with OSPF
20/8
R1
R2 R3 R4
R5
R6
2
0
/8
v
ia
R
2
20/8 via R3 20/8 via R4 20/8 via R5
2
0
/
8

v
i
a

R
6
20/8 via R
5
OSPF
20/8 via R3
20/8 via R6
LFIB
In
-
Out
-
The Iirst table that needs to be available is the routing table, which is build up
in this example with the help oI the OSPF link state routing protocol.
II only the MPLS Transport system is in use any IGP can be used. Only MPLS
Transport in combination with MPLS TE requires a link state routing protocol
like OSPF or ISIS.

20
20 {C} Herbert Haas 2005/03/11
2 - LDP or TDP
MPLS CIoud
20/8
R1
R2 R3 R4
R5
R6
2
0
/8
u
s
e
5
3
20/8 use 5 20/8 use 12 20/8 do POP
2
0
/
8

u
s
e

1
2
20/8 do
P
O
P
OSPF
20/8 via R3
20/8 via R6
LFIB
In
-
Out
41
LDP/TDP is also performed in reverse
direction
But no GP information about these reverse
label-paths, so normally not used!
Best route from OSPF table determines best
label in FB and is used in LFB
FIB
20/8 via R3 use 5
20/8 via R6 use 12
LFIB
In
41
Out
12
2
0
/8
u
s
e
4
1
20/8 use 41 20/8 use 33 20/8 use 12
20/8 use 7
2
0
/
8

u
s
e

4
1
Allocated labels are advertised to all neighbor LSRs regardless oI whether they
are upstream or downstream.
Per-platform label allocation: typically an LFIB contains no incoming
interIace, so the same destination (next hop) can be associated with the same
label Ior all interIaces. The LSR simply advertises the same label Ior the same
destination through all interIaces. LSR announces label to adjacent LSRs only
once even iI there are parallel links between them. Advantage: Quicker label
exchange, small LFIB. Drawback: Insecure: A third party router can send
packets to the LSR even though the label was not announced to it.
Per-interface label space: LFIB contains incoming interIace. Label can be
reused per interIace with diIIerent meanings.
POP (implicit null) removes outermost label.
PHP does not work on ATM because VPI/VCI cannot be removed. POP or
"implicit null label" uses value 3 when being advertised to a neighbor.

22
22 {C} Herbert Haas 2005/03/11
Cisco Express Forwarding (CEF)
Requirement for MPLS

Forwarding information (L2-headers, addresses, IabeIs)


are maintained in FIB for each destination
Newest and fastest IOS switching method
CriticaI in environments with frequent route changes
and Iarge RTs: The Internet backbone!
Invented to overcome Fast Switching probIems:
No overIapping cache entries

Any change of RT or ARP cache invaIidates route cache


First packet is aIways process-switched to buiId route
cache entry
Inefficient Ioad baIancing when "many hosts to one
server"
Many route changes occur in the Internet backbone, causing cache entries to be
invalidated Irequently. ThereIore, a signiIicant percentage oI Internet traIIic is
process switched. First tests with IOS "ISP Geek images" under extreme
conditions. Now CEF is the deIault switching mode in Cisco IOS Release 12.0
and the only switching mode on Cisco 12000 routers and Catalyst 8500.
Cisco IOS 12.0 knows several switching methods: Process Switching, Fast
Switching, Autonomous Switching, Silicon Switching Engine (SSE)
Switching, Optimum Switching, Distributed Fast Switching, CEF, Distributed
CED (dCEF).
Process Switching was the Iirst switching method implemented in IOS. It is
simple (brute-Iorce), slow, CPU demanding, non-optimized but at least
platIorm independent.
Fast Switching: Cached subset oI the routing table and MAC address tables.
During Process Switching (which is still done Ior the Iirst packet), the
inIormation learned is stored in a Iast cache. This inIormation contains route
(next hop), interIace and MAC header combinations. In order to avoid
collisions in the Iast cache, beginning with IOS 12.0, radix trees instead oI hash
tables are used.
Compared to process switching and Iast switching technologies, CEF supports
packet manipulation on the Ily. This means the FIB table lookup also provides
some additional inIormation (e.g. precedence settings, Label inIormation etc.)
which are implemented in the outgoing data packet.

23
23 {C} Herbert Haas 2005/03/11
How CEF Works
CEF "Fast Cache" consists of
CEF tabIe: Stripped-down version of the RT (256-mtrie)
Adjacency tabIe: ActuaI forwarding information (MAC, interfaces, ...)
CEF cache is pre-buiIt before any packets are switched
No packet needs to be process switched
CEF entries never age out
Any RT or ARP changes are immediateIy mapped into CEF cache
root
1.0.0.0
2.0.0.0
10.0.0.0
...
...
255.0.0.0
10.1.0.0
10.2.0.0
10.20.0.0
...
...
10.255.0.0
10.20.1.0
10.20.2.0
10.20.5.0
...
...
10.20.255.0
10.20.5.1
10.20.5.2
10.20.5.16
...
...
10.20.5.255
00E3.C10F.8B11
Interface s0/0
...
Adjacency TabIe
ExampIe-Look up "10.20.5.16"
CEF TabIe
The CEF (FIB) table holds all the necessary inIormation needed to rewrite the
layer 2 and 3 header oI an Iorwarded data packet. Changes in the routing table
has to be reIlected in the CEF table immediately.
mtree: tree oI pointers; data is stored elsewhere.
Display CEF table inIormation using show ip cef summary.
Display Adjacency table inIormation: show adjacency.
dCEF: Very high perIormance boost. Each interIace holds its own CEF table
and is able to Iorward packets autonomously. Available on GSR, Cisco 7500
router

26
26 {C} Herbert Haas 2005/03/11
IOS Standard Behavior
Routers with packet interfaces
Per-pIatform IabeI space !!!
UnsoIicited IabeI distribution
LiberaI IabeI retention !
Independent controI
Routers with ATM interfaces
Per-interface IabeI space
On-demand IabeI distribution
Conservative or IiberaI IabeI retention
Independent controI
ATM switches
Per-interface IabeI space
On-demand distribution
Conservative IabeI retention
Ordered controI
This slide summarized the main diIIerences.
Note that routers perIorms a per-platform label allocation. That is, the LFIB
does not contain any incoming interIace, so the label must be unqiue on the
entire router Ior a given destination. In other words, the same label can be used
Ior a packet on any interIace and will be Iorwarded to the same destination
this is the positive version.
Which label distribution and retention behavior is used depends on the
interIace type in use.
Unsolicited label distribution means that labels are advertised automatically
without being asked...
Liberal label retention: All advertised labels are accepted, even Irom LSRs
which are not next hop to the destination.
Conservative label retention: Advertised labels are only accepted Irom LSRs
which are next hop LSRs Ior a given destination.

27
27 {C} Herbert Haas 2005/03/11
TDP Key Facts
Tag Distribution ProtocoI (TDP) invented
by Cisco for distributing
<IabeI, prefix> bindings

EnabIed by defauIt
Session estabIishment: UDP/TCP port 711

HeIIo messages via UDP, destination 224.0.0.2


(aII subnet routers)

Session via TCP, incrementaI updates


Not compatibIe with LDP

But can co-exist as Iong as two peers use


same protocoI
The TDP protocol was developed by Cisco and is used to distribute Lable-
PreIix bindings between adjacent LSRs. Only in the case oI MPLS TE TDP
updates are also exchanged between not adjacent LSRs through so called
Tunnel interIaces.
The TDP protocol is using both UDP and TCP at the transport layer. The TDP
server process is addressed by the port number 711 and the updates are sent
using the well known all routers Multicast address 224.0.0.2.
UDP is used in combination with a Hello procedure to detect neighboring
LSRs.
The TCP protocol is used to reliable transport label binding inIormation.
TDP is incompatible with LDP so neighboring LSRs need to use the same
Protocol to allow a TDP/LDP session to come up.

28
28 {C} Herbert Haas 2005/03/11
LDP Key Facts
IETF standard, descendent of Cisco's
proprietary TDP
Same concept but port 646

AIso to destination 224.0.0.2


6-byte TLV ("LDP-ID") identifies

Router (4 bytes)

LabeI space (2 bytes)


Per-pIatform IabeI space is set to zero
The LDP protocol is the standard protocol speciIied by the IETF. It works the
same way like TDP does but they are incompatible as you can see just by the
port numbers in use.
ReIerence: draIt-ietI-mpls-ldp-07.txt
Combination oI Irame-mode and cell-mode (or multiple cell-mode) links result
in multiple LDP sessions.
An LDP session is established by the router with the higher IP address.
Non-adjacent neighbors are discovered by unicast messages.

29
29 {C} Herbert Haas 2005/03/11
LDP DetaiIs
One session per LDP identifier

Per-pIatform IabeI space: 1 identifier for aII


Iinks
TCP session initiated from router with
highest address
UDP
Transport Address
20.0.0.1
LDP Identifier
10.0.0.1:0
OptionaI TLV:
Session interface
Router's Address
2 byte IabeI space
0...per-pIatform
IP
HeIIo:
Also non-adjacent LDP or TDP sessions can be established. In this case
unicast addresses are used instead oI multicast (Ior the hello packets).
Note: MPLS is enabled per interIace. TDP is used by deIault on Cisco routers.
II the router works in a mixed environment, enable both LDP and TDP Ior best
interoperability.

30
30 {C} Herbert Haas 2005/03/11
BGP Standard Behavior
Good styIe: Use Ioopback addresses and next hop seIf
BUT: FuII mesh IBGP !!!
BUT: Each router has fuII routing tabIe !!!
IGP is used to propagate Ioopback addresses
1.1.1.1/32, 1.1.1.2/32, 1.1.1.3/32, and 1.1.1.4/32
Note: Sync Off
Otherwise IBGP routes wouId never be copied into the routing tabIe
IBGP updates wouId onIy be propagated by PE-router if this network is
reachabIe via IGP
R1
R2 R3
R4 R5
AS 10
EBGP
12/8 NH R5 1.1.1.1/32
1.1.1.2/32 1.1.1.3/32
1.1.1.4/32
IBGP 12/8
NH 1.1.1.4/32
IBGP 12/8
NH 1.1.1.4/32
IBGP 12/8
NH 1.1.1.4/32
IBGP:
neighbor R3, R2, R1
next-hop self
update source loopback 0
AS 5
Note: Sync is on by deIault (Cisco). "Update source loopback" makes IBGP
updates using the loopback address as source address oI update messages.
Note: The loopback addresses are speciIied as neighbor addresses.
Note: Next-hop selI is necessary Ior the PE-routers because BGP otherwise
assumes R5 to be the next hop AND there is no label to R5 iI the IGP was not
started on the external link.
Do not summarize PE loopback addresses as it would break the label-switching
path. ThereIore it is a good practice to use host-route loopback addresses with
subnet masks oI 32 bits. Equivalently do not use next-hop-selI on
conIederation boundaries as it would also break the label-switching path.

31
31 {C} Herbert Haas 2005/03/11
MPLS and BGP
FEC = Next Hop

OnIy PE routers must Iearn aII externaI routes

OnIy the PE routers must be powerfuI


IBGP sessions onIy between PE-routers
R1
R2 R3
R4 R5
AS 10
EBGP
(thousands
of routes) 1.1.1.1/32
1.1.1.2/32 1.1.1.3/32
1.1.1.4/32
IBGP
(thousands of routes)
NH 1.1.1.4/32
AS 5
1
.
1
.
1
.
4
/
3
2
d
o

P
O
P
1.1.1.4/32
use 9
1
.1
.1
.4
/3
2
u
s
e
4
0
For IGP derived routes a FEC represents an IP destination network.
For BGP derived routes a FEC represents the BGP Next Hop attribute.
This means that all routes which are imported by an EBGP Peer into an
autonomous system are reachable via one and the same Label which points
towards the EBGP Peers loopback address in the case NEXT HOP SELF is
used on the EBGP Peer.
ThereIore P routers don t need to run BGP because they are able to Iorward
packets Ior external locations using the Label inIormation derived Irom the
EBGP Peers loopback address.
Advantages summary:
The BGP topologv has been much simplifiedonly the AS edge routers
need to run BGP with Iull Internet routing.
Core routers do not require much memorv. The Internet routing table
(by 2002) comprises about 100,000 routes which may require more than
50 MB oI memory Ior the BGP table, IP routing table, and CEFs FIB
table and distributed FIB tables).
Changes in the Internet do not impact core routers!
Private (RFC 1918) addresses can be used inside the core. Note that in
this case the TTL propagation must be disabledotherwise a
traceroute would show private addresses.

32
32 {C} Herbert Haas 2005/03/11
Note
LSRs announce onIy one IabeI (per destination)
to adjacent LSRs

Even if there are paraIIeI Iinks between them


Insecure: Any neighbor can abuse IabeI!
After a Iink faiIure
AII IabeIs (and reIated information) are removed from
the FIB/LFIB/LIB
After routing convergence FIB (RT) knows another path

New IabeI is provided by LIB


When broken Iink comes back again
LIB had aIready Iost the IabeI
Path broken!
LDP/TDP sessions are between routers not between interIacesthat's why
label announcements are only sent once, even iI there are parallel links between
them. ThereIore the LFIB is smaller and Iorwarding quicker.
But on the other hand, as the label is not bound to any interIace, any neighbor
can abuse the label and send a packet with this label to the router. The router
does not (can not) check whether the packet had been received on the right
interIace Ior the given label.
Note that the label Ior a given destination is lost when a link is broken and
comes back again. MPLS TE provides some measures against this.

33
33 {C} Herbert Haas 2005/03/11
NormaI TTL Usage
Loop detection

LDP and TDP basicaIIy reIy on IGP Ioop


detection

AdditionaIIy a TTL fieId in the MPLS header


prevents endIess routing
TTL Propagation: IP TTL is copied into
MPLS header

EnabIed by defauIt on Cisco routers


8 8 9 5 8 7 6
IP TTL MPLS TTL
IGP protocols typically provide strong mechanisms to avoid routing loops.
Nevertheless, the MPLS header carries a TTL Iield which provides additional
protection against endless loopingIor example caused by misconIigured
static routes.
TTL Propagation: This mechanism is enabled by deIault (at least on Cisco
routers) and ensures that the IP TTL value is also processed inside the MPLS
domain. Actually, the IP TTL value is copied into the MPLS header. Within
the MPLS domain only the MPLS TTL value is decremented.
Upon ingress, the IP TTL is copied to the MPLS header, upon egress the MPLS
TTL is copied back to the IP header.


35
MPLS VPN
Where the complexity begins...

36
36 {C} Herbert Haas 2005/03/11
Two Major VPN Paradigms
OverIay VPNs: Transparent P2P Iinks

WeII-known technoIogy

Provider does not care about customer


routing

Best customer isoIation


Peer VPNs: Participation in C-routing

Optimum routing

SimpIe provision of additionaI VPN

ProbIems with address space


VPN services can be oIIered based on two major paradigms:
Overlay VPNs requires service providers to provide virtual point-to-point
links between customer sites. The service provider does not see customer
routes and is responsible only Ior providing point-to-point transport oI
customer data. All routing protocols run directly between customer routers.
Layer 1 solutions: Classical TDM technologies such as E1, ISDN,
SONET/SDH.
Layer 2 solutions: FR, ATM, X.25.
Layer 3 solutions: IPsec, GRE whereas access (dialup) environments use
L2TP, PPTP or L2F.
Peer-to-Peer VPNs requires service providers to participate in customer
routing.
The isolation oI the customers is realized via packet filters on PE routers at
the PE-CE interIaces.
Another alternative is to implement controlled route distribution where each
customer has a dedicated PE router which only knows about this customer's
routes.
Peer VPNs allow a much simpler provision oI additional VPNs because only
the sites are provisioned, not the links between them.
Note: All customers share the same (provider-assigned or public) address
space.

37
37 {C} Herbert Haas 2005/03/11
MPLS VPN - Best of Both WorIds
PE routers participate in C-routing

Hence optimum routing between sites

Easy provisioning (sites onIy)


PE routers aIIow route isoIation

By using VirtuaI Routing and


Forwarding TabIes (VRF)

AIIows overIapping address spaces


OverIapping VPNs possibIe

By a simpIe (?) attribute syntax


The MPLS VPN solution combines the best of both worlds (overlapping and
peer VPN).
Here the PE routers participate in C-routing which allows Ior easy
provisioning and optimum site-connections. But the core routers do not need
to carry much routing inIormation. Only the PE routers must have some power.
Site isolation is provided by Virtual Routing and Forwarding Tables
(VRFs) which are explained soon. This method allows Ior overlapping address
space or overlapping VPNs (but not both together).
The main task is to speciIy which routes should be imported into which VRF.
This is accomplished by special attributes during the conIiguration. The
principle is easy (as you will see) but the attribute-syntax looks...strange (as
you will see).

38
38 {C} Herbert Haas 2005/03/11
MPLS VPN - PrincipIes
Requires MPLS Transport
Requires MP-BGP

Supports IPv4/v6, VPNv4, muIticast

DefauIt behavior: BGP-4


VPNv4 uses 96 bit addresses

64 bit Route Distinguisher (RD)

32 bit IP address
Every router uses one VRF for each VPN

VirtuaI Routing and Forwarding TabIe (VRF)


For MPLS VPN services its mandatory to have an properly working MPLS
Transport system already in place. Furthermore MP-BGP needs to be set up to
allow the exchange oI VPNV4 updates and VPN Label inIormation.
A VPNV4 address is made up oI a 64 bit Route Distinguisher (RD) and a 32
bit IPV4 address. This VPNV4 address is needed to allow overlapping address
spaces inside diIIerent VPNs. Every PE router holds diIIerent VRFs which
holds address inIormation Ior one or more VPNs, depending whether simple
VPNs or overlapping VPNs are in use.

39
39 {C} Herbert Haas 2005/03/11
10.3.0.0
10.3.0.0
10.2.0.0
10.2.0.0
MPLS VPN
CE2
CE1
CE2
CE1
R1 R2 R3
1.1.1.1/32
1.1.1.1/32
do POP
1.1.1.1/32
use 36
AS 10
VPNv4
B 10:100 10.2/16
B 10:200 10.2/16
VRF CE2
R 10.2/16 via CE2
RD 10:100
RT
I
100:200
RT
E
100:200
R
IP
v
2
1
0
.2
v
ia
C
E
2
R
I
P
v
2
1
0
.
2

v
i
a

C
E
1
VRF CE1
R 10.2/16 via CE1
RD 10:200
RT
I
100:300
RT
E
100:300
R
e
d
i
s
t
.
R
e
d
i
s
t
.
LDP/TDP, OSPF,
MP-BGP, MPLS
CE1: RIPv2
C 10.2/16
CE2: RIPv2
C 10.2/16
LDP/TDP LDP/TDP
Each interIace is exclusively member oI the global routing process OR one
VRF. The RD, RTi, and RTe are manually conIigured by the administrator.
Each VRF has conIigured exactly one RD , but can have one or more RTi and
RTE. The RD identiIies each VPN (unless overlapping VPNs are conIigured).
Routes Ior a VPNs are learned via an standard routing process running between
the PE and the CE router such as RIPv2, OSPF, EIGRP and EBGP.
RIPv2, EIGRP or EBGP are good choices because a link state protocol such as
OSPF would be limited to approx. 28 processes (theoretically a total oI 32
routing processes). RIPv2, EIGRP and EBGP on the other hand can maintain
many sub-processes, consuming only one process-number.
Bidirectional redistribution needs to be conIigured between MP-BGP and
OSPF, RIPv2 and EIGRP, which copies the IGP inIormation into the MP-BGP
VPNv4 table and vice versa. Redistribution is not needed when EBGP is used
as the PE-CE routing protocol.
Learned routes and the preconIigured RD is redistributed Irom the VRF tables
into the MP-BGP VPNv4 table and since BGP makes triggered updates, this
inIormation is sent to the peers.
Note: the MP-BGP VPNv4 Table does not show the RTe, but the RTe is
copied into to the BGP-database during the redistribution process.

40
40 {C} Herbert Haas 2005/03/11
10.3.0.0
10.3.0.0
10.2.0.0
10.2.0.0
MPLS VPN
CE2
CE1
CE2
CE1
R1 R2 R3
1.1.1.1/32
1.1.1.1/32
do POP
1.1.1.1/32
use 36
AS 10
LDP/TDP, OSPF,
MP-BGP, MPLS
CE1: RIPv2
C 10.2/16
CE2: RIPv2
C 10.3/16
CE1: RIPv2
C 10.3/16
LFIB
52 S0.1
IGP Metric MED
CE2: RIPv2
C 10.2/16
MP-IBGP Update
NLRI: 10:200 10.2/16
NH: 1.1.1.1/32
RT
E
: 100:300
VPN-LabeI: 77
MP-IBGP Update
NLRI: 10:100 10.2/16
NH: 1.1.1.1/32
RT
E
: 100:200
VPN-LabeI: 52
LDP/TDP LDP/TDP
The RD together with the IPv4 address makes up the VPNv4 address which is
propagated via MP-BGP updates. These VPNv4 addresses are now used in the
NLRI Iields oI the BGP update instead oI traditional IPv4 addresses. Also the
RTe is carried with this update using extended community attributes as well as
the VPN Label inIormation.
The received MP-IBGP update is then imported into all VRFs which hold a
matching RTi and optionally redistributed towards the connected CE routers.
During the import Irom the VPNV4 table to the VRF the RD is removed
resulting in a standard IPV4 address.
The IGP Metric (i. e. the RIPv2 hop count) is copied into BGP MED attributes,
in order to carry this inIormation to the other side.

41
41 {C} Herbert Haas 2005/03/11
10.3.0.0
10.3.0.0
10.2.0.0
10.2.0.0
MPLS VPN
CE2
CE1
CE2
CE1
R1 R2 R3
1.1.1.1/32
1.1.1.1/32
do POP
1.1.1.1/32
use 36
AS 10
LDP/TDP, OSPF,
MP-BGP, MPLS
VPNv4
B 10:100 10.2/16
B 10:200 10.2/16
VRF CE2
B 10.2/16 via 1.1.1.1/32 B 10.2/16 via 1.1.1.1/32
RD 10:100
RT RT
I I
10:200 10:200
RT
E
10:200
VRF CE1
B 10.2/16 via 1.1.1.1/32 B 10.2/16 via 1.1.1.1/32
R3: OSPF
O 1.1.1.1/32 via R2
CE1: RIPv2
C 10.2/16
CE2: RIPv2
C 10.3/16
R 10.2/16 R 10.2/16
CE1: RIPv2
C 10.3/16
R 10.2/16 R 10.2/16
MED IGP Metric
RD 10:200
RT RT
I I
100:300 100:300
RT
E
100:300
CE2: RIPv2
C 10.2/16
R 10.3/16 via CE1
R 10.3/16 via CE2
MP-IBGP Update
NLRI: 10:200 10.2/16
NH: 1.1.1.1/32
RT
E
: 100:300
VPN-LabeI: 77
MP-IBGP Update
NLRI: 10:100 10.2/16
NH: 1.1.1.1/32
RT
E
: 100:200
VPN-LabeI: 52
LFIB
52 S0.1
LDP/TDP LDP/TDP
The RTi (import) is used locally by a VRF instance to determine which routes
will be imported in the VRF-table and which not.
Routes are only copied into the VRF iI the RTe matches the RTi. This route
must be redistributed into the RIPv2 process.
Also a MPLS-label Ior this VPN is communicated via IBGP and is directly
copied into the CEF table (FIB) oI the peer PE router.
The MED attribute is copied into the hop-count Iield oI the RIPv2 update.
Thus, CE1 and CE2 on the right side learn about the metric which was
speciIied on the other edge oI the provider. The MPLS network is Iully
transparent to RIPv2 and only increases the IGP metric by one.

42
42 {C} Herbert Haas 2005/03/11
10.3.0.0
10.2.0.0 10.3.0.0
10.2.0.0
Transparent for IGP
R3 has one FIB per RT
One FIB for gIobaI RT
One FIB for VRF CE1 RT
One FIB for VRF CE2 RT
Each MPLS-Router has exactIy one LFIB
PE routers must be connected with CE routers via (sub) interfaces
36, 52, 10.2.2.2 52, 10.2.2.2
1
0
.2
.2
.2
Removes
VPN IabeI
1
0
.2
.2
.2
LFIB
52 S0.1
POP 36
R3: FIB VRF CE2
B 10.2/16 via 1.1.1.1/32 use IabeI 36; 52
R1 R2 R3
CE2
CE1
CE2
CE1
CE1: RIPv2
R 10.2/16 R 10.2/16
C 10.3/16
Now IP packets can be Iorwarded between the VPNs. For example, IP packets
to 10.2.2.2 are Iorwarded Irom the CE1 router (right side) to the next hop VRF-
R3, which adds the labels 36; 52} into the MPLS header, according to its FIB.
R2 pops the MPLS-Transport header and R1 can quickly deliver the IP packet
to the correct VPN according to the remaining VPN label 52} which is stored
in the LFIB table at R1 pointing to the interIace oI the appropriate VPN.
R1 removes the MPLS-VPN label 52} beIore the IP packet is delivered to
CE2 (leIt side). Thus, the VPNs do not recognize any MPLS network in-
between; MPLS is completely transparent.

43
43 {C} Herbert Haas 2005/03/11
OverIapping VPNs
IBGP SpIit Horizon RuIe assures that R3 (HQ) does not forward
routes Iearned by peers
IP networks must be unique in overIapping situations!
RD = 10:201
RTI =10:100
RTE=10:100
RD = 10:202
RTI=10:200
RTE=10:200
RTI=10:100
RTE=10:100
No RT-Match!
IBGP
10.2/16
RTE=10:200
R2
R3 (HQ)
R1
IBGP
10.3/16
RTE 10:100
RTE 10:100
10.1/16
10.2/16
10.3/16
IBGP
10.3/16
RTE=10:200
RTE=10:200
RTI=10:200
RTE=10:200
RD = 10:200
IBGP
10.2/16
RTE=10:200
IBGP
10.1/16
RTE=10:100
When using simple VPNs the RTi is equal to the RTe (keyword "both" when
conIiguring) , but when overlapping VPNs are used, the Route Targets need to
be diIIerent according to the desired communication behavior.
In our example all routes Irom the VPN-green and VPN-red are propagated to
R3 (HQ) and copied into the VRF table due to the conIigured RTe and RTi
values.
II R3 sends its update towards R1 and R3 all routes (except routes learned
from IBGP sessions) out oI R3s VRF are propagated to R1 and R2 with both
RTEs attached. These routes are then imported by R1 and R2 into the
appropriate VRF tables.
Due to the IBGP split horizon rule R3 does not propagate routes learned Irom
R2 towards R3 and vice versa. So without the IBGP split horizon rule MPLS
VPNs would not exist.
Note: Both RTi and RTe can be conIigured multiple times. For example one
VRF on a router can have speciIied three diIIerent RTi values. ThereIore, all
IBGP updates whose RTe values match one oI the speciIied RTi values can be
imported.
Note: some older IOS versions require that at least one RTi and one RTe are
identical.

44
CeII-based MPLS
lf you need this...
ATM? Try some, buy some...

45
45 {C} Herbert Haas 2005/03/11
CeII-based MPLS
LabeI-switching controIIed ATM
(LC-ATM)

On ATM switches
On Routers with ATM interfaces
Legacy ATM switches become
MPLS capabIe
Via firmware upgrade, if existing
controI processor aIIows that (LS
1010, Cat 8510, Cat 8540, Cat 5500)
Via externaI LabeI Switch ControIIer
(LSC) attached on standard ATM
interface (MGX 8850, BPX 8650)
LSC
Cisco 7500/7200 routers
ATM Link
VSI
BPX 8650
Enabling cell-based MPLS on Cisco IOS-based ATM switches is identical
as enabling Irame-based MPLS on IOS routers. When enabling cell-based
ATM on IOS routers with ATM interIaces, the command interface atm
X/X/X tag-switching must be used. The keyword tag-switching here
reserves the VC 0/32 Ior control messages.
LSC is available Ior Cisco BPX switches. A special Virtual Switch Interface
(VSI) protocol is used between the standard ATM interIace and the LSC. The
VSI basically only supports VC additions and deletions. All higher MPLS
operations are perIormed by the LSC using VC 0/32.
One main advantage oI Cell-mode ATM is to avoid NSAP addressing (and
mapping) which is needed to run PNNI.


46
46 {C} Herbert Haas 2005/03/11
ATM IP Packet (cont.) AAL5
ATM IP Packet (cont.) AAL5
CeII-mode MPLS CeIIs
ATM Switches can onIy switch VPI/VCI-no MPLS IabeIs!
OnIy the topmost IabeI is inserted in the VPI/VCI fieId
Other reserved VPI/VCI fieIds are used for LDP/TDP and
routing updates
Note: TypicaIIy onIy a few VPI/VCI combinations are
supported by each switch
LabeIs are a very scarce resource !!!
Per-interface IabeI aIIocation
Layer 2
MPLS
Header
IP Packet
ATM
MPLS
Header
IP Packet AAL5 ATM IP Packet (cont.) AAL5
First ceII Subsequent ceIIs
The top label is always copied into the (VPI/) VCI Iields. LDP/TDP sessions
are established via reserved VPI/VCI labels. Typically a ATM switch only
provides a Iew VPI/VCI numbers, so it is diIIicult to adapt all MPLS labels
used in a router network.
Note that LC-ATM provides a per-interIace label allocation since the ATM
switching matrix ( LFIB) always contains the incoming interIace! That is,
same labels can be reused on diIIerent interIaces on the same machine. This
has a security advantage: Labeled packets are only accepted on that interIaces
where the labels had been previously assigned.

47
47 {C} Herbert Haas 2005/03/11
Basic PrincipIes Summary
MPLS Layer 2.5 packet is sent via AAL5
Top-of-stack IabeI is aIways copied into VPI/VCI fieId
Per defauIt: VPI=1, range can be configured
LDP, TDP and routing protocoIs are sent in-band in VC 0/32
by defauIt (IETF)
Other channeI can be configured
Out-band controI channeI typicaIIy not impIemented (e. g.
Ethernet)
ATM Switches typicaIIy perform control-driven IabeI-
requests downstream
Based on RT content, not actuaI data fIow
Recursive process (request/response: "Ordered ControI")
Need IabeI for net 10 Need IabeI for net 10 Need IabeI for net 10
Use IabeI 1/45 Use IabeI 1/31 Use IabeI 1/99
1 2 3
4 5 6
The main diIIerence between Irame-based MPLS in routers and cell-based
MPLS is the Iollowing: Routers can handle both IP packets (LDP, TDP,
routing updates) and labeled-packets (MPLS data packets on layer 2.5). But
ATM switches can ONLY handle VPI/VCI-labeled packets.
As the top-oI-stack MPLS label is now always used in the VPI/VCI Iield, there
must be a dedicated VC Ior control packets such as LDP, TDP, and routing
protocols.
Per deIault, only the 16-bit VCI value carries the label value. Note that VPI
values are a scarce resource. ThereIore the VPI value is set to 1 per deIault.
Optionally, a VPI range can be speciIied.
The MPLS control VC is by deIault conIigured on VC 0/32 and must use
LLC/SNAP encapsulation oI IP packets as deIined in RFC 1483. The
corresponding IOS keyword is aal5snap.

48
48 {C} Herbert Haas 2005/03/11
LabeI Request Procedure
A router requests a IabeI for every destination with next
hop reachabIe via LC-ATM interface
An ATM switch can onIy aIIocate an incoming IabeI if it has
aIready an outgoing IabeI
Thus a IabeI request can onIy be answered after outgoing IabeI
had been requested
"Ordered controI"
LSRs can aIways assign an incoming IabeI
"Independent controI"
LFIB = ATM switching matrix
Need IabeI for net 10 Need IabeI for net 10 Need IabeI for net 10
1 2 3
4 5 6
Use IabeI 1/45 Use IabeI 1/31 Use IabeI 1/99
Labels are requested via LDP/TDP as soon as an edge router (LSR) learns
about a destination which is reachable via a next hop through a LC-ATM
interIace.
Each ATM-LSR can only allocate a label Ior this (requested) destination when
it knows an outgoing label already. ThereIore the response message must be
delayed and another label request is sent downstream. Only when the last LSR
on the right side, (or ATM-LSR which is the egress ATM LSR and needs L3
Iunctionality) receives the request, it allocates a label and sends a response to
the label request. Note that this last (egress) ATM LSR has no outgoing label
as it is directly connected with the destination network. We assume that "net
10" is located at the right side next to the rightmost LSR.

49
49 {C} Herbert Haas 2005/03/11
Reuse of Downstream LabeIs
Reusing downstream IabeI Ieads to
interIeaving of IP packets !

AIIocate a separate downstream IabeI for every


upstream request

Prevent ceII interIeaving (watch packet


boundaries) -"VC Merge"
Use 2/43
U
s
e
1
/8
0
Use 1/81
A
B
C D
2/43 2/43 2/43 2/43
1
/8
1
1
/8
1
1
/8
0
1
/8
0
Note the diIIerence to the old AAL5 problem: All cells belonging to one AAL5
IP packet are not interleaved with the cells oI another IP packet received on the
same interIacethat is: Irom the same source (having the same VPI/VCI). But
a switch may indeed interleave the cells oI diIIerent VPI/VCIs. The only
problem occurs iI some cells are lost, especially the last cell which indicates
packet boundaries.
The problem illustrates above involves two sources (A and B) whose cells are
switched downstream with the same label. This is possible in a normal MPLS
network which consists oI routers only! But with LC-ATM the packets would
be interleaved and cannot be reassembled correctly anymore.
ThereIore, two solutions are implemented: Avoid cell interleaving (and assure
packet interleaving only) or allocate separate downstream labels Ior every
upstream request.

50
50 {C} Herbert Haas 2005/03/11
VC-Merge
BIocks incoming ceIIs untiI Iast ceII
of packet arrived
Saves IabeIs but requires switch to
seriaIize aII ceIIs beIonging to one
packet
SeriaIization deIay increased and
buffer resources needed
Jitter increases !!!
AAL5 only marks the end-cell oI a IP-packet. ThereIore it is not possible to
aggregate several MPLS-VCs into one VC using a unique label because as
cells are interleaved, subsequent switches cannot reassemble the IP packets.
II switches support "VC Merge" then they are capable to buIIer all cells
belonging to one IP packet and send them at once. That is, the switches avoid
to interleave cells oI diIIerent IP packets.
But most implementations block all other interIaces in the meanwhile! Then
the Iorwarding delay oI a complete packet depends on concurrent packets.
1itter occurs! This solution transIorms the cell-based ATM network in a
classical Irame-based network!

52
52 {C} Herbert Haas 2005/03/11
Summary
The very basic idea:

MPLS decoupIes information used for forwarding (the


IabeI) and information used for routing (the IP address)
MPLS transport
Is fundamentaI to other MPLS features
Requires a IabeI distribution system (LDP/TDP)
Requires CEF to estabIish a fast FIB

Can do IabeI stacking which aIIows greater fIexibiIity


Differentiate frame-based and ceII-based MPLS
MPLS VPNs
AdditionaI IabeI to differentiate VPNs
VPNv4 addresses and Route Targets to define VPN
menbership of the VRFs

1
2005/03/11 {C} Herbert Haas
DNS Introduction
www.what-is-my-ip-address.com

2
'Except for Great Britain. According
to ISO 3166 and Internet tradition,
Great Britains top-level domain
name should be gb. Instead, most
organi:ations in Great Britain and
Northern Ireland (i.e., the United
Kingdom) use the top-level domain
name uk. Thev drive on the wrong
side of the road, too.`
DNS and BIND book
Footnote to the ISO 3166 two-Ietter country code TLDs

3
3 {C} Herbert Haas 2005/03/11
DNS Tree Growth
162,128,493
by 2002/7
The ISC about the new DNS survey method:
The new survey works by querying the domain system Ior the name assigned to every possible IP
address. However, this would take too long iI we had to send a query Ior each oI the potential 4.3
billion (2`32) IP addresses that can exist. Instead, we start with a list oI all network numbers that
have been delegated within the IN-ADDR.ARPA domain. The IN-ADDR.ARPA domain is a
special part oI the domain name space used to convert IP addresses into names. For each IN-
ADDR.ARPA network number delegation, we query Ior Iurther subdelegations at each network
octet boundary below that point. This process takes about two days and when it ends we have a
list oI all 3-octet network number delegations that exist and the names oI the authoritative domain
servers that handle those queries. This process reduces the number oI queries we need to do Irom
4.3 billion to the number oI possible hosts per delegation (254) times the number oI delegations
Iound. In the January 1998 survey, there were 879,212 delegations, or just 223,319,848 possible
hosts.
With the list oI 3-octet delegations in hand, the next phase oI the survey sends out a common
UDP-based PTR query Ior each possible host address between 1 and 254 Ior each delegation. In
order to prevent Ilooding any particular server, network or router with packets, the query order is
pseudo-randomized to spread the queries evenly across the Internet. For example, a domain server
that handles a single 3-octet IN-ADDR.ARPA delegation would only see one or two queries per
hour. Depending on the time oI day, we transmit between 600 and 1200 queries per second. The
queries are streamed out asynchronously and we handle replies as they return. This phase takes
about 8 days to run.
See RFC 1296 about details oI how traditional DNS surveys were made.

4
4 {C} Herbert Haas 2005/03/11
Top Host Names - WorIdwide
956841 www
336393 mail
56958 cpe
36107 router
35004 ftp
33720 ns2
33128 gw
27548 ns1
23019 pc1
21775 pc2
16432 smtp
15265 pc3
15177 pc4
14979 broadcast
14891 pc5
14877 gateway
14138 server
...big gap...
3884 cisco
3883 venus
3867 dev
3795 zeus
3765 jupiter
3720 mars
3656 l0
3647 t3
3567 www3
3511
loopback0
3470 pop
3452 mercury
3438 intranet
3404 demo
3397 alpha
3388 pc13
3330 pluto
3308 exchange
3253 linux
384 venus 204 mac4 172 mac9
356 pluto 201 hobbes 172 mac11
323 mars 201 hermes 170 mac8
288 jupiter 198 thor 169 phoenix
286 saturn 198 sirius 169 mac12
285 pc1 196 gw 169 hal
282 zeus 195 calvin 168 snoopy
262 iris 194 mac5 168 mac13
260 mercury 191 mac10 167 mac15
259 mac1 190 fred 167 mac14
258 orion 189 titan 167 grumpy
254 mac2 189 pc3 163 gandalf
240 newton 186 opus 162 pc4
234 neptune 186 mac6 160 uranus
233 pc2 185 charon 159 mac16
224 gauss 185 apollo 158 sleepy
222 eagle 179 mac7 158 io
213 mac3 179 athena 157 earth
209 merlin 177 alpha 156 europa
207 cisco 172 mozart 155 rigel
Top Host Names JuIy 2002 Top Host Names Jan 1992
Notice that the people used more Iancy names 10 years ago. What can we
conclude Irom this?

5
5 {C} Herbert Haas 2005/03/11
History
Even in the earIy Arpanet hosts have been
identified by names

For PeopIe, not machines!


Name/Address bindings in HOSTS.TXT
fiIes
Kenny
10.0.1.2
Stan
10.0.1.3
Eric
10.0.1.1.
127.0.0.1 eric IocaIhost
10.0.1.1 eric.spark eric
10.0.1.2 kenny.spark kenny
10.0.1.3 stan.spark stan
(Kenny and Stan have simiIar hostfiIes)
"SPark"

Through the 1970s, the ARPAnet was a small community oI a Iew hundred hosts.
A single Iile called HOSTS.TXT, contained a name-to-address mapping Ior every
host connected to the ARPAnet. The Iamiliar UNIX host table, /etc/hosts, was
compiled Irom HOSTS.TXT.
HOSTS.TXT was maintained by SRI's Network Information Center ("the NIC")
and distributed Irom a single host, SRI-NIC. SRI is the StanIord Research
Institute in Menlo Park, CaliIornia. SRI conducts research into many diIIerent
areas, including computer networking.
ARPAnet administrators typically emailed any changes to the NIC, and
periodically Ietched the current HOSTS.TXT by FTP. Any changes were
compiled into a new HOSTS.TXT, typically once or twice a week. The /etc/hosts
Iile which is used by any UNIX host has been generated by using HOSTS.TXT.

6
6 {C} Herbert Haas 2005/03/11
HostfiIe ProbIems
CentraIIy maintained by Network
Information Center (NIC)
Copied by aII hosts
ScaIabiIity probIem
Consistency probIem
Maintenance probIem
UnIortunately this approach did not scale as the Arpanet were growing Iaster and
Iaster. Every additional host not only caused another line in HOSTS.TXT, but also
produced additional update traIIic Irom and to SRI-NIC. Thus the total network
bandwidth necessary to distribute
a new version oI the hosts Iile is proportional to the square oI the total number oI
hosts! In these days memory was very expensive and additionally modiIying
hostnames on a local network became visible to the Internet only aIter a long
(distribution-) delay. Furthermore the name space was not yet hierarchical
organized and this "directory" became chaotic.
For example name collisions occurred, that is two hosts in HOSTS.TXT could
have the same name. While the NIC could assign unique addresses, it had no
authority over host names. There was nothing to prevent someone Irom adding a
host with a conIlicting name and violating the rules oI the name organization. For
example iI somebody adds a host with the same name as a major mail hub he
could disrupt mail service Ior many users.
The decentralization oI administration would eliminate the single-host bottleneck
and relieve the traIIic problem. And local management would make the task oI
keeping data up-to-date much easier. It should use a hierarchical name space to
name hosts. This would ensure the uniqueness oI names.

7
7 {C} Herbert Haas 2005/03/11
1984: DNS
PauI Mockapetris (IAB) created DNS
Distributed database

WorId-wide and redundant

Maintained by Name Servers

SimuIates hierarchicaI tree of mnemonic names

Each domain name is a node in a database

GoaI: SimpIe "Hostname resoIution"

But aIso stores other information


Paul Mockapetris, a member oI USC's InIormation Sciences Institute, was
responsible Ior designing the architecture oI the new system. In 1984, he
released RFCs 882 and 883, which describe the Domain Name System. Later.
these RFCs were superseded by RFCs 1034 and 1035, the current
speciIications oI the Domain Name System. RFCs 1034 and 1035 have now
been augmented by many other RFCs, which describe potential DNS security
problems, implementation problems, administrative gotchas, mechanisms Ior
dynamically updating name servers and Ior securing domain data, and much
more.
A few RFC example about basic DNS concepts:
RFC 1034: Domain Names - Concepts and Facilities
RFC 1035: Domain Names - Implementation and SpeciIication
RFC 1713: Tools Ior DNS debugging
RFC 1032: Domain Administrators Guide
RFC 1033: Domain Administrators Operations Guide
The basic idea was simply to "split the HOSTS.TXT Iile is into thousand oI
Iragments". DNS "replaces" the hostaddress to a human readable Iormat and
enables a mapping between names and addresses (and many other types oI
inIormation).
Note: Domain Names are just indexes of the database, which may store
whatever information-not only IP addresses!
The domain system is also important to Iorward emails. There are entry types to
deIine what computer handles mail Ior a given name, to speciIy where an
individual is to receive mail, and to deIine mailing lists.
But most oIten it is only used Ior "hostname resolution", that is, Iinding an IP
address Ior a given domain name.

8
8 {C} Herbert Haas 2005/03/11
LogicaI Tree of Names
IP net-IDs are "fIat"
Arbitrary assignment
without semanticaI or
IogicaI considerations

Hard to remember
DNS maps addresses to
names
DNS aIIows hierarchicaI
tree of names

No name coIIisions
anymore!

Max 127 IeveIs

Concatenation resuIts in
FuIIy QuaIified Domain
Name (FQDN)
.
COM ORG BIZ EDU AT . .
DEBIAN
WWW
AC
TUWIEN
WWW GD
Root
Domain
TLDs
WWW.DEBIAN.ORG.
192.25.206.10
GD.TUWIEN.AC.AT.
192.35.244.50
WWW.TUWIEN.AC.AT.
128.130.102.130
2
nd
LeveI
Domain
3
nd
LeveI
Domain
Note that IP network addresses are flat. Although we oIten call IP addresses
structured, the net-IDs are indeed Ilat, that is, they have no Iurther structure.
Moreover, IP address assignment had been done rather arbitrary without taking
semantic or logical considerations into account. But what's most important:
people cannot easily remember a 32 byte decimal number by heart.
The DNS maps the whole "Ilat" IP address space into a logical and hierarchical
tree oI names. The tree origins at the root domain, which is represented by a
single dot ".", while all other domainsIirst level domains, second level
domains, and so onare attached below the root. The Iirst level domains are
also known as "Top Level Domains" (TLDs).
The leaves oI this tree and each node in between can be speciIied by
concatenating all names Irom here to the root. This is called a "Fully Qualified
Domain Name" (FQDN).
Note: This tree does not reIlect any physical or geographical location oI hosts!
For example ten diIIerent hosts might be physically located in diIIerent networks
and each in a diIIerent country, but all can belong to the same domain!

9
9 {C} Herbert Haas 2005/03/11
Name Servers
The DNS tree is reaIized by Name Servers
The Domain Name Tree does NOT refIect
the physicaI network structure!
Each NS cares for a subset of the DNS
tree: zones
FIexibIe mappings

1:n (Routers or servers with severaI network


interfaces)

n:1 (MuItipIe services behind a singIe IP


address)
How is this hierarchical tree implemented? All inIormation is stored in world-
wide distributed name servers, each oI which knows only a Iragment oI course.
This Iragment is called a "zone" inIormation. A "zone" is simply a part oI the
tree or a subdomain. Zones are explained later in more detail.
DNS allows Ilexible mappings between addresses and namesthey do not need
to be one-to-one! For example a router might be known by a unique name but is
reachable by multiple addresses because it employs a number oI interIaces.
Furthermore, a workstation might oIIer diIIerent services such as FTP, HTTP,
MAIL, and so on, and each service is identiIied by a separate name, Ior example
Itp.x.y.z, www.x.y.z or mail.x.y.z. These mappings are implemented by so-called
aliases.

10
10 {C} Herbert Haas 2005/03/11
TerminoIogy
A "Domain" is a subtree
of the domain name
space
A "Domain Name" is the
name of a node in the
tree

Concatenated IabeIs
from the root to the
current domain

Listed from right to Ieft


Separated by dots
Max 255 characters
A "LabeI" is a
component of the
domain name
Max 63 characters
.
COM GOV
FBI
SECRET
X-FILES MIB
Domain
FBI.GOV
Domain Name (node)
SECRET.FBI.GOV.
Domain
GOV
A "Domain" is everything under a particular point in the tree and relates to the
naming structure itselI, not the way things are distributed.
A "Domain Name" is the name oI a node in this treethe index oI the database.
It consists oI all concatenated labels Irom the root to this node and must not
exceed 255 characters.
Thus a domain name is made up oI several "Labels", which need only be unique
at a particular point in the tree. That is, both "name.y.z" and "name.x.y.z" are
allowed. Labels must not exceed 63 characters.
Note that DNS is not case sensitivealthough DNS originates Irom UNIX
systems. That is, "www.nic.org" is the same as "WWW.NIC.ORG"
Due to SMTP restrictions, domain names may contain only characters oI the
Iollowing sets: a-z}, A-Z}, 0-9}, and the dash character "-". Additional
language speciIic characters might be supported in Iuture implementations.

11
11 {C} Herbert Haas 2005/03/11
The Root Domain
The root of the DNS tree is represented as
a dot "."

A true FQDN incIudes the dot

Otherwise "reIative" domain name

Most peopIe/appIications don't care

However, DNS does care!


The root is impIemented by severaI root-
servers (currentIy 13)
BeIow the root, a domain may be caIIed
top-IeveI, second-IeveI, third-IeveI etc...
The root domain "." is always the rightmost "label" oI a FQDN, although most
applications such as web browsers do not care about it. However, any DNS
conIiguration is absolutely sensitive oI the proper use oI this dot. Any domain
name without the root-dot is regarded as relative domain name.
The root is realized by 13 root servers (as oI 2002) which are world wide
dispersed Ior perIormance and redundancy reasons.
However.9 root name servers are indeed located in the USA.

12
12 {C} Herbert Haas 2005/03/11
Top LeveI Domains
Seven "generic domains" (gTLDs)

COM, EDU, GOV, INT, ORG, MIL, NET

InitiaIIy inside USA, now gIobaIIy used


244 Two-Ietter country codes

E.g. AT, DE, UK, ES, RU, CH, IT, AQ, .

InitiaIIy outside USA onIy, now aIso "US"

Country code does not necessariIy refIect reaI


Iocation!
Seven new TLDs

BIZ, INFO, NAME, MUSEUM, COOP, AERO,


PRO
The Arpanet deIined seven generic top level domains, short gTLDs, which were
originally only assigned inside the USA.
com Commercial
edu Educational
org Non ProIit Organizations (NPOs)
net Networking providers
mil US military (e. g. navy.mil, army.mil)
gov US government organisations (e. g. nasa.gov, nsI.gov)
int International organizations
In 1996, the restrictions Ior gTLDs have been relaxed (except mil and gov), Ior
example even commercial organisation can use the net and org TLDs.
Additionally the two letter coutry code, which is deIined in deIined in ISO-3166
is also used. Currently, there are 244 country speciIic registries.
Also new TLDs have been introduced recently: AERO Ior the airport industry,
BIZ Ior businesses, COOP Ior "Cooperatives", INFO Ior unrestricted use,
MUSEUM Ior (*surprise*) museums, NAME Ior individuals, and PRO Ior
"ProIessionals" such as accountants, lawyers, physicians, and so on.
The us domain has IiIty subdomains that correspond to the IiIty U.S. states. Each
is named according to the standard two-letter abbreviation Ior the state which has
been deIined by the U.S. Postal Service.
Country and state domains typically reIlect geographical locationsbut not
necessarily!

13
13 {C} Herbert Haas 2005/03/11
DeIegation and Zones
To ease administration,
the authority over
subdomains is deIegated
to other nameservers
A zone is a point of
deIegation or "Start of
Authority" (SOA)
Zones reIate to the way
the database is
partitioned and
distributed
ORG
BAR
CROSS FOO
Zone ORG
Zone CROSS.BAR.ORG
Zone FOO.BAR.ORG
Delegation
Delegation
Zone "."
Delegation
.
Obviously root and TLD name servers cannot hold all inIormation about a
domain, and even many organizations are as big that it is not reasonable to
maintain a whole domain database at a single server. Because oI this,
administration is simpliIied by delegating the authority oI a subdomainalso
called a zoneto another nameserver.
That is: name servers generally deal with zonesnot domains!
The so-called "Start of Authority" (SOA) record oI a name server speciIies the
realm oI the particular zone. Or in simpler words: Each name server stores
inIormation about a zone and each zone is thereIore a "Start oI Authority".
A zone can span over a whole domain or just be part oI it. In this case a zone is
like a pruned domain. It contains all names Irom this point downwards the
domain-tree except those which are delegated to other zones (i.e. to other name
servers).
Also the org name servers control diIIerent zones. Imagine iI a root name server
loaded the root domain instead oI the root zone: it would be loading the entire
name space! Now iI a name server is asked Ior data in some subdomain, it can
reply with a list oI the right name servers to talk to.

14
14 {C} Herbert Haas 2005/03/11
Hostname ResoIution
Recursive queries = the job is forwarded

The response must be exact (or error message)

Most burden on next name server


Iterative queries = AII NS are queried top-down

The response contains best answer aIready known

Requested name server makes no further queries


www.mit.edu. ?
w
w
w
.
m
i
t
.
e
d
u
.

?
Root + gTLDs (e.g. EDU)
L
i
s
t

o
f

m
i
t

n
a
m
e

s
e
r
v
e
r
s
www.mit.edu. ?
18.181.0.31
MIT server
Recursive Iterative
18.181.0.31
There are two ways Ior hostname resolution: recursive and iterative queries.
The recursive query is more burdening Ior the server which is being queried,
because this server is asked to do the whole job oI name resolution by its own.
OI course it may also Iorward this query and the next server must perIorm the
whole work.
A name server conIigured to Iorward all unresolved queries to a designated name
server is called a "forwarder".
Being requested by an iterative query is much easier. The queried server only has
to reply using the best inIormation currently known. II the queried name server
isn't authoritative Ior the data requested, the initiator will have to query other
name servers to Iind the answer.
Most name servers will recurse, since this permits them to cache the various
resource records used to access the Ioreign domain, in anticipation oI Iurther
similar requests.
The BIND 8 name server can be conIigured to reIuse recursive queries.

15
15 {C} Herbert Haas 2005/03/11
A DetaiIed ReaI-WorId ExampIe
AT
AC
TUWIEN
g
d
.
t
u
w
i
e
n
.
a
c
.
a
t
AQ
CO
UNIVIE
.
gd.tuwien.ac.at
zone "."
zone
"ac.at"
zone
"tuwien.ac.at"
gd.tuwien.ac.at
192.35.244.50
ns2.univie.ac.at
ns1.univie.ac.at
ns.uu.net
.
a.root-servers.net
ns2.univie.ac.at
tunamed.tuwien.ac.at
List of at name servers
gd.tuwien.ac.at
tunamed.tuwien.ac.at
tunamec.tuwien.ac.at
List of tuwien.ac.at
name servers
gd.tuwien.ac.at
Address = 192.35.244.50
1
9
2
.
3
5
.
2
4
4
.
5
0
Let me FTP something
GD ZID INFO
The diagram above shows a real world example oI name resolution, starting at the
root name servers. OI course not any request needs to start at the root since most
ISP name servers cache a lot oI inIormation or know at least the addresses oI
authoritative name servers.
But in our example we start at the top. The whole process can be veriIied by
using standard DNS tools such as dig.

16
16 {C} Herbert Haas 2005/03/11
Note
Each questioned name server repIies
with more detaiIed information.or
the desired information itseIf!
A reference to another NS gives
precious information about new zone
authority - cached!
AIter a reIerence to another NS, this (recursing) NS learns the IP address oI a new
NS which is authoritative about a new zone. This is precious inIormation and
thereIore it is cached.

17
17 {C} Herbert Haas 2005/03/11
Caching
First, the IocaI NS resoIves the
name kenny.southpark.edu
Hereby it Iearns aIso the
addresses of the
southpark.edu NS
AII this information is cached!
Root NS
southpark.edu NS
LocaI NS
Root NS
southpark.edu NS
LocaI NS
superbestfriends.southpark.edu NS
When resolving the name
seamen.superbestfriends.southpa
rk.edu the local NS notices that
this name is member of
southpark.edu
Address of southpark.edu NS is
cached
No need to start at root NS!
Caching greatly reduces the resolving duration and unburdens the root name
servers.

18
18 {C} Herbert Haas 2005/03/11
Reverse Lookups
Very often reverse Iookups are necessary

"Have address but want name"

For Iogging purposes or service restriction


Therefore the in-addr.arpa domain was
created

Given an IP-address the associated hostname


can be found

Otherwise an exhaustive search in the domain


space wouId be necessary to find any desired
hostname
Reverse lookups are commonly used by WWW servers to log its users in a Iile or
IRC servers that want to restrict their service to a certain domain, Ior example a
closed discussion group exclusive Ior IEEE.ORG members.
In order to support reverse lookups the old arpa domain is reused today, which is
connected to the in-addr subdomain. All IP addresses are attached as labels to
in-addr.arpa.
The ARPA TLD was originally only used while changing Irom HOSTS.TXT to
DNS. All hosts were originally members oI the arpa domain, then all hosts
moved to the speciIic TLDs. Today ARPA is reused Ior inverse lookups.
Reverse delegation is becoming increasingly important as organizations attempt
to veriIy the origin oI requests to their servers by looking up the domain name
associated with the IP address making the request. Among the services this
applies to are FTP and mail.
Customers may not be able to access services iI reverse lookup on their host IP
numbers is not setup.

19
19 {C} Herbert Haas 2005/03/11
In-Addr.Arpa
Each byte of an IP
address is treated as
IabeI and attached under
the in-addr.arpa TLD
Expressed as character
string for its decimaI vaIue
("0" - "255")
LabeIs are concatenated
in reverse order
"10.206.25.192.in-addr.arpa"
.
ORG ARPA . .
DEBIAN
WWW 192
WWW.DEBIAN.ORG.
193 194 191
24 25 26
205 206 207
9 10 11
Pointer (PTR)
IN-ADDR
.
.
. .
.
.
. .
.
.
.
What's the Domain Name
of 192.25.206.10 ?
The "in-addr.arpa." domain is the reverse tree Ior IPv4 addresses. The name
derives Irom "Inverse (IP) address", and "ARPA" was once oI the organizations
behind the creation oI the Internet.
The whole IP address space is represented as a Iour-level tree which is attached to
the in-addr.arpa domain name. Each byte oI the IP address is interpreted as
ordinary label, allowing normal lookups. But at the leaves oI this tree a pointer
(PTR) is Iound, which points to the oIIicial domain name oI this host.
For simplicity the domain names should be organized on byte boundaries,
however, today tricks are used to assign even names Ior subnets that are not
aligned on byte boundaries.
This is called the "classless in-addr" trick and is not discussed here. Hint: Just
introduce an artiIicial 5th level in the tree.

20
20 {C} Herbert Haas 2005/03/11
BIND
BerkeIey Internet Name Domain (BIND)

ImpIemented by PauI Vixie as an Internet name


server for BSD-derived systems

Most wideIy used name server on the Internet

Version numbers: 4 (oId but stiII used), 8, 9


BIND consists of

A name server program "named"

A resoIver Iibrary for cIient appIications


BIND deaIs with zones!
The most important implementation Ior DNS is the Berkeley Internet Name
Domain (BIND), which has been created by Paul Vixie. BIND consists oI a
server (named, "d" stands Ior "daemon") and a client, the resolver library.
The "resolver" is a collection oI Iunctions like gethostbyname(2) and
gethostbyaddr(2) and is used by all Internet applications, such as Telnet, FTP,
webbrowsers, and others.
By the way: Windows systems use their own DNS implementation based on
BIND.

21
21 {C} Herbert Haas 2005/03/11
ResoIver and Name Server
User
Progra
m
Resolver
Foregn
NS
Shared
Database
user queres
user responses
queries
responses
cache addtons references
CLlENT FORElGN
Shared
Databas
e
NS
(name
d)
Maste
r Fes
Foregn
Resove
r
Foregn
NS
queries
responses
references refreshes
maintenance queries
maintenance
responses
SERVER
FORElGN
All DNS messages use
port 53
Zone transfers use TCP
Simple queries use UDP
The diagram above shows the principle design oI a DNS server and resolver
according to the IETF.
All DNS messages use port 53. Zone transIers use TCP Ior reliability and simple
queries which are originated Irom clients use UDP Ior speed.
Note that replies that are longer than 512 bytes (check the implementation!)
might also be send via TCP.
Early (up to version 4) BIND implementations did not cache query responses.
All modern DNS implementations do cacheunless disabled.

22
22 {C} Herbert Haas 2005/03/11
Types of Name Servers
Primary Masters (or "Master")

Has data about a zone in a IocaI fiIe

Therefore is authoritative about a zone

Each zone has exactIy one Primary


Secondary Masters (or "SIave")

Copies zonefiIes from a Master Server (P or S)

This is caIIed "zone transfer" (TCP)

Therefore aIso authoritative

Each zone must have at Ieast one Secondary


There are exactly two types oI name servers which are authoritative Ior a zone:
the Primary Master and the Secondary Master. With BIND 8/9 the terms
master and slaves are used instead.
Each zone must have exactly one primary name server. All conIiguration is done
in the master Iiles or "zone Iiles" oI the primary.
A secondary name server Ior a zone gets the zone data Irom another name server
that is authoritative Ior the zone, called its master server. The master server is
either a primary or a secondary name server. When a secondary starts up, it
contacts its master name server and, iI necessary, pulls the zone data over. This is
reIerred to as a zone transfer.
Note that secondary name servers are not second-class name servers. DNS
provides these two types oI name servers to make administration easier. Just
conIigure set up a primary master name server and speciIy some secondaries.
Once they are set up, the secondaries will transIer new zone data when necessary.
A name server can be a primary master Ior one zone and a secondary Ior another,
hereby providing enough redundancy to tolerate Iailures.
Secondary NS are initially suggested in RFC 1035.

23
23 {C} Herbert Haas 2005/03/11
Resource Records
AII database information is stored in
resource records (RR)
Different cIasses: IN, HS, CH

OnIy IN (Internet) is important today


RR Format:
[DOMAIN] [TTL] [CLASS] TYPE RDATA
Domain Name to
which RR appIies
Time of VaIidity
in seconds
Network CIass
(Internet "IN")
What type of
information is
specified
What type of
information is
specified
Records are divided into classes, each oI which deIines various inIormation
types. Today only the Internet class (IN) is important, while the others--Chaosnet
(CH) and Hesiod (HS)have only historic signiIicance, and has been used at the
MIT.
Within a class, records also come in several types, which correspond to the
diIIerent varieties oI data that may be stored in the domain name space.
All DNS operations are Iormulated in terms oI Resource Records (RRs, RFC
1035), Iurthermore, each query is answered with a copy oI matching RRs. RRs
are the smallest unit oI inIormation available through DNS

24
24 {C} Herbert Haas 2005/03/11
Some Important RR Types
Type Value Meaning
A 1 Host address
NS 2 Authoritative name server
CNAME 5 Canonical name for an alias
SOA 6 Marks the start of a zone of authority
WKS 11 Well known service description
PTR 12 Domain name pointer
HNFO 13 Host information
MNFO 14 Mailbox or mail list information
MX 15 Mail exchange
TX 16 Text strings
The table above shows some important RR types. Most important is the address
(A), which speciIies an IP address Ior address resolution, and the name server
(NS), which speciIies other name servers, authoritative Ior another zone. NS
records are used Ior delegations and constitute the "glue" oI the hierarchical tree.
CNAME entries are used to assign a certain host an alias, Ior example "WWW".
SOA marks the "Start oI Authority" and is used as a preamble in each zone Iile.
Pointer (PTR) records are used Ior inverse queries, and Mail Exchange (MX)
entries are used to speciIy mail transIer agents, which are responsible to Iorward
mails.

25
25 {C} Herbert Haas 2005/03/11
Root Servers
13 root servers impIement the "."

Maintained by ICANN

Each of them knows aII TLD name servers

Most are even authoritative for the generic top-


IeveI domains
Name Servers must maintain a Iist of root
servers

Stored in "root.hints" fiIe (BIND)

Queried one after the other untiI positive repIy

This Iist is aIso updated by requesting singIe


root servers
In the absence oI other inIormation, resolution has to start at the root name servers. The Internet
has thirteen root name servers (as oI this writing) spread across diIIerent parts oI the network.
Two are on the MILNET, the U.S. military's portion oI the Internet; one is on SPAN, NASA's
internet; two are in Europe; and one is in Japan. Clearly when root servers go oIIline there is no
name resolution anymore, thereIore redundancy is cruxial.
Each root server might be implemented by several physical servers. The utilization oI these root
servers varies Irom some kbit/s (rarely) to some Mbits/s (average) up to 100 Mbit/s and more
(peaks). Root servers are typically connected to several ISPs, some oI them provide Iree transit
service. Furthermore, a backup power supply is needed e.g. battery and generator in the case oI
outage oI commercial power supplies.
Server Operator Cities
A VeriSign Global Registry Services Herndon VA, US
B InIormation Sciences Institute Merina Del Rey CA, US
C Cogent Communications Herndon VA, US
D University oI Maryland College Park MD, US
E NASA Ames Research Center Mountain View CA, US
F Internet SoItware Consortium Palo Alto CA, US - San Francisco CA
G U.S. DOD Network InIormation Center Vienna VA, US
H U.S. Army Research Lab Aberdeen MD, US
I Autonomica Stockholm, SE
J VeriSign Global Registry Services Herndon VA, US
K Reseaux IP Europeens London, UK
L IANA Los Angeles CA, US
M WIDE Project Tokyo, JP

26
26 {C} Herbert Haas 2005/03/11
Root Hints ExampIe
. 604800 N NS G.ROOT-SERVERS.NET.
. 604800 N NS K.ROOT-SERVERS.NET.
. 604800 N NS H.ROOT-SERVERS.NET.
. 604800 N NS A.ROOT-SERVERS.NET.
. 604800 N NS B.ROOT-SERVERS.NET.
G.ROOT.SERVERS.NET. 604800 N A 192.112.36.4
K.ROOT.SERVERS.NET. 604800 N A 193.0.14.129
H.ROOT.SERVERS.NET. 604800 N A 128.63.2.53
A.ROOT.SERVERS.NET. 604800 N A 198.41.0.4
B.ROOT.SERVERS.NET. 604800 N A 128.9.0.107
TTL [s]
Internet
Addres
s
Name servers root
The slide above shows an actual example (Iragment) oI a root.hints Iile. Note
the Iive-column speciIications oI each entry, which is typical Ior all DNS
entries.
From leIt to right each line speciIies:
1. The domain Ior which this inIormation applies
2. The TTL in seconds, how long this entry is valid iI it is cached
3. The network class (almost always IN Ior Internet)
4. The type oI entryhere NS Ior Name Server and A Ior Address
5. The data itselI, whose meaning has been speciIied by column Iour (type)

27
27 {C} Herbert Haas 2005/03/11
Behind the Scenes
FrequentIy private root servers are used within
organizations

IsoIated from officiaI DNS


RecentIy severaI unofficiaI "roots" were avaiIabIe
in the Internet

OverIaps officiaI DNS and introduces new unofficiaI


TLDs
Now ICANN is responsibIe for managing and
coordinating the DNS to ensure universaI
resoIvabiIity
ICANN: GIobaI, NPO, pubIic interest
Cares for distribution of unique IP addresses and
domain names
Current eIIorts oI the ICANN is to assure one single root Ior the Internet.
Recently unoIIicial root name servers "polluted" the Internet with non oIIicially
registered TLDs and caused wrong hostname resolutions.
Some companies persuade their users to have their resolvers point to their
alternate root instead oI the authoritative root.
Others (New.net Ior example) create special browser plug-ins and other soItware
workarounds to accomplish the same eIIect.

28
28 {C} Herbert Haas 2005/03/11
Caching
Caching is criticaI for DNS
performance

OffIoad root NS (onIy 13 root servers!)

OffIoad other authoritative NS


Cached information

Is non-authoritative

Is vaIid as specified in TTL


A name server processing a recursive query discovers a lot oI inIormation about
the domain name space as. Each time it is reIerred to another list oI name
servers, it learns that those name servers are authoritative Ior some zone, and it
learns the addresses oI those servers.
In order to accelerate Iuture client request and to reduce DNS traIIic, all these
inIormation is cached.
With version 4.9 and all version 8 BINDs, name servers even implement
negative caching, that is, iI an authoritative name server responds to a query with
an answer that says the domain name or data type in the query doesn't exist, the
local name server will temporarily cache that inIormation, too.
Every piece oI DNS inIormation has a Time To Live (TTL) assigned which
speciIies the number oI seconds this inIormation may be cached beIore it must be
discarded. With BIND 8.2 the TTL is only used Ior negative caching.
Deciding on a TTL is essentially deciding on a trade-oII between perIormance
and consistency. A small TTL ensures that data is consistent across the network,
because remote name servers will time it out more quickly and be Iorced to query
authoritative name servers more oIten Ior new data. On the other hand, this will
increase the DNS traIIic and processing load and lengthen the resolution time on
the average.

29
29 {C} Herbert Haas 2005/03/11
ExampIe Config (1)
ZONE pub.foo.org
Name Servers: ns.foo.org
stan.pub.foo.org
org
foo
ns
pub
stan kyIe
docs
cartman kenny
ZONE foo.org
The picture above shows an example domain "Ioo" having two name servers "ns"
and "stan", each responsible Ior another zone. Ns is authoritative Ior the Ioo.org
zone and stan is authroitative Ior the pub.Ioo.org zone.
The Iollowing slides show example BIND conIigurations.

30
30 {C} Herbert Haas 2005/03/11
ExampIe Config (2)
; zone file for the foo.org. zone
@ IN SOA ns.foo.org. admin.kenny.docs.foo.org (
199912245 ;serial number
360000 ;refresh time
3600 ;retry time
3600000 ;expire time
3600 ;default TTL )
IN NS ns.foo.org.
IN NS ns.xyz.com. ;secondary nameserver for @
IN MX mail.foo.org. ;mailserver for @
Pub IN NS stan.pub.foo.org.
; glue records
ns IN A 216.32.78.1
stan.pub IN A 216.32.78.99
; hosts in the zone foo.org
Mail IN A 216.32.78.10
Linus IN A 216.32.78.20
kenny.docs IN A 216.32.78.100
cartman.docs IN A 216.32.78.150
Deegaton for the
zone pub.foo.org.
Records descrbng
zone .foo.org. = @
The slide above shows an example conIiguration in the ns.Ioo.org name server.
Note the "glue records" that assign IP addresses to the NS records so that
delegations make sense.

31
31 {C} Herbert Haas 2005/03/11
Timers in the SOA RR
Refresh time
TeIIs sIave at which time intervaIs it shouId check for zone
changes

Some hours (3-12 typicaIIy)


Retry time

If master couId not be reached


TypicaIIy shorter than refresh time
Expire time
Time after which unrefreshed zone data is definiteIy outdated
(removed)
TypicaIIy one week (aIso months)
TTL
BIND pre 8.2: Specifies how Iong any cached entry is vaIid

BIND 8.2 and Iater: OnIy vaIid for negative caching!


Performance versus consistency!
BeIore BIND 8.2 all these values were conIigured in seconds. Post BIND 8.2
releases also allow time values in hours, minutes, days, and weeks (h, m, d, w).
When the TTL expires the DNS must remove the respective entries Irom the
cache. This is also true Ior negative data ("negative caching").
This is diIIerent with BIND 8.2 and later: The TTL is actually a "Negative
Caching TTL" and is only valid Ior the negative caches.

32
32 {C} Herbert Haas 2005/03/11
ExampIe Config (3)
; zone file for the 78.32.216.in-addr.arpa domain
@ IN SOA ns.foo.org admin.kenny.docs.foo.org.
(
1034
3600
600
3600000
86400
)
IN NS ns.foo.org.
1 IN PTR ns.foo.org.
10 IN PTR mail.foo.org.
20 IN PTR linus.foo.org.
99 IN PTR stan.pub.foo.org.
100 IN PTR kenny.docs.foo.org.
150 IN PTR cartman.docs.foo.org.
The slide above shows an example conIiguration in the ns.Ioo.org name server,
used Ior inverse resolution.
Note the PTR entries.

33
33 {C} Herbert Haas 2005/03/11
ExampIe Config (4)
; zone file for pub.foo.org
@ IN SOA stan.pub.foo.org hostmaster.stan.pub.foo.org.
( 1034
3600
600
3600000
86400 )
; Name Servers
IN NS stan
IN NS ns.foo.org. ; secondary NS
; glue records
stan IN A 216.32.78.99
nameserver IN CNAME stan
; other hosts:
kyle IN A 216.32.22.50
IN MX 1 mail.foo.com
IN MX 2 picasso.art.net
IN MX 5 mail.ct.oberon.tuwien.ac.at
butters IN A 216.32.22.51
garison IN A 216.32.22.52
IN HINFO VAX-11/780 UNIX
IN WKS 216.32.22.52 TCP
(telnet ftp netstat finger pop)
wendy IN A 216.32.34.2
IN HINFO SUN UNIX
; etc.....
The slide above shows other example entries Iound in the stan.pub.Ioo.org name
server.
Note the additional inIormation, such as MX records, host inIormation (HINFO),
and well-known services (WKS).
Consider the security relevance oI HINFO and WKS.

34
34 {C} Herbert Haas 2005/03/11
DeIegations
DeIegations are made when a zone has a parent
domain
A parent name server acting as deIegation point
keeps a Name Server record (NS) that specifies
responsibIe name servers for that subzone
A-records that correspond with associated NS
records are caIIed gIue records
GIue records are onIy necessary if the specified
nameserver (NS record) is inside the subzone it
serves!
AND the parent is no secondary server for that zone
Every zone needs at least two nameservers. One is called the primary or master,
the other is the secondary or slave. Today we should use the terms master and
slaves only.
Delegations are implemented using NS records, which speciIy authoritative
(master or slave) name servers Ior some speciIic zone oI this domain.
Additionally A records are necessary to speciIy the associated IP addresses. These
A records are the so-called "glue records".

35
35 {C} Herbert Haas 2005/03/11
Registration Terms
Registry

ResponsibIe of TLD zone maintenance

One unique registry per TLD


Registrar

Intermediate agent between customer and


registry (ISP)
Registration

Customer teIIs registrar which NS shouId be


used for deIegation to reach a subdomain

PIus contact information


Network Solutions Inc is responsible (and hereby the only registry) Ior the TLDs
com, net, org, and edu. Network Solutions Inc. also acts as registrar Ior these
TLDs. Since June 1999 the ICANN allowed other registrars Ior the TLDs com,
net, and org (see http://www.internic.net/regist.html Ior a list).

36
36 {C} Herbert Haas 2005/03/11
Domain Registrations
Many providers act as "registrars"
ICANN controIs continentaI
registrars

USA: InterNIC (www.internic.net)

Europe: RIPE (www.ripe.net)

Asia: APNIC (www.apnic.net)


Domain name registration is independent Irom IP address assignment and usually
any provider can act as a registrar, who applies Ior a registration at the regional
network inIormation center (RIPE, APNIC, InterNIC) in behalI oI the customer.
The overall control over the DNS has recently been directed to the ICANN, the
Internet Corporation Ior Assigned Names and Numbers. Check out
http://www.icann.org.

37
37 {C} Herbert Haas 2005/03/11
Diagnostic TooIs
DIG - Domain Information Groper

Send domain name query packets to


name servers

ResuIts are printed in a human-readabIe


format
NSLOOKUP

Query Internet name servers


interactiveIy
There are two standard DNS tools available on most workstations. The Domain
InIormation Groper (DIG) is the most important one and typically Iound on any
UNIX and LINUX distribution.
The syntax is:
dig [@server] domain [<query-type>] [<query-class>][+<query-option>]
[-<dig-option>] [%comment]
The other important tool is NSLOOKUP and also Iound on Windows operating
systems. NSLOOKUP is operated interactively, just enter "help" Ior a command
and option list.

39
39 {C} Herbert Haas 2005/03/11
SeIected RFCs (1)
RFC 1034

Domain Name Concept And FaciIities


RFC 1035

Domain Name ImpIementation and


Specification
RFC 1101

DNS Encoding Network Names And Other


Types
RFC 1183

New DNS RR Definitions


RFC - 881 - The Domain Names Plan And Schedule
RFC - 882 - Domain Names Concepts And Facilities
RFC - 883 - Domain Name Implementation and SpeciIication
RFC - 897 - Domain Name System Implementation Schedule
RFC - 921 - Domain Name System Implementation Schedule (Rev)
RFC - 973 - Domain System Changes And Observations
RFC - 974 - Mail Routing And the Domain System
RFC - 1032 - Domain Administrators Guide
RFC - 1033 - Domain Administrators Operations Guide
RFC - 1034 - Domain Name Concept And Facilities
RFC - 1035 - Domain Name Implementation and SpeciIication
RFC - 1101 - DNS Encoding Network Names And Other Types
RFC - 1183 - New DNS RR DeIinitions
RFC - 1348 - DNS NSAP RRs
RFC - 1383 - An Experiment In DNS Based IP Routing
RFC - 1386 - The US Domain
RFC - 1394 - Relationship OI Telex Answerback Codes To Internet Domains

40
40 {C} Herbert Haas 2005/03/11
SeIected RFCs (2)
RFC 1591

Domain Name System Structure And DeIegation


RFC 1664

Using The Internet DNS To Distribute RFC1327 MaiI


Address Mapping TabIes
RFC 1712

DNS Encoding Of GeographicaI Location


RFC 1788

ICMP Domain Name Messages


RFC 1794

DNS Support For Load BaIancing


RFC - 1401 - Correspondence Between The IAB And DISA On The Use OI DNS
Throughout The Internet
RFC - 1464 - Using The Domain Name System To Store Arbitrary String
Attributes
RFC - 1480 - The US Domain
RFC - 1535 - A Security Problem And Proposed Correction With Widely
Deployed DNS SoItware
RFC - 1536 - Common DNS Implementation Errors And Suggested Fixes
RFC - 1537 - Common DNS Data File ConIiguration Errors
RFC - 1591 - Domain Name System Structure And Delegation
RFC - 1664 - Using The Internet DNS To Distribute RFC1327 Mail Address
Mapping Tables
RFC - 1637 - DNS NSAP Resource Records
RFC - 1612 - DNS Resolver MIB Extensions
RFC - 1611 - DNS Server MIB Extensions
RFC - 1706 - DNS NSAP Resource Records
RFC - 1712 - DNS Encoding OI Geographical Location
RFC - 1788 - ICMP Domain Name Messages
RFC - 1794 - DNS Support For Load Balancing

41
41 {C} Herbert Haas 2005/03/11
SeIected RFCs (3)
RFC 1876
A Means For Expressing Location Information In The Domain Name System
RFC 1886
DNS Extensions To Support IP Version 6
RFC 1918
Address AIIocation for Private Internets
RFC 1982
SeriaI Number Arithmetic
RFC 1995
IncrementaI Zone Transfers In DNS
RFC 1996
A Mechanism For Prompt Notification Of Zone Changes (DNS Notify)
RFC 2052
A DNS RR For Specifying The Location Of Services (DNS SRV)
RFC 2065
Domain Name System Security Extensions
RFC 2136
Dynamic Updates In The Domain Name System (DNS Update)
RFC - 1876 - A Means For Expressing Location InIormation In The Domain
Name System
RFC - 1886 - DNS Extensions To Support IP Version 6
RFC - 1912 - Common DNS Operational and ConIiguration Errors
RFC - 1918 - Address Allocation Ior Private Internets
RFC - 1982 - Serial Number Arithmetic
RFC - 1995 - Incremental Zone TransIers In DNS
RFC - 1996 - A Mechanism For Prompt NotiIication OI Zone Changes (DNS
NotiIy)
RFC - 2052 - A DNS RR For SpeciIying The Location OI Services (DNS SRV)
RFC - 2065 - Domain Name System Security Extensions
RFC - 2136 - Dynamic Updates In The Domain Name System (DNS Update)
RFC - 2137 - Secure Domain Name System Dynamic Update
RFC - 2163 - Using the Internet DNS To Distribute MIXER ConIormant Global
Address Mapping (MCGAM)
RFC - 2168 - Resolution oI UniIorm Resource IdentiIiers Using The Domain
Name System
RFC - 2181 - ClariIications To The DNS SpeciIication

42
42 {C} Herbert Haas 2005/03/11
SeIected RFCs (4)
RFC 2308

Negative Caching Of DNS Queries (DNS


Ncache)
RFC 2535

Domain Name System Security Extensions


RFC 2541

DNS Security OperationaI Considerations


RFC 2606

Reserved Top LeveI DNS Names


RFC - 2182 - Selection And Operation OI Secondary DNS Servers
RFC - 2219 - Use OI DNS Aliases For Network Services
RFC - 2230 - Key Exchange Delegation Record For The DNS
RFC - 2240 - A Legal Basis For Domain Name Allocation
RFC - 2247 - Using Domains In LDAPX.500 Distinguished Names
RFC - 2308 - Negative Caching OI DNS Queries (DNS Ncache)
RFC - 2352 - A Convention For Using Legal Names As Domain Names
RFC - 2517 - Building Directories From DNS Experiences From WWW Seeker
RFC - 2535 - Domain Name System Security Extensions
RFC - 2536 - DSA KEYs And SIGs In The Domain Name System
RFC - 2537 - RSAMD5 KEYs And SIGs In The Domain Name System
RFC - 2538 - Storing CertiIicates In The Domain Name System
RFC - 2539 - Storage OI DiIIie-Hellman Keys In The Domain Name System
RFC - 2540 - Detached Domain Name System InIormation
RFC - 2541 - DNS Security Operational Considerations
RFC - 2606 - Reserved Top Level DNS Names

43
43 {C} Herbert Haas 2005/03/11
SeIected RFCs (5)
RFC 2672

Non-TerminaI DNS Name Redirection


RFC 2673
Binary LabeIs In The Domain Name System
RFC 2845
Secret Key Transaction Authentication For DNS (TSIG)
RFC 2870

Root Name Server OperationaI Requirements


RFC 2874

DNS Extensions To Support IPv6 Address Aggregation


And Renumbering
RFC 3007

Secure Domain Name System Dynamic Update


RFC - 2671 - Extension Mechanisms For DNS (EDNS0)
RFC - 2672 - Non-Terminal DNS Name Redirection
RFC - 2673 - Binary Labels In The Domain Name System
RFC - 2694 - DNS Extensions To Network Address Translators (DNSALG)
RFC - 2782 - A DNS RR For SpeciIying The Location OI Services (DNS SRV)
RFC - 2826 - IAB Technical Comment On The Unique DNS Root
RFC - 2845 - Secret Key Transaction Authentication For DNS (TSIG)
RFC - 2870 - Root Name Server Operational Requirements
RFC - 2874 - DNS Extensions To Support IPv6 Address Aggregation And
Renumbering
RFC - 2915 - The Naming Authority Pointer (NAPTR) DNS Resource Record
RFC - 2916 - E.164 number and DNS
RFC - 2929 - Domain Name System IANA Considerations
RFC - 2931 - DNS Request And Transaction Signatures (SIG(0)s)
RFC - 3007 - Secure Domain Name System Dynamic Update
RFC - 3008 - Domain Name System Security (DNSSEC) Signing Authority
RFC - 3071 - ReIlections On The DNS, RFC 1591, And Categories OI Domains

44
44 {C} Herbert Haas 2005/03/11
SeIected RFCs (6)
RFC 3090

DNS Security Extension CIarification On Zone Status


RFC 3152
DeIegation Of IP6.ARPA
RFC 3172
Management GuideIines & OperationaI Requirements
For the Address And Routing Parameter Area Domain
(ARPA)
RFC 3363
Representing Internet ProtocoI Version 6 Addresses In
The Domain Name System
RFC 3364

Tradeoffs In Domain Name System Support For Internet


ProtocoI Version 6
RFC - 3088 - OpenLDAP Root Service An experimental LDAP reIerral service
RFC - 3090 - DNS Security Extension ClariIication On Zone Status
RFC - 3110 - RSASHA-1 SIGs And RSA KEYs In The Domain Name System
RFC - 3123 - A DNS RR Type For Lists OI Address PreIixes (APL RR)
RFC - 3130 - Notes From The State OI The Technology DNSSEC
RFC - 3152 - Delegation OI IP6.ARPA
RFC - 3172 - Management Guidelines & Operational Requirements For the Address And Routing
Parameter Area Domain (ARPA)
RFC - 3197 - Applicability Statement For DNS MIB Extensions
RFC - 3225 - Indicating Resolver Support OI DNSSEC
RFC - 3226 - DNSSEC And IPv6 A6 Aware Serverresolver Message Size Requirements
RFC - 3258 - Distributing Authoritative Name Servers Via Shared Unicast Addresses
RFC - 3363 - Representing Internet Protocol Version 6 Addresses In The Domain Name System
RFC - 3364 - TradeoIIs In Domain Name System Support For Internet Protocol Version 6
RFC - 3397 - Dynamic Host ConIiguration Protocol (DHCP) Domain Search Option
RFC - 3403 - Dynamic Delegation Discovery System Part Three The Domain Name System
Database
RFC - 3425 - Obsoleting IQUERY

45
45 {C} Herbert Haas 2005/03/11
Summary
DNS initiaIIy onIy created for humans
HierarchicaI tree of names
Addresses and other database
information
Inverse resoIution using in-addr.arpa
TLD
Primary vs Secondary nameservers
Port 53, TCP and UDP

46
46 {C} Herbert Haas 2005/03/11
Any Questions?

1
2010/02/15 {C} Herbert Haas
WLAN
802.11a-z
In this chapter we discuss basic communication issues, such as synchronization,
coding, scrambling, modulation, and so on.

2
2 {C} Herbert Haas 2010/02/15
WireIess Products
WLAN is integrated

E. g. InteI Centrino chipsets


Increasing data rates

Towards Fast Ethernet speeds and more


Today strong native security
soIutions avaiIabIe

IPsec/TLS grade
VoIP support

QoS soIutions avaiIabIe


Ongoing penetration in consumer
market

TV/Radio-Iinks, WireIess HiFi, various


gadgets, .
The Iirst widespread commercial use oI the 802.11b standard Ior networking was
made by Apple Computer under the trademark AirPort. On the non-Apple
market, Linksys could be considered the current leader.


3 {C} Herbert Haas 2010/02/15
A Very First Introduction (1)
Basic standards:
802.11a, 802.11b, 802.11g, 802.11n
Frequencies used
ISM 2.4 GHz (mostIy used; 3-4 usabIe channeIs)

ISM 5 GHz (more channeIs; depends on country)


Strange terms
The cIient station is often caIIed the "STA"
This convention is NOT used in this chapters; we prefer the
universaI term "cIient"

Outdoor Device Unit (ODU)


Can be used outdoors (weatherproof)
Access Points (APs) manage traffic from, to, and between
cIients
The radio ceII is a shared coIIision network

EVERY traffic must go over the AP - there is no direct inter-


cIient traffic possibIe (cIients wouId refuse that)


4 {C} Herbert Haas 2010/02/15
A Very First Introduction (2)
The wireIess network name (ceII name) is caIIed Service Set
Identifier (SSID)
Basic-SSID (SSID for a singIe ceII)
Extended-SSID (same SSID spans over muItipIe ceIIs)
TypicaI security used:
WiFi Protected Access (WPA)
TypicaI QoS used:
WiFi MuIti Media (WMM)
TypicaI distances possibIe
StrongIy depends on antennas
20-50 meters indoor
Up to 15 km outdoor (much more possibIe with some efforts)
TypicaI ceII throughput possibIe
802.11b => 5-6 Mbit/s
802.11a|g => up to 22|25 Mbit/s
802.11n => probabIy 300-400 Mbit/s (wiII be ratified end 2006)

5
5 {C} Herbert Haas 2010/02/15
EvoIution of the 802.11 Standards
1980s: EarIy deveIopments - 215, 344, 860 kbit/s @ 900 MHz
1997: 802.1y aka 802.11
1 or 2 Mbit/s, FHSS or DSSS
902-928 MHz, probIems with EU & Asia
1999: 802.11b
1, 2, 5.5, 11 Mbit/s, onIy DSSS
ISM 2,4000-2,4835 GHz, nearIy worId-wide avaiIabIe
USA: 11 channeIs, Europe 13, Japan 14
3 non-overIapping (1,6,11 with 22 MHz per channeI)
1999: 802.11a (shipped in 2001)
6,9,12,18,24,36,48,54 Mbit/s, OFDM
5.150-5.350 GHz, 8-12(-24) non-overIapping channeIs
20 MHz per channeI
2003: 802.11g
1,2,5.5,11,12,18,24,36,48,54, DSSS and OFDM
ISM 2,4 GHz => same channeIs as 802.11b
2004: 802.11i (Security)
AES-CCM + 802.1x (TKIP/MIC onIy as migrating soIution)
2006: 802.11n
Up to 600 Mbit/s via MIMO-OFDM
Optimized MAC for higher throughput
Germany: "From 13 November 2002, Irequencies in the bands 5150 MHz - 5350
MHz and 5470 MHz - 5725 MHz may be used Ior Wireless Local Area Networks
Iree oI charge. The Regulatory Authority Ior Telecommunications and Posts
(RegTP) published a general assignment oI these Irequencies in its OIIicial
Gazette oI 13 November 2002."
802.11 begann in dem 902-928 MHz Frequenzbereich. Jedoch wurde, soweit ich
wei, kein Standard jemals Ir diesen Bereich Iertiggestellt. 900 Mhz Band ist
nmlich nur in Amerika (Nord und Sd), sowie in Australien Irei. Europa, Asien,
AIrika ist dieser Bereich nicht Irei. Deshalb hat sich auch ziemlich schnell die
Entwicklung in das 2,4 Ghz Band verschoben (dieses ist absulut berall linzenz
Irei).
InIrared (900 nm), diIIuse light versus directional light. Widely used in mobile
phones, Laptops. IrDA 2000 Standard. TransIer rate only 4 Mbit/s (directional).
Easy screening.

6
6 {C} Herbert Haas 2010/02/15
IEEE WLAN Standards Overview
802.11a - 5 GHz- Ratified in 1999 (shipping 2001)
802.11b - 11Mbit/s 2.4GHz, ratified in 1999
802.11c - MAC-Iayer bridging (802.1d)
802.11d - AdditionaI reguIatory domains (worId mode)
802.11e - QuaIity of Service
802.11f - Inter-Access Point ProtocoI (IAPP)
802.11g - Higher Datarate (>20MBit/s, actuaIIy 54 MBit/s) 2.4GHz
802.11h - 54 Mbit/s at 5GHz using DFS and TPC (Europe)
802.11i - Authentication and security
802.11j - Japan reguIatory conformance
802.11k - Radio Resource Management (SignaI QuaIity, 2004)
802.11m - Various 802.11 improvements (bugfixes)
802.11n - Beyond 100 Mbit/s, Ionger distances (2004)
802.11p - WireIess Access for the VehicuIar Environment (WAVE)
802.11r - Fast roaming
802.11s - Mesh networks
802.11T - WireIess Performance Prediction (WPP), test methods and metrics
802.11u - InteroperabiIity with non-802 networks (e.g. ceIIuIar)
802.11v - WLAN Management
802.11a can communicate at a maximum rate oI 72Mbps but due to FCC
Irequency restrictions, it is currently limited to 54Mbps. II these regulations
change, a simple Iirmware upgrade will update your equipment.
There are two consortiums Ior 802.11n: WWiSE and TGnSYNC.
802.11p (WAVE) is meant Ior "vehicular environments" such as ambulances and
passenger cars.
For 802.11n news, see
http://grouper.ieee.org/groups/802/11/Reports/tgnupdate.htm

7
7 {C} Herbert Haas 2010/02/15
WireIess FideIity AIIiance
Wi-Fi AIIiance (1999)

Certifies interoperabiIity of IEEE 802.11


products and promotes them as the gIobaI,
wireIess LAN standard across aII market
segments

FormeIy known as
WireIess Ethernet CompatibiIity AIIiance
(WECA)
Certified substandards

802.11i => Wi-Fi Protected Access (WPA)

802.11e => WireIess MuItimedia (WMM)


www.wi-fi.com


10 {C} Herbert Haas 2010/02/15
WireIess Overview

11
11 {C} Herbert Haas 2010/02/15
A simpIe dot map of commerciaI wireIess antennas in the USA
(Note: This was even in 2002!)
Source: Cybergeography.org
"I've suggested, haIf seriousIy, that the next step is direct mentaI input - the 'brain cap,' which
you put on your head and then the impressions, sights, everything, go directIy into the brain.
I'm afraid this may turn us aII into permanent 'couch potatoes' because we then need not traveI
anywhere. We can experience anything, Iearn anything...just Iying on the couch."
Arthur C. CIarke
This juvenescent guy above is. Arthur C. Clarke, right. Author oI '2001 A
Space Odyssey (and lots oI other good science Iiction novels), Iirst promoter oI
the geostationary orbit, and much else. Lives in Sri Lanka.

1
2010/02/15 {C} Herbert Haas
WLAN
Physically
In this chapter we discuss basic communication issues, such as synchronization,
coding, scrambling, modulation, and so on.

3
3 {C} Herbert Haas 2010/02/15
ReaI ChanneI OverIapping (2.4 GHz)
IEEE 802.11b/g onIy specifies center frequencies and a
spectraI mask

802.11b spectraI mask requires that the signaI be at Ieast 30 dB


down from its peak energy at 11 MHz from the center frequency
and at Ieast 50 dB at 22 MHz
Therefore, actuaIIy ALL channeIs overIap

Even the "non-overIapping" channeIs 1, 6, and 11


Might be a probIem with significant TX-power differences
0 dB
-10 dB
-20 dB
-30 dB
-40 dB
-50 dB
-60 dB
1 2 3 4 5 6 7 8 9 10 11 12
22 MHz
11 MHz
11 MHz
22 MHz
2412 2417 2422 2427 2432 2437 2442 2447 2452 2457 2462 2477
Since the spectral mask only deIines power output restrictions up to +22 MHz
Irom the center Irequency, some people assume that the channel's energy doesn't
extend any Iurther than that, but in reality, it does. In Iact, iI the transmitter is
suIIiciently powerIul, the signal can be quite strong even beyond the +22 MHz
point.
From this the so-called near/far-problem Iollows: two communicating stations
encounter interIerences when a Ioreign station that transmits on an adjacent
channel is much closer to the receiver as the expected transmitter.
ThereIore, it is incorrect to say that channels 1, 6, and 11 do not overlap. It is
more correct to say that, given the separation between channels 1, 6, and 11, the
signal on any channel should be suIIiciently attenuated to minimally interIere
with a transmitter on any other channel.
But this is not universally true. For example, a powerIul transmitter on channel 1
can easily overwhelm a weaker transmitter on e.g. channel 6. In one lab test,
throughput on a Iile transIer on channel 11 decreased slightly when a similar
transIer began on channel 1, indicating that even channels 1 and 11 can interIere
with each other a little bit.

4
4 {C} Herbert Haas 2010/02/15
The Near/Far ProbIem
Foreign stations
are cIoser and
therefore "Iouder"
than distant peer
OnIy a probIem if:
Stations are
very sensitive
No roaming
possibIe
0 dB
-10 dB
-20 dB
-30 dB
-40 dB
-50 dB
-60 dB
1 2 3 4 5 6 7 8 9 10 11 12
2412 2417 2422 2427 2432 2437 2442 2447 2452 2457 2462 2477
-70 dB
-80 dB
-90 dB
ChanneI 6
association
ChanneI 11
Ch 6
Ch 11
Since the spectral mask only deIines power output restrictions up to +22 MHz
Irom the center Irequency, some people assume that the channel's energy doesn't
extend any Iurther than that, but in reality, it does. In Iact, iI the transmitter is
suIIiciently powerIul, the signal can be quite strong even beyond the +22 MHz
point.
From this the so-called near/far-problem Iollows: two communicating stations
encounter interIerences when a Ioreign station that transmits on an adjacent
channel is much closer to the receiver as the expected transmitter.
ThereIore, it is incorrect to say that channels 1, 6, and 11 do not overlap. It is
more correct to say that, given the separation between channels 1, 6, and 11, the
signal on any channel should be suIIiciently attenuated to minimally interIere
with a transmitter on any other channel.
But this is not universally true. For example, a powerIul transmitter on channel 1
can easily overwhelm a weaker transmitter on e.g. channel 6. In one lab test,
throughput on a Iile transIer on channel 11 decreased slightly when a similar
transIer began on channel 1, indicating that even channels 1 and 11 can interIere
with each other a little bit.

5
5 {C} Herbert Haas 2010/02/15
802.11h: TPC and DFS
ETSI requires TPC and DFS for 5 GHz bands

Otherwise onIy very Iimited powers aIIowed


Transmit Power ControI (TPC)

Reduces TX power if possibIe


Provides minimum required TX power for each user
Assures minimaI interference
Dynamic Frequency SeIection (DFS)

EnabIes transmitter to move to another channeI when it


encounters 'Primary AppIications' on its channeI
BasicaIIy designed to avoid interferences with miIitary
RADAR
Interference ThreshoId I
th
= 62 dBm/MHz is the maximum
aggregate interference (as sensed by a node) aIIowed for
channeI access
Sharing Rules:
1. Non-Greedy Occupancy: No user may occupy the channel with rate 0 (no
data to send). This rule, Ior instance, prohibits devices to use jamming techniques
to have exclusivity to a channel.
2. Channel Select: A channel is deemed accessible at a node iI the aggregate
interIerence power at the intended receiver is less than Ith. The rules recognize
that a connection needs to be established between nodes beIore this rule can come
into operation. The channel width chosen is at the discretion oI the node. A node
should be able to access any available channel in the allocation in question.
3. Range & Power Select: Nodes should reduce transmit power to the minimum
necessary to achieve the link margin they require. For the purposes oI compliance
testing, Ior every reduction oI transmit range by Iactor A, the node must reduce
transmit power by a minimum oI 20 log10 A dB. Practical transmit power control
should be operational over a dynamic range oI 12 dB with step size oI 1 dB.

6
6 {C} Herbert Haas 2010/02/15
DFS DetaiIs
Non-Occupancy Time: 30 min

The time a channeI must not be used


ChanneI AvaiIabiIity Check: 60 sec

Time before using a channeI


ChanneI Move Time: 10 sec

Must Ieave a channeI within that time in case of radar


detection
ChanneI CIosing Transmission Time: 260 msec

TotaI transmission time of certain management traffic


during the ChanneI Move Time (Beacon and Probe
Responses)

No data traffic within the ChanneI Move Time!


Lightweight APs save a list oI their current radar detections in Ilash,
thereIore aIter reset the LAP will still continue the non-occupancy time

7
7 {C} Herbert Haas 2010/02/15
ISM 5 GHz - Europe [Jan 2007]
Austria
(Source: Rundfunk &
TeIekom ReguIierung GmbH)
9 10 11 12 13 14 15 16 17 18 19
5.150 5.350
200 mW
1 W
5.470 5.725 GHz
1 2 3 4 5 6 7 8
ndoor only! 802.11h
WLAN Usage:
Reserved for HPERLAN (1W ERP)
"Europe"
A rough picture onIy
ReIeased in
may 2005
ReIeased
in faII 2005
There are RADAR applications working at 5725-5825 MHz.
2003 World Radiotelecommunications ConIerence made 802.11a easier Ior
worldwide use (opened the 5 GHz bands)

8
8 {C} Herbert Haas 2010/02/15
ISM 5 GHz - USA
10 11 12
5.150 5.350
250 mW indoor
1 W indoor
5.825 5.725
5 6 7 8
9
50 mW
indoor
3 4 1 2
4 W outdoor
200 mW
outdoor
1 W outdoor
UnIicensed NationaI Information Infrastructure (U-NII)
UNII-1 UNII-2 UNII-3
WLAN Usage:
ndoor only!
40 mW with 6dBi
integrated antenna.
n- and outdoor
Removable antennas
Allowed.
Used mainly for bridging:
1 W with 6 dBi P2MP
or max 23 dBi P2P
WLAN Usage:
NEW: FCC aIIows removabIe antennas for aII bands !!!
Generally the FCC allows much higher TX powers.
Additionally there is a high-Irequency 5 GHz range close to 6 GHz providing Iour
non-overlapping channels. This third band (UNII-3) is intended Ior long distance
outdoor WLAN bridging which can be done with 4 Watts. The Cisco Aironet
1400 Bridge is designed Ior that UNII-3 band.
(ODU Outdoor Unit)

9
9 {C} Herbert Haas 2010/02/15
5 GHz Comparison EU/USA
10 11 12
5.150 5.350
1 W indoor
5.825 5.725
9
4 W outdoor
USA: 1 W
outdoor
UNII-1 UNII-2 UNII-3
9 10 11 12 13 14 15 16 17 18 19
200 mW
1 W
5.470
1 2 3 4 5 6 7 8
Europe USA
Europe
USA
Generally the FCC allows much higher TX powers.
In the ETSI domain there are additional 11 channels in the middle oI the 5-6 GHz
band which can be used with (up to) 1 Watt.
These high powers (more than 30 times higher compared with 802.11g) and the
reduced noise and interIerence in that band overcompensate the 5-GHz Iree space
loss which results in remarkably long communication distances (kilometers).

10
10 {C} Herbert Haas 2010/02/15
ReguIatories & Law
WLAN senders may radiate beyond
premises borders

DirectionaI antennas which are used to


get over a foreign premise should be
announced (Austria, Germany)

Therefore stiII IegaI probIems to sue


Iayer-1 based DoS attacks
Germany: www.regtp.de Austria: www.rtr.at
USA: www.fcc.gov
In the Europe (ETSI) domain each country may still speciIy local requirements
additional to the ETSI limits. However (Iortunately) these country-speciIic
deviations seem to disappear.
Regarding the Austrian situation only:
Die RundIunks & Telekom Regulierungs GmbH in sterreich arbeitet eng
mit der Telekom-Control-Kommission (TKK) und der
Kommunikationsbehrde Austria (KommAustria, Ressortbereich des
Bundeskanzleramtes) zusammen.
Die Funk-Schnittstellenbeschreibungen sind im Internet abruIbar unter:
http://www.rtr.at/de/tk/FRQSP2400MHz |Nov 2007|
(ehemals http://www.bmvit.gv.at/radiointerIaces)

11
11 {C} Herbert Haas 2010/02/15
802.11d - "WorId Mode"
"Extensions to Operate in AdditionaI ReguIatory Domains"
Ratified in June, 2001
Defines frequency and power Iimitation for different reguIatory
domains
AIIows cIients to roam across different reguIatory domains
APs are set to appropriate reguIatory domain
During association, cIients inherit the power and frequency
requirements of this reguIatory domain
5.0 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 6.0 GHz 5.9
Aeronautical
Navigation
Satellite
FSS
Radar
and
Science
Aeron.
Navig.
Radionavigation
Maritime navigation
Amateur Radio
Radiolocation
802.11d world mode allow a world-wide WLAN operator to announce the local
RF limits Ior its roaming clients via the AP.

12
12 {C} Herbert Haas 2010/02/15
"Surrounding" AppIications
2.4 GHz ISM
This is just Ior your interest and to get a Ieeling where GSM and UMTS work in
comparison to WLAN.

13
13 {C} Herbert Haas 2010/02/15
US Frequency Plan (3 kHz - 3 CHz)
This picture should simply provide an impression oI the FCC Irequency plan,
Irom 3 kHz to 300 GHz (Extremely Low Frequency to Far InIrared).
Clearly it is not that easy to Iind a Iree Irequency range Ior new applications.

14
2010/02/15 {C} Herbert Haas
ModuIation Techniques
Spread Spectrum Basics
FHSS vs. DSSS
QAM Variants and CCK
OFDM

15
15 {C} Herbert Haas 2010/02/15
UNITED STATES PATENT OFFICE 2,292,387
SECRET COMMUNICATION SYSTEM
Hedy KiesIer Markey, Los AngeIes, and George AntheiI, Manhattan Beach, CaIif.
AppIication June 10, 1941, SeriaI No. 397,412 6 CIaims. (CI. 250-2)
This invention relates broadly to secret communication systems involving the use of carrier waves of different
frequencies, and is especially useful in the remote control of dirigible craft, such as torpedoes.
An object of the invention is to provide a method of secret communication which is relatively simple and
reliable in operation, but at the same time is difficult to discover or decipher.
Hedy Lamarr was honored with the Viktor Kaplan Medal oI the Austrian
Association oI Patent Holders and Inventors on October 16, 1998. The medal,
considered the highest award which can be bestowed upon inventors in Austria,
was presented to Miss Lamarr Ior her pioneering contribution to enabling radio
communications to be made secure Irom interIerence and eavesdropping. Miss
Lamarr was proposed Ior the medal by Dr. Peter Paul Sint oI the Austrian
Academy oI Sciences. In support oI the nomination, Dr. Sint stated that her
invention was decades ahead oI its time and anticipated "essential elements oI
digital logic." Hedy Lamarr was the recipient oI a number oI technology prizes in
the US during 1997. The presentation oI the Viktor Kaplan Medal is the Iirst such
recognition oI her achievement in her homeland Austria. As with prior awards,
Miss Lamarr did not personally attend the Kaplan Medal presentation ceremony
in Esterhazy Palace in Eisenstadt, Austria. She was represented by her son,
Anthony Loder.
BTW: The tremendous Iame oI the movie "Ecstasy" is due above all to a single
scene in which the audience sees Hedy Lamarr swimming nude in a lake and then
running through a nearby Iorest. This sequence - lasting several minutes - is
considered the Iirst nude scene in cinematic history and caused a worldwide
scandal in the 1930s. "Ecstasy" was then banned in many countries oI the world -
most notably in the US - or only a radically expurgated version oI it was
permitted to be shown.

16
16 {C} Herbert Haas 2010/02/15
Why Bandwidth Spreading?
If input power is spread over a Iarge
band: hard to intercept
The noise is reduced (compared to the
noise in the totaI bandwidth used) by
the spreading gain
To synchronize, we muItipIy with aII
possibIe shifted versions of the PN
sequence
Fast auto-correIation needed
c
c
T
T
=
P
f
P
f
BW Spreading
Sender reduces spectraI
power density but conserves
totaI energy:
To transmit
P
f
P
f
P
f
De-spread
Bandpass
fiIter
High-power
smaIIband
interference
Low-power
broadband
interference
Receiver recovers originaI
signaI by correIation
While transmission small- and broadband interIerences add to the user signal.
Power density could be smaller than in the smallband signal. It is also possible
that the power density is smaller then the ambient noise.

17
17 {C} Herbert Haas 2010/02/15
Bandwidth Spreading Methods
Direct Sequence Spread Spectrum (DSSS)

802.11b/g: 14 possibIe channeIs - 3 channeIs can be used


simuItaneousIy
Can operate with SNR of 12dB
Throughput up to 11 Mbit/s (and more)

Range up to 40 km (and more)


Frequency Hopping Spread Spectrum (FHSS)
802.11: 79 possibIe channeIs - 15 channeIs can be used
simuItaneousIy
Can operate with SNR of 18dB

Interference toIerant
Less muItipath probIems

TechnicaIIy Iimited up to 2 Mbit/s


OFDM (MuIticarrier ModuIation)
ActuaIIy used to minimize the required bandwidth but often
referred as spreading technique in the WLAN context
DSSS deIines a set oI channels spaced across the whole radio bandwidth. There are 14 oI these
channels, but channel 14 is reserved Ior Japan. DSSS modulates the data with a spreading code
(chipping) and transmits the result on only one oI these channels. There has to be 30MHz between
the carrier Irequencies Ior multiple access points to operate within the same area without
interIerence. Since the entire bandwidth is 83.5MHz, only a maximum oI 3 DSSS access points
can operate within the same area. The limited available total bandwidth is also the methods
vulnerability. II narrow band interIerence occurs in the used channel, one can only wait until it
disappears beIore communications can be resumed. In return DSSS gives a longer range. The
modulation technique can operate with a signal to noise ratio (SNR) oI 12dB where FHSS
operates with SNR oI 18dB.
FHSS deIines a set oI channels spaced across the whole radio bandwidth. Here in Norway, there
are 79 such channels. When transmitting, FHSS uses only one channel at a time in a
predetermined sequence and dwell time between hops. There are 78 such sequences and they are
orthogonal so that they do not interIere with each other. This enables as many as approx. 15
access points belonging to diIIerent systems, to coexist. Because the whole bandwidth is available
and the signal is sent/received on only a small part oI it at a time, this is the method most tolerant
to narrow band interIerence. This would not block the communication entirely but only when the
hopping pattern happens to hit the interIering Irequency and only Ior the duration oI the dwell
time, oIten set to 128ms. All band interIerence would oI course stop FHSS also. FHSS also has
lowest power consumption.
Orthogonal Frequency Division Multiplexing (OFDM) is a multicarrier transmission method and
actually tries to reduce the required bandwidth. Using OFDM together with QAM (see later) very
high data rates can be achieved, thereIore a given bandwidth (20-22 MHz with WLANs) is
optimally utilized.

18
18 {C} Herbert Haas 2010/02/15
DSSS
User bit-pattern is moduIated (substituted)
with chipping-sequence ("Barker code")

Each bit of data is encoded by 11 bits of the


chipping sequence

802.11b: 22 MHz moduIation bandwidth


User Data
Chipping-
Sequence
ResuIting
SignaI
XOR
=
Chipping Sequence: 10110111000 (Barker Code*)
Chip Time t
c
Bit Time t
b
In 802.11, the chipping sequence is known as the Barker code, which is an 11-bit
sequence (10110111000) that has certain mathematical properties making it ideal
Ior modulating radio waves. The basic data stream is XORed with the Barker
code to generate a series oI data objects called chips. Each bit is "encoded" by the
11 bit Barker code, and each group oI 11 chips encodes one bit oI data.
Direct Sequence Spread Spectrum (DSSS) uses a XOR Chipping Sequence on
the userdata to spread the signal (digital modulation). The spreaded signal is
modulated to a carrier (analog modulation).
Userdata bit bit length oI t
b
Chipping Sequence smaller bit length oI t
c
(chips)
Spreading Factor: s t
b
/t
c
Bandwidth oI spreaded signal s*w
Civil uses spread Iactor oI 10-100 (Barker-Code has Iactor 11), Military
spread Iactor up to 10000

19
19 {C} Herbert Haas 2010/02/15
Codes Used
For 5.5 and 11 Mbps data rates, Barker sequences
are not used
Instead CompIementary Code Keying (CCK) is
used (64 8-bit code words)
By regulations, a DSSS system in the ISM band must have a minimum oI 10 dB
processing gain
1 Mbps 11 bit Barker code processing gain 10.4 dB
11 Mbps CCK processing gain 11 dB
Any highrate modulation is more susceptible to jamming, multipath interIerence
and Iilter distortion than lower rate modulation because oI the higher required
SNR (E
S
/N
0
)
Processing gain is the reason why DS is relatively jamming resistant provided
that the WLAN hardware is designed well (which is oIten not the case with cheap
hardware).

21
21 {C} Herbert Haas 2010/02/15
FHSS
AvaiIabIe bandwidth spiIt into severaI smaIIer channeIs
with smaIIer bandwidth
Sender and receiver uses one of this smaIIer channeIs for a
part of time, then jump to next one

Pseudo-random jump sequence


Avoids being stuck in a bad frequency band
SIow hopping: muItipIe bits before frequency hop
Fast hopping: muItipIe frequency hops per bit
On muIti-access media, coIIisions are onIy rare
ISM bandwidth (2.4 GHz) = 83 MHz is divided into 1 MHz
channeIs for FHSS
FCC requires that any FHSS radio must visit at Ieast 79 of
the channeIs at Ieast once in 30 seconds
Minimum hop rate: 2.5 hops/second
Note: The originaI 802.11
impIementations onIy used FHSS,
but it is stiII used in criticaI
environments today (airports etc)
Frequency Hopping Spread Spectrum (FHSS) uses a radio that moves or hops
Irom one Irequency to another at predetermined times and channels. The hopping
pattern is speciIied in the WLAN beacons.
The regulations require that the maximum time spent on any one channel is
400mS. For the 1- and 2-Mb FH systems, the hopping pattern must include 75
diIIerent channels, and must use every channel beIore reusing any one.
For the Wide Band Frequency Hopping (WBFH) systems, that permit up to 10-
Mb data rates, the rules require use oI at least 15 channels, and they cannot
overlap. With only 83MHz oI spectrum, it limits the systems to 15 channels,
thereby causing scalability issues.
GFSK is used Ior the modulation process.
FHSS was used Ior the initial IEEE 802.11 standard, providing up to 2 Mbit/s but
it is still available today, manuIactured by certain vendors, to allow wireless data
transmission in diIIicult environments such as airports etc.

22
22 {C} Herbert Haas 2010/02/15
QAM
802.11a and HiperIan
WireIess Medium: OFDM
BPSK @ 6 and 9 Mbps
QPSK @ 12 and 18 Mbps
16-QAM @ 24 and 36 Mbps
64-QAM @ 48 and 54 Mbps
802.11b
WireIess Medium: DSSS
DBPSK @ 1 Mbps
DQPSK @ 2 Mbps
16 CCK @ 5.5 Mbps
256 CCK @ 11 Mbps
Q
I
10 11
00 01
Standard
PSK
Quadrature
PSK (QPSK)
Q
I
1 0
Q
I
16-QAM
Re{U
i
}
m{U
i
}
1V 3V 5V
Other exampIe:
Modem V.29
2400 Baud
Max. 9600 Bit/s
DBPSK: OnIy "1" causes
periodic phase shifts.
It is important to understand that spread spectrum (or OFDM) techniques are always combined
with a symbol modulation scheme. Quadrature Amplitude Modulation (QAM) is a general method
where practical methods such as BPSK, QPSK, etc are derived Irom.
The main idea oI QAM is to combine phase and amplitude shiIt keying. Since orthogonal
Iunctions (sine and cosine) are used as carriers, they can be modulated separately, combined into a
single signal, and (due to the orthogonality property) de-combined by the receiver.
And since A*cos(wt phi) A/2cos(wt)cos(phi) sin(wt)sin(phi)} QAM can be easily
represented in the complex domain as Real A*exp(i*phi)*exp(i*wt)}.
The standard PSK method only use phase jumps oI 0 or 180 to describe a binary 0 or 1. In the
right picture above you see a enhanced PSK method, the Quadrature PSK (QPSK) method. While
using Quadrature PSK each condition (phase shiIt) represent 2 bits instead oI 1. Now it is
possible to transIer the same datarate by halved bandwidth.
The QSK signal uses (relative to reIerence signal)
- 45 Ior a data value oI 11
- 135 Ior a data value oI 10
- 225 Ior a data value oI 00
- 315 Ior a data value oI 01
To reconstruct the original data stream the receiver need to compare the incoming signal with the
reIerence signal. The synchronization is very important.
Why not coding more bits per phase jump ?
Especial in the mobile communication there are to much interIerences and noise to encode right.
As more bits you use per phase jump, the signal gets more 'closer. It is getting impossible to
reconstruct the original data stream. In the wireless communication the QPSK method has proven
as a robust and eIIicient technique.

23
23 {C} Herbert Haas 2010/02/15
CCK
Based on MarceI J. E. GoIay (1951) poIyphase
compIementary codes
Has ideaI AKF properties
CompIex codes
6 bits of each byte seIect one of 64 unique orthogonaI
eight chips Iong poIyphase compIementary codes
The other two bits rotate the whoIe code word (0, 90, 180
or 270 degrees)
8 chips => 1 symboI hence 1,375 Mbaud => 11
Mchips/s
SymboI is a 8-dimensionaI vector with compIex
components:
Data bits encode component phases using DQPSK

1
is contained in aII 8 chips => rotates the vector
Same spectrum shape as with Barker code words
ExampIe:
10110101
d1 d0 1
d3 d2 2
d5 d4 3
d7 d6 4
0 0 0
0 1
1 0 /2
1 1 - /2
Assuming that the bits
of a 8-bit word
controI the phase
components according
and the foIIowing
QPSK specification
is true
then the codeword
transforms into
{1,-1, j, j, -j, j, -1,-1}
Based on Marcel J. E. Golay, 1951, spectrometer application. The Walsh
transIorm is a special case oI the Fourier transIorm and used Ior the correlation.
The eight components oI the 8-dimensional vector are complex chips, as shown
in the example on the right (1, -1, j, j, -j, j, -1, -1).
CCK is a variation on M-ary Orthogonal Keying modulation, which uses I/Q
modulation architecture with complex symbol structures. CCK allows the 80211b
Ior multi-channel operation in the 2.4 GHz band using the existing 802.11 DSSS
channel structure scheme. The spreading employs the same chipping rate and
spectrum shape as the 802.11 Barker`s code word. The spread Iunction Ior CCK
in 802.11b is chosen Irom a set oI M nearly orthogonal vectors by the data word.
CCK uses one vector Irom a set oI 64 complex (QPSK) vectors Ior the symbol
and thereby modulates 6 bits (one oI 64) on each 8 chips spreading code symbol.
In the 802.11b, the Iormula that deIines the CCK codewords has 4 phase terms.
The Iirst oI them modulates all oI the chips and this is used Ior the QPSK rotation
oI the whole code vector. The second modulates every odd chip, the third
modulates every odd pair oI chips and the Iorth modulates every odd quad oI
chips.

24
24 {C} Herbert Haas 2010/02/15
OFDM
OrthogonaI Frequency Division MuItipIexing (OFDM)
Avoids muItipath-induced interferences that aIways occur at higher
symboI rates
1966: Chang (BeII Labs) issued OFDM paper and patent
1993: Morris impIemented first experimentaI OFDM WLAN at 150 Mbit/s
Basic idea:
1) SpIit data stream in muItipIe Iower-rate streams
2) Convert n bits into m QAM symboIs
3) Regard the m QAM symboIs as discrete compIex spectrum and
convert it into the time domain via FFT
-1
The m compIex QAM symboIs must be "mirrored" appropriateIy in order to
get reaI-vaIued time-domain vaIues (hint: ampIitudes even, phase odd)
Each eIement of the "QAM-vector" can be interpreted as a subchanneI
SubchanneIs overIap!
Approx. 50% Iess totaI bandwidth necessary than FDM
ISI is minimized because of orthogonaI sub-bands
EquivaIent to Nyquist-puIses in time domain
In Europe a special modulation type Ior digital radio, called Digital Audio
Broadcast (DAB) is used. This modulation method uses many Irequencies at the
same time (Multicarrier Modulation (MCM)). The big advantage is the
robustness against ISI. As higher the symbol rate as higher the ISI eIIect.
Because oI this reason MCM splits the symbol rate into more stream with lower
rate on a own carrier.
For example:
n Symbols per second uses c new carrier. Then only n/c symbols per second need
to be transIerred, and each symbol represent 2 bits (like QPSK). Only small parts
oI the signal will be destroyed while strong interIerences.
The DAB standard can use 192-1536 carrier at the same time.

25
25 {C} Herbert Haas 2010/02/15
OFDM - 802.11a DetaiIs (1)
ChanneI BW is 20 MHz (occupied BW is
16.6 MHz)

52 subcarriers are used per channeI

48 subcarriers carry the data

4 subcarriers are piIots which faciIitate phase


tracking for coherent demoduIation

Subcarrier separation: 312,5 kHz (20 MHz/64)


Each of these subcarriers can be a BPSK,
QPSK, 16-QAM or 64-QAM coded signaI
TIME DOMAIN construction of an OFDM
signaI from its constituent carriers
OFDM is eIIiciently realized by the use oI eIIective signal processing, Iast-Iourier
transIorm, in the transmitter and receiver. This signiIicantly reduces the amount
oI required hardware compared to earlier FDM-systems. One oI the beneIits oI
OFDM is the robustness against the adverse eIIects oI multipath propagation with
respect to intersymbol interIerence. It is also spectrally eIIicient because the
subcarriers are packed maximally close together. OFDM also admits great
Ilexibility considering the choice oI and realization oI diIIerent modulation
alternatives.
OFDM, Orthogonal Frequency Division Multiplex, is a special Iorm oI
multicarrier modulation. The basic idea is to transmit broadband, high data rate
inIormation by dividing the data into several interleaved, parallel bit streams, and
let each one oI these bit streams modulate a separate subcarrier. In this way the
channel spectrum is passed into a number oI independent non-selective Irequency
subchannels. These sub channels are used Ior one transmission link between the
AP and the MNs.
The time domain construction oI an OFDM signal Irom its constituent carriers is
shown above. The data values can be adjusted. For some data combinations the
peak power is much higher than Ior others and this can complicate analog
ampliIier design in OFDM systems. In multipath channels, the delays can cause
symbol overlap, destroying the perIect sum oI sinusoids. This is easily Iixed by
cyclicly extending the signal by a length longer than the channel delay.

26
26 {C} Herbert Haas 2010/02/15
OFDM - 802.11a DetaiIs (2)
SymboI duration is 4 microseconds (250
symboIs/sec)

With a guard intervaI of 800 ns

OptionaI shorter guard intervaI of 400 ns may be used in


smaII indoor environments
Generation of orthogonaI components is done in
baseband (via DSPs) which is then upconverted
to 5 GHz at the transmitter
Each subcarrier can be represented as compIex number
The time domain signaI is generated by IFFT
The receiver downconverts, sampIes at 20 MHz
and does an FFT to retrieve the originaI compIex
coefficients
The guard interval is needed to achieve the desired spectral shape.

27
27 {C} Herbert Haas 2010/02/15
OFDM - Pros and Cons
Advantages

High spectrum efficiency


High muItipath resistance
GeneraI better interference resistance
AII this resuIts in Ionger distances
Drawbacks
More expensive circuits

Higher power consumption


(compared to 802.11b)
EnveIope of MuIti-carrier moduIation resuIts in high Crest
factors (peak to average power)
NonIinear effects in anaIog devices and ADCs
ResuIts in BW spreading (higher order signaIs)
Four-Wave Mixing
Neighbor channeI interference degrades receiver sensitivity

Therefore 30 mW EIRP Iimitation (2.4 GHz)


ChanneI overIapping is more criticaI ('Bart Simpson Head')
f
P
DSSS OFDM
Note: OFDM was originally only planned Ior 802.11a in the "clean" 5 GHz band
since the QAM used here is relatively noise-sensitive, much more compared to
DSSS. Considering this, how will 802.11g really perIorm in noisy 2,4 GHz
environments?

28
2010/02/15 {C} Herbert Haas
Antennas
.and a bit physics.

29
29 {C} Herbert Haas 2010/02/15
'Was it not the God who wrote these signs,
that have calmed alarm of mv soul and have
opened to me a secret of nature?`
Ludwig BoItzmann quoting "Faust" as
he first saw the MaxweII equations.
James CIerk MaxweII The famous "MaxweII Equations",
a compIete description of the EM fieId
All phenomena oI the electromagnetic Iield are covered by the Iamous Maxwell's
equations. Fortunately (or not?) we do not need these equations in the Iollowing
sections but since they are so remarkable, short, and the basis oI all this, they are
presented here in order to praise Maxwell.

30
30 {C} Herbert Haas 2010/02/15
DecibeIs
Why use decibeIs?
ExtremeIy Iarge and extremeIy smaII factors are mapped
into a smaII intervaI

MuItipIication and division is transformed into addition


and subtraction
ncrease Factor Decrease Factor
0 dB 1 x 0 dB 1 x
1 dB 1.25 x -1 dB 0.8 x
3 dB 2 x -3 dB 0.5 x
6 dB 4 x -6 dB 0.25 x
10 dB 10 x -10 dB 0.10 x
12 dB 16 x -12 dB 0.06 x
20 dB 100 x -20 dB 0.01 x
30 dB 1000 x -30 dB 0.001 x
40 dB 10,000 x -40 dB 0.0001 x
We mostIy need dB, dBm, and dBi,
and onIy rareIy dBw and dBd (at Ieast in the WLAN context)
Radio Frequency signals are subject to various losses and gains as they pass Irom
transmitter through cable to antenna, through air (or solid obstruction), to
receiving antenna, cable and receiving radio. With the exception oI solid
obstructions, most oI these Iigures and Iactors are known and can be used in the
design process to determine whether an RF system such as a WLAN will work.
In the table above you see some examples, list by dB. An increase oI 3 dB
indicates a doubling (2x) oI power. An increase oI 6 dB indicates a quadrupling
(4x) oI power. Conversely, a decrease oI 3 dB is a halving (1/2) oI power, and a
decrease oI 6 dB is a quarter (1/4) the power.

31
31 {C} Herbert Haas 2010/02/15
Generating Radio Waves
GoaI: Inject the waveguide wave
from the sender into free space
Antennas are "opened"
osciIIator-circuits
Radio waves are generated by
acceIerated eIectrons in the antenna
Antenna Iength L
Good efficiency if L =
L=/2 (dipoIe)
L=/4 (monopoIe)
To concentrate power in a
desired direction requires
L >
ReaI antenna
Iength
Mirrored
antenna
Iength
effective antenna
Iength
An applied alternating voltage (e. g. oscillating at 2.4 GHz) Iorce the electrons to
move along the axis oI an antenna (back and Iorth). Each time the electrons
change the direction they emit radiation. This radiation is similarly 'oriented' or
polarized as the current Irom which is was originated.

32
32 {C} Herbert Haas 2010/02/15
G
[dBi]
= 10 Iog G
Antenna Gain
maximum power density towards specific direction
mean power density (isotropic radiation)
G = G =
4 x A
e

2
Power Density: S
R
=
P
S
G
S

4 x r
2
r
Receiver
Sender
Hertz' Dipole: G = 1.5
/2 Dipole: G = 1.64 (= 2.14 dBi = 0 dBd)
Parabolic dish with 4 m diameter and
2.4GHz
: G = 10
4
Power at receiver's antenna output: P
R
= P
S
G
S
G
R

4 x r

-2
Ae ... eIIective antenna surIace ("aperture").
The equation Ior the received power is sometimes also called "Friis' transmission
equation".
Note that Ior real world (especially indoor) calculations, the eIIective antenna
gain is smaller because oI obstacles, multipath, etc.

33
33 {C} Herbert Haas 2010/02/15
PoIarization
Linear poIarization
VerticaI or horizontaI
Requires Iinear antenna eIements
EIIipticaI poIarization
CircuIar poIarization is onIy a speciaI case
Requires bended antenna eIements
Transmitter and receiver antennas shouId be aIigned for same
poIarization to achieve best performance
Otherwise "infinite" attenuation with "opposite" antennas
Or 3 dB attenuation between Iinear and circuIar antennas
PoIarization change with diffractions and refIection
VerticaI poIarization is preferred for Iong range transmission
(ground effect attenuate the signaI power in horizontaI
poIarization)
CircuIar poIarization antennas mitigate the effect of refIections
PrincipIe aIso used for GPS
See heIicaI antennas (for exampIe)
Vertical polarization is the Iirst choice Ior WLAN applications because most
deployments require to maximize the distance in the horizontal direction. As it
can be seen in antenna diagrams, vertically polarized antennas are perIectly suited
Ior horizontal transmissions.
Vertical polarization is also preIerred Ior long range transmission because the
ground eIIect attenuate the signal power in horizontal polarization case in long
range.

34
34 {C} Herbert Haas 2010/02/15
Other Antenna Facts
Impedance Matching
Free space impedance is 377 Ohm
Antenna cabIes have 50 Ohm (typicaIIy)

Antenna must transform 50 to 377 Ohm


Without impedance matching
RefIections wiII resuIt into standing waves

TX power wiII not be transferred efficientIy to the


antenna
VoItage Standing Wave Ratio (VSWR)

s = Umax / Umin 1
s = 1 means ideaI impedance matching
s > 1 means refIections and high rippIes
=> higher rms-vaIues
=> higher Ioss
Zo sqrt (muo / epso) 377 Ohm . Iar Iield.
Voltage maximum on open end, no current.
Umax ,Uincident, ,UreIlection,
Umin ,Uincident, - ,UreIlection,
VSWR should be measured at antenna Ieedpoint (where the reIlection occurs)
which is typically not possible.

35
35 {C} Herbert Haas 2010/02/15
Other Antenna Facts
Theorem of Reciprocity

Antenna impedance, Gain, as weII as antenna


diagrams are equivaIent for RX and TX
Near fieId versus far fieId
Shortening effect
SIower wave propagation in antenna (c
wire
< c
0
)
pIus capacitive effects on antenna-ends
demands for shortening the antenna

TypicaIIy 3-8 %
The reciprocity theorem was Iirst stated by Rayleigh and Helmholtz and it was
later applied to the problem oI antennas by Carson. This theorem basically says
that the antenna parameters remain the same no matter whether the antenna is
used Ior sending or receiving. More practically, upon using two diIIerent
antennas, one Ior sending, the other Ior receiving, we would measure the same
currents on the receiving-antenna, even iI we switch TX and RX. The reciprocity
theorem can be proved Irom Maxwell's equations and are only valid in isotropic
media between the antennas (e. g. certain Ierrites are not isotropic).
Mostly the antenna endpoints contribute to the shortening eIIect, while inner halI-
wave "pieces" remain constant. ThereIore, the longer the antenna the less
dramatic the eIIect.

36
36 {C} Herbert Haas 2010/02/15
Wave Propagation
Free space:

FieIds E, H ~ 1/r

Power density S = E - H ~ 1/r


2

Compared to cabIes: attenuation ~ e


-r
AIong earth's surface aIso surface waves
must be considered

FieIds E, H ~ e
-r
The higher the frequencies the Iower the
effect of surface waves

"Quasi-opticaI" propagation
The 'inverse square law' is only valid Ior powers not Ior Iield strengths.
Note that in general the energy is radiated over multiple wave components, Ior
example also surIace waves may exist along the earth surIace (usually only with
longer wavelengths).

37
37 {C} Herbert Haas 2010/02/15
Antenna Patterns
FieId strengths as poIar diagram

ScaIed to maximum vaIue (0 dB)


Logarithmic or Iinear (F~1/r)
EIevation and Azimuth
Often used for simpIe Iinear
poIarized antennas

Often corresponds to co- and


cross-poIarized patterns
E and H patterns

For Iinear poIarized antennas


Distinguish:
E-FieId and H-FieId
EIevation and HorizontaI
Both types are common (!)
High-gain antennas have
significant nuII-angIes
Complex antennas, such as many television broadcast antennas, include a
signiIicant signal in both the horizontal and vertical polarizations. The azimuth
pattern Ior these antennas is oIten supplied Ior both polarizations, and the
complexity oI the antenna can result in signiIicantly diIIerent azimuth patterns Ior
the two polarizations.

38
38 {C} Herbert Haas 2010/02/15
WLAN Antenna ExampIes
CircuIar poIarity (5 dBi)
Microstrip patch (6-18 dBi)
Omni (2-10 dBi)
ParaboIic dish (20-30 dBi)
Sector (14 dBi)
Yagi (8-16 dBi)
Cisco
(21 dBi)
Use circular polarity wireless antennas where metal or reIlective materials are
present.
Note: Vertical poles Ior antenna mounting have signiIicant inIluence in case
vertical polarized antennas are used (Iield distortions). Especially critical Ior Yagi
antennas.
Hidetsugu Yagi (1886-1976) and Shintaro Uda (1896-1976), University oI
Tohoku in Sendai/Japan.

39
39 {C} Herbert Haas 2010/02/15
Antennas & Patterns
Cisco WLAN Antennas and verticaI radiation shown onIy
Omni, 5.2 dBi
Omni, 12 dBi
Omni, 5.2 dBi
Diversity, 2.2 dBi
Patch, 2.0 dBi
DipoIe, 2.0 dBi
Consider the Iollowing two Cisco antennas Ior practical indoor installations in
larger halls:
The 6 dBi AIR-ANT2012 is a diversity patch antenna and oIIers 80 degrees
illumination angle horizontally and 55 degrees vertically. With this antenna the
distances compared to a omni (rubber) can be easily doubled.
Using the AIR-ANT3549 with 8.5 dBi the distance might be increased by a
Iactor oI 2.7 compared to an omni. However, the tradeoII is smaller angles: 60/55
degrees, which is oIten a good thing Ior long but narrow halls (1:3).

40
40 {C} Herbert Haas 2010/02/15
Some Cisco Antennas
HorizontaI VerticaI
Sector, 14 dBi
Yagi, 13.5 dBi
Dish, 21 dBi
Dish, 28 dBi
5.8 GHz
Some radio equipment manuIacturers speciIically warn against this because it
damages the transmitter. Most pieces oI amateur or commercial radio equipment
carry this warning because they operate at a much higher transmitter power. The
reIlected wave standing wave ratio (SWR) caused by a lack oI a proper antenna
or load can damage the Iinal ampliIier stage known as the power ampliIier (PA).
For Cisco Aironet equipment, the transmitter power output is 100 mW Ior the
350 series and 30 mW Ior the 340 series, so damage is unlikely but possible. II
you absolutely have a requirement to run the devices without antennas, it is
recommended that you turn the transmitter power down to 1-5 mW or use a 50-52
ohm "dummy load," just to be saIe.

41
41 {C} Herbert Haas 2010/02/15
Waveguide Antennas
Standing waveIength
g

depends on
Tube diameter D
Open air waveIength
o

First maximum point is
g
/4
from the cIosed end

FIat maximum area


TotaI tube Iength: Open end
shouId match (next) maximum
IdeaIIy 3/4
g

o
= 300 / f
[MHz]

cut
= 1.706 - D
1/
o
= 1/
cut
+ 1/
g

Waveguide antennas act as opened waveguide. Standing waves and modes, high
pass behavior. Goal: Find the point oI maximum Iield strength oI the standing
wave
It is important to notice that the standing wavelength Lg is not the same as
wavelength Lo counted Irom hI signal. Large tubes are near as open air where Lg
and Lo are almost same but when tube diameter becomes smaller the Lg increases
eIIective until there becomes a point when Lg becomes inIinite. It corresponds the
diameter when hI signal doesn't come to the tube at all. So the waveguide tube
acts as a high pass Iilter which limit wavelength Lc 1.706 x D. Lo can be
calculated Irom nominal Irequency: Lo/mm 300/(I/GHz).

42
42 {C} Herbert Haas 2010/02/15
FSL
Free Space Loss (FSL)

ReaI Loss > FSL

RefIects the RF power Iaw P


~ 1 / r
2

Defined as 10 Iog P
S
/P
R
DoubIe distance means

AdditionaI 6 dB Ioss

Because power decreases


by factor 4

OnIy with cabIes the totaI


Ioss can be muItipIied by
two

ExponentiaIIaw
2
4

'
\

'

x r
FSL
Free Space Loss: No Fresnel zone encroachment assumed!
In German it is called "FreiraumdmpIung".

44
44 {C} Herbert Haas 2010/02/15
FSL - SimpIe FormuIas
FSL
dB
= 20 Iog (f
GHz
) + 20 Iog (r
km
) + 92.45
FSL
dB
= 22 + 20 Iog (r/)
FSL
dB
= 20 Iog (f
MHz
) + 20 Iog (r
km
) + 32.45
FSL
dB
= 20 Iog (r
km
) + 100
2.4 GHz
FSL
dB
= 20 Iog (r
km
) + 107
5.3 GHz
r
km
= 10^((FSL -100)/20)
r
km
= 10^((FSL -107)/20)
GeneraI
The Iormulas highlighted in blue are the most important Ior quick estimations oI
the Free Space Loss.
Note that the inverse Iormulas are very sensitive regarding their exponent.
Slightly diIIerences in the FSL result in huge deviations oI the distance.

45
45 {C} Herbert Haas 2010/02/15
GeneraI Attenuation Considerations
For isotropic antennas in free space, the
attenuation of 5 GHz is higher

Friis: 20 Iog (5.25/2.4) = 6.8 dB


However onIy IittIe materiaI differences 'in
generaI'

TypicaIIy 5 GHz is onIy 1-2 dB worse


Exceptions:

Grid spacing of enforced concrete couId match


waveIengths
Red brick introduces approx. 10 dB additionaI
attenuation for 5 GHz and wood Iumber additionaI 3-6
dB
Note: RefIections is a compIeteIy different story
(and more compIicated)
(Wood lumber German: Bauholz)
The reIlection characteristics heavily depend on the wavelength used and the
particular thickness oI the considered layer.

46
46 {C} Herbert Haas 2010/02/15
EIRP (for Spread Spectrum)
EquivaIent IsotropicaIIy Radiated Power

TheoreticaI power for an isotropic antenna to


reach same PSD as directionaI antenna
EIRP = 10^(g
dBi
/10) * P [W]

NationaI band-specific EIRP Iimits


Europe (ETSI) max EIRP

100 mW or 20 dBm for DSSS


= 17 dBm (50 mW) + 3 dBi

30 mW or 15 dBm for OFDM (typicaIIy)


Europe 100 mW except France: France is only 7 dBm (5 mW).
In the U.S., the FCC (Federal Communications Commission) deIines power
limitations Ior wireless LANs in FCC Part 15.247. ManuIacturers oI 802.11
products must comply with Part 15 to qualiIy Ior selling their products within the
U.S. Regulatory bodies in other countries have similar rules.
The FCC eases EIRP limitations Ior Iixed, point-to-point systems that use higher
gain directive antennas. II the antenna gain is at least 6 dBi, the FCC allows
operation up to 4 watts EIRP. This is 1 watt (the earlier limitation) plus 6 dB oI
gain.
For antennas having gain greater than 6 dBi, the FCC requires you to reduce the
transmitter output power iI the transmitter is already at the maximum oI 1 watt.
The reduction, however, is only 1 dB Ior every 3 dB oI additional antenna gain
beyond the 6 dBi mentioned above. This means that as antenna gain goes up, you
decrease the transmitter power by a smaller amount. As a result, the FCC allows
EIRP greater than 4 watts Ior antennas having gains higher than 6 dBi.
Note: EIIective Radiated Power (ERP) restrictions exist Ior unlicensed service
only. Amateur radios may have MUCH more power...

47
47 {C} Herbert Haas 2010/02/15
EIRP In Other Countries
America (FCC)

Point-to-muItipoint (typicaI AP usage)


30 dBm (1 W) and 1:1 power/gain
reduction/increase

Point-to-point (typicaI bridging usage)


36 dBm (4 W) = 30 dBm + 6 dBi
G>6dBi requires minus 1dBm for each 3 dBi
more gain
Japan, China: EIRP 10 mW

48
48 {C} Herbert Haas 2010/02/15
Diversity Antennas
Due to refIections,
a short-time
standing fieId is
produced - with
rippIes, peaks and
Iows

Same picture for


every frame if
"nobody moves"
Therefore, use
muItipIe antennas:
one wiII IikeIy pick
up more energy
than the other
Indoor office signaI intensity map
(source unknown)
For small distances (rooms) the speed oI light is approximately inIinite
On the other hand, the data rate is limited and every Irame produces a nearly
instantaneous EM-Iield (Ior a short period oI time)
Due to reIlections, a short-time standing Iield is produced with ripples, peaks
and lows. Same picture Ior every Irame iI "nobody moves"
ThereIore, use multiple antennas: one will likely pick up more energy than the
other.

49
49 {C} Herbert Haas 2010/02/15
The EM FieId
RefIections,
diffractions and
scattering are
highIy dynamic

Consider static
and dynamic
configurations
MuItipath
probIems

"High signaI
strengths but Iow
quaIity"
Indoor office signaI intensity map
(source unknown)
Source: www.intersil.com
This picture shows the useIulness oI diversity antennas.
Similar pictures can easily be made with 4NEC2X available Ior windows (or
NEC2 the Iree Linux version).

50
50 {C} Herbert Haas 2010/02/15
Why are bigger antennas better?
Assume we compIy to 20 dBm EIRP
Then this can be reached in various ways:
AdditionaIIy, SNR is improved with higher gains
Therefore, try to maximize antenna gains !!!
P
TX
Gain
P
TX Gain
17 dBm 17 dBm 3 dBi 3 dBi FSL + 17 dBm + 6 dBi
10 dBm 10 dBm 10 dBi 10 dBi FSL + 10 dBm + 20 dBi
0 dBm 0 dBm 20 dBi 20 dBi FSL + 0 dBm + 40 dBi
It is important to understand the true importance oI a high gain antenna. While
the TX power is limited by regulatory it makes no diIIerence when using a perIect
omni antenna with 100 mW or a 20 dBi dish with 1 mW.
But when signals are to be received the antenna gain (oI the receiver)
signiIicantly increases the sensitivity and thereIore lengthens the maximum
distance.
Note that in the yellow boxes above the FSL is assumed to have the same
(unknown) value each time. What changes is the TX power and the antenna gain.

51
51 {C} Herbert Haas 2010/02/15
PracticaI 2.4 GHz Distance Limits
ETSI Iimits 2.4 GHz EIRP to 20 dBm
(AIso for P2P Iinks)
A minimum RX power of -80 dBm can be
assumed as practicaI Iimit
Then a maximum FSL of -120 dB is aIIowed
This resuIts in a maximum distance of 10 km
P=0 dBm, G=20 dBi
P=0 dBm, G=20 dBi
FSL = -120 dB => 10 km
The typical practical distance limit oI wireless bridges operating in an ETSI
domain at 2.4 GHz is approximately 10 km.
Assuming a RX power oI -80 dBm a data rate oI 11 Mbit/s can be easily
achieved (the minimum signal level Ior 11 Mbit/s is -85 dB or less).

52
52 {C} Herbert Haas 2010/02/15
PracticaI 5 GHz Distance Limits
CompIeteIy different situation
HIPERLAN band (5470-5725 MHz) reIeased for WiFi

ETSI aIIows EIRP = 1 W = 30 dBi !!!


AIso a minimum RX power of -80 dBm can be
assumed as practicaI Iimit
Then a maximum FSL of -140 dB is aIIowed
This resuIts in a maximum distance of 45 km
P=0 dBm, G=30 dBi
P=0 dBm, G=30 dBi
FSL = -140 dB => 45 km
Note: the 5 GHz band is nearly 750 MHz wide this results in signiIicant
diIIerent wavelengths:
299.792.458 m/s / 5150 MHz 0.0582 m
299.792.458 m/s / 5470 MHz 0.0548 m
299.792.458 m/s / 5725 MHz 0.0524 m
That is the wavelength diIIerences are about 10.
ThereIore a FSL oI 140 dB can be reached either using 5150 MHz and 46.31 km
or 5725 MHz and 41.7 km. Currently only the upper bands can be used Ior
outdoor applications and 1 W EIRP, so we reasonably only consider 5470 MHz,
Ior which 140 dB FSL corresponds to 43.61 km.

53
53 {C} Herbert Haas 2010/02/15
ExpIoit Diversity (5.4 GHz)
ExampIe:
TX-Antenna is 30 dBi paraboIa
(1 W = 30 dBm EIRP = 0 dBm + 30 dBi)
RX-Antenna is 40 dBi paraboIa
AIIows 150 dB FSL => 140 km !!!
OptionaIIy an additionaI preamp can be used

E. g. + 10 dB => 160 dB FSL => 444 km theoretically


ProbIem: CSMA/CA timing must consider signaI
propagation time
140 km => 466 usec deIay (but SIFS = 16 usec)
TX
RX
30 dBi
0 dBm
0 dBm
40 dBi
40 dBi
30 dBi
FSL 150 dB possibIe
*** 140 km ***
The regulatory only limits the EIRP but not the sensitivity oI a receiver. ThereIore
the total distance can be easily increased with better RX-only antennas.
This can only be achieved by reusing the diversity antenna ports and disabling
diversity. Simply conIigure one port Ior TX and the other port (with the higher
gain antenna) Ior RX.
Although this sounds interesting this involves a non-trivial antenna-pointing
challenge. Additionally the bridges must support CSMA/CA timing adaptations
otherwise Irames cannon be acknowledged properly.
The Irequency oI 2.44 GHz is equal to a 0.122 m wavelength.

54
54 {C} Herbert Haas 2010/02/15
SNR
Sensitivity is not the onIy important parameter for the
receiver quaIity
Low noise IeveI: Sensitivity is Iimiting
High noise IeveI: SNR is Iimiting
Shannon 1948: ChanneI Capacity

Depends on Bandwidth and SNR


ExampIe: Required SNR for the Orinoco PCMCIA
SiIver/GoId
11 Mbps SNR
min
= 16 dB
5.5 Mbps SNR
min
= 11 dB
2 Mbps SNR
min
= 7 dB
1 Mbps SNR
min
= 4 dB
AIthough TX-power reguIated (EIRP) the RX-SNR has the
same effect!
See e. g. RX 2400-o from SSB "Receive Booster" (8-10 db pIus)
The most important parameter to keep an eye on is the Signal-to-Noise-Ratio
(SNR).
The eIIect is simple: the more SNR a client and AP observes the higher the data
rate possible.
ThereIore the longer the distance between client and AP the lower the data rate.

55
55 {C} Herbert Haas 2010/02/15
TypicaI Receiver Sensitivities
Orinoco cards PCMCIA SiIver/GoId

11Mbps -82 dBm


5.5Mbps -87 dBm
2Mbps -91 dBm
1Mbps -94 dBm
CISCO cards Aironet 350
11 Mbps -85 dBm

5.5 Mbps -89 dBm


2 Mbps -91 dBm
1 Mbps -94 dBm
Edimax USB cIient

11Mbps -81 dBm


BeIkin router/AP
11 Mbps -78 dBm
TypicaI noise fIoor: -95 dB, onIy +/- 2dB differences between a, b, g
The Cisco 1240AG Access Point has the Iollowing sensitivity levels:
1 Mbit/s (2.4 GHz) -96 dBm
11 Mbit/s (2.4 GHz) -88 dBm
54 Mbit/s (both 2.4 and 5 GHz) -73 dBm

56
56 {C} Herbert Haas 2010/02/15
CabIe Loss
TypicaI Ioss in common coaxiaI cabIes at 2.45
GHz
RG 58 (quite common, used for Ethernet):
1 dB per meter.

RG 213 ("big bIack", quite common):


0.6 dB per meter.
RG 174 (thin, seems to be the one used for pigtaiI
adapter cabIes): 2 dB per meter.
Aircom : 0.21 dB/m.
AirceII : 0.38 dB/m.
LMR-400: 0.22 dB/m
IEEE 802.3 (thick 'yeIIow' Ethernet coax) 0.3 dB/m
This is a very boring slide. Don't spend too much time on it.
OF COURSE it is IMPORTANT to know about cable attenuation.

57
57 {C} Herbert Haas 2010/02/15
Connector Loss
Add connector Ioss to cabIe Ioss before
caIcuIating the Link Budget

TypicaIIy between 0.1 and 0,5 dB at 2,45 GHz

Use as few connectors as possibIe


Loss depends on the quaIity of the connectors

DieIectric materiaI, Geometry, etc

Best: N connectors or SMA connectors


Worse: OId BNC connectors
Avoid PigtaiIs
(=short cabIes with different connectors on each side)
30 cm may have ~ 1.5 dB!

Use singIe-unit converters instead


.also don't Iorget to consider connector losses.

58
58 {C} Herbert Haas 2010/02/15
WLAN Connectors
N
FemaIe
N
MaIe
RP-SMA
FemaIe
RP-SMA
MaIe
RP-TNC
FemaIe
RP-TNC
MaIe
MC
MMCX
MC
Cisco uses reverse poIarity for spread spectrum products to
prevent connecting wrong antennas.
Cisco preIers the Reverse Polarity Threaded Naval Connector (RP-TNC) to
prevent connecting a non-certiIied antenna inadvertently.

59
59 {C} Herbert Haas 2010/02/15
Link ExampIe
Given 24 dB dish
Output power must be reduced to -4 dBm

That is 0.4 mW (!) to stay within the IegaI Iimits


of 20 dBm in Europe
TheoreticaI maximum range for a reIiabIe
Iink wiII be 8 km

Assuming 15 dBm fade margin

Due to highIy increased antenna gain in the


receiver path (SNR)
Here is a Iinal link budget example.

60
60 {C} Herbert Haas 2010/02/15
Quasi-opticaI Propagation
Requires "Iine-of-sight"

ReIiabIe connections due to steady fieId


strengths (no variabiIities)

SmaII TX powers possibIe

Free-space wave propagation


Fading through interferences

MuItipIe waves with different phases

Fading-controIIers at the receivers


(GSM, UMTS)

Diversity antennas (WLAN, GSM and UMTS)



61
61 {C} Herbert Haas 2010/02/15
The FresneI Zones (1)
Surfaces where refIected rays wouId reach the receiver with an extended
path by /2
=> Destructive interference
TX and RX Iocated at focaI points
Any path connecting F1, F2, and surface has same Iength
RuIe of thumb:
If 60% of first FresneI Zone is cIear of obstructions then nearIy same Iink as a
cIear path
However might be unstabIe under bad weather conditions
Try to achieve fuII FresneI zone cIearance
1st FresneI Zone
[m]
FresneI zones radius:
d
1
d
2
r
2nd FresneI Zone
3
r
d
F
r
e
s
n
e
I Z
o
n
e
2 1
2 1
d d
d d n
r
+

=

The range oI a wireless link is dependent upon the maximum allowable path loss. For outdoor
links this is a straightIorward calculation as long as there is clear line oI sight between the two
antennas with suIIicient clearance Ior the Fresnel zone. For line oI sight, you should be able to
visibly see the remote locations antenna Irom the main site. There should be no obstructions
between the antennas themselves. This includes trees, buildings, hills, and so on.
Fresnel Zone (pronounced 'Ire-nel' the "s" is silent)
Fresnel zone is an elliptical area immediately surrounding the visual path. It varies depending on
the length oI the signal path and the Irequency oI the signal. The Fresnel zone can be calculated,
and it must be taken into account when designing a wireless link.
The area around the visual line-oI-sight that radio waves spread out into aIter they leave the
antenna. This area must be clear or else signal strength will weaken.
Fresnel Zone is an area oI concern Ior 2.4 GHz wireless systems. The table above provides a
guideline on height requirements Ior antennas based on both line oI sight and Fresnel zone
requirements (Ior 2.4 GHz). Outdoors, every increase oI 6 dB will double the distance. Every
decrease oI 6 dB will halve the distance. Shorter cable runs and higher gain antennas can make a
signiIicant diIIerence to the range.
Point-to-Point
When connecting two points together (such as an Ethernet bridge), the distance, obstructions, and
antenna location must be considered. II the antennas can be mounted indoors and the distance is
very short (several hundred meters), the standard dipole or mast mount 5.2 dBi omni-directional
may be used. An alternative is to use two patch antennas. For very long distances (1/2 km or
more) directional high gain antennas must be used. These antennas should be installed as high as
possible, and above obstructions such as trees, buildings, and so on. With a line-oI-site
conIiguration, distances oI up to 20 km at 2.4GHz can be reached using parabolic dish antennas, iI
a clear line-oI-site is maintained.
Point-to-Multipoint Bridge
In this case (in which a single point is communicating to several remote points) the use oI an
omni-directional antenna at the main communication point must be considered. The remote sites
can use a directional antenna that is directed at the main point antenna.
As a rule oI thumb, the earth curvature becomes signiIicant at distances greater than 10 km.

62
62 {C} Herbert Haas 2010/02/15
The FresneI Zones (2)
Consideration especiaIIy important
when Earth's buIge touches FresneI
zones

Distances >9 km => high poIes are


required for antenna mount
Distance
(km)
Fresnel zone
(radius)
Earth
Curvature
Total
1,6 3 1 4
8 9 1,5 10,5
16 13 4 17
24 16 8,5 24,5
32 20 15 35
40 22 23 45
OpticaI horizon:
R
[km]
= 3.57 ( sqrt(h
S
) + sqrt (h
R
) )
Radio horizon:
R
[km]
= 4.12 ( sqrt(h
S
) + sqrt (h
R
) )

63
63 {C} Herbert Haas 2010/02/15
Diffraction
Radio waves wiII be distracted on edges from
objects.
It is possibIe to catch receiver behind objects
h
d1 d2
Loss = 20 Iog
0.225
h
( )
0.12 d
1
d
2
2 (d
1
+ d
2
)
1/2

Also called "Inflexion"
Radio waves will be distracted on edges Irom objects. It is possible to catch
receiver behind objects.
However it's a bad design don't expect high-quality signals.

64
64 {C} Herbert Haas 2010/02/15
NaturaI Attenuation
Fog and rain:

Approx 0.5 dB/km @ 2,4


GHz-stiII IittIe effect
Dense snow storm is
more criticaI

SignaI scattering effect


ProbIem becomes reaIIy
serious for higher
frequencies

MoIecuIe absorption effects


Therefore be Iucky with
WLANs.
(No fog, no rain)
WLANs
Although 2.4 GHz signals pass rather well through walls, they have a tough time
passing through trees. The main diIIerence is the water content in each. Walls are
rather dry: trees contain high levels oI moisture. Radio waves in the 2.4 GHz
band absorb into water quite well.

65
65 {C} Herbert Haas 2010/02/15
DeIay Spread
Consequence of muItipath propagation

Receiver needs equaIizer


Manufacturers specify deIay spread Iimit
ExampIe: Orinoco Frame Error Rate (FER) < 1%
11Mbps 65 ns

5.5 Mbps 225 ns


2 Mbps 400 ns

1Mbps 500 ns
Note: DeIay spread in wide areas with Iots of muItipaths
can reach severaI s !
RuIe of thumb: Path Iength difference of 15 meters Ieads to 50
ns spreading
SoIutions:
Directive antennas

CircuIar poIarization
OFDM
narrow puIses
from sender
spread puIses
at receiver
(Inter-SymboI-
Interference)
In order to minimize the reIlection rate it is better using directive antennas, even
iI you are at short distance, and being in line oI sight. Another possibility is also
to use circular wave polarisation antennas (helical antenna) that cancel quite
well the Iirst reIlexions. (that is because the reIlected signal has the opposite
circulation direction (leIt becomes right), so the receiver is insensitive to this
reIlected signal) The helical would be ideal.

66
66 {C} Herbert Haas 2010/02/15
Outdoor Antenna Safety
Antenna cabIes connect indoor and
outdoor EM-environment
Prone to (in-) direct Iightning
Can pick up eIectricaI fieIds (=>
currents) through dry air or EMI
There is no 100% soIution to protect
your equipment !!!
But good chances to protect the
indoor area (heaIth, fire)
Use Iightning arrestors (antenna
cabIe) or grounding bIocks
(pwr/consoIe coax) against surges
DC-continuity type needed for WLAN
with coax power suppIy (gas tube or
spark gap)
Proper Iow-impendance grounding
criticaI (not that easy!)
Keep tower and coax at same
potentiaI (to prevent "side fIashes)
0-3 GHz
Lightning
Protector
HyperGain
ModeI HGLN-F
DuaI F Grounding BIock
(F-connectors are used
in Aironet 1400 series for the
Bridge suppIy cabIes)
RP-TNC Connectors
(Aironet 350 series,
Antenna cabIes)
WLAN equipment can be damaged by various electrical disturbances such as power line switching
transients and voltage surges, as ell as static build-up on outside wires and antennas.
Arrestors Ior coaxial cable also come in several types, each oI which Iunctions somewhat
diIIerently. DC blocking-type arrestors have a Iixed Irequency range and must be selected Ior a
speciIic application. Their main advantage is that they present a high-impedance path to the
Irequencies Iound in lightning (less than 1 MHz) while oIIering a low impedance to signals
created by your radio.
Arrestors that have dc continuity (gas tube and spark gap types) are broad-band and can be used
over a wider Irequency range than the dc-blocking types. Also, in installations where the coax is
also used to supply voltages to a remote device (such as a mast-mounted preamp or remote coax
switch), the dc continuity-type arrestor must be used.
The Cisco Aironet Lightning Arrestor prevents energy surges Irom reaching the RF equipment
by the shunting eIIect oI the device. Surges are limited to less than 50 volts, in about .0000001
seconds (100 nano seconds). A typical lightning surge is about .000002 (2 micro seconds). The
accepted IEEE transient (surge) suppression is 8 usec. The Lightning Arrestor is a 50-ohm
transmission line with a gas discharge tube positioned between the center conductor and ground.
This gas discharge tube changes Irom an open circuit to a short circuit almost instantaneously in
the presence oI voltage and energy surges, providing a path to ground Ior the energy surge.
Note: Lightning can occur even without a thunderstorm - whenever and wherever there is a
suIIicient charge build-up.
Note: Some towers, especially AM radio towers, are not grounded because the tower is actually
isolated Irom ground, being used as the antenna. This is known as a hot tower, and you must
isolate the bridge and all grounds Irom this type oI tower.
However, the ARRL Antenna Book states, "The best protection Irom lightning is to disconnect all
antennas Irom equipment and disconnect all equipment Irom power lines."
When lightning strikes, it will always try to Iind the shortest electrical path to ground. Proper
grounding is critical to lightning protection. Lightning contains energy in a wide range oI
Irequencies thereIore provide a low-impedance path to ground Ior the energy.

67
67 {C} Herbert Haas 2010/02/15
WorId Record (earIy 2005)
200 km without ampIifiers
But an EIRP beyond IegaI Iimits
See
http://www.wifiworIdrecord.com/
http://www.wifi-shootout.com/
Nevada
Utah
200 km
4 m dish, 300 mW
3 m dish, 300 mW
3m dish ~ 35 dBi

68
68 {C} Herbert Haas 2010/02/15
Tomorrow's Antenna Design
Microwave antenna design using
genetic aIgorithms

http://ic.arc.nasa.gov/projects/esg/resea
rch/antenna.htm

1
2010/02/15 {C} Herbert Haas
WLAN
Protocol
In this chapter we discuss basic communication issues, such as synchronization,
coding, scrambling, modulation, and so on.

2
2 {C} Herbert Haas 2010/02/15
ProtocoI Layers
MAC Iayer
Medium access controI
Fragmentation
PHY Iayer = PLCP +
PMD
EstabIished signaI for
controIIing
CIear ChanneI
Assessment (CCA)
Service access point
PhysicaI Layer
Convergence ProtocoI
(PLCP)
Synchronization and
SFD
Header
PhysicaI Medium
Dependent (PMD)
ModuIation and coding
802.2 - LogicaI Link ControI (LLC)
Media Access ControI (MAC)
802.3
CSMA/CD
802.4
Token Bus
802.5
Token Ring
802.6
DQDB
802.12
Demand
Priority
802.11
WireIess
PHY PHY PHY PHY PHY PHY
802.1 Management, Bridging (802.1D), QoS, VLAN, .
PLCP
PhysicaI Layer Convergence ProtocoI
PMD
PhysicaI Media Dependent
The 802.11 standard only describes the physical and the MAC layer. The
physical layer is split into the PLCP and the PMD protocol. The Medium Access
Control takes-over the layer 2 Iunctions.
Every 802.11 layer takes-over diIIerent tasks. The MAC layer is necessary Ior
the medium access and Iragmentations. The PLCP part oI the physical layer is
necessary Ior the controlling oI the CCA signal. The PMD part enIolded the data
modulation and the coding.

3
3 {C} Herbert Haas 2010/02/15
CIear ChanneI Assessment
CCA is an aIgorithm to determine if the
channeI is cIear
But what is "clear" ?

Either measuring onIy WLAN carrier signaI


strengths

Or measuring the totaI power of both noise


and carriers
Minimum RX signaI power IeveIs shouId
be configured at receivers (APs & cIients)

CSMA wouId not aIIow to send any frames if


the environmentaI noise IeveI is too high
Part of PHY, used for MAC
The Clear Channel Assessment (CCA) algorithm is a Iundamental method used
in all wireless technologies to determine whether a channel is currently occupied
or not.
Basically a minimum power level threshold must be speciIied. II the currently
measured RX power level Ior a given channel is below that threshold the channel
is considered non-occupied and a data Irame can be sent.
ThereIore the CCA threshold is the minimum allowable power level Ior legal
WLAN clients or equivalently the maximum allowable noise power level.

4
4 {C} Herbert Haas 2010/02/15
FHSS Frame Format
PLCP header runs aIways with 1 Mbit/s
User data up to 2 Mbit/s
Synchronization with 80 bit string "01010101."
AII MAC data is scrambIed by a s
(z)
=z
7
+z
4
+1 poIynomiaI to bIock any DC component
Start Frame DeIimiter (SFD)
Start of the PLCP header
0000110010111101 bit string
PLCP Length Word (PLW)
Length of user data incIusive 32 bit CRC of the user data (vaIue between 0 and 4095)
Protects user data
PLCP SignaIing FieId (PSF)
Describe the data rate of the user data
Header Error Check (HEC)
16 bit CRC
Protect Header
PLCP PreambIe
Synchronization SFD PLW PSF HEC MAC + Data
PLCP Header
80 16 12 4 16 variabIe Bits:
The FHSS Irame Iormat is only presented Ior historical interests.iI there are
any.
Note that some vendors still produce FHSS-based 802.11 devices Ior special
purposes (high interIerence environments).

5
5 {C} Herbert Haas 2010/02/15
DSSS Frame Format
PLCP header runs aIways with 1 Mbit/s (802.11 standard)
User data up to 11 Mbit/s (802.11b standard)
Synchronization (128 bit)
AIso used for controIIing the signaI ampIification
And compensation for frequency drifting
Start Frame DeIimiter (SFD)
1111001110100000
SignaI (Rate)
0x0A 1 Mbit/s (DBPSK)
0x14 2 Mbit/s (DQPSK)
Other vaIues reserved for future use
11 Mbit/s today with CCK
Service
0x00 802.11 frame
Other vaIues reserved for future use
Length
16 bit instead of 12 bit in FHSS
Header Error Check (HEC)
16 bit CRC (ITU-T-CRC-16 StandardpoIynom)
PLCP PreambIe
Synchronization SFD SignaI Service HEC MAC + Data
PLCP Header
128 16 8 8 16 variabIe
Length
16
802.11g and 802.11a use
simiIar frame format
The DSSS Irame Iormat shown here is used by 802.11b but the Irame Iormat is
also valid with 802.11g and 802.11a except that the values Ior the Iields are
diIIerent.
The most important thing to understand here is that only the PLCP headers are
sent with the lowest supported data rate. The Iollowing MAC header and the
payload can be sent with a higher data rate.
The symbol rate is constant Irom the very beginning oI the Irame to the very end.
What changes is only a 'jump' in the QAM Iamily (i. e. in the code complexity)
which causes a change in the inIormation rate.
Even a distant receiver should at least be able to decode the PLCP (because the
PLCP has a low data rate) in order to determine the QAM code required to
decode the remainder oI the Irame (the data part).

6
6 {C} Herbert Haas 2010/02/15
MAC PrincipIes
ResponsibIe for severaI tasks

Medium access

Roaming

Authentication

Data services

Energy saving
Asynchronous data service

Ad-hoc and infrastructure networks


ReaItime service

OnIy infrastructure networks


The MAC layer is responsible Ior many tasks. The important one is the
controlling oI the medium access. But also the roaming, authentication and
energy saving mechanisms are included here. The basic services are the
Asynchronous data service, Ior Ad-hoc and inIrastructure networks, and the
Time-bounded service, Ior inIrastructure networks only. With the Asynchronous
data service broadcast and multicast Irames are possible.
General rule: Collisions cannot be detected, so each packet is acknowledged
(except MAC-level retransmissions).

7
7 {C} Herbert Haas 2010/02/15
MAC Header - Overview
Frame ControI (FC) incIudes

ProtocoI version, frame type


Encryption information

2 Distribution System Bits (DS)


Duration ID (D-ID) for virtuaI reservations
IncIudes the RTS/CTS vaIues
Addresses are interpreted according DS bits
Sequence ControI (SC) to avoid dupIicates
FC D-ID Address 1
0-2312 2
Address 2 Address 3 Address 4 SC Data CRC
2 6 6 6 6 2 4
MAC Header
The picture above shows the standard MAC header with it Iields.
Frame Control (FC). These 2 bytes contains inIormation about the protocol
version, the Irame type, encryption inIormation and the important DS bits.
Duration ID (D-ID). These Iield include the RTS/CTS values. These Iield
includes the NAV values.
Address. These 4 address Iields constrains IEEE 802.11 MAC addresses. The
interpretation oI this addresses depends on the DS bits.
Sequence Control (SC). The Sequence control is a value to avoid Irame
duplications.
Data. A MAC Irame can include any kind oI data (max 2.312 bytes).
Checksum (CRC). A 32 bit sum to protect the Irame.

8
8 {C} Herbert Haas 2010/02/15
MAC Header - More Specific

Header Iength: 10-30 Bytes

TotaI maximum Iength: 2346 Bytes (without CRC)

Time fieId aIso used for power saving


CtrI Time Address 1 Address 2 Address 3 Address 4 Seq
CRC-32
2 2 6 6 6 6 2
4
Some of these fieIds can be omitted with
certain frame types
Ver
To
DS
Type Sub-Type
From
DS
More
Frag
Retry
Pwr
Mgmt
More
Data
WEP Order
2 2 4 1 1 1 1 1 1 1 1
Required time
for data pIus ACK
(aIso for CSMA/CA)
(Bits)
(Bytes)
Sequence Number of message (not frame)
Number of
Fragment
4 12
Data (0-2312)
(Bits)
2312 bytes max Irame length without encryption etc.
Most adapters allow at least 2346 byte Irames (total length).

9
9 {C} Herbert Haas 2010/02/15
Header DetaiIs - Addresses
Infrastructure network:
CeII address = AP's MAC address
Address 1 Address 2 Address 3 Address 4
Receiver Sender CeII --
To
DS
From
DS
CtrI
0 0
Receiver CeII Sender -- 0 1
CeII Sender Receiver -- 1 0
CeII CeII Receiver Sender 1 1
Used for aII mgmt
and ctrI frames. Used for
data frames in Ad-hoc or
broadcast situations.
Communication inside
BSS: Frame from AP to
Receiver. Sender is
originator. ACK must be
sent to AP.
Communication inside
BSS: Frame from Sender
to AP. ShouId be reIayed
to receiver.
Communication between
APs. Address1 is receiving
AP, address2 is sending
AP.
Four addresses are used in bridging mode but bridging is a very proprietary
Ieature with lots oI additional undocumented tricks.

10
10 {C} Herbert Haas 2010/02/15
Note
If an AP is used, ANY traffic runs
over the AP

Because stations do not know whether


receiver is associated to this AP or
another AP
CeII address = AP's MAC address

AIways specified in header

Not needed in Ad-hoc network



11
11 {C} Herbert Haas 2010/02/15
Service Set Management Frames
Beacon frame
Sent periodicaIIy by AP to announce
its presence and reIay information,
such as timestamp, SSID, and other
parameters
Radio NICs continuaIIy scan aII 802.11
radio channeIs and Iisten to beacons
as the basis for choosing which
access point is best to associate with
Probe request frame
Once a cIient becomes active, it
searches for APs in range using probe
request frames
Sent on every channeI in an attempt to
find aII APs in range that match the
SSID and cIient-requested data rates
Probe response frame
TypicaIIy sent by APs
Contains synchronization and AP Ioad
information (aIso other capabiIities)
Can be sent by any station (ad hoc)
Initiator Responser
Probe request
Probe response
Authentication request
Authentication response
Association request
Association response
Authentication frame: 802.11 authentication is a process whereby the access
point either accepts or rejects the identity oI a radio NIC. The NIC begins the
process by sending an authentication Irame containing its identity to the access
point. With open system authentication (the deIault), the radio NIC sends only
one authentication Irame, and the access point responds with an authentication
Irame as a response indicating acceptance (or rejection). With the optional shared
key authentication, the radio NIC sends an initial authentication Irame, and the
access point responds with an authentication Irame containing challenge text. The
radio NIC must send an encrypted version oI the challenge text (using its WEP
key) in an authentication Irame back to the access point. The access point ensures
that the radio NIC has the correct WEP key (which is the basis Ior authentication)
by seeing whether the challenge text recovered aIter decryption is the same that
was sent previously. Based on the results oI this comparison, the access point
replies to the radio NIC with an authentication Irame signiIying the result oI
authentication.
Deauthentication frame: A station sends a deauthentication Irame to another
station iI it wishes to terminate secure communications.

12
12 {C} Herbert Haas 2010/02/15
Authentication and Association
Authentication frame
AP either accepts or rejects the identity of a radio NIC
Deauthentication frame
Send by any station that wishes to terminate the secure communication
Association request frame
Used by cIient to specify: ceII, supported data rates, and whether CFP is desired (then
cIient is entered in a poIIing Iist)
Association response frame
Send by AP, contains an acceptance or rejection notice to the radio NIC requesting
association
Reassociation request frame
To support reassociation to a new AP
The new AP then coordinates the forwarding of data frames that may stiII be in the buffer of
the previous AP waiting for transmission to the radio NIC
Reassociation response frame
Send by AP, contains an acceptance or rejection notice to the radio NIC requesting
reassociation
IncIudes information regarding the association, such as association ID and supported data
rates
Disassociation frame
Sent by any station to terminate the association
E. g. a radio NIC that is shut down gracefuIIy can send a disassociation frame to aIert the
AP that the NIC is powering off
Association request frame: 802.11 association enables the access point to
allocate resources Ior and synchronize with a radio NIC. A NIC begins the
association process by sending an association request to an access point. This
Irame carries inIormation about the NIC (e.g., supported data rates) and the SSID
oI the network it wishes to associate with. AIter receiving the association request,
the access point considers associating with the NIC, and (iI accepted) reserves
memory space and establishes an association ID Ior the NIC.
Association response frame: An access point sends an association response
Irame containing an acceptance or rejection notice to the radio NIC requesting
association. II the access point accepts the radio NIC, the Irame includes
inIormation regarding the association, such as association ID and supported data
rates. II the outcome oI the association is positive, the radio NIC can utilize the
access point to communicate with other NICs on the network and systems on the
distribution (i.e., Ethernet) side oI the access point.
II a radio NIC roams away Irom the currently associated access point and Iinds
another access point having a stronger beacon signal, the radio NIC will send a
reassociation Irame to the new access point. The new access point then
coordinates the Iorwarding oI data Irames that may still be in the buIIer oI the
previous access point waiting Ior transmission to the radio NIC.
An access point sends a reassociation response Irame containing an acceptance or
rejection notice to the radio NIC requesting reassociation. Similar to the
association process, the Irame includes inIormation regarding the association,
such as association ID and supported data rates.
Disassociation frame: A station sends a disassociation Irame to another station iI
it wishes to terminate the association. For example, a radio NIC that is shut down
graceIully can send a disassociation Irame to alert the access point that the NIC is
powering oII. The access point can then relinquish memory allocations and
remove the radio NIC Irom the association table.

13
13 {C} Herbert Haas 2010/02/15
Beacon DetaiIs
CIients verify their current ceII by examine the beacon
Beacon is typicaIIy sent 10 times per second
Information carried by beacon:
Timestamp (8 Bytes)

Beacon IntervaI (2 Bytes, time between two beacons)


CeII address (6 Bytes)
AII supported data rates (3-8 Bytes)
OptionaI: FH parameter (7 Bytes, hopping sequenz, dweII time)
OptionaI: DS parameter (3 Bytes, channeI number)
ATIM (4 Bytes, power saving in ad-hoc nets) or TIM
(infrastructure nets)
OptionaI but very common: vendor-specific INFORMATION
ELEMENTS (IEs)
ProbIem: Beacons reveaIs features and existence of ceII
Security relevance: The beacon is always sent with the lowest supported data
rate (1 Mbit/s Ior 802.11b/g or 6 Mbit/s with 802.11a) and thereIore even in large
distances the beacon reveals the existence oI a cell.
However there is no real workaround against it as you cannot disable beacons.
One could speciIy an increased 'required' data rate and disable lower 'required'
rates. Only 'required' rates are used Ior management Irames. This reduces the
detection range.
Additionally you can increase the beacon interval Irom 100 msec to e. g. 1
second. However this may aIIect the roaming service.

14
14 {C} Herbert Haas 2010/02/15
SSID
32 bytes, case sensitive
Spaces can be used, but be carefuI
with trailing spaces
MuItipIe SSIDs can be active at the
same time; assign the foIIowing to
each SSID:
VLAN number
CIient authentication method
Maximum number of cIient
associations using the SSID
Proxy mobiIe IP
RADIUS accounting for traffic
using the SSID
Guest mode
Repeater mode, incIuding
authentication username and
password
OnIy "Enterprise" APs support
muItipIe SSIDs
Cisco: 16
One broadcast-SSID, others kept
secret
Repeater-mode SSID
AP# configure terminal
AP(config)# configure interface dot11radio 0
AP(config-if)# ssid batman
AP(config-ssid)# accounting accounting-method-list
AP(config-ssid)# max-associations 15
AP(config-ssid)# vlan 3762
AP(config-ssid)# end
II you want the access point to allow associations Irom client devices that do not
speciIy an SSID in their conIigurations, you can set up a guest SSID. The access
point includes the guest SSID in its beacon. The access point's deIault SSID,
tsunami, is set to guest mode. However, to keep your network secure, you should
disable the guest mode SSID on most access points.
II your access point will be a repeater or will be a root access point that acts as a
parent Ior a repeater, you can set up an SSID Ior use in repeater mode. You can
assign an authentication username and password to the repeater-mode SSID to
allow the repeater to authenticate to your network like a client device.
II your network uses VLANs, you can assign one SSID to a VLAN, and client
devices using the SSID are grouped in that VLAN.
SSID broadcasting. In some cases, such as public Internet access applications,
you can broadcast the SSID to enable user radio cards to automatically Iind
available access points. For private applications, it's generally best to not
broadcast the SSID Ior security reasons -- it invites intruders. Multiple SSIDs
means you can mix and match the broadcasting oI SSIDs.

15
2010/02/15 {C} Herbert Haas
The IEEE 802.11 ProtocoI
CSMA/CA

16
16 {C} Herbert Haas 2010/02/15
Access Methods - CSMA/CA
Distributed Coordination Function (DCF)

Asynchronous data service

OptionaIIy with RTS/CTS


Point Coordination Function (PCF)

Intended for reaItime service (e. g. VoIP)

PoIIing method

OptionaI
"Distributed Foundation
Wireless Medium
Access Control"
(DFWMAC)
DCF (CSMA/CA)
PCF
In the 802.11 standard 3 access methods are deIined. One method that based on a
CSMA/CA version (must be supported), one optional method which avoid the
problem oI invisible devices and a optional, collision Iree method. The Iirst two
methods are called Distributed Coordination Function (DCF) and third method is
a so called Point Coordination Function (PCF). DCF methods can only support
asynchronous services, PCF supports asynchronous and time-bounded services.
But a access point is necessary Ior PCF methods.
Note: The PCF is optional and only very Iew APs or Wi-Fi adapters actually
implement it.

17
17 {C} Herbert Haas 2010/02/15
Superframe
Beacon is sent by "Point Coordinator" (PC=AP)
Minimum CP period guaranteed

To avoid starvation of non-reaItime data

At Ieast one frame can be sent


Note: PoII-Frames and ACKs omitted in this picture!
VoIP
t
RT Data Data Data
Superframe
B
Contention-Free Period (CFP) Contention Period (CP)
VoIP
PCF Regime: Polling DCF Regime: Contention
Next
Superframe
B B B B
Beacon
IntervaI
Typically the Point Coordinator (PC) is integrated in the AP but this is not
required. In order to give an idea oI the basic principle o the superIrame, the CFP
and the CP, many details have been omitted. The details oI both periods are
explained in the Iollowing slides.
Note that the Beacon Irames are primarily used to detect stations within this cell.

18
18 {C} Herbert Haas 2010/02/15
CSMA Access Method
No standing waves in free space => no
Ethernet-Iike coIIision detection
possibIe
CoIIision is detected by missing ACKs!
Truncated Random ExponentiaI Backoff
Iike in Ethernet and 802.3
SimpIe fragmentation mechanism
Ethernet compatibiIity
Performance (interferences)
CCA to determine medium state
CSMA: "Listen before taIk"
A safety Inter-frame Space
(DIFS | PIFS | SIFS, pIus Backoff) must
be awaited before TX
CW is multiple of Ethernet slot time
f medium is busy: Backoff
Slot time: 47 s (9 s)
DCF nter-Frame Space (DFS)
Longest waiting time, 128 s (34 s )
Used for asynchronous data services
PCF nter-Frame Space (PFS)
Used for APs to stop user communication,
78 s (25 s)
Short nter-Frame Space (SFS)
Shortest waiting time, highest priority, 28
s (16 s)
Used for ACKs
Basic Ideas Details
Next Frame Medium busy
DIFS DIFS
PIFS
SIFS
t
SIot Time
Max. Competition window for
Random Backoff mechanism
TX Waiting time
The picture above shows some important parameters which are necessary beIore a
device can access on a medium. DIFS, PIFS and SIFS control the priority beIore
a device can have access. A medium can be Iull or busy, the current status will
be detect with the help oI the CCA signal.
Real Collision Detection would require a Iull-duplex connection to detect
collisions also at the "end" oI a wireless connection. This would be too expensive
Ior wireless LAN hardware.
Note the whole Irame with all control inIormation, FCS, etc is up to 2346 bytes
long.
The DIFS parameter describes the longest waiting time and consequently the less
priority. It is used Ior the data transIer.
The PIFS is used Ior time-bounded services. II a access point need to scan some
devices, the access point only need to wait PIFS.
The shortest waiting time has the SIFS, data which use the SIFS have the highest
priority. All controlling Irames (e.g. acknowledgements) using these time. So
they cannot be blocked by data transIer.
Note: The numbers in brackets relate to the 802.11a standard values.

19
19 {C} Herbert Haas 2010/02/15
Backoff PoIicies
Random backoff reduces coIIisions
Competition window (CW)

Start vaIue of 7 sIot times

After every coIIision CW doubIed

To a max of 255
Post-backoff

After successfuI transmission

To avoid "channeI-capture"
Exception: Long siIent durations

Station may send immediateIy after DIFS


The random slot time is a value oI slots. Every competition window starts with a
slot number oI 7. Every collision the competition value is doubled till a max oI
255. The DFWMAC with CSMA/CA method works Iine with less devices, but
there will be to much collisions with to many devices.

20
20 {C} Herbert Haas 2010/02/15
CW Data
Data
CSMA/CA in Action
Point-to-point communication
AcknowIedgment is send after SIFS

Before aII other communications


Guaranteed coIIision free
Re-transmitted frames have no higher priority
over other frames
Sender
Receiver
Other stations
DIFS
SIFS
Ack
DIFS
Waiting time
t
The picture above shows the DFWMAC with CSMA/CA method with a point-to-
point communication. AIter the user data a acknowledgement is send. This
acknowledgement is send aIter SIFS, so it will be transIer collision Iree and
beIore all other communication. II a data packet need to be re-transmitted, the
device need to wait DIFS and also using the normal backoII mechanism. This re-
transmitted Irames have no advantages compared with others.

21
21 {C} Herbert Haas 2010/02/15
CSMA/CA with RTS/CTS
Avoid the probIem of invisibIe devices or
"Hidden Stations"
Station receives data from two other devices
The two other devices didn't see each other
Each device thinks medium is free CoIIision
2 speciaI packets RTS and CTS
Every station must Iisten to this packets
Access Method
NAV (CTS)
NAV (RTS)
CW
Data
Data
RTS
Sender
Receiver
Other stations
Hidden stations
DIFS
SIFS
CTS
Waiting time
SIFS SIFS
ACK
t
DIFS
Four-way handshake:
1. RTS
2. CTS
3. Data
4. ACK
To avoid the problem oI invisible devices (devices who didnt see each other) the
DFWMAC with RTS/CTS method was created. Two special packets, the RTS
and CTS packet, help to Iix this problem. Every 802.11 device must listen to
these packets.
II a station (sender) want to send out some packets it sends out a RTS packet Iirst.
This RTS packet include inIormation about the target device (receiver) and about
the approximate transIer time. All other station will receive this RTS packet. The
stations saIes the approximate transIer time into a so called Net Allocation Vector
(NAV). For this time now, the other stations will not send out any packets. Also
the receiver station will receive this packet and send out a CTS packet aIter SIFS.
The CTS packet is a signal Ior the sending station to start their transmission. It
also include more exact inIormation about the transIer time. All other station
receive this packet too and adjust their NAV. AIter the NAV the medium is Iree
Ior all, and a new completion can start.

22
22 {C} Herbert Haas 2010/02/15
RTS/CTS => "VirtuaI Reservation"
CoIIision can onIy occur at the begin
or after a transmission
Much more overhead

RTS/CTS packets increase the totaI


access-deIay
Usage guideIines

OnIy when Ionger frames are sent on


average (> 500 Bytes)

When hidden stations are expected


This process is called ,virtual reservation'. Collisions can only occur at the
begin or at the end oI a transmission, within the competition window. The big
disadvantage oI the DFWMAC with RTS/CTS method is the traIIic. The
RTS/CTS packets increase the traIIic and so the access-delay. This method is
only using with longer Irames.

23
23 {C} Herbert Haas 2010/02/15
PCF - PoIIing PrincipIe
Guaranteed transmission parameters
Minimum data rate
Maximum access-deIay
AP necessary (!)
For medium access controI
PoIIing and time-keeping
Acts as "point coordinator"
Point Coordinator (PC) spIits access time into a Superframe
Contention-free period (PCF method)
Contention period (DCF method)
Target Beacon Transmission Time (TBTT) is announced in each beacon
VoIP
t
RT Data Data Data
Superframe
B
Contention-Free Period (CFP) Contention Period (CP)
VoIP
PCF Regime: Polling DCF Regime: Contention
Next Superframe
B B B B
Beacon
IntervaI
Both DCF methods didnt support transIer guaranties. With DFWMAC-PCF
some parameters can be deIined, such as a minimal bandwidth or a maximal
access delay. For this PCF method a access point is necessary, so it can be only
used within inIrastructure networks. The access point in necessary to control the
medium access and Ior interrogation between the diIIerent stations (polling).
The access time oI the medium is split into a so called ,SuperIrame' by the point
coordinator. The SuperIrame consists oI a competition period (CP) and a
competition Iree period (CFP). The CFP is only optional and does not need to be
supported by the AP. II it is supported, then the AP also controls the changes
Irom one phase to the other. The Beacon is sent periodically, primarily to detect
other stations in the cell. Additionally, the SuperIrame always begins with a
beacon. Note that the SuperIrame is sometimes also called CFP-Interval.

24
24 {C} Herbert Haas 2010/02/15
CFP PoIicy
Beacon starts CFP by announcing maximum
duration of CFP

Can be muItipIe of Beacon intervaIs

Intermediate Beacons indicate the remaining CFP


duration
Between two successive CFPs there must be
space to send at Ieast on frame in the CP mode!
The AP may finish the CFP earIier!
Sending the CF-End ControI Frame
CFP is optional
CSMA/CA-onIy cIients must not interfere

CFP aIso reIies on CSMA/CA


Not all clients need to support CFP. NAV is set by beacon Irame Ior all stations.

25
25 {C} Herbert Haas 2010/02/15
Medium fuII
t
4
t
2
t
3
U
4
D
3
D
4
U
2
t
1
t
0
Station's NAV
PCF Medium Access
NAV (no competition)
D
1 Point
Coordinator
Stations
PIFS
Superframe
SIFS
U
1
SIFS
D
2
SIFS
SIFS
Station's NAV
NAV (no competition)
Point
Coordinator
Stations
SIFS
SIFS
CF
end
PIFS
Competition t
The picture above shows the schematic view oI the DFWMAC-PCF method.
Remember the data transIer take place by the point coordinator. All data will be
send or received Irom or by the point coordinator. D
n
and U
n
are data/user-data
Irom or to the point coordinator.

26
26 {C} Herbert Haas 2010/02/15
PCF AIgorithm
At t
0
starts the competition free zone
Medium gets free at t
1
After PIFS the PC can access the medium
No other station can access because PIFS is smaIIer than DIFS
Now PC poIIs first station (D1)
Stations may answer with user data after SIFS
Stations must Ack within PIFS

PIFS is shortest idIe period within CFP


AII frames are sent through AP !!!
AP maintains Iist of aII stations that shouId be poIIed

Announced by association process


PC continuousIy poIIs Iisted stations
PC can send data together with beacon (piggy-back)
By sending a CF
end
frame the PC starts the CP
AIter the medium gets Iree at t
1
, the point coordinator waits PIFS beIore he can
access to the medium. PIFS is smaller then DIFS, so no other station can get an
access on the medium beIore the point coordinator. Now the point coordinator
can start sending the data to the Iirst station. The station can answer immediately
aIter SIFS.
Now the point coordinator starts to send out the data to station 2. With the CF
end

packet the point coordinator opens the competition period.
Within this modus the complete controlling takes place by the access point. With
a consistent query to all station it is possible to get Iix a bandwidth. But also this
method has a disadvantage. II some station didn`t have anything to send, much
bandwidth will be unused.

27
27 {C} Herbert Haas 2010/02/15
802.11g/b CompatibiIity
"b" expects CCK preambIe and cannot
detect OFDM signaIs

Therefore coIIisions with Iegacy "b"


CompatibiIity mode

g-devices onIy use RTS/CTS


AIways 1 Mbit/s and BPSK
Newer "g" sends a CCK-based CTS before each
OFDM-based data frame

"g" suffers from reduced throughput


8-14 Mbit/s instead of 22 Mbit/s
"g" reaches Ionger distances (=>OFDM)

CeII design must consider b-onIy cIients

OnIy when same power IeveI used !



28
28 {C} Herbert Haas 2010/02/15
ReaItime ProbIems with 802.11
AvaiIabIe BW is shared among cIients
No traffic priorities
Once a station gains access it may keep
the medium for as Iong at it choses

Low bitrate stations (e. g. 1 Mbit/s) wiII


significantIy deIay aII other stations
No service guarantees
PCF does not support traffic cIasses

However, the PCF is typicaIIy not impIemented


in APs and cIient adapters
Originally (1999) WLAN QoS was only provided by the PCF algorithm which
was actually never implemented in any products.

29
29 {C} Herbert Haas 2010/02/15
Specific PCF ProbIems
IrreguIar Beacon deIays

Stations may finish each transmission even if


TBTT aIready expired

Up to 2304 bytes (2312 bytes if encrypted, new:


even 2342 bytes aIIowed)

Station may even send aII fragments of a L2-


fragmented packet
Hidden station and interferences
No traffic cIasses means: AII appIications
have equaI TX opportunity
Since the beacon is sent using CSMA rules, signiIicant delays are possible.
802.11a: 250 usec delay on average, but can reach 4.9 ms(!)

30
30 {C} Herbert Haas 2010/02/15
802.11e - EDCF and HCF
New coordinate functions reIying on Traffic
CIasses (TCs)
Enhanced DCF (EDCF)
Better CHANCES for high-priority cIasses
But NO GUARANTEES ("best effort QoS")

Performed within CP
Hybrid Coordination Function (HCF)
Is an enhanced PCF

AIIows precise QoS configurations on the HC:


BW controI
Guaranteed throughput
Fairness between stations
CIasses of traffic
Jitter Iimits

Performed within CFP


The HCF is the most complex coordinate Iunction Ior WLANs.

31
31 {C} Herbert Haas 2010/02/15
802.11e - HCF DetaiIs
Stations announce their TC queue Iengths
The Hybrid Coordinator (HC=AP) does not need
to foIIow round robin but any coordination
scheme
Stations are given a Transmit Opportunity (TXOP)
They may send muItipIe packets in a row, for a given
time period
During the CP, the HC can resume controI of the
access to the medium by sending CF-PoII packets
to stations
AIso aIIows to send muItipIe data frames foIIowed
by singIe ACK

32
32 {C} Herbert Haas 2010/02/15
802.11e - Facts
Concept Summary

CP aIIows to prioritize certain TCs instead stations


More important traffic cIasses wiII be preferred-
statisticaIIy

CFP aIIows bandwidth reservation by stations and non-


round-robin poIIing
Not yet impIemented (FaII 2004)
Hybrid ControIIer (HC) required

ControIs aII other "enhanced stations"

TypicaIIy impIemented within AP (not necessariIy)

"QBSS" instead of BSS


Main driver for QoS is "Voice over WireIess IP"
(VoWIP)
802.11e uses the Enhanced Distributed Coordination Function (EDCF) and the
Hybrid Coordination Function (HCF). All stations which support the Enhanced
Distributed Coordination Function are called 'Enhanced Stations. One oI these
Enhanced Stations will control all other stations and is called 'Hybrid
Controller.
The EDCF is used while the CP and the HCF is used while CP and CFP
(complete SuperIrame). In 802.11e there are diIIerent TraIIic Categories (TCs).
For every TC a Enhanced Stations need to set a own back-oII timer and other
parameters. The TC prevent collisions iI the timer oI two or more stations
reached zero at the same time.

33
33 {C} Herbert Haas 2010/02/15
802.11e - AIgorithm (1)
AII traffic is separated into TCs

Enhanced stations must maintain a separate back-off timer for


each TC
Up to 8 priority queues for each TC

"VirtuaI Stations" inside enhanced stations


Each TC has different priority vaIue
To avoid coIIisions if the counters of two TCs expire
TCs compete within Arbitration Interframe Space (AIFS)
Different AIFS for each TC possibIe

At Ieast one DIFS Iong


Persistence factor (PF) soIves coIIision
Used to caIcuIate new back-off vaIues
PF=1..16
Legacy stations must have a CWmin=15 and PF=2
A so-called Persistence Factor (PF) is used iI a collision oI TCs Irom diIIerent
stations occurs.
Enhanced Stations have a CWmin oI 0..255.

34
34 {C} Herbert Haas 2010/02/15
802.11e - AIgorithm (2)
Transmission Opportunity (TXOP)

Time sIot during a station may send


EDCF-TXOP
Issued by EDCF aIgorithm
Limited by system-wide TXOP-Iimit announced in beacon
frames
PoIIed-TXOP
Issued by HCF
Limited by parameter announced in poII-frame
HCF can redefine TXOP at each time
And finish the CP earIier
HC aIso supports controIIed contention
PoIIing frames announce sending desire of other stations

Legacy stations must wait untiI end of controIIed contention


period
The Polled-TXOP is limited by a parameter which is announced in poll-Irames
and which replaces the NAV timer.

35
35 {C} Herbert Haas 2010/02/15
802.11e - Queuing Concept
BACKOFF
AIFS CW PF
BACKOFF
AIFS CW PF
BACKOFF
AIFS CW PF
BACKOFF
AIFS CW PF
BACKOFF
AIFS CW PF
BACKOFF
AIFS CW PF
BACKOFF
AIFS CW PF
BACKOFF
AIFS CW PF
SCHEDULER
(ResoIves "virtuaI coIIisions" by granting TXOP to highest priority)
Media Access Attempt
TC7 TC6 TC5 TC4 TC3 TC0 TC1 TC2
Higher priority Lower priority
BACKOFF
DIFS 15 2
Legacy Station
(onIy one priority)
WireIess Medium (PHY)
Media
Access
Attempt
IEEE 82.11e
34 usec 34
usec
0-255 1-16
Legacy DCF uses an AIFS34 usec, CWmin15, PF2
Enhanced stations perIorm EDCF with AIFS|TC|~ 34 usec, CWmin|TC|0-
255, PF|TC|1-16

36
36 {C} Herbert Haas 2010/02/15
WiFi MuItimedia - WMM
WMM impIements a subset of
802.11e to satisfy urgent QoS needs

Certification start: 09/2004


OnIy supports prioritized media
access:

4 access categories per device: voice,


video, best effort, and background

Does not support guaranteed


throughput
Cisco 1100/1200 APs already provide a subset oI 802.11e

37
37 {C} Herbert Haas 2010/02/15
Legacy QoS
Most legacy (no 802.11e) APs only support
downstream QoS

On the AP, create QoS poIicies and appIy them


to VLANs

If you do not use VLANs on your network, you


can appIy your QoS poIicies to the access
point's Ethernet and radio ports
Note: APs do not classify packets!

OnIy aIready cIassified packets are prioritized


(DSCP, cIient type, 802.1p)

EDCF-Iike queuing is performed on the Radio


port; onIy FIFO on Ethernet egress port

OnIy 802.1Q tagging supported - no ISL !!!



38
38 {C} Herbert Haas 2010/02/15
802.1x and WAN Congestion
Congestion on WAN Iinks: prioritize 802.1x packets
CIassify and mark RADIUS packets using the Cisco
ModuIar QoS Command Line (MQC)

Method to determine the appropriate queue size for the


802.1x/RADIUS packets
And to determine how to enabIe queuing on router interfaces
ip access-list extended LEAPACL !!! Create ACL for interesting traffic
permit udp any host 172.24.100.156 eq 1645
class-map match-any LEAPCLASS !!! Classify
match access-group name LEAPACL
policy-map MARKLEAP !!! This is a policy group
class LEAPCLASS
set ip dscp 26 !!! Corresponds to AF31 (Class=3, 1=low drop)
interface FastEthernet0/0.100 !!! Attach marker on interface
encapsulation dot1Q 100
service-policy input MARKLEAP !!! Mark inbound (input) packets only
policy-map LEAPQUEUE
class LEAPCLASS
bandwidth 8 !!! 8kb/s if needed (dynamical management)
interface Serial3/0:0 !!! Attach policy-map on WAN interface
ip address 172.24.100.66 255.255.255.252
load-interval 30
service-policy output LEAPQUEUE
Remember that the DSCP has the Iollowing Iormat: XYZPP0, where XYZ selects
one oI Iour assured Iorwarding classes (like 'premium, 'gold, 'silver, and
'bronze) and 'PP represents the drop precedence bits (RFC 2597, 'low,
'medium, and 'high).

1
2010/02/15 {C} Herbert Haas
WLAN
Security Summary

2
2 {C} Herbert Haas 2010/02/15
Threat Summary
SimpIe eavesdropping

Radio broadcast

Reduce TX powers!

Encryption (WEP, TKIP, AES, IPsec)


Authentication

Shared secrets vs. stoIen devices, Iarge nets

CentraIized AAA => 802.1x

MutuaI authentication (Rogue APs)


DoS Attacks

PhysicaI jamming

DifficuIt to prevent (shieIding, directionaI


antennas)

3
3 {C} Herbert Haas 2010/02/15
WLAN Security Overview
802.11 Standard
WEP Encryption
Open Authentication
Shared Authentication
TKIP & MIC 802.1x
802.11i
AES
WPA-2
WPA
IPsec VPN

4
2010/02/15 {C} Herbert Haas
WEP ProbIems
Content
In this chapter a detailed overview about today's WLAN security problems and
solutions are presented.
This subchapter provides an introduction into WEP, the basis oI the 802.11
original and only method Ior encryption, authentication and integrity
protection.

5
5 {C} Herbert Haas 2010/02/15
Intro
WireIess LAN is a perfect media for attackers

Sniffers easiIy remain undetected


Outdoor attacks
SimpIe DoS attacks through jamming
VuInerabiIities found in initiaI standards

Authentication / Encryption / Integrity


CentraIized management of user credentiaIs
"MobiIe devices" => frequent hardware theft
Rogue APs often remain undetected
MutuaI auth required
InteroperabiIity of security features of different vendors stiII
in question (nevertheIess WPA)
Lots of cracker tooIs avaiIabIe (WEPCrack, AsLeap, .)
2002/2003: 66% of WLANs unprotected (but better security
awareness in 2004)
Compared to all other physical communication media, the wireless realm is the
best-of-choice medium for attackers and hackers. The main reason Ior this
is, that there are no wires an attacker needs to attach to. Moreover, the attacker
can hide in another building or in a car, more than 100 meters outside the
building (iI he/she has a good antenna).
Since there are no wires it is not possible to protect the physical media Irom
interIerences or jamming, thereIore DoS attacks are critical. An attacker could
even destroy the sensitive receiving devices by jamming at very high power
levels.
Additionally, there are other problems, caused by the 802.11 design itselI:
The standard encryption, integrity and authentication method has serious
design flaws.
There is no means to manage user credentials in a central way, which leads
to bad practical security designs.
The standard security concept is based on device-bound secrets, thereIore
hardware theIt opens security holes Ior that network.
The standard security concept does not allow to authenticate the inIrastructure
devices, thereIore so-called "rogue access points" can be installed by
attackers.
Proprietary security enhancements caused an interoperability problem Ior
several years.
Dozens oI cracker tools are available on the Web.
And Iinally, the WLAN security awareness only became widespread in the last
year and still too many WLAN networks are poorly secured or not secured at
all. In 2002/2003, almost two-third oI all WLAN networks were unprotected.

6
6 {C} Herbert Haas 2010/02/15
RC4 Facts
SimpIe and fast stream cipher

VariabIe key Iengths (1-256 bytes)

15 times faster than 3DES


8-16 operations per output byte

AIso used by SSL/TLS


Designed 1987 by Ron Rivest for
RSA Security

Kept as trade secret by RSA Security


but Ieaked out in 1994
Period is Iarger than 10
100
!!!
The security oI stream ciphers depends 1) on the pseudo-randomness oI the
keystream they produce, and 2) oI the implementation which must guarantee that
each keystream is only used once! Since encryption and decryption is the same
operation (XOR), iI two plaintexts are encrypted with the same keystream,
cryptanalysis is typically simple (Ior example, assume that one plaintext is
known).
A stream cipher can be as secure as block cipher oI comparable key length but
typically stream ciphers are much Iaster and use Iar less code. For example, iI
3DES can produce 3 Mbit/s on a Pentium II, then RC4 could achieve 45 Mbit/s,
which is 15-times Iaster!
The RC4 algorithm had been kept as a trade secret by RSA Security, but in
September 1994 the code was anonymously posted in the Cypherpunks mailing
list.

7
7 {C} Herbert Haas 2010/02/15
How RC4 Works
for i = 0 to 255 do
S[i] = i;
T[i] = K[i mod keylen];
j = 0;
for i = 0 to 256 do
j = (j + S[i] + T[i]) mod 256;
Swap (S[i], S[j]);
i, j = 0;
while (1)
i = (i + 1) mod 256;
j = (j + S[i]) mod 256;
Swap (S[i], S[j]);
t = (S[i] + S[j]) mod 256;
k = S[t];
InitiaIize S[0]..S[255] with ascending numbers.
InitiaIize T[0]..T[255] with the key K (If keyIen < 256 then
repeat K as often as necessary).
Use T to produce initiaI permutation of S.
Hereby go from S[0] to S[255] and swap each S[i] with
another byte dictated by T[i].
After that, S stiII contains aII numbers from 0 to 255 but
in a permutated order.
Now again swap S[i] with another byte in S, but this time
it is dictated by S itseIf (the key is no Ionger used).
After S[255] is reached, repeat again with S[0], as Iong as
there are bytes to encrypt or decrypt.
XOR byte k with pIaintext byte or ciphertext byte for
encryption or decryption respectiveIy.
Possible key lengths range Irom 1 to 256 bytes (i. e. 8 to 2048 bits).

8
8 {C} Herbert Haas 2010/02/15
GeneraI Stream Cipher Issues
Every stream cipher is supposed to
produce a good pseudorandom
"keystream"

This is the idea of a "one-time pad"


The keystream is XORed with the
pIaintext
This method is secure if

The keystream-generator has high


entropy (i. e. reaIIy random)

Each keystream is onIy used once



9
9 {C} Herbert Haas 2010/02/15
Wired EquivaIent Privacy (WEP)
OnIy encryption method of the 802.11
standard

Used for privacy, integrity and authentication


Shared key method

Either one static key

Or short Iist of dynamic keys (up to four)


Key Iengths:

40 bit (defauIt, aka "64 bit" with IV)

OptionaIIy 104 (or "128" bit with IV)


No key distribution method defined(!)
The Wired Equivalent Privacy (WEP) algorithm should provide a nearly-wired
privacy look-and-Ieel, as its name suggests.
WEP uses the RC4 PRNG algorithm Irom RSA Data Security Inc. RC4 is a
stream cipher, a well studied algorithm, which expands a key into an inIinite
pseudorandom sequence.
This RC4 key consists oI a 40 bit or 104 bit secret key and a 24 bit
Initialization Vector (IV).
Note: The 40- or 104-bit WEP key is used as the base key Ior each packet. When
combined with the 24-bit initialization vector, it is sometimes called the "WEP
seed". ThereIore WEP seeds are made oI 64 or 128 bits in total and many
manuIacturers reIer to the 104-bit WEP keys as 128-bit keys Ior this reason.
UnIortunately the IEEE 802.11 standard does not speciIy methods how to
distribute the WEP keys to inIrastructure and client devices.
Typically, most vendors allow to speciIy up to four WEP keys which can be
dynamically chosen in order to conIuse attackers.

10
10 {C} Herbert Haas 2010/02/15
Basic PrincipIe
PayIoad is XORed with a RC4-generated
pseudorandom keystream K

S depends on shared key and 24 bit


InitiaIization Vector (IV)

Ciphertext C = PIaintext P Keystream K


IV Key ID PayIoad
24 Bits 8 Bits
ICV
RC4 encrypted
(6 bits pad
and 2 bits
key ID)
MAC
CRC-32
Both the visible initialization vector and the shared secret WEP key are used by
the RC4 algorithm to produce a pseudo-random keystream Ior encryption and
decryption.
This keystream is mixed with the payload using the XOR operation. In principle
the RC4 encryption is very secureiI there were no severe design Ilaws.
The weaknesses within WEP were Iirst exposed by researchers Irom Intel, the
University oI CaliIornia at Berkeley, and the University oI Maryland. The most
damning report came Irom Fluhrer, Mantin, and Shamir, which outlined a passive
attack that StubbleIield, Ioanndis, and Rubin at AT&T Labs and Rice University
implemented by capturing a hidden WEP key based on the attacks proposed in
the Shamir et al. paper (aka Fluhrer et. al. paper). This attack took just hours to
implement.
Ron Rivest, inventor oI the RC4 algorithm, recommends that
"Users consider strengthening the key scheduling algorithm by preprocessing the base
key and any counter or initialization vector by passing them through a hash Iunction such
as MD5. Alternatively, weaknesses in the key scheduling algorithm can be prevented by
discarding the Iirst 256 output bytes oI the pseudo-random generator beIore beginning
encryption. Either or both oI these techniques suIIice to deIeat the |Fluhrer, Martin, and
Shamir| attacks on WEP."

11
11 {C} Herbert Haas 2010/02/15
WEP - Design FIaw in DetaiI
The ProbIem:

XOR operation eIiminates two identicaI terms!


If same S is used on different pIaintexts, then
C1=S P1 and C2=S P2
C1 C2 = P1 P2
Same keystream S canceIs out!
If P1 is known then P2 can be easiIy caIcuIated!
1 0 1 0 0 0 1 1 0 1
1 1 0 1 0 1 1 0 0 0
0 1 1 1 0 1 0 1 0 1
P1
S
C1

0 0 1 0 0 1 0 1 1 1
0 1 0 1 0 0 0 0 1 0
0 1 1 1 0 1 0 1 0 1
P2
S
C2

C1 C2 1 0 0 0 0 1 1 0 1 0

P1 P2 1 0 0 0 0 1 1 0 1 0
Although RC4 is a very good algorithm, its application with WEP reveals some
remarkable security Ilaws. WEP is insecure when the same keystream is used
more than oncethe key length and the random properties oI the keystream do
not matter at all!
This is because the XOR operation eliminates two identical terms. That is, iI
an attacker sniIIed Ciphertext C1 and Ciphertext C2, which had been produced by
the same keystream S, then actually the Iollowing operations were made by the
WEP algorithm:
C1S P1 and C2S P2.
Hence C1 C2 cancels out S and equals P1 P2. Thus, iI Plaintext P1 is
known, P2 can be easily calculated!
Note: This attack method also works Ior a subset oI these "vectors": II a part oI
P1 is known, then a congruent part oI P2 can be calculated.
Knowledge oI parts oI the plaintext message can enable statistical attacks to
recover all plaintexts. These statistical attacks become increasingly practical as
more ciphertexts that use the same key stream are known. Once one oI the
plaintexts becomes known, it is trivial to recover all oI the others.
Although most 802.11 equipment is designed to disregard encrypted content Ior
which it does not have the key, it is relatively simple to change the conIiguration
oI the drivers. Active attacks, which requires transmission seems to be more
diIIicult, yet not impossible. Many 802.11 products come with programmable
Iirmware, which had been reverse-engineered and modiIied to provide the ability
to inject traIIic to attackers.

12
12 {C} Herbert Haas 2010/02/15
IV CoIIisions
Keystream shouId change for each packet

Assures that same pIaintexts resuIt in different


Ciphertext

802.11 does not specify how to pick IVs

Many impIementations reset IV to zero at startup and


then count up
OnIy 2
24
IV choices CoIIisions wiII occur !!!

Attacker couId maintain a "codebook" of aII possibIe S

1500 byte 2
24
= 24 GByte
Matter of hours onIy
Shared key Iength does not hamper the attack!
Because oI the XOR properties it is crucial to continuously change the key that
makes up the particular keystreamideally Ior each packet sent! The key is made
up oI the shared secret and the IV, and the latter was intended to assure collision
protection. But actually, the standard does not specify how to change the IV.
There is no strict requirement to change IVs at all!
Example of an attack duration:
A busy access point, which constantly sends 1500 byte packets at 11Mbps, will exhaust
the space oI IVs aIter 1500*8/(11*10`6)*2`24 ~18000 seconds, or 5 hours. This
allows an attacker to collect two Ciphertexts that are encrypted with the same key stream
and perIorm statistical attacks to recover the plaintext.
Now it is clear, that the shared key length do not aIIect this sort oI attack at all
(also see Jesse Walker's "UnsaIe at any key length" paper). II P1 is known then
P2 is immediately available. Much oI network traIIic contains predictable
inIormation, but it is much easier when three or more packets collide. Certain
devices on the market utilize the IV in a simply predictable way, Ior instance by
incrementing by one Ior each packet. Furthermore, the IV value is reset at each
startup.
One New York computer security consultant who was quoted in the Wall Street Journal
article says he was able to access the computer network oI his client, a major Iinancial
services Iirm on Wall Street, while sitting on a bench across the street.
Common wireless sniIIing tools are WEPcrack and AirSnort.

13
13 {C} Herbert Haas 2010/02/15
Integrity VuInerabiIity
Encrypted CRC is used to
check integrity
But CRC is Iinear:

CRC(X Y) = CRC(X) CRC(Y)


Thus payIoad bits can be
manipuIated, because
RC4
K
(X Y) = RC4
K
(X) Y

RC4
K
(CRC(X Y)) =
RC4
K
(CRC(X)) CRC(Y)
Attacker can easiIy modify
known bytes of packets (at
Ieast L3/L4 header structures
are known)
011010010101 . . . 0110
100110110010 . . . 1100
pIaintext CRC
111100100111 . . . 1010
00001 10000000 . . . 1001
keystream
ciphertext
manipuIation frame
111110100111 . . . 0011
manipuIated ciphertext correct CRC

=
=
Furthermore, WEP is also used to protect the integrity oI a Irame in combination
with the CRC. But the CRC is a linear operation and can thereIore be
additively decomposed.
Because oI this property, an attacker could XOR a plaintext X with another
plaintext Y Ior manipulation purposes and only has to calculate CRC(X) XOR
CRC(Y) to get CRC(X XOR Y). Because oI the linearity, this operation can also
be successIully applied even when the CRC is RC4-encrypted!
Thus the 'Integrity check does not prevent packet modiIication, and an attacker
can easily flip bits in packets, modiIy active streams, or bypass access control.
Even partial knowledge oI the packet is suIIicient iI the attacker wants only to
modiIy the known portion.

14
14 {C} Herbert Haas 2010/02/15
Bit-FIipping Attack ExampIe
Attacker catches and manipuIates
encrypted frame, updates ICV
AP decrypts frame, vaIidates ICV and
forwards frame
Router detects fauIt and sends
predictabIe error message
Keystream = C'' + P''
C' P'
P'' C''

15
15 {C} Herbert Haas 2010/02/15
Arbaugh Attack
AIIows to arbitrariIy expand a known
keystream of size n

EasiIy done with known messages (e. g.


DHCP discoveries)
Create messages of size n-3 and
encrypt it with the known keystream
OnIy the Iast byte (4th CRC byte) is
not encrypted: triaI and error!
On average onIy 128 triaIs necessary
for every additionaI byte!
The Arbaugh Attack
Here is a more detailed example to understand the Arbaugh attack:
1. Find an initial keystream S oI size n. For example look Ior DHCP-Discover
messages, which have a Iixed size and a broadcast MAC destination address. The
known plaintext oI the DHCP-Discover message consists oI a source IP address oI
0.0.0.0, a broadcast destination IP address 255.255.255.255, and some other Iixed
header inIormation. This method reveals 24 bytes oI keystream, that is n 24.
2. Create a message M oI size n - 3, that is 21 bytes in our case. For example an ARP
request or an ICMP packet.
3. Create the ICV oI the message M and append three bytes oI it to the message,
resulting in a plaintext P .
4. XOR the known keystream to the plaintext: C P XOR S.
5. Instead oI the true Iourth byte oI the ICV append a test byte Bi to the ciphertext C.
For example the Iirst test byte could be B0 0x00. The resulting ciphertext packet
is Ci .
6. Send Ci to the AP. II the last byte Bi (i. e. the Iourth byte oI the ICV) was correctly
encrypted then the AP accepts the packet and the network will send a response. II
Bi was wrongly encrypted then the AP will discard the packet silently. Next try B1
0x01, B2 0x02, ... ,B255 0xFF. On average aIter 128 trials Bi is Iound.
7. Since the whole ICV is known as plaintext, calculate the unknown keystream-byte
S25 Bi XOR ICV4 . Remember that Bi ICV4 XOR S25 .
Practically one could create an ICMP echo request oI increasing length. II the Irame
has been correctly encrypted then there will be an ICMP echo reply. (Remember
that the payload oI an ICMP packet may have arbitrary length.

16
16 {C} Herbert Haas 2010/02/15
Attacks Summary (1)
Keystream reuse (IV collisions)

Dictionary-buiIding attacks

AIIows reaI-time automated decryption of aII traffic


Bit-flipping attacks

Attacker intercepts WEP-encrypted packet, fIips


bits recaIcuIates CRC and retransmits forged
packet to AP with same IV

Because CRC32 is correct, AP accepts and


forwards frame

Layer 3 end device rejects and sends a predictabIe


response

AP encrypts response and sends it to attacker

Attacker uses response to derive key


The presented WEP attacks only belong to the most simple one. Here is a summary oI the most
practical attack methods.
Keystream reuse attack:
This already described method is typically combined with dictionary-building and statistical
analysis. Finally the attacker has created a large dictionary containing all keystreams possible with
the used WEP keys and then he/she can perIorm real-time decryption oI all traIIic.
Bit flipping attacks:
The attacker could make guesses about the headers oI a packet, which contains typically a lot oI
redundancy that is predictable. In particular, all that is necessary to guess is the destination IP
address. Now the attacker can Ilip appropriate bits to transIorm the destination IP address to send
the packet to another machine, which is in his own realm. Most wireless networks are connected to
the Internet and the APs will decrypt each packet that is destined to a wired destination. This is
also called a redirection attack.
II a guess can be made about the TCP headers oI the packet, the attacker could change the
destination port to be port 80, which will allow it to be Iorwarded through most Iirewalls. Note
that the IP checksum can be easily spooIed and the TCP checksum is disregarded by the network.
Changing an IP address is relatively simple. Assume the high and low 16-bit words oI the original
IP address are IPH1 and IPL1, and should be changed to IPH2 and IPL2. II CRC1 is the
original IP checksum, then CRC2 CRC1 IPH2 IPL2 IPH1 IPL1 in one's
complement math.
II the attacker knows CRC1 by some means, he then Iigures out CRC2 as above and computes
CRC1 XOR CRC2 to get to the Iinal checksum. Another way is to make guesses about the IP
address and see iI they work. The TCP reaction attack works by seeing what the reaction oI the
system is to Iorgeries. A correctly guessed IP will be accepted by the system, while a bad one
causes the packet to be dropped into the bit bucket. This only works on TCP packets, because the
attacker needs the ACKs that TCP sends (the TCP ACK packet is oI a standard size) when the
TCP checksum is correct.

17
17 {C} Herbert Haas 2010/02/15
Attacks Summary (2)
Fluhrer, Mantin, Shamir (FMS) attack on RC4

RC4 key scheduIing is insufficient


The beginning of the pseudorandom stream shouId be
skipped, otherwise some IV vaIues reveaI information
about the key state

Key can be recovered after severaI miIIion packets

'WEPpIus' = WEP with avoidance of weak IVs


KoreK Attack

Packet manipuIation, reinjection and CRC anaIysis

Key can be recovered after severaI 100,000


packets
Arbaugh Attack

CaIcuIate arbitrary additionaI bytes on a known


but short keystream
Fluhrer et. al. attack:
Some IV values reveal inIormation about key state, thus the shared keys can be recovered aIter
several million packets. In the RC4 algorithm the Key Scheduling Algorithm (KSA) creates an IV
based on the base key. A Ilaw in the WEP implementation oI RC4 allows "weak" IVs to be
generated. The RC4 key scheduling is insuIIicient: the beginning oI the pseudorandom stream
should be skipped.
The KoreK Attack was Iirst implemented in the tool "ChopChop" and is now part in nearly all
WEP cracking tools, such as aircrack or airsnort.
Also the Arbaugh Attack is an acceleration tool and thereIore part in many modern WEP
cracking tools.

18
2010/02/15 {C} Herbert Haas
Interim SoIutions: TKIP and MIC
Content
In this chapter a detailed overview about today's WLAN security problems and
solutions are presented.
This subchapter provides an introduction into TKIP and MIC.

19
19 {C} Herbert Haas 2010/02/15
802.11i
Two new network types

Transition Security Network (TSN)

Robust Security Network (RSN)


An RSN onIy aIIows devices using
TKIP/MichaeI and CCMP
A TSN supports both RSN and pre-RSN
(WEP) devices

ProbIem: broadcast packets have to be


transmitted with the weakest common
denominator security method

Consider a singIe cIient onIy supporting WEP


Task Group i (TGi) was Iormed in March 2001 as a split Irom the MAC
Enhancements Task Group (TGe). Its charge was to "enhance the 802.11 Media
Access Control (MAC) to enhance security and authentication mechanisms." TGi
Iinished work on the 802.11i standard, and it has been approved.
802.11i deIines two WLAN network types: Transition Security Network (TSN)
and Robust Security Network (RSN). RSNs only allow devices which support
TKIP/Michael and CCMP. TSNs support both RSN devices and legacy pre-RSN,
i. e. WEP devices. The drawback with RSN is that broadcast packets have to be
transmitted with the weakest common denominator security method. II there is a
device using WEP in a TSN network, it weakens the security oI broadcast traIIic
Ior all the devices. RSN is deIinitely preIerred, and getting all networks to use
CCMP exclusively is the long term goal.

20
20 {C} Herbert Haas 2010/02/15
802.11i
Message Integrity Check (MIC)

NonIinear aIgorithm
TemporaI Key Integrity
ProtocoI (TKIP or "WEP2")

AIso uses RC4-based WEP


without the known fIaws
Per-packet keys through IV mixing
RepIay protection

EssentiaIIy a patch for WEP


Counter Mode CBC MAC
(CCMP)

= AES + CBC-MAC

RepIaces WEP !!!


(requires new HW support)
Pre-standard
802.11i
(WPA)
Ratified 802.11i
(WPA2)
First WPA2 certifications
aIready since 1st Sept 2004
Recently, the IEEE 802.11i Security Task Group released two "inIormative
texts" providing WEP hardening: MIC and TKIP. The IEEE 802.11 Task Group
"i" is working on standardizing WLAN encryption improvements. Two new
network types, called Transition Security Network (TSN) and Robust Security
Network (RSN) had been deIined.
The Temporal Key Integrity Protocol (TKIP, initially reIerred to as WEP2) is
an interim solution (as part oI TSN) that Iixes the key reuse problem oI WEP.
TKIP is a compromise on strong security and possibility to use existing hardware.
Still uses RC4 but per-packet keys plus replay protection through a keyed packet
authentication mechanism (Michael MIC).
TKIP begins with a 128 bit "temporal key" shared among clients and access
points. TKIP combines the temporal key with the client's MAC address and then
adds a 6-byte IV to produce the key that will encrypt the data. Thus each station
uses diIIerent key streams Ior encryption. TKIP changes keys every 10,000
packets, using a dynamic distribution method.
The IEEE plans to use the Advanced Encryption Standard (AES) instead oI
RC4 Ior TKIP in the long run (RSN), combined with Counter Mode - Cipher
Block Chaining - Message Authentication Code (CBC MAC) to provide strong
integrity and message authentication. Also the term "Wireless Robust
Authenticated Protocol" (WRAP) is sometimes used synonymously Ior this
concept.
The Wi-Fi speciIied TKIP and MIC as mandatory Ieatures oI the Wi-Fi
Protected Access (WPA) protocol, while AES should be part oI WPA2.
Note: WiFi and particular vendors uses different TKIP/MIC algorithms, which
are not compatible. Even WPA was intended to be an intermediate solution
because the WiFi only picked a subset oI the IEEE 802.11i working draIt 3.0).

21
21 {C} Herbert Haas 2010/02/15
MIC (as used by WPA)
Encrypted checksum
=> NonIinear function now
Uses "MichaeI" aIgorithm
Much more Iightweight than MD5 or SHA
Uses separate 64-bit key
Data Integrity Key (DIK) derived from PTK after WPA
key management
AP and STA use different MIC keys (128-bit DIK is spIit)
DATA MIC ICV
AdditionaI 8 byte 4 byte (CRC)
Integrity
Check VaIue
RC4 encrypted
MAC Header
The Message Integrity Check (MIC) provides data integrity similar to CRC but
provides a non-linear operation, the "Michael" algorithm, and is thereIore not
vulnerable aIter RC4 encryption.
The MIC is based on a seed value or a secret key, the destination and source
MAC, and payload. That is, any change oI these values signiIicantly alter the
MIC.
The 802.11i task group Ielt that other commonly used hashing algorithms such as SHA-1 were
too computation-intensive to calculate on legacy hardware, so they agreed on the simpler Michael
algorithm. Like many hash algorithms, Michael is calculated over the length oI the packet, but all
oI the scrambling it does is based on shiIt operations and XOR additions, which are quick to
calculate. Michael uses a key called the Michael key, which is derived during the WPA procedure
(pairwise key).
But according to the 802.11i speciIication, the Michael algorithm "provides only
weak protection against active attack." ThereIore MIC countermeasures have
been speciIied by the 802.11i: 1) logging and 2) disable and deauthenticate. II two
Michael Iailures occur within one minute, both ends should disable all packet
reception and transmission. In addition, the AP should deauthenticate all stations
and delete all security associationsa rather drastic solution.

22
22 {C} Herbert Haas 2010/02/15
MIC ProbIems
MichaeI aIgorithm

Provides security IeveI of onIy


20 bit strength

Attacker can construct


forgery after approx 2^19 tries
(520,000 frames)
MIC Countermeasures

Upon two MIC faiIures within


60 seconds, this AP
disassociates all stations for
at Ieast 60 seconds and
erases current keys in use

So attacker forgery triaIs


become nearIy impossibIe

TypicaIIy turned OFF (DoS!!!)


PayIoad DA SA Key
MMH
Hash
8-byte MIC
WPA

23
23 {C} Herbert Haas 2010/02/15
Cisco MIC (CMIC)
Uses a seed vaIue as pseudo-key
Uses sequence number (AP verifies
order)
MMH
Hash
DA SA LLC SNAP PayIoad
4-byte MIC
Cisco
(CMIC)
SEQ Seed
DATA MIC ICV
Integrity Check
VaIue (ICV)
additionaI 4 byte 4 byte (CRC)
Note: The Cisco Message Integrity Check serves the same purpose as the 802.11i
MIC and is in Iact stronger than Michael. It is based on Shai Halevi and Hugo
Krawczyk's MMH hashing algorithm.

24
24 {C} Herbert Haas 2010/02/15
TKIP (As used by WPA)
Features
Longer and unpredictabIe IV through IV/key mixing
Encrypted repIay protection number (TSC)
WPA TKIP
48 bit IV, incIudes MAC
Fast S-box mixer
Fresh session keys on every association
TX-MAC TTAK Phase 1
32 bits 16
Phase 2 24 104 bits
TEK (Temporal Encryption Key)
IV WEP-Key
128 Bits
TKIP Sequence Counter (TSC)
TKIP mixed Transmit
Address and Key (TTAK)
48 Bits
"WEP Seed"
KEY STREAM
RC4
80 Bits
Padded such
to avoid
weak IVs
The WPA's TKIP solution complies to the 802.11i proposals and uses Iresh session
keys on every association as well an 48-bit IV space. The mixing Iunctions are based
on substitution boxes (S-boxes), which are computationally very eIIicient, compared
to other hash Iunctions.
The Temporal Encryption Key (TEK) is derived Irom the "Pairwise Master Key"
(PMK, also called "base key"), which has been negotiated by the WPA key
management protocol. The TEK is used to securely hash a packet counter, the TKIP
Sequence Counter (TSC), and the transmit MAC address. A second hash stage
enhances the security oI the S-box principle.
The TSC is split into 16-bit and 32-bit parts. The 16-bit part is padded to 24 bits to
produce a traditional IV. The padding is done in a way that avoids the possibility oI
weak IV generation. Interestingly, the 32-bit part is not used Ior the transmitted IV
generation; instead, it is utilized in the TKIP per-packet key mixing.
Phase 1 eliminates the use oI the same key by all connections, and the second phase
reduces the correlation between the IV and per-packet key.
The TSC starts at 0 and increases by 1 Ior each packet. TSCs must be remembered
because they must never repeat Ior a given key. Each receiver keeps track oI the
highest value it has received Irom each MAC address. II it receives a packet that has
a TSC value lower than or equal to one it has already received, it assumes it is a
rebroadcast and drops it. Thus, packets can only arrive in sequence.
TKIP is only a SW-addon and can reuse the existing WEP hardware.

25
25 {C} Herbert Haas 2010/02/15
TKIP DetaiIs
Phase 1
The high-order 32 bits of the TSC are combined with the TA and the first 80 bits
of the TEK.
This phase of the key mixing is an iteration invoIving inexpensive addition, XOR,
and AND operations, pIus an S-box Iookup reminiscent of the RC4 aIgorithm.
These were chosen for their ease of computation on Iow-end devices such as
APs.
Phase 1 produces an 80-bit vaIue caIIed TKIP mixed Transmit Address and Key
(TTAK). Note that the onIy input of this phase that changes between packets is
the TSC. Because it uses the high-order bits, it onIy changes every 64K packets.
Phase 1 can thus be run infrequentIy and use a stored TTAK to speed up
processing. The incIusion of the transmitter's MAC address is important to aIIow
a pair of stations to use the same TEK and TSC vaIues and not repeat RC4 keys.
Phase 2
Now the TTAK from phase 1 is combined with the fuII TEK and the fuII TSC.
This phase again uses inexpensive operations, incIuding addition, XOR, AND,
OR, bit-shifting, and an S-box.
The output is a 128-bit WEP seed that wiII be used as the RC4 key in the same
manner as traditionaI WEP.
In the phase 2 aIgorithm, the first 24 bits of the WEP seed are constructed from
the TSC in a way that avoids certain cIasses of weak RC4 keys.
BTW: TKIP was designed as a 5 year interim solution only! Obviously it will be
used much longer than intended.

26
26 {C} Herbert Haas 2010/02/15
Cisco TKIP ("CKIP")
SimpIe proprietary soIution
StiII uses 24 bit IV but caIcuIates per-
packet WEP keys from IV

Hash-based mixer
HASH
Base WEP Key IV
KEY STREAM RC4 IV Packet Key
Because urgent security demands oI the market, Cisco developed a proprietary
"Cisco KIP" (CKIP), which is based on hashing the static WEP key together
with the 24-bit IV to gain the actual packet key.
Also Cisco's solution provides per-packet keys, but it is recommended to use
WPA's TKIP because:
WPA's TKIP is computationally more eIIicient.
It is more secure, because oI the PMK involved.
The dynamical RC4-key space is much bigger as compared to CKIP.
Nearly all important vendors support WPA.

27
27 {C} Herbert Haas 2010/02/15
Security
Against rumors, TKIP is reasonabIy safe!

For each packet, the 48-bit IV is mixed with the


128-bit PTK to create a 104-bit RC4 key
There is practicaIIy no statisticaI correIation
Estimated one weak-IV per century (!)

Countermeasures against traffic re-injection


Sequence numbers + MIC

Robust 4-way handshake


OnIy probIem: WPA-PSK

Which uses a specified passphrase to PMK


mapping => good passphrase required !!!

Otherwise dictionary attack possibIe


The estimated weak IV Irames appearance interval with TKIP is about a century,
so by the time a cracker collects the necessary 3,000 or more interesting IV
Irames,
he or she would be 300,000 years old. |Found somewhere: CHECK!|

28
2010/02/15 {C} Herbert Haas
AES and CCMP
Content
In this chapter a detailed overview about today's WLAN security problems and
solutions are presented.
This subchapter provides an introduction into AES and CCMP.

29
29 {C} Herbert Haas 2010/02/15
802.11i
Message Integrity Check (MIC)

NonIinear aIgorithm
TemporaI Key Integrity ProtocoI
(TKIP or "WEP2")
AIso uses RC4-based WEP without
the known fIaws
Per-packet keys through IV mixing
RepIay protection

EssentiaIIy a patch for WEP


Counter Mode CBC MAC
(CCMP)

= AES + CBC-MAC

RepIaces WEP !!!


(requires new HW support)
Pre-standard
802.11i - TSN
(WPA)
Ratified 802.11i
- RSN
(WPA2)
First WPA2 certifications
aIready since 1st Sept 2004
Recently, the IEEE 802.11i Security Task Group released two "inIormative
texts" providing WEP hardening: MIC and TKIP. The IEEE 802.11 Task Group
"i" is working on standardizing WLAN encryption improvements. Two new
network types, called Transition Security Network (TSN) and Robust Security
Network (RSN) had been deIined.
The Temporal Key Integrity Protocol (TKIP, initially reIerred to as WEP2) is
an interim solution (as part oI TSN) that Iixes the key reuse problem oI WEP.
TKIP is a compromise on strong security and possibility to use existing hardware.
Still uses RC4 but per-packet keys plus replay protection through a keyed packet
authentication mechanism (Michael MIC).
TKIP begins with a 128 bit "temporal key" shared among clients and access
points. TKIP combines the temporal key with the client's MAC address and then
adds a 6-byte IV to produce the key that will encrypt the data. Thus each station
uses diIIerent key streams Ior encryption. TKIP changes keys every 10,000
packets, using a dynamic distribution method.
The IEEE speciIies to use the Advanced Encryption Standard (AES) instead oI
RC4 Ior TKIP in the long run (RSN), combined with Counter Mode - Cipher
Block Chaining - Message Authentication Code (CBC MAC) to provide strong
integrity and message authentication. Also the term "Wireless Robust
Authenticated Protocol" (WRAP) is sometimes used synonymously Ior this
concept.
The Wi-Fi speciIied TKIP and MIC as mandatory Ieatures oI the Wi-Fi
Protected Access (WPA) protocol, while AES should be part oI WPA2.
Note: WiFi and particular vendors uses different TKIP/MIC algorithms, which
are not compatible. Even WPA was intended to be an intermediate solution
because the WiFi only picked a subset oI the IEEE 802.11i working draIt 3.0).

30
30 {C} Herbert Haas 2010/02/15
WPA2 aka 802.11i
ExactIy the same as WPA1 except...
CCMP (AES in counter mode) instead of
RC4
HMAC-SHA1 instead of HMAC-MD5 for the
EAPoL MIC
Against rumors WPA2 is onIy a LITTLE
better than WPA1
But neither wiII be cracked in the near
future !!!
How secure is AES compared to RC4?
RC4 uses up to 128 bits key length, AES uses 256 bits, that is the AES key is 128
bits longer. II only brute Iorce attacks are assumed (algorithms are save enough)
and considering Moore's law (computing power doubles every 18 month), then
AES is at least log
2
(128)*18 months ahead, that is more than 10 years, compared
to RC4.

31
31 {C} Herbert Haas 2010/02/15
802.11i: CCMP - Overview
AES for data encryption (privacy)
128-bit bIock cipher
No per-packet keying needed
HW-reaIization recommended
Key-Iife determined by 48-bit IV
AES requires a feedback mode
To avoid the risks associated with the triviaI EIectronic Codebook
(ECB) mode
Repeating patterns are not hidden
Not recommended for messages Ionger than one bIock !
The IEEE is stiII deciding which feedback mode to standardize for
AES encryption - two choices:
Counter Mode CBC MAC (CCM)
Provides encryption, authenticity and integrity
AppIied on both header and data
IV aIso used to prevent repIay attacks
WLAN's current favourite
OffIine Code Book (OCB) mode
ProbIem: patented
AIso supported by some WLAN vendors
The 802.11i standard was Iinished in May 2004 and approved in 1une 2004. The
main result, WPA2, includes support Ior more robust encryption algorithm
(CCMP: AES in Counter mode with CBC-MAC) to replace TKIP and
optimizations for handoff (reduced number oI messages in initial key
handshake, pre-authentication, and PMKSA caching).
The Advanced Encryption Standard (AES) is considered as state-oI-the-art
encryption method, designed recently, using Rijndael as algorithm and is oIIicial
successor oI DES or 3DES. This 128-bit block cipher is considered unbreakable
Ior the next ten years or so.
CCM is a actually the block cipher mode oI AES that provides both encryption
and authentication. It is a combination oI counter-mode encryption and CBC-
MAC authentication which are two modes that have been studied extensively Ior
many years. CCM was developed as a non-patented alternative to OCB ("OIIset
Codebook") Ior use in secure wireless networks, but it can be used in almost any
situation that requires secure communications. With CCM encryption and
authentication
Links:
Rijndael description and algorithm:
http.//csrc.nist.gov/CrvptoToolkit/aes/rifndael/
AES Lounge:
http.//www.iaik.tu-gra:.ac.at/research/krvpto/AES/

32
32 {C} Herbert Haas 2010/02/15
Cipher BIock Chaining (CBC)
No patent
Encryption and MAC use different nonces
CoIIision attacks possibIe but sufficient mitigation when
key management provides frequent key changes
IdenticaI ciphertext bIocks resuIt onIy when:

Same key and


Same pIaintext and

Same IV is used
CBC is seIf-synchronizing
If an error (incIuding Ioss of one or more entire bIocks)
occurs in bIock c
j
but not c
j+1
, then c
j+2
is correctIy
decrypted to x
j+2
.
Although CBC mode decryption recovers Irom errors in ciphertext blocks,
modiIications to a plaintext block xj during encryption alter all subsequent
ciphertext blocks. This impacts the usability oI chaining modes Ior applications
requiring random read/write access to encrypted data.
An exposed IV might allow a man-in-the-middle (MITM) to change the IV value
in-transit. Changing the IV changes only the deciphered plaintext Ior the Iirst
block, without garbling the second block. Any or all bits oI the Iirst block
plaintext can be changed systematically with complete control.
The most obvious way to prevent deliberate MITM changes to the Iirst block
plaintext with the IV is to encipher the IV; that prevents an opponent Irom
changing plaintext bits systematically.

33
33 {C} Herbert Haas 2010/02/15
Counter Mode (CCM)
Instead of directIy encrypting the
data onIy a counter is encrypted
Message is then XORed with this
encrypted counter
Counter = nonce (SQNR, Source-
MAC, Priority fieIds)
WPA2 supports FIPS 140-2 compliant security, basically AES in counter mode.
(An early draIt included AES-OCB instead but it was dropped due to patent
issues.) A 48 bit IV protects against replay attacks.
Authentication and Integrity is maintained using an 8 byte CBC-MAC with a 48
bit nonce. Besides the data also the source and destination MAC addresses in the
header are protected by the CBC-MAC. (These Iields are called Additional
Authentication Data (AAD).
The CBC-MAC, the nonce, and additional 2 byte IEEE 802.11 overhead make
the CCMP packet 16 octets larger than an unencrypted IEEE 802.11 packet.
The AP advertises cipher suites both in beacons and probe responses.

34
34 {C} Herbert Haas 2010/02/15
Offset Code Book (OCB)
Patented
Combines authentication and encryption
SIightIy faster than CBC encryption

More prone to coIIision attacks than CBC-MAC


If a particuIar coIIision on 128-bit vaIues occurs,
then an attacker can modify the message without
being detected by the OCB authentication
function

Weak authentication aIgorithm - uses same nonce for


encryption and authentication
In order to Iimit the probabiIity of a successfuI forgery
attempt to Iess than 2^-64 change the key after 2^32
bIocks of data

Indeed strong enough for many peopIe but does not


justify 128-bit AES as successor of DES
AES-OCB is a mode that operates by augmenting the normal encryption process
by incorporating an oIIset value.
The routine is initiated with a unique nonce (the nonce is a 128-bit number) used
to generate an initial oIIset value. The nonce has the XOR Iunction perIormed
with a 128-bit string (reIerred to as value L).
The output oI the XOR is AES-encrypted with the AES key, and the result is the
oIIset value.
The plain-text data has the XOR Iunction perIormed with the oIIset and is then
AES-encrypted with the same AES key.
The output then has the XOR Iunction perIormed with the oIIset once again. The
result is the cipher-text block to be transmitted.
The oIIset value changes aIter processing each block by having the XOR Iunction
perIormed on the oIIset with a new value oI L.
See http://www.cs.ucdavis.edu/~rogaway/ocb/index.html

35
35 {C} Herbert Haas 2010/02/15
OCB AIgorithm
Convention: Message M, Key K, Nonce N
Define from which the offset foIIows.
Then the message is spIit into M
1
, ., M
m
,
where onIy M
m
is typicaIIy a non-128 bit
bIock. The messages M
1
, . M
m-1
are
encrypted as foIIows:
WhiIe M
m
is encrypted
using denoting the
Iength of this bIock:
The authentication is performed in two steps:
. "Checksum"
. "MAC Tag" of arbitrary Iength,
depending on security vs.
transmission cost trade-off.
TypicaIIy 32..80 (documentation)
C
m
0* . Iast ciphertext bIock padded
with zeros to fuII 128 bit Iength

36
2010/02/15 {C} Herbert Haas
802.11 Standard Authentication
Content
In this chapter a detailed overview about today's WLAN security problems and
solutions are presented.
This subchapter provides an introduction into the 802.11 standard authentication
methods.
Objective
AIter completing this chapter the Iollowing tasks could be solved:
Highlight the design Ilaws oI the WLAN standard authentication
Explain the design idea oI 802.1x
Compare EAP-TLS, LEAP, PEAP, EAP-TTLS, EAP-FAST with each other
and emphasize important security Ieatures
Explain the design concept oI WPA and WPA2
Implement a reliable 802.1x inIrastructure over a WAN connection
List important issues to be considered when choosing a VPN design
Explain PSPF

37
37 {C} Herbert Haas 2010/02/15
802.11 Standard Authentication
Methods
Open System Authentication

Anyone is granted access


IdeaI for transient users
DefauIt method
AII frames sent in cIear, even
when WEP is enabIed
Shared Key Authentication

ReIies on WEP aIgorithm


Every user has same shared
key-and same as AP
OnIy cIient device
authentication

User is not authenticated


(device theft criticaI)
AP is not authenticated (!)
VuInerabIe.
Initiator Responser
Authentication request
Authentication resuIt (OK)
Initiator Responser
Authentication request
ChaIIenge and IV
WEP encrypted response
Authentication resuIt
Open System Authentication allows anyone to gain access to the WLAN. It is generally
applicable where public access should be provided, Ior example in universities, airports, or hotels.
The authentication process is realized using "management" Irames with "authentication" as
subtype. SpeciIically, the open system method is indicated using an algorithm identiIication Iield.
Shared Key Authentication uses the WEP algorithm to implement a Iour-step handshake
procedure, provided that each user has the same shared key. Shared Key Authentication only
enables client authentication but the client can never be sure whether the AP is a "rouge" AP.
Furthermore, WEP is vulnerable, and hence this authentication process can be attacked.
This Iour-step procedure requires WEP support Irom both sides. It is assumed that both sides
possess the same shared key. The initiator sends an authentication request management Irame
indicating that it wish to use 'shared key authentication. The responder replies by sending an
authentication management Irame containing an 128 octets challenge text. This challenge text is
generated by using the WEP pseudo-random number generator (PRNG) with the 'shared secret
and a random IV. The initiator receives the challenge and the IV and sends a WEP-encrypted
version oI the challenge back to the responder, hereby using the shared secret and the IV. The
responder decrypts the received Irame and veriIies the 32-bit CRC integrity check and that the
challenge text matches that sent in the Iirst message. In this case the authentication is successIul
and the responder completes the process by sending the authentication result. Optionally, the
initiator and the responder switch roles and repeat the process to ensure mutual authentication.
However, mutual authentication is seldom implemented. The value oI the status code Iield is set to
zero when successIul, and to an error value iI unsuccessIul. The element identiIier identiIies that
the challenge text is included. The length Iield identiIies the length oI the challenge text and is
Iixed at 128. The challenge text includes the random challenge string.
Besides WEP design Ilaws, the whole authentication is tied to the device identity, not the user's
identity. That is, a stolen device can be abused to gain access to the WLAN.

38
38 {C} Herbert Haas 2010/02/15
Shared Key Authentication
Attacker captures 2
nd
and 3
rd

authentication message and
has

PIaintext P (the chaIIenge)


Ciphertext C = RC4
K
(P)
The keystream is simpIy
S = C P
Other fieIds than the chaIIenge
are known a priori
Have aIways the same vaIue in
each authentication process
Possessing S, an attacker can
correctIy respond to each
chaIIenge
Never use Shared Key
Authentication !!!
Initiator Responser
Authentication request
ChaIIenge and IV
WEP encrypted response
Authentication resuIt
Never use Shared Key Authentication
An attacker could easily capture the 2nd and 3rd authentication messages and
possesses a plaintext (the challenge) and the corresponding ciphertext. Remember
that the keystream S can be easily calculated by XORing both messages.
Other Iields (besides the challenge) are rather static and can be guessedthey
have always the same values in each authentication process.
Having S, an attacker can easily authenticate to the network as he is able to
correctly respond to each challenge sent by a responder.

39
2010/02/15 {C} Herbert Haas
802.1x and EAP Authentication
Content
In this chapter a detailed overview about today's WLAN security problems and
solutions are presented.
This subchapter provides an overview oI 802.1x authentication and various EAP
protocols.

40
40 {C} Herbert Haas 2010/02/15
802.1x Authentication - Intro
Port-based network access controI method
utiIizing IETF's ExtensibIe Authentication
ProtocoI (EAP)
Supports mutuaI authentication between cIient and AP
Dynamic WEP/TKIP key distribution and refresh

OnIy for unicast traffic


Each cIient has its own key-as Iong as AP has enough
key sIots
Session Iifetime

But static and shared broadcast key


Either pre-configured or automaticaIIy assigned after
authentication
CentraIized user credentiaI management via
RADIUS
Various cIient credentiaIs supported
(Fast) L2 roaming support (possible)
The IEEE is working on a supplement to the 802.1d standard which will deIine the changes
necessary to the operation oI a MAC layer bridge in order to provide port-based network access
control capability. This standard is known as 802.1x and has been adopted by the 802.11i working
group.
802.1x provides port-based access control, that is, a special authentication mechanism is used to
switch a bridge port or the AP Irom an unauthorized state into an authorized state. Only the latter
state allows traIIic other than 802.1x traIIic.
Using 802.1x, a wireless client that associates with an AP cannot gain access to the network until
the user perIorms a network logon or provides other strong credentials. Practically, when the user
enters a username and password into a network logon dialog box or its equivalent, the client and
an authentication server, a RADIUS server, perIorm a mutual authentication. Additionally, the
RADIUS-based authentication server (AS) allows centralized user credential management.
Note that the AP acts as pass-through device, while the actual authentication process is perIormed
by the authentication server. The authentication server and client then derive a client-speciIic
WEP/TKIP key to be used by the client Ior the current logon session. User passwords and session
keys are never transmitted in the clear, over the wireless link.
The whole authentication process is conducted by the Extensible Authentication Protocol
(EAP) which has been deIined in RFC 2284 as PPP extension. Note that EAP is only a meta-
authentication protocol. EAP initiates the process and carries the actual authentication protocol,
Ior example the Transport Layer Security (TLS) protocol and others. Most oI them provide a
session identiIier and thereIore provide seamless handover between access points, without re-
authentication need.
Note that 802.1x can only negotiate per-user session keys Ior unicast transmission. A single static
broadcast key must also be conIigured on an access point Ior 802.1x clients to receive broadcast
and multicast messages. This is typically perIormed automatically.
Reauthentication can be easily realized, because each AP can ask the central AS whether the client
is already authenticated. This principle supports Iast roaming (even better, iI there is a caching
instance in-between).
Wi-Fi's WPA also requires 802.1x as authentication method.

41
41 {C} Herbert Haas 2010/02/15
What is EAP?
ExtensibIe: aIIows to deveIop and depIoy
new authentication protocoIs easiIy

No SW update on authenticator (AP) needed

OnIy suppIicant and AS server need to be


updated
See RFC 2284
TLS
EAP
MD5 AKA/SIM TTLS PEAP FAST LEAP
PPP 802.3 802.11
RADIUS
UDP
IP
802.3
802.1x "EAPoL" or "EAPoW"
802.1x relies on EAP as underlying authentication protocol carrier. EAP is
extensible, as it allows to develop and deploy new authentication protocols easily
without changing the AP soItware. That is, EAP can be imagined as a container
Ior authentication schemes.
The picture above shows the layers involved. EAP itselI is either carried by a
layer-2 protocol such as 802.3 ("EAP over LAN", EAPoL) or 802.11 ("EAP over
Wireless", EAPoW), or by RADIUS ("EAP over RADIUS").
In order to be carried over RADIUS, the EAP inIormation is decomposed into
inIormation elements and additionally, new Attribute Value Pairs (AVPs) had to
be deIined ("eap-radius").
See RFC 2284 Ior Iurther details.

42
42 {C} Herbert Haas 2010/02/15
802.1x - ProtocoI Layers
Authenticator (AP) bIocks access untiI cIient is authenticated
OnIy accepts Ethertype 0x888E (EAPoL)
802.1x frames are sent to muIticast DA = 01-80-C2-00-00-03
Authenticator transIates 802.1x to UDP/IP
SuppIicant
Authenticator
(802.11 AP)
Authentication Server
(E.g. Cisco ACS)
EAP over Radius
EAP over LAN (EAPoL)
EAP over WireIess (EAPoW)
EAP's Authentication Method
EAP
802.11
802.1x
RADIUS
UDP/IP
RADIUS
UDP/IP
802.1x
802.11 802.3 802.3
Each 802.1x-based authentication consists oI three participants:
1. The client, who is called "Supplicant"
2. The "Authenticator", which is actually the AP
3. An "Authentication Server" which must support eap-radius.
Both the Supplicant and the Authentication Server are authenticated to each other
but this handshake is intercepted by the Authenticator, which Iorwards these
messages to the endpoints.
OI course also the Authenticator must be authenticated. This is typically done by
a shared secret between the Authenticator and the Authentication Server.
Note: The Authenticator is basically an 802.1x-to-UDP 'bridge.
When an 802.1X-capable host starts up, it will initiate the authentication phase
by sending the EAPoL-Start 802.1x protocol data unit (PDU) to the reserved
IEEE multicast MAC address (01-80-C2-00-00-03) with the Ethernet type or
length set to 0x888E.

43
43 {C} Herbert Haas 2010/02/15
802.1x - EAP Concept
SuppIicant
Authenticator
(802.11 AP)
Authentication Server
(E.g. Cisco ACS)
CIient associates with AP
AP bIocks
aII traffic
User provides authentication credentials
Credentials forwarded via RADIUS
User
authenticated
RADIUS Server
authenticated
Both ends derive unicast WEP key
Send unicast WEP key to AP
AP creates broadcast
WEP key
Send broadcast WEP key encrypted
with unicast WEP key to cIient
AP accepts
WEP encrypted packets
AS provides authentication credentials Credentials forwarded via EAPoW
This picture illustrates the basic concept oI 802.1x and EAP.
1) When the AP receives an association request Irom the client, the AP requires
the client to authenticate via EAP.
2) The client sends his/her authentication credentials, the AP cannot veriIy these
by itselI but Iorwards them to a preconIigured authentication server (a
RADIUS server).
3) The RADIUS server Iinds this user in its database and veriIies the correctness
oI the associated credentials.
4) During this EAP negotiation, also the client can authenticate the RADIUS
serverand by doing this, also the AP is authenticated implicitly (because the
AP and the RADIUS server are bound via a shared secret.
5) Both client and RADIUS server determine a unicast WEP key.
6) The AP sends this unicast WEP key to the AP.
7) The AP creates a random broadcast WEP key, encrypts it using the unicast
WEP key and Iorwards it to the client.
8) Now, the client and the AP can communicate using WEP encrypted packets.
The AP will decrypt each correctly encrypted packet Irom the client and will
Iorward it to the wired LAN.
Note: Cisco supports broadcast key rotation. AIter a certain amount oI time the
AP dynamically distributes a new broadcast key to the clients. Obviously this
Ieature is only possible when EAP is enabled.

44
44 {C} Herbert Haas 2010/02/15
802.1x - EAP ProtocoI
SuppIicant
Authenticator
(802.11 AP)
Authentication Server
(E.g. Cisco ACS)
EAP over Radius
EAP over LAN (EAPoL)
EAP over WireIess (EAPoW)
802.11 ASSOC Request (Open)
802.11 ASSOC Response
EAP Request ID
EAPoW Start
EAP Response ID
EAP Request Method
EAP Response Method
EAP SUCCESS
EAP 4-Way Key-exchange
Handshake
RADIUS Access Request (EAP)
RADIUS Access ChaIIenge (EAP)
RADIUS Access Request (EAP)
Radius Access Accept (EAP)
OriginaI 802.1x used singIe EAPoW key message.
New improved 802.1x (802.1aa) uses a 4-way handshake
to prevent MITM attacks.
With MPPE attributes for keys
EAP provides an envelope that can carry many different kinds of authentication
types: challenge/response, one time passwords (OTPs) , SecurID tokens, digital
certiIicates, etc. What exactly happens between "EAP Start" and "EAP Success"
depends upon the type oI authentication being used.
The original 802.1X standard used a single EAPoW Key message Ior this
purpose, but the new improved 802.1x (called 802.1aa) uses a four-way
handshake to prevent man-in-the-middle attacks that might otherwise
compromise these keys. AIter both ends the client and the APoI the wireless
association have session keys, data sent over the air can be encrypted to prevent
eavesdropping.
One oI the most important beneIits will be Ielt by users who are roaming within
an organization's wireless LAN and require a seamless connection. II they are
asked to authenticate themselves each time they pass Irom one conIerence room
to another, they will want to give up security in Iavor oI convenience.
Using the connection re-establishment mechanism provided by the TLS
handshake users can have one seamless connection while roaming between
diIIerent APs connected to the same backend server. II the session ID is still
valid, the wireless client and server can share previously negotiated secrets to
establish a new handshake and keep the connection alive.
Additionally, secure session timeouts trigger re-authentication and new WEP
(TKIP) keys.

45
45 {C} Herbert Haas 2010/02/15
802.1x - EAP-TLS (1)
First secure 802.1x reaIization, EAP method 13 (RFC 2716)
ReIies on Transport Layer Security (TLS)
Successor of SSL version 3.0, adopted by IETF
Both cIients and AS authenticated via certificates
Only TLS authentication and tunnel establishment procedure (tunnel not used)
TLS aIso used to derive Iink-Iayer key between endpoints
ProbIems:
CIient identity is not protected
No fast session reconnection
Need for PKI (practicaI: certificate stored in token card or simiIar)
Prerequisite for WPA certification
UntiI May 2005 the onIy required EAP method for WPA
EAP ID Request
EAPoW Start
EAP ID Response RADIUS Access Request (EAP)
TLS Authentication
CIient Certificate Server Certificate
The TLS Working Group was established in 1996 to standardize a "transport
layer" security protocol. The working group began with SSL version 3.0, and in
1999, RFC 2246, the "TLS Protocol Version 1.0" was published as a Proposed
Standard. The working group has also published RFC 2712, "Addition oI
Kerberos Cipher Suites to Transport Layer Security (TLS)" as a Proposed
Standard, and two RFCs on the use oI TLS with HTTP.
EAP-TLS was the Iirst most-widely implemented 802.1x method Ior WLANs.
EAP-TLS supports session expiration and 802.1x re-authentication by using the
RADIUS session timeout option (RADIUS Internet Engineering Task Force
option 27). To avoid IV reuse (IV collisions), the base WEP key is rotated
beIore the IV space is exhausted.
However, several problems lead to a seldom use oI EAP-TLS:
Every client needs a certiIicate. This is only aIIordable iI a PKI is available
providing a CA Ior certiIicate management, revocation and so on.
In the three messages oI the EAP starting sequence, the user-ID is revealed. This
is considered privacy-critical today.
Fast session reconnection (Ior VoIP) is not possible.
Remember: A certiIicate is a cryptographically signed structure, that guarantees
the association between at least one identifier and a public kev.
Note: The name in the client certiIicate must be the same username as in the AS
user database. This is one important reason to choose a private Root-CA, besides
the advantage oI more control.

46
46 {C} Herbert Haas 2010/02/15
802.1x - EAP-TLS (2)
After each re-authentication a new session key can be generated based on
the same master key
Note: TLS detaiIs omitted in the picture
Such as record detaiIs (server_key_exchange, change_cipher_spec, .)
CIientHeIIo: Random_1, Session_ID
E
A
P
-
T
y
p
e

=

E
A
P
-
T
L
S
ServerCertificate, ServerHeIIo: Random_2, Session_ID
CIientCertificate
Pre-masterSecret (encrypted with server's pubIic key)
MasterSecret = PRF (Pre-masterSecret,
Random_1, Random_2, "master secret")
MasterSecret = PRF (Pre-masterSecret,
Random_1, Random_2, "master secret")
Session Key = PRF (MasterSecret,
Random_1, Random_2,
"cIient EAP encryption")
Session Key = PRF (MasterSecret,
Random_1, Random_2,
"cIient EAP encryption") Authenticator MAY choose subsequent keying materiaI
(encryption keys, MAC-keys, and IV) from this session key
(for exampIe using the 1st 32-byte bIock as encryption key,
the 2nd 32-byte bIock as MAC-key and so on.)
The SessionID can be used Ior Iast re-authentication purposes.
As part oI the TLS handshake between the server and the client, the client
generates a pre-master secret and encrypts it with the server's public key. Then
this pre-master secret is sent to the AS. Another option would be to use DiIIie-
Hellman exchange to derive the pre-master secret.
The pre-master secret, server and client random values, and "master secret" string
value are used to generate a master secret per session. A Pseudo Random
Function (PRF) is used again along with master secret, client and server random
values, and "client EAP encryption" string value to generate the 128-bit session
keys, Message Authentication Code (MAC) keys and initialization values (Ior
block ciphers only).
Note that both the client and the AS independently derive the session keys.
However, the length oI the session key is determined by the authenticator (the
AP) and is sent in the EAPoL key message at the end oI the EAP authentication
to the client.
A TLS session is governed by a security context, which consists oI session
identiIier, peer certiIicate, compression method, cipher spec Ior the session key,
MAC algorithm parameters, and the shared master secret.
TLS sessions expire aIter some time and the AS can be notiIied via RADIUS.
EAP-TLS is nativelv supported in MAC OS 10.3 and above, Windows 2000 SP4,
Windows XP, Windows Mobile 2003 and above, and Windows CE 4.2

47
47 {C} Herbert Haas 2010/02/15
802.1x - LEAP
Cisco's Iightweight impIementation
Fast Secure Roaming (< 150 ms)
ChaIIenge-response based on shared secrets
ImpIemented simiIar as MS-CHAPv2 (two stage MD4 hashing
of passwords)
Can utiIize existing Windows NT Domain Services
authentication databases as weII as Windows 2000 Active
Directory databases
No support for LDAP and NIS
Drivers for Windows 95, 98, Me, 2000, NT and XP and uses
the Windows Iogon as the Cisco LEAP Iogon
AIso Linux and Mac support
VuInerabIe to dictionary attacks
Secure if strong passwords are enforced (10 chars at
minimum)
Cisco's Lightweight EAP (LEAP) implementation is widely deployed because oI its simplicity as
it is based on shared user secrets. Furthermore, only LEAP supports fast secure roaming,
necessary iI low-delay applications are used (e. g. VoIP).
Cisco has developed drivers Ior most versions oI MicrosoIt Windows (Windows 95, 98, Me,
2000, NT and XP) and uses the Windows logon as the Cisco LEAP logon.
A soItware shim in the Windows logon allows the username and password inIormation to be
passed to the Cisco Aironet client driver. The driver will convert the password into a Windows
NT key and hand the username and Windows NT key to the Cisco NIC. The NIC executes 802.1x
transactions with the AP and the authentication, authorization, and accounting (AAA) server.
Note: Neither the password nor the password hash is ever sent across the wireless medium.
Additionally, any Open Database Connectivity (ODBC) that uses MS-CHAP passwords can also
be used with LEAP.
Note: II an AS is used Ior both Cisco LEAP and MAC authentication, the MAC address should
use a diIIerent strong password Ior the required MS-CHAP/CHAP Iield. II not, an eavesdropper
can spooI a valid MAC address and use it as a username and password combination Ior Cisco
LEAP authentication.
Note: The LEAP key generation mechanism is proprietary and is generated every
(re)authentication, thus achieving key rotation. The session timeout in RADIUS allows Ior
periodic key rotation, thus achieving security against sniIIing and hacking the keys. The RADIUS
exchanges Ior LEAP include a couple oI Cisco-speciIic attributes in the RADIUS messages.
To avoid IV reuse (IV collisions), LEAP rotates the base WEP key beIore the IV space is
exhausted.
Note: LEAP is only as strong as the passwords used. ThereIore it is vulnerable to dictionary
attacks. At least 10-character passwords should be used.
BTW: Implementation details oI LEAPv1: ChallengeLEN8, RESPONSELEN24,
KEYLEN16 |BYTES|

48
48 {C} Herbert Haas 2010/02/15
LEAP / MSCHAPv2 FIaws
AS sends 8 byte chaIIenge
CIient encrypts chaIIenge 3 times using NT hash
of the password as DES seed (=key)
DES requires a 7 byte seed vaIue in this aIgorithm
So cIient spIits 16 byte NT hash into three portions:
Seed1 = B1 .. B7
Seed2 = B8 .. B14
Seed3 = B15, B16, 0x00, 0x00, 0x00, 0x00, 0x00
FIaw: third DES output is cryptographicaIIy weak,
Ieaving onIy 2^16 possibIe permutations
After B15 and B16 are known, we can
significantIy reduce the number of potentiaI
matches in our dictionary fiIe, using the known 2
bytes of the user's hash as a keying mechanism
The 8 Byte challenge is encrypted 3 times, using Seed3 Ior the third DES
encryption. Since the attacker knows the challenge and the encrypted response, a
simple brute Iorce attack quickly recovers seed3. Now the search duration in the
attacker's dictionary Iile can be signiIicantly reduced. Assuming that this
dictionary Iile has been prepared such that it already contains the NT hashes oI
each password, the lookup algorithm must only look Ior hashes Ior which bytes
15 and 16 matches the recovered seed3.

49
49 {C} Herbert Haas 2010/02/15
AsIeap
OffIine attack on LEAP
PrincipIe:

LEAP performs
unencrypted MSCHAPv2
(chaIIenge-handshake)

AsIeap captures
chaIIenge and encrypted
repIy and performs an
offIine dictionary attack
Written by Joshua
Wright
http://asIeap.sourceforg
e.net/
AIso see Leapcrack
ExampIe: AsIeap, cracking password "test"
A good policy should require a password length oI at least 12 characters,
including numbers, mixed case, and punctuation. It should also include a
requirement that passwords be based on neither words Iound in any dictionary nor
any variant oI the username.
There are cracking dictionaries Ior hundreds oI languages and commonly used
words, such as names oI places, people, and movies. Usually the only way to
enIorce strong passwords is with tools that enIorce passwords at creation time.
Users are good at choosing easy-to-remember passwords and tend to ignore
unenIorced rules. It is a good idea to run regular, automated password cracking on
your organization's passwords and warn users or disable accounts with bad
passwords. Your organizational environment determines what strength oI
password enIorcement and Irequency oI password changes is acceptable to your
user community.

50
50 {C} Herbert Haas 2010/02/15
802.1x - EAP-TTLS
Created by Funk and
Certicom
(Internet draft)
EAP method 21
WideIy impIemented,
aIso Linux support; but
no Cisco support
Supports ANY inner
authentication method
Any EAP method
As weII as oIder
methods such as CHAP,
PAP, MS-CHAP and MS-
CHAPv2
O
u
te
r E
A
P
A
V
P
PAP, CHAP,
MCHAP,
MSCHAPv2, .
EAP-TTLS
TLS using
Server-Certificates
Basic Idea:
EAP-TTLS was developed by Funk SoItware and Certicom, and was Iirst
supported by Agere Systems, Proxim, and Avaya. Today EAP-TTLS is being
considered by the IETF as a new standard.
The structure oI Tunnelled TLS (TTLS) and PEAP are quite similar. Both are
two-stage protocols that establish security in stage one and then exchange
authentication in stage two.
Stage one oI both protocols establishes a TLS tunnel and authenticates the
authentication server to the client with a certificate. Once that secure channel has
been established, client authentication credentials are exchanged in the second
stage.

51
51 {C} Herbert Haas 2010/02/15
802.1x - EAP-TTLS
Radius-Iike AVPs
between cIient and Server
CIient certificate not
required but user has two
identities:
1. A anonymous identity
such as
"anonymous@exampIe.c
om" and
2. The reaI identity, which
is onIy sent encrypted,
such as
user342@exampIe.com".
Client identity protected
by TLS
Fast session reconnect
(but too sIow for VoIP)
DetaiIed:
PAP, CHAP,
MSCHAP, MSCHAPv2
AVP TLS EAP
Ethernet
or Radius
Other than PEAP, EAP-TTLS supports any authentication method, not only
EAP-methods. ThereIore, there is no inner EAP session but RADIUS-like AVPs
are used to carry the authentication data.
EAP-TTLS oIten uses PAP (also with Linux).
As with PEAP, user identity information is protected.

52
52 {C} Herbert Haas 2010/02/15
802.1x - Other EAP Choices
More than 44 EAP types aIready defined

EAP-AKA: username and password (UMTS systems)


EAP-MD5: No dynamic WEP keys, no mutuaI authentication,
dictionary attacks possibIe
(EAP method 4)
EAP-GTC: Generic Token Card (EAP method 6), no mutuaI
authentication
PEAP-GTC: Cisco's PEAP method
EAP-SIM: Used for SIM-card based devices (3GPP, aIso known
as EAP-GSM)
EAP-SRP: Secure Remote Password

.
EAP-FAST: Successor of LEAP
See dedicated section
PEAP-EAP-TLS
Another Microsoft soIution simiIar as EAP-TLS
There are other EAP methods which are currently not so important in the 802.11
WLAN world.
EAP-AKA works similar as LEAP. AKA stands Ior Authentication and Key
Agreement. It is also used with HTTP Authentication and GSM. See draft-arkko-
pppext-eap-aka-12.txt Ior details.
EAP-MD5 does not support mutual authentication and is not strong enough, also some
vendors use it with WLAN devices.
EAP-GTC is typically only used as inner EAP-method oI PEAP. In this case it is oIten
called "PEAP-GTC".
EAP-SIM is used by 3GPP applications (GSM and UMTS). SIM stands Ior Subscriber
Identity Module.
EAP-SRP (Secure Remote Password) is a method used by some vendors, mainly
Orinoco.
WPA-Note: EAP-MD5, EAP-GTC, EAP-OTP, and EAP-MSCHAPV2 cannot be used
alone with WPA. They can only be used as inner authentication algorithms with EAP-
PEAP and EAP-TTLS.
MicrosoIt supports another Iorm oI PEAPv0 (which MicrosoIt calls PEAP-EAP-TLS)
that Cisco and other third-party server and client soItware don`t support.
PEAP-EAP-TLS does require a client-side digital certiIicate located on the client`s
hard drive or a more secure smartcard. PEAP-EAP-TLS is very similar in operation to
the original EAP-TLS but provides slightly more protection due to the Iact that
portions oI the client certiIicate that are unencrypted in EAP-TLS are encrypted in
PEAP-EAP-TLS.
Since Iew third-party clients and servers support PEAP-EAP-TLS, users should
probably avoid it unless they only intend to use MicrosoIt desktop clients and servers.

53
53 {C} Herbert Haas 2010/02/15
EAP Types Overview
1-6 Assigned by RFC
1Identity
2Notification
3Nak (response onIy)
4MD5-ChaIIenge
5One-Time Password (OTP)
6Generic Token Card (GTC)
7-8 Not assigned
9 RSA PubIic Key Authentication
10 DSS UniIateraI
11 KEA
12 KEA-VALIDATE
13 EAP-TLS
14 Defender Token (AXENT)
15 RSA Security SecurID EAP
16 Arcot Systems EAP
17 EAP-Cisco WireIess (LEAP)
18 Nokia IP SmartCard authentication
19 SRP-SHA1 Part 1
20 SRP-SHA1 Part 2
21 EAP-TTLS
22 Remote Access Service
23 UMTS Authentication and Key Agreement
24 EAP-3Com WireIess
25 PEAP
26 MS-EAP-Authentication
27 MutuaI Authentication w/Key Exchange (MAKE)
28 CRYPTOCard
29 EAP-MSCHAP-V2
30 DynamID
31 Rob EAP
32 SecurID EAP
33 EAP-TLV
34 SentriNET
35 EAP-Actiontec WireIess
36 Cogent Systems Biometrics
Authentication EAP
37 AirFortress EAP
38 EAP-HTTP Digest
39 SecureSuite EAP
40 DeviceConnect EAP
41 EAP-SPEKE
42 EAP-MOBAC
43 EAP-FAST
44-191 Not assigned; can be assigned by
IANA on the advice of a designated expert
192-253 Reserved; requires standards
action
254 Expanded types
255 ExperimentaI usage
This list is just Ior reIerence.

54
2010/02/15 {C} Herbert Haas
PEAP

55
55 {C} Herbert Haas 2010/02/15
802.1x using PEAP
Created by Cisco and
Microsoft

SimiIar to EAP-TTLS
Open standard

EAP method 25
Since third EAP
message is aIways in
cIear

CIient may send a


routing reaIm instead
of the user identity to
protect the user
identity
O
u
te
r E
A
P
In
n
e
r
E
A
P
EAP-MSCHAPv2
EAP-PEAP
Username/Password
TLS using
Server-Certificates
Basic Idea:
Protected EAP (PEAP) has been developed by Cisco and Microsoft and is only
available on the newest MicrosoIt platIorms (XP).
PEAP is a two-stage protocol that establish a secure TLS tunnel which carries an
inner EAP session.
PEAP only supports EAP-type authentication. MicrosoIt proposes MS-CHAPv2,
while Cisco preIers Generic Token Cards (EAP-GTC). Cisco diIIerentiates "v0"
and "v1" while MicrosoIt only knows "PEAP", which means PEAPv0 and only
supports MSCHAPv2. Cisco's implementation "v0" also supports EAP-SIM,
while "v1" also supports EAP-GTC.
The main advantage oI PEAP is that client certificates are not necessary.
Support Ior MS-DB but no support Ior LDAP-DB.
The PEAP result is the so-called Compound Session Key (CSK) which is
actually a concatenation oI the Master Session Key (MSK), which is 64 bytes,
and the Extended Master Session Key (EMSK), which is 64 bytes.
The MSK and EMSK are deIined in RFC 3269 (also known as RFC 2284bis) as
Iollows:
MSK: Key derived between the peer and the EAP server and exported to the
authenticator.
EMSK: Additional keying material derived between the peer and the EAP server and
exported to the authenticator. It is reserved Ior Iuture use and not deIined in the current
RFC. In addition, the PEAP key mechanisms are designed Ior Iuture extensibility; the
exchange sequences (and choreographies) and Iormats can be used Ior handling any key
material; binding inner, outer, and other intermediate methods; and veriIying the security
between the layers that are required Ior Iuture algorithms.

56
56 {C} Herbert Haas 2010/02/15
Version Overview
PEAPv0

Supported since Windows XP SP1

Microsoft proposes MS-CHAPv2


EAP method 29
PEAPv1

Cisco's proposaI: EAP-GTC


EAP method 6
PEAPv2

Latest draft

Security updates and more features


Various cipher-suites supported
MITM protection through "crypto-binding"
The PEAP result is the so-called Compound Session Key (CSK) which is actually a
concatenation oI the Master Session Key (MSK), which is 64 bytes, and the Extended
Master Session Key (EMSK), which is 64 bytes.
The MSK and EMSK are deIined in RFC 3269 (also known as RFC 2284bis) as Iollows:
MSK: Key derived between the peer and the EAP server and exported to the authenticator.
EMSK: Additional keying material derived between the peer and the EAP server and exported to
the authenticator. It is reserved Ior Iuture use and not deIined in the current RFC. In addition, the
PEAP key mechanisms are designed Ior Iuture extensibility; the exchange sequences (and
choreographies) and Iormats can be used Ior handling any key material; binding inner, outer, and
other intermediate methods; and veriIying the security between the layers that are required Ior
Iuture algorithms.
Note:
PEAPv0 and PEAPv1 both reIer to the outer authentication method and is the mechanism
that creates the secure TLS tunnel to protect subsequent authentication transactions while
EAP-MSCHAPv2, EAP-GTC, and EAP-SIM reIer to the inner authentication method
which Iacilitates user or device authentication. PEAPv0 supports inner EAP methods
EAP-MSCHAPv2 and EAP-SIM while PEAPv1 supports inner EAP methods EAP-GTC
and EAP-SIM. Since MicrosoIt only supports PEAPv0 and doesn't support PEAPv1,
MicrosoIt simply calls PEAPv0 PEAP without the v0 or v1 designator. Another
diIIerence between MicrosoIt and Cisco is that MicrosoIt only supports PEAPv0/EAP-
MSCHAPv2 mode but not PEAPv0/EAP-SIM mode.
However, MicrosoIt supports another Iorm oI PEAPv0 called PEAP-EAP-TLS that
Cisco and other third-party server and client soItware don't support. PEAP-EAP-TLS
does require a client-side digital certiIicate located on the client's hard drive or a more
secure smartcard. PEAP-EAP-TLS is very similar in operation to the original EAP-TLS
but provides slightly more protection due to the Iact that portions oI the client certiIicate
that are unencrypted in EAP-TLS are encrypted in PEAP-EAP-TLS. Since Iew third-
party clients and servers support PEAP-EAP-TLS, users should probably avoid it unless
they only intend to use MicrosoIt desktop clients and servers.

57
57 {C} Herbert Haas 2010/02/15
PEAP as Pipe ModeI
Only supports EAP-
type authentication
CIient certificate not
required
Fast session reconnect
(but too sIow for VoIP)
Version 2 stiII in
deveIopment
PEAP DetaiIed
TLS
Outer EAP
Ethernet
or Radius
MSCHAPv2
or GTC
(or EAP-TLS, .)
Inner EAP
TLV
PEAP
Security Claims oI PEAPv2
Intended use: Wireless or Wired networks, and over the Internet, where
physical security cannot be assumed.
Auth. mechanism: Use arbitrary EAP and TLS authentication mechanisms Ior
authentication oI the client and server.
Ciphersuite negotiation: Yes.
Mutual authentication: Yes. Depends on the type oI EAP method used within
the tunnel and the type oI authentication used within TLS.
Integrity protection: Yes
Replay protection: Yes
Confidentiality:Yes
Key derivation: Yes
Key strength: Variable
Dictionary attack prot: Not susceptible.
Fast reconnect: Yes
Crypt. binding: Yes.
Acknowledged S/F: Yes
Session independence: Yes.
Fragmentation: Yes
State Synchronization: Yes |80211Req|

58
58 {C} Herbert Haas 2010/02/15
PEAPv2 Layers
In PEAPv2 Part 1

Outer-TLVs are used to heIp


estabIishing the TLS tunneI,
but no Inner-TLVs are used
In PEAPv2 Part 2
TLS records may encapsuIate
zero or more Inner-TLVs, but
no Outer-TLVs

EAP packets used within


tunneIed EAP authentication
methods are carried within
Inner-TLVs
TLS OptionaI Outer-TLVs
PEAP
EAP
PEAP
EAP
EAP
Inner-TLVs (EAP-PayIoad TLV)
TLS
Part 2
Part 1
The TLS v1.0 mandatory-to-implement ciphersuite TLSDHEDSSWITH3DESEDECBCSHA
must be supported
For light-weight devices also other TLS cipher suites supported
PEAPv2 client and servers SHOULD support
TLSRSAWITH3DESEDECBCSHA
TLSRSAWITHRC4128MD5
TLSRSAWITHRC4128SHA
TLSRSAWITHAES128CBCSHA

59
59 {C} Herbert Haas 2010/02/15
PEAPv2: Provisioning of CredentiaIs
Provisioning inside a server-authenticated
TLS tunneI
Provisioning inside a server-
unauthenticated TLS tunneI

If TLS tunneI cannot be vaIidated by cIient


(Iacking required credentiaIs) the cIient instead
may reIy on inner EAP method

AIthough this reduces depIoyment costs, MITM


attacks are possibIe !

An impIementation is therefore optionaI and


not recommended
UnIortunately many people use PEAP inside a "server-unauthenticated TLS
tunnel" which is (unIortunately) a supported method but this actually conIlicts
with the initial idea oI a secure-tunnel authentication!
ThereIore always install appropriate root certiIicates!

60
60 {C} Herbert Haas 2010/02/15
PEAPv2
AIso other than certificate-based
cipher-suites are supported

E. g. DH-based
If certificates are sent by the server

The cIient onIy verifies whether the


server possesses the corresponding
private key

The cIient does not need to vaIidate via


the trust anchor (CA)
II the validation oI the server certiIicate Iails (because oI Iailing private key
validation or invalid certiIicate parameters) then the "provisioning inside a server-
unauthenticated TLS tunnel"-mode must not be entered.

61
61 {C} Herbert Haas 2010/02/15
PEAPv2 - MITM Protection
A sequence of zero or more inner EAP
authentication methods can be negotiated
Crypto-Binding TLVs must be sent in the
PEAP success/faiIure (ResuIt TLV)
messages

In a sequence, aIso after each EAP-method a


Crypto-Binding TLV must be sent by both
parties

The server shouId not reveaI any sensitive


data to the cIient untiI after the Crypto-Binding
TLV has been properIy verified !!!
Note that with every EAP method, there must be a Iinal EAP success/Iailure
indication sent in clear to inIorm the authenticator.
An early Cisco solution to the MITM problem with pre-PEAPv2 versions is to
enIorce the client to choose a PEAP trust anchor. That is the client must select a
root certiIicate issuer Irom a list. II the certiIicate oIIered by the server cannot be
validated via the pre-selected trust anchor, the authentication process stops.
UnIortunately, also "any" can be selected.

63
63 {C} Herbert Haas 2010/02/15
Crypto-Binding TLVs
PEAPv2 derives keys by combining keys
from TLS and the inner EAP methods
The Crypto-Binding TLV caIcuIation
incIudes

The first two Outer-TLVs messages sent by


both peer and EAP-server
(used for TLS tunneI estabIishment)

The EAP-Type (= set to PEAP) sent in the first


two messages by both peer and EAP-server
Outer-TLVs SHOULD NOT be included in other PEAP packets since there is no
mechanism to detect modiIication.
For subsequent packets (aIter the Iirst two) the EAP Type in the clear could be
modiIied and will likely result in Iailure, hence it is not included in the Crypto-
Binding calculation.

64
64 {C} Herbert Haas 2010/02/15
DoS Attacks
TheoreticaIIy possibIe if the attacker

Can modify unprotected fieIds in the


PEAP packet such as the EAP protocoI
or PEAP version number

Modify protected fieIds in a packet to


cause decode errors

65
65 {C} Herbert Haas 2010/02/15
PEAPv2 - Other Features
Fast session resumption

Using the "sessionID" of the TLS protocoI and


the Server-Identifier TLV in PEAP
Server may send a Server-Identifier TLV to give
cIient a hint which sessionID shouId be used
(protected by MAC)

If too much time eIapsed since previous


authentication, the server wiII not aIIow the
continuation

The inner authentication may or may not be


skipped !!!
TLS compression must be supported
PEAPv2 "Iast reconnect" is desirable in applications such as wireless roaming,
since it minimizes interruptions in connectivity. It is also desirable when the
"inner" EAP mechanism used is such that it requires user interaction. The user
should not be required to re-authenticate herselI, using biometrics, token cards or
similar, every time the radio connectivity is handed over between access points in
wireless environments.
Since PEAPv2 Part 1 may not provide client authentication, establishment oI a
TLS session (and an entry in the TLS session cache) does not by itselI provide an
indication oI the peer's authenticity. Implementations that do not remove TLS
session cache entries aIter a Iailed PEAPv2 Part 2 authentication or Iailed
protected termination MUST use other means than successIul TLS resumption as
the indicator oI whether the client is authenticated or not. TLS resumption MUST
only be enabled iI the implementation supports TLS session cache removal !!!
II an EAP server implementing PEAPv2 removes TLS session cache entries oI
peers Iailing PEAPv2 Part 2 authentication, then it MAY skip the PEAPv2 Part 2
conversation entirely aIter a successIul session resumption, successIully
terminating the PEAPv2 conversation

66
66 {C} Herbert Haas 2010/02/15
PEAPv2 Fragmentation
A singIe TLS message may consist of muItipIe
TLS records

A singIe TLS record may be up to 16384 bytes in Iength

A TLS certificate message may in principIe be as Iong as


16 MByte
Fragmentation needed

RADIUS cannot handIe such Iong messages

MuItiIink PPP (MRRU LCP) method supported on


Ethernet/802.3
But there's no PPP in 802.11 which couId negotiate that
PEAPv2 own fragmentation support defined
DoS attacks (reassembIy Iockup) can be mitigated to set a
maximum size for one group of TLV messages (e. g. 64 KB)
Fragementation support is not that easy. Requires sequence numbers ACKs and
NAKs (Iortunately provided by EAP already), several Ilags such as (M)ore
Iragments, (S)tart and a length Iield.

67
67 {C} Herbert Haas 2010/02/15
PEAPv2 Key Derivation
New keys are derived from TLS master secret
to protect the conversation within the PEAPv2
tunneI
Since normaI TLS keys are used in the handshake
they shouId not be used in a different context
Combines key materiaI from TLS exchange
with key materiaI from inner key generating
EAP methods
To bind inner authentication mechanisms to TLS
tunneI
The input Ior the cryptographic binding includes the Iollowing:
|a| The PEAPv2 tunnel key (TK) is calculated using the Iirst 40 octets oI the (secret) key material
generated as described in the EAP-TLS algorithm (|RFC2716| Section 3.5). More explicitly, the
TK is the Iirst 40 octets oI the PRF as deIined in |RFC2716|:
PRF(master secret,"client EAP encryption", random)
Where random is the concatenation oI clienthello.random and serverhello.random
|b| The Iirst 32 octets oI the MSK provided by each successIul inner EAP method ;Ior each
successIul EAP method completed within the tunnel.
ISK1..ISKn are the MSK portion oI the EAP keying material obtained Irom methods 1 to n. The
ISKj shall be the Iirst 32 octets oI the generated MSK oI the jth EAP method. II the MSK length is
less than 32 octets, it shall be padded with 0x00's to ensure the MSK is 32 octets. Similarly, iI no
keying material is provided Ior the EAP method, then ISKj shall be set to zero (e.g. 32 octets oI
0x00).
The PRF algorithm is based on PRF Irom IKEv2 shown below ("," denotes concatenation)
K Key, S Seed, LEN output length, represented as binary in a single octet.
PRF (K,S,LEN) T1 , T2 , T3 , T4 , ... where:
T1 HMAC-SHA1(K, S , LEN , 0x01)
T2 HMAC-SHA1 (K, T1 , S , LEN , 0x02)
T3 HMAC-SHA1 (K, T2 , S , LEN , 0x03)
T4 HMAC-SHA1 (K, T3 , S , LEN , 0x04)
...
The intermediate combined key is generated aIter each successIul EAP method inside the tunnel.

68
68 {C} Herbert Haas 2010/02/15
Crypto-Binding TLV
The Crypto-Binding TLV is used prove that
both peers participated in the sequence
of authentications

That is, the TLS session and inner EAP


methods that generate keys
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|M|R| TLV Type (12) | Length (56) |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Reserved | Version | Received Ver. | Sub-Type |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| |
~ Nonce (32 bytes; temporally unique; ~
| used for compound MAC key derivation at each end |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Compound MAC |
~ (Computed using the HMAC-SHA1-160 keyed MAC that provides 160 ~
| bits of output using the CMK key) |
| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
The Crypto-Binding TLV MUST be used to perIorm Cryptographic Binding
aIter each successIul EAP method in a sequence oI EAP methods is complete in
PEAPv2 part 2. The Crypto-Binding TLV can also be used during Protected
Termination.
The Crypto-Binding TLV must have the version number received during the
PEAP version negotiation. The receiver oI the Crypto-Binding TLV must veriIy
that the version in the Crypto-Binding TLV matches the version it sent during the
PEAP version negotiation. II this check Iails then the TLV is invalid.
The receiver oI the Crypto-Binding TLV must veriIy that the subtype is not set to
any value other than the ones allowed. II this check Iails then the TLV is invalid.
This message Iormat is used Ior the Binding Request (B1) and also the Binding
Response. This uses TLV type CRYPTOBINDINGTLV. PEAPv2
implementations MUST support this TLV and this TLV cannot be responded to
with a NAK TLV.
The MAC is computed over:
1. The entire Crypto-Binding TLV attribute with the MAC Iield zeroed out.
2. The EAP Type sent by the other party in the Iirst PEAP message.
3. All the Outer-TLVs Irom the Iirst PEAP message sent by EAP-server to peer.
II a single PEAP message is Iragmented into multiple PEAP packets; then the
Outer-TLVs in all the Iragments oI that message MUST be included.
4. All the Outer-TLVs Irom the Iirst PEAP message sent by the peer to the EAP
server. II a single PEAP message is Iragmented into multiple PEAP packets, then
the Outer-TLVs in all the Iragments oI that message MUST be included.

69
2010/02/15 {C} Herbert Haas
EAP-FAST
Content
In this chapter a detailed overview about today's WLAN security problems and
solutions are presented.
This subchapter provides an introduction into EAP-FAST, which is considered as
the successor oI LEAP.

70
70 {C} Herbert Haas 2010/02/15
Quick Facts
Cisco, LEAP successor

Design by Cisco but open draft (IETF)


InitiaIIy known as "TunneIed EAP
(TEAP)" or "LEAPv2"
Supported by cIient devices since
Q4/2004
GoaIs:
PEAP/EAP-TTLS -Iike security
SimpIe depIoyment
Fast roaming support (VoIP)

ComputationaIIy Iightweight
Symmetric cryptography is used
Key concept:
AIso TLS-protected inner EAP
authentication
But PACs instead X.509 certificates
TLV EncapsuIation ProtocoI
TLS
EAP- FAST
EAP
Carrier ProtocoI
(EAPoL, RADIUS, Diameter, .)
Inner EAP or other method
EAP Fast has been designed by Cisco and can be considered as the successor oI
LEAP. Other than LEAP, EAP-FAST is a IETF draIt. (See draIt-cam-winget-eap-
Iast-01.txt).
Client support has been available since Q4/2004. The main goals oI the EAP-
FAST design are:
- Strong authentication and session key provision similar like PEAP or EAP-
TTLS
- Simple deployment without the use oI a PKI
- Fast roaming support in order to allow Ior VoIP applications (WDS integration)
- Computationally lightweight by using symmetric cryptography
EAP-FAST uses so-called Protected Access Credentials (PACs) instead oI
certiIicates. The protocol must Iacilitate the use oI a single strong shared secret by
the peer while enabling the servers to minimize the per user and device state it
must cache and manage.

71
71 {C} Herbert Haas 2010/02/15
PACs
First, Protected Access CredentiaIs
(PACs) are generated by the
authentication server and distributed to
the cIients

Either manuaIIy ("out-of-band")

Or automaticaIIy ("in-band" during "phase 0" )


PACs consist of a secret and opaque part

Secret part contains keying materiaI

Opaque part is sent by cIient to prove that


he/she aIso possesses the secret part
Note: also a "Phase 0" had been speciIied Ior in-band provisioning, to provide the
peer with a shared secret to be used in secure phase 1 conversation. In phase 0,
the Authenticated DiIIie-Hellman Protocol (ADHP) can be used Ior PAC-key
exchanges. This phase is independent oI other phases; hence, any other scheme
(in-band or out-oI-band) can be used in the Iuture. The main goal oI phase 0 is to
eliminate the requirement in the client to establish a master secret every time a
client requires network access.
The PAC-Opaque contains the PAC-Key encrypted by a strong key only known
to the server and is sent to the server with the TLS ClientHello.

72
72 {C} Herbert Haas 2010/02/15
PAC Components (DetaiIed)
1) PAC Key
32 byte
RandomIy generated by AS

Used as TLS pre-master-secret to estabIish "phase 1"


tunneI
2) PAC Opaque

VariabIe Iength fieId


Sent to AS during phase 1 tunneI estabIishment
Can onIy be interpreted by AS

Contains the PAC key and the peer's identity


3) PAC Info
VariabIe Iength fieId

Contains readabIe information such as authority identity


(A-ID), PAC issuer, and PAC-key Iifetime
An EAP-FAST authentication server is identiIied by its Authority Identity (A-ID).
This A-ID is unique to each server along with the server master key. II an EAP-
FAST session starts, the server sends its A-ID in the EAP-FAST start packet.
Based on the A-ID, the EAP-FAST client selects the correct PAC.
Supports MS-DB, and LDAP-DB. No support Ior OTP.

73
73 {C} Herbert Haas 2010/02/15
Concept
Two or three EAP-FAST phases
Phase 0: (Optional) automatic PAC provision
Phase 1: TLS tunneI estabIishment
Phase 2: MutuaI authentication
After authentication
Master Secret Keys (MSKs) are derived
AS can update the cIient with a fresh PAC key
A cIient may cache muItipIe PACs to
communicate with different authentication
servers
Today nearly everybody uses Phase 0. Since ACS 4.0 lots oI additional EAP-
FAST Ieatures were introduced including so-called Authenticated Phase 0
where the server (ACS) is Iirst authenticated using a normal X.509 certiIicate.
This server certiIicate is also used to negotiate a tunnel key Ior Phase 0 (instead
oI DiIIie-Hellman).

74
74 {C} Herbert Haas 2010/02/15
802.1x - EAP-FAST - DetaiIs
SuppIicant
Authenticator
(802.11 AP)
Authentication Server
EAP over Radius
EAPoL
OptionaI Phase 0: TLS via DH
After MS-CHAPv2 authentication a PAC is assigned to cIient (disconn.)
OR
ManuaI PAC creation and assignment
PAC-Key
PAC-Opaque
PAC-Info
TTL, Issuer
PAC
PAC-Key
AS_priv
Protected with
AS_priv
Phase 1: TLS TunneI EstabIishment
PAC-Opaque sent to AS
AS recovers
PAC-Opaque
TLS
Phase 2: Inner Authentication
PAP, GTC, .
As explained in the previous page the optional phase 0 ("PAC provisioning") can
be either unauthenticated (DH used) or authenticated (server X.509 certiIicate
used).
Note: Especially when conIiguring the ACS the keys are named diIIerently:
The ASpriv key is also known as Master Key
The PAC-key is also known as Tunnel Key

75
75 {C} Herbert Haas 2010/02/15
Note
No Server States Needed!

The PAC-opaque is sent by the cIient


and contains the PAC-key which is
encrypted by ACS's private key

OnIy after receiving the PAC-opaque,


the server knows the shared secret and
can estabIish the TLS tunneI with it
One oI the main advantages oI EAP-FAST is that authentication servers do not
have to maintain state inIormation Ior each client.
A client begins authentication by sending the PAC-opaque to the server, which
contains the PAC-key encrypted by a strong key only known to the server.
That is, upon receiving the PAC-opaque, the server decrypts it and thereIore
derives the PAC-key

76
76 {C} Herbert Haas 2010/02/15
Unauthenticated Phase 0 - DetaiIed
PAC auto-provisioning using
TLS with DH key agreement to
estabIish a secure tunneI
AdditionaIIy, MS-CHAPv2 is
used to authenticate the cIient
and to prevent MITM
After the PAC has been
successfuI provisioned, EAP-
FAST is restarted to gain
network access

Therefore, after a successfuI


PAC provisioning transaction,
an EAP failure occurs to
terminate the EAP-FAST session
Afterwards, the newIy
provisioned PAC can be used to
estabIish an authenticated
session
Source: Cisco Systems
Manual provisioning ("out-band")
May be necessary iI a non-MicrosoIt-Iormat database is used (such as
LDAP) which does not support MSCHAPv2 credentials
PAC Iiles can be manually generated at the ACS and distributed manually
to client devices
"Out-oI-band" provisioning

77
77 {C} Herbert Haas 2010/02/15
EAP-FAST Phases - DetaiIed
Phase 1

CIient sends onIy the PAC


opaque to the server, not
the PAC key
The server decrypts the
PAC opaque using its
master-key
Now server and cIient
have the same PAC key

The PAC key is used to


create a TLS tunneI for
this cIient's authentication
Phase 2
Inside the TLS tunneI,
user authentication
credentiaIs are passed
secureIy (Phase 2)
E. g. using EAP-GTC
Source: Cisco Systems
Source: Cisco Systems
The client response is cryptographically bound to the EAP authentication success
message. This prevents a Man-In-The-Middle (MITM) attack in which the
attacker (client) attempts to provide a Ialse response to the server in order to
obtain the session key.

78
78 {C} Herbert Haas 2010/02/15
Phase 1 - DetaiIs
SuppIicant
Authenticator
(802.11 AP)
Authentication Server
EAP over Radius
EAPoL
EAP Request/Identity
EAP Response/Identity (username or anonymous user)
EAP-FAST Start, Authority-Identity (A-ID TLV)
EAP-FAST TLS/CIientHeIIo (cIient_random, PAC_Opaque, use TLS_RSA_WITH_RC4_128_SHA ciphersuite)
Note: Any ciphersuite might be supported. The RSA key exchange is not executed
but 128-bit RC4 for confidentiality and SHA-1 for authenticity
Generate Master_Secret and tunneI keys using
cIient_random, server_random, and PAC-key
EAP-FAST Request,
TLS/ServerHeIIo (server_random), TLS/ChangeCipherSpec, TLS/Finished (encrpyted keys and secrets)
Generate Master_Secret and tunneI keys using
cIient_random, server_random, and PAC-key
EAP-FAST TLS/ChangeCipherSpec, TLS/Finished
Now both sides are ready to transmit and receive protected authentication messages
i. e. the TLS tunnel had been established
Note: Since a PAC may be used as a credential Ior other applications beyond
EAP-FAST, the PAC key is Iurther hashed using T-PRF to generate a Iresh TLS
mastersecret. Additionally, the hash oI the PAC-key is required to stretch it to
the required 48 octet mastersecret:
Mastersecret T-PRF(PAC-key, "PAC to master secret label hash",
serverrandom clientrandom, 48)
Key material Ior EAP-FAST tunnel protection:
keyblock PRF(mastersecret, "key expansion", serverrandom
clientrandom)
("'" denotes concatenation)
In case EAP-FAST authentication employs 128bit RC4 and SHA1, the keyblock
is partitioned as Iollows:
clientwriteMACsecret|hashsize20|
serverwriteMACsecret|hashsize20|
clientwritekey|Keymateriallength16|
serverwritekey|keymateriallength16|
clientwriteIV|IVsize0|
serverwriteIV|IVsize0|
sessionkeyseed|seedsize 40|
AIter phase 2, the MSKs are derived. Part oI the MSK is Iorwarded to the AP by
the AS using the RADIUS MS-MPPE attributes (RFC 2548).
Pseudorandom Iunction used as deIined in RFC 2246.

79
79 {C} Herbert Haas 2010/02/15
Phase 2 - DetaiIs
SuppIicant
Authenticator
(802.11 AP)
Authentication Server
EAP over Radius
EAPoL
EAP Request/Identity
EAP Response/Identity (user-ID)
EAP Request, List of supported EAP-types (e. g. EAP-GTC, .)
Inner EAP procedures
Result: key material
Now check whether both sides came to the same result
EAP Request, Crypto_Binding TLV
EAP Response, Crypto_Binding TLV
EAP Request, FinaI_ResuIt TLV
EAP Response, FinaI_ResuIt TLV
CIeartext EAP Success/FaiIure indication
All EAP messages are encapsulated in the EAP Message TLV. Assumption:
Phase 1 had been successIul, or TLS session had been successIully resumed.
Phase 2 key derivations are used to prove tunnel integrity and to generate session
keys. The details depend on the inner EAP method. The inner keying material is
always expanded (iI necessary) to (at least) 32 octets. The inner keying material
(i. e. the result oI the inner EAP exchange) is Ied into a PRF to generate the
MSK.
The phase 2 inner authentication method over EAP-TLV can be EAP-SIM, EAP-
OTP, EAP-GTC, or MSCHAPv2.

80
80 {C} Herbert Haas 2010/02/15
AdditionaI Facts
CIient can resume TLS session by sending its
session-ID (in a CIientHeIIo)

Bypass inner EAP conversation

But server must cache cIient's session-ID,


master_secret, and CipherSpec
EAP-FAST supports singIe sign-on (SSO) using
username and password during Windows
networking Iogon
AIso supports separate machine authentication
SeamIess migration from LEAP to EAP-FAST
possibIe

SimiIar AP settings

ACU reconfiguration via ACAT


WPA is aIso supported

81
2010/02/15 {C} Herbert Haas
WPA and WPA2
Content
In this chapter a detailed overview about today's WLAN security problems and
solutions are presented.
This subchapter provides an overview about the WPA procedures.

82
82 {C} Herbert Haas 2010/02/15
Introduction
802.1x aIone does not (need to) provide key management

Often 802.1x is simpIy combined with WEP


Even 802.1x with TKIP wouId aIways start with same base key
Basic Idea of WPA:
Strong per-user, per-session, per-packet keying (TKIP and
MIC)

Use 802.1x and dynamicaI transient key management


AIternativeIy pre-shared keys (SOHO apps.) instead of 802.1x
WPA starts with a security capabiIity negotiation
Therefore cipher suites must be configured on AP

APs advertises capabiIities in beacon and in probe-response


frames
"Cipher Suite" = Auth. Method + Encryption Method
CIient can seIect the desired method during association
request
The basic idea oI WPA is to combine 802.1x authentication with TKIP and
MIC. Furthermore, dynamically established master keys should be the basis to
calculate dynamic per-user, per-session, and per-packet keys, using TKIP.
Key management can be perIormed either through RADIUS (like 802.1x is doing,
and then it is called "WPA-EAP") or alternatively via pre-shared keys without
any additional servers. Both mechanisms will generate a master session key Ior
the Authenticator (AP) and Supplicant (client station).
WPA allows to conIigure "cipher suites" on the AP, while the clients may select
the most appropriate one during the association process.

83
83 {C} Herbert Haas 2010/02/15
WPA/WPA-2
Certified EAP Methods

EAP-TLS (originaIIy the onIy one)

EAP-TTLS/MSCHAPv2

PEAPv0/EAP-MSCHAPv2

PEAPv1/EAP-GTC

EAP-SIM
Native OS support

Windows XP with Service Pack 2 and WPA2


patch

No support for Win2k

Linux: wpasupplicant (Iarge feature set)


The master key is calculated "pair-wise", that is on the AP as well as ob the client
device, either based on 802.1x authentication states or on a Pre-Shared-Key
(PSK). The WPA-PSK method is only used, when there is no Authentication
Server available, typically in home installations.
Note: WPA-PSK is not supported by Cisco ADU or ACU.
Note: When using WPA encryption on an access point, encryption key 1 must not
be used as the WPA key negotiation mechanism uses this key position in the AP
to transIer authentication data to the client.
WPA2 supports FIPS 140-2 compliant security, basically AES in counter mode.
(An early draIt included AES-OCB instead but it was dropped due to patent
issues.) A 48 bit IV protects against replay attacks.
Authentication and Integrity is maintained using an 8 byte CBC-MAC with a 48
bit nonce. Besides the data also the source and destination MAC addresses in the
header are protected by the CBC-MAC. (These Iields are called Additional
Authentication Data (AAD).
The CBC-MAC, the nonce, and additional 2 byte IEEE 802.11 overhead make
the CCMP packet 16 octets larger than an unencrypted IEEE 802.11 packet.
The AP advertises cipher suites both in beacons and probe responses.

84
84 {C} Herbert Haas 2010/02/15
WPA Concepts
1) Pairwise Master Key (PMK) is negotiated between cIient and AS
Based on 802.1x credentiaIs or based on a PSK in home environments
PMK is designed to Iast the entire session
ShouId be exposed as IittIe as possibIe (therefore PTK needed)
2) PMK is pushed from AS to AP
Via RADIUS-Access-Accept message
3) AP generates Pairwise Transient Key (PTK)
Negotiated via Four-Way Handshake to cIient
PTK= HASH (PMK, AP_nonce, STA_nonce, AP_MAC, STA_MAC)
From PTK, other working keys are generated (KCK, KEK, TK)
4) AP aIso derives a Group TemporaI Key (GTK)
To decrypt muIticast and broadcast traffic
Must be the same on aII cIients (!)
Need to be updated periodicaIIy (e. g. when a device Ieaves the network)
AP sends new GTK to each cIient, encrypted with cIient's PTK
Each cIient must acknowIedges the new GTK
Unlike WEP, which uses a single key Ior unicast data encryption and typically a
separate key Ior multicast and broadcast data encryption, WPA uses a set oI Iour
diIIerent keys Ior each wireless client-wireless AP pair (known as the pairwise
temporal keys) and a set oI two diIIerent keys Ior multicast and broadcast traIIic.
This set oI messages exchanges the values needed to determine the pairwise
temporal keys, veriIies that each wireless peer has knowledge oI the PMK (by
veriIying the value oI the MIC), and indicates that each wireless peer is ready to
begin encrypting and providing message integrity protection Ior subsequent
unicast data Irames and EAPOL-Key messages.
For multicast and broadcast traIIic, the wireless AP derives a 128-bit group
encryption key and a 128-bit group integrity key and sends these values to the
wireless client using an EAPOL-Key message, encrypted with the EAPOL-Key
encryption key and integrity-protected with the EAPOL-Key integrity key. The
wireless client acknowledges the receipt oI the EAPOL-Key message with an
EAPOL-Key message.
When a device leaves the network, the GTK also needs to be updated to prevent
the device Irom receiving any more multicast or broadcast messages.

85
85 {C} Herbert Haas 2010/02/15
The Basic Steps
PMK is derived from the master key of the preceding 802.1x
negotiations
Four WPA (main-) steps are performed after 802.1x authentication
Each step of this procedure is protected by dedicated transient
(temporary) keys
Push PMK to AP
Use PMK to derive, bind, and verify PTK
Use Group Key Handshake to send
GTK from AP to cIient
2
3
4
CIient
(SuppIicant)
AP
(Authenticator)
AS
802.1x Authentication using any EAP method
CaIcuIate PMK CaIcuIate PMK 1
The Pairwise Master Key (PMK) is typically calculated using some
authentication data which had been derived at the end oI a preceding 802.1x/EAP
negotiation. For example iI EAP-TLS were used, then the PMK PRF
(MasterKey, clientHello.random, serverHello.random, "client EAP encryption")
WPA implements a new 4-Way Handshake and a Group Key Handshake Ior
generating and exchanging data encryption keys between the Authenticator and
Supplicant. This handshake is also used to veriIy that both Authenticator and
Supplicant know the master session key.

86
86 {C} Herbert Haas 2010/02/15
WPA - Basic Handshake (SimpIified)
1. The AP sends a nonce-
vaIue and the STA now can
construct the PTK
2. The STA sends its own
nonce-vaIue to the AP
together with a MIC
3. The AP sends the GTK and
a sequence number
together with another MIC
This SeqNr wiII be used in
the next muIticast or
broadcast frame, so STA
can perform basic repIay
detection
1. The STA sends a
confirmation to the AP
CIient
(STA)
AP AS
AP_nonce
Derive PTK
STA_nonce, MIC
Derive PTK
Ack
GTK, MIC
Push PMK to AP
PMK PMK
Note: WPA also includes the requirement to use open key authentication and to
obsolete the Ilawed shared-key authentication. Like 802.11i, WPA capabilities
are advertised in beacons, probe responses, association requests, and
reassociation requests.

87
87 {C} Herbert Haas 2010/02/15
WPA DetaiIs - Transient Keys
The PTK (256 bit) is the basis to derive additionaI
transient keys

Data Encryption Key (128 bit)


For unicast frames
Aka TemporaI Key (TK)

Data Integrity Key (128 bit)


For unicast MIC
Key Encryption Key (KEK, 128 bit)
To encrypt EAPoL key messages

Key Integrity Key (KIK, 128 bit)


To caIcuIate the MIC for EAPoL key messages
The GTK (256 bit) is the basis to derive

A Group Encryption Key (GEK)

A Group Integrity Key (GIK)


Based on the PTK, several temporary working keys are derived:
A 128-bit Data Encryption Key Ior unicast transmission which is
similar as a WEP key and consists oI 256-n bits oI the PTK key.
A 128-bit Data Integrity Key Ior unicast MIC
A 128-bit EAPoL Key Encryption Key (KEK) to encrypt EAPoL key
messages. This key simply consists oI the bits 128-255 oI the PTK.
A 128-bit EAPoL Key Integrity Key (KIK) to calculate the MIC Ior
EAPoL key messages. This key is also called "Key ConIirmation Key"
(KCK) and consists oI the bits 0-127 oI the PTK.
Based on the GTK, these temporary working keys are derived:
A 128-bit Group Encryption Key (GEK), which is also known as
Group Transient Key (GTK) to encrypt multicast and broadcast Irames.
This key simply consists oI the bits 128-255 oI the GTK.
A 128-bit Group Integrity Key (GIK) to calculate the MIC Ior multicast
and broadcast Irames. This key simply consists oI the bits 0-127 oI the
PTK.

88
88 {C} Herbert Haas 2010/02/15
(WPA - DetaiIed)
AII WPA procedure messages are of type "EAPoL Key Messages"
Temporary Key (TK) consists of (256-n) bits of the PTK, depending on cipher used
Same Group Transient Key (GTK) is assigned to aII cIients within VLAN
CIient
(SuppIicant)
AP
(Authenticator)
AS
Generate random Nonce_2
Derive PTK = EAPoL_PRF (PMK, Nonce_1, Nonce_2, MAC_1, MAC_2)
Derive KEK and KIK from PTK
Push PMK to AP
Nonce_1, MAC_1
Nonce_2, MAC_2, MIC (using KIK)
Derive PTK = EAPoL_PRF (PMK, Nonce_1, Nonce_2, MAC_1, MAC_2)
Derive KEK and KIK from PTK
Verify MIC using KEK
InstaII PTK, Start_Seq_Number, MIC (using KIK)
Start_Seq_Number, MIC (using KIK)
("OK, use this PTK")
("OK, I wiII use this PTK and I am ready to communicate properIy")
InstaII Temporary Key (TK) InstaII Temporary Key (TK)
Generate random Nonce_3
Generate random GTK, derive GEL and GIK
Nonce_3, GTK + MIC (encr. using KEK and Nonce_3)
("Use this GTK")
ACK, MIC
("OK")
The temporary key exchange is initiated by the AP and consists oI the Iollowing
steps:
1. The AP sends an EAPoL key message including Nonce1 and MAC1. This
message is not encrypted, and no MIC is possible at that stage.
2. Now, the client can calculate a "Pairwise Transient Key" (PTK) and
derives the KEK and KIK.
3. The client sends an EAPoL key message including Nonce2 and MAC2 plus
MIC. The MIC is calculated using the EAPoL-KIK.
4. Now, the AP can also derive the PTK and can veriIy the MIC.
5. The AP sends an EAPoL key message including a MIC and a start-sequence-
number to indicate that the AP is now ready to send encrypted unicast Irames
as well as EAPoL key Irames.
6. The client also sends an EAPoL key message including a MIC and a start-
sequence-number to indicate that the client is now also ready to send unicast
Irames as well as EAPoL key Irames.
7. The AP Iinally calculates a 128-bit Group Encryption Key (GEK) as well as a
128-bit Group Integrity Key (GIK) and transmits these values via an EAPoL
key message (encrypted with EAPoL-KEK and protected by EAPoL-KIK) to
this client.
8. The client acknowledges this message by sending a valid EAPoL key
message.
Note: The basic idea oI all this is to use a PMK to generate "Iresh" PTKs Ior
encryption.

89
89 {C} Herbert Haas 2010/02/15
GTK Issues
GTK is either

A pseudo-random number chosen by AP

The first PTK that the AP uses


GTK Usage

Cannot be used with sequence numbers


because it is used for ALL cIients
Distant cIients might overhear some frames

So management and broadcast frames are


encrypted via WEP onIy
Broadcast key rotation recommended

90
90 {C} Herbert Haas 2010/02/15
WPA-2: PKC
WPA2 mandates both TKIP and AES
capabiIity

TKIP is used by the network if at Ieast


one cIient supports TKIP onIy
PMK Proactive Key Caching (PKC)
support

AP caches credentiaIs 1 hour to aIIow


fast reconnect
According to MicrosoIt Knowledge Base Article - 815485: "With 802.1X, the
rekeying oI unicast encryption keys is optional. Additionally, 802.11 and 802.1X
provide no mechanism to change the global encryption key used Ior multicast and
broadcast traIIic. With WPA, rekeying oI both unicast and global encryption keys
is required. For the unicast encryption key, the Temporal Key Integrity Protocol
(TKIP) changes the key Ior every Irame, and the change is synchronized between
the wireless client and the wireless access point (AP). For the global encryption
key, WPA includes a Iacility Ior the wireless AP to advertise the changed key to
the connected wireless clients."
WPA-2 PMK Caching: PKC allows a client to store PMKs to reuse them when
later associated to the same AP or LAP. In order to support PKC the clients
calculates and sends PMKIDs, i. e. a hash oI the PMK, a string, the station MAC
and the AP MAC. This 'PMK SA IdentiIier' is sent in an association request. The
PMKID uniquely identiIies the PMK on the WLC and thereIore the 802.1x
authentication can be by-passed. The client can send more than one key name in
the association request. II the access point or WLC sends a success in the
association response, then the client and access point proceed directly to the 4-
way handshake.
Note:
PKC is automatically enabled on a Cisco WLC when WPA2 is enabled Ior a
WLAN.
PKC does not work with Aironet Desktop Utility (ADU) as client supplicant.
PMK cache records kept Ior 1 hour Ior non associated clients.

91
91 {C} Herbert Haas 2010/02/15
WPA-2: Pre-Authentication
Pre-authentication support

AIIows a cIient to pre-authenticate with


the AP toward which it is moving

But stiII maintains a connection to the


AP it's moving away from
Note that pre-authentication is done
through the AP to which the cIient is
currentIy assoicated!
Roaming times beIow 100 ms
While PKC reduces the reauthentication time on APs or WLCs where the client
has been authenticated once, preauthentication reduces roaming delays because it
allows clients to authenticate to other APs or WLCs without association. Note
that the preauthentication process is realized through the current AP or WLC to
which the client is currently associated! Using preauthentication the client can
establish PMKs with all APs or WLCs. The PTK handshake is only perIormed
when the client actively associates to a new AP or WLC. In this case the
association request again carries a PMK SA IdentiIier as explained in the PKC
section above.

92
92 {C} Herbert Haas 2010/02/15
WPA-PSK (1)
ONLY usefuI for home WLANs
ReIies on Pre-Shared Key (PSK) onIy
No AAA server needed
PMK is a 4096-times hash of:

Passphrase (8-63 chars or 64 hex digits)

SSID and SSID-Iength

Nonces
The alternative to server-based keys (SBKs). In WPA-PSK, users must share a
passphrase that may be Irom eight to 63 ASCII characters or 64 hexadecimal digits
(256 bits). Each character in the pass-phrase must have an encoding in the range oI
32 to 126 (decimal), inclusive. (IEEE Std. 802.11i-2004, Annex H.4.1). The space
character is included in this range.
In November 2003, Robert Moskowitz, a senior technical director at ICSA Labs
(part oI TruSecure) released "Weakness in Passphrase Choice in WPA Interface".
In this paper, Moskowitz described a straightIorward Iormula that would reveal the
passphrase by perIorming a dictionary attack against WPA-PSK networks. This
weakness is based on the Iact that the pairwise master key (PMK) is derived Irom the
combination oI the passphrase, SSID, length oI the SSID and nonces. The
concatenated string oI this inIormation is hashed 4,096 times to generate a 256-bit
value and combine with nonce values. The inIormation required to create and veriIy
the session key is broadcast with normal traIIic and is readily obtainable; the
challenge then becomes the reconstruction oI the original values. Moskowitz
explains that the pairwise transient key (PTK) is a keyed-HMAC Iunction based on
the PMK; by capturing the Iour-way authentication handshake, the attacker has the
data required to subject the passphrase to a dictionary attack.

93
93 {C} Herbert Haas 2010/02/15
WPA-PSK (2)
2003: Robert Moskowitz pubIished
an effective dictionary attack against
WPA-PSK
Passphrase shouId be more than 20
characters !!!
Attack TooIs: CoWPAtty, KisMAC,
WPA Cracker, .
According to Moskowitz, "a key generated Irom a passphrase oI less than about
20 characters is unlikely to deter attacks." In late 2004, Takehiro Takahashi, then
a student at Georgia Tech, released WPA Cracker.
Around the same time, Josh Wright, a network engineer and well-known security
lecturer, released coWPAtty. Both tools are written Ior Linux systems and
perIorm a brute-Iorce dictionary attack against WPA-PSK networks in an attempt
to determine the shared passphrase. Both require the user to supply a dictionary
Iile and a dump Iile that contains the WPA-PSK Iour-way handshake. Both
Iunction similarly; however, coWPAtty contains an automatic parser while WPA
Cracker requires the user to perIorm a manual string extraction.
Additionally, coWPAtty has optimized the HMAC-SHA1 Iunction and is
somewhat Iaster. Each tool uses the PBKDF2 (Password-Based Key Derivation
Function) algorithm that governs PSK hashing to attack and determine the
passphrase. Neither is extremely Iast or eIIective against larger passphrases,
though, as each must perIorm 4,096 HMAC-SHA1 iterations with the values as
described in the Moskowitz paper.
PBKDF2 is a key derivation Iunction that is part oI RSA Laboratories' Public-
Key Cryptography Standards (PKCS) series, speciIically PKCS #5 v2.0, also
published as Internet Engineering Task Force's RFC 2898. It replaces an earlier
standard, PBKDF1, which could only produce derived keys up to 160 bits long.

Vous aimerez peut-être aussi