Vous êtes sur la page 1sur 18

Towards root cause analysis of

BGP routing dynamics


Matthew Caesar, Lakshmi Subramanian,
Randy H. Katz
mccaesar@cs.berkeley.edu
Motivation
Interdomain routing suffers from many
problems
Instability
Slow convergence after changes
Misconfigurations
Poor visibility into dynamics
What is the spectrum of causes of route changes?
What are the primary causes of instability?
How does BGP respond to a routing change?
Incomplete model incomplete solutions
How can we improve routing?
BGP Health monitoring
system
Collect routes from routers
Infer properties of network
elements
Redistribute information
Achieves greater visibility
Our focus: how to do
inference
AS
AS
AS
U
p
d
a
t
e
s
Updates
U
p
d
a
t
e
s
Health
Monitor (HM)
I
n
f
e
r
e
n
c
e
s
Inferences
In
fe
re
n
c
e
s
Uses of a health monitor
Troubleshooting/debugging
BGP policies based on historical
link/node stability
Online help to path selection
Help customers choose upstream
providers
Inference problem
Terminology:
Event: an activity that generates routing updates
Suspect location set: set of ASs and links where
event could have occurred
Suspect cause set: set of types of events that
could have occurred
Problem: Given route updates observed at
multiple vantage points, determine the
suspect set = suspect cause set, suspect
location set} of routing events that trigger
each update
Correlating Observations: Main
idea
Quiescent: Assume bursts of updates to a single prefix are
correlated
Turbulent: Assume updates to many prefixes are correlated
Assuming correlated observations are independent worsens
precision, but assuming independent observations are
correlated worsens accuracy
Time ?
#

u
p
d
a
t
e
d

p
r
e
f
i
x
e
s

?
Quiescent
Turbulent
A
c
t
i
v
i
t
y

Time
Our approach
Refined suspect set
for minor event
Refinement across
multiple views
Inferences, updates
from other views
Correlate updates
based on time
Link analysis
of topology
Potential link/AS
causing major event
Detection Algorithm
Quiescent Turbulent
Split updates into groups
Continuous updates
Burst of updates
for one prefix
Inference rules from
a single view
Why continuous
flaps occur?
Suspect set
Updates from a
single view
Turbulent Inference: Example
Scenario #1:
~1000 prefixes updated that used (A,C)
(A,C) suspect
D
B
E
A
C
2000 prefixes
3000 prefixes
1000 prefixes
V
1
0
0
0
Turbulent Inference: Example
Scenario #2:
~4000 prefixes updated that used (A,B),
~2000 that used (B,E)
(A,B) suspect
D
B
E
A
C
2000 prefixes
3000 prefixes
1000 prefixes
V
4000
2000
2000
Turbulent Inference: Example
Scenario #3:
~2000 prefixes updated that used (A,B),
~2000 that used (B,E)
(B,E) suspect
D
B
E
A
C
2000 prefixes
3000 prefixes
1000 prefixes
V
2000
2000
TurbulentInfer: Issues
Effects of simultaneously occurring
independent events are
overshadowed
Two large events simultaneously
occurring
Transition from Quiescent to
Turbulent periods
Quiescent Inference: Example
REROUTE
Silence, route change,
silence
Potential causes
(yellow):
Link/router repair
MED/LocalPref decrease
Hold-down expired
Potential causes (blue):
Link/router failure
MED/LocalPref increase
Hold-down triggered
Previous path
Final path
worsened
improved
QuiescentInfer:
Inference across views
Did NOT observe
event,
remains
in advertised state
DB
V1
D
V2
Observed
event
(ii) Observed by one view
DA
V1
D
V2
Observed
event
Observed
event
(i) Observed by two views
Event did not
occur here
Event did not
occur here
QuiescentInfer: Issues
Simultaneous events:
Eg. Flap in one view, Reroute in another
Eg 2. Advertisement in one view, Withdrawal in another
Community attribute changes
Community change can trigger reroute several hops away
Multiple peering links
AS 5
AS 1
AS 2
(a) Singly-peered (b) Multiply-peered
AS 3 AS 4
AS 5
AS 1
AS 2
AS 3 AS 4
Validation
Well-known historical events
UUNET (10/3/02), AT&T (8/28/02) routing difficulties
Internet worms
BGP beacons
View at the origin
View at the origin results
BGP Beacon results
Results
70% of updates can be pinpointed to a
single inter-AS link (pair of ASs)
More precise inference for more major
events
Breakdown of updates by cause
Results
Few ASs, links causing majority of
updates
Many previously unknown major events
Peering instability: July 21 2003: link between
two tier 1s affected reachability to several
popular prefixes over 10 hour period
Misconfiguration: Jan 26 2003: AS erroneously
advertised 500 prefixes
Reroute: Jan 23 2003: link between two tier
1s causes many prefixes to switch to alternate
paths
Future work
Investigate continuously flapping
prefixes
Apply statistical inference techniques
Placement of views
Alarms/Triggers to detect unhealthy
behavior
Real time analysis
http://www.cs.berkeley.edu/~mccaesar/hmon.html

Vous aimerez peut-être aussi