Vous êtes sur la page 1sur 16

Network Troubleshooting checklist (v0.

3)
Incident / problem
Checklist completed by

Referen <reference details>


ce
<Name of person who completed
checklist>

Critical checks have been completed*

Yes

<>

No

<X>

Important checks have been completed

Yes

<>

No

<X>

1. Information
1.1) <Description of issue,
incident or outage and including
the symptoms displayed or
experienced>*

1.6) <Describe how the business was


impacted by stating the undesired
outcome>

2.

Timelines

1.2) Is it a loss of connectivity?

Yes

<>

No

<X>

1.3) Is it a degradation of
services?

Yes

<>

No

<X>

1.4) Are there any visual records


or photos available?

Yes

<>

No

<X>

1.5) Has a physical inspection


been conducted?

Yes

<>

No

<X>

1.7)
Conditions

<The business and IT conditions


present when the incident or outage
occurred>
<>
1.8) Do we have a diagram or
Yes
<X>
insight into the customer solution No
or configuration?

(lifecycle of incident and problem recording of these times will assist in highlighting where

improvements are possible in the SLA)

Event

Date

Time

Analysis

2.1) Time when incident started

(actual
something has happened to a component or an event
has occurred)

<dd/mm/y
y>

<hh:mm>

Time to detect
was acceptable?

2.2) Time when incident was detected

<dd/mm/y
y>

<hh:mm>

(incident is detected either by monitoring tools, IT


personnel or, worse case, the user/customer)

NETWORK CHECKLIST
* Critical

Yes

<

>

No

<
X
>

1 | Page

2.3) Time of diagnosis

<dd/mm/y
y>

<hh:mm>

2.4) Time of repair

(process to fix failure


started or corrective action initiated)

<dd/mm/y
y>

<hh:mm>

2.5) Time of recovery

(component recovered
the component is back in production business ready to
be resumed)

<dd/mm/y
y>

<hh:mm>

2.6) Time of restoration

(normal operations
resume the service is back in production)

<dd/mm/y
y>

<hh:mm>

2.7) Time of workaround

<dd/mm/y
y>

<hh:mm>

<dd/mm/y
y>

<hh:mm>

(underlying cause we

know what happened?)

(Service is back in

production with workaround)

2.8) Time of escalation

(to third level support

if required)

2.9.1) Time period service was unavailable

<minutes>

(SLA

measure)

Time to repair
was acceptable?

Yes

<

>

No

<
X
>

Time to restore
service to
operational
state was
acceptable?

Yes

<

>

No

<
X
>

Workarounds
and escalation
were
acceptable?

Yes

<

>

No

<
X
>

Yes

<

>

No

<
X
>

2.9.3) SLA
targets
achieved? *

2.9.2) Time period service was degraded

(SLA measure)

<minutes>

3. Proximate cause investigation


3.1) Were there any changes
logged in this time period?

3.4) Was any planned


maintenance scheduled for this
time period?
3.6) Main circuit
<main
being investigated circuit id>

NETWORK CHECKLIST
* Critical

Yes

<>

3.2) List of changes


logged

<list and description of


changes?

No

<X
>

3.3) Last known


change on device
impacted? *

<description of
change>

Yes

<>

No

<X
>

3.5) List and time of


planned maintenance
tasks
<details of Location
A>

<list, time and


description of
maintenance tasks>
3.6.2)
<details
Location B of
Location

3.6.1)
Location A

2 | Page

B>
3.7) Are
Ye
there
s
any
No
other
circuits
with
similar
issues?
3.8) Proximate
investigation

<additional circuits associated with issue>

<
>
<
X
>

cause

3.8.1) Have all proximate causes


been investigated?
3.8.2) Prevalent weather
conditions

<>
Yes
<X>
No
<details of weather
conditions at location A
and location B>
<>
Yes

3.8.3) Is the problem being


investigated suspected to be
associated with the weather
conditions?
3.8.4) Has preventative
maintenance been conducted at
the relevant high site within the
acceptable review period?
3.9) Is
the issue
associate
d with an
environm
ental
problem?

Yes

<
>

No

<X
>

Yes

<
>

No

<X
>

3.9.1) Is
it
electrical
power
related?

Yes

<
>

No

<X
>

3.9.6)
Are there
any
hardware
or
compone
nt
failures
identified
?

Yes

<
>

No

<X
>

3.9.2) Is
it
lightning
related?

Yes

<
>

No

<X
>

3.9.7)
Have
vendor
and
software
bugs
been
eliminate
d?

Yes

<
>

No

<X
>

No

<X>

Yes

<>

No

<X>

3.9.3) Is
there
water
damage?

Yes

<
>

No

<X
>

3.9.8)
Have
contractor
and
service
provider
faults been
eliminated
?

Yes

<
>

No

<X
>

3.9.4)
Is it
heat
related
?

Yes

<
>

No

<X
>

3.9.10)
Are
there
any
other
suspect
ed
causes
identifi
ed?

Yes

<
>

No

<X
>

3.9.5) Has
there been
a change
request
submitted
that has
impacted
the circuit?

3.10) <Description of any other suspected causes identified>

4. Device checks
4.1) Are there any issues highlighted by the device checks? *
4.1.1) Device

<output of chassis show fans>

NETWORK CHECKLIST
* Critical

Yes

<>

No

<X>

Acceptable?

Yes

3 | Page

fan speed
status
4.1.2) Device
temperature
status
4.1.3) Device
power supply
status
4.1.4) Device
power on self
tests
4.1.5) Device
archive

4.1.6) Device
system status

4.2) Type of
connection
4.3) Firmware

No
<output of chassis show temperature?

Acceptable?

Yes
No

<output of chassis show power>

Acceptable?

Yes
No

<output of chassis post show>

Any power
on self test
errors?
Any
unexplaine
d resets or
violations?
Any
highlighted
system
related
issues?

Yes

<output of device-archive show>

<output of system
<output of system
average>
<output of system
utilization>
<output of system
utilization>
Access
dot1q
QinQ
4.3.1) Radio
4.3.2) Switch
4.3.3) Latest
release notes

4.4) MAC
Addresses visible
4.4.1)
Ye
MAC
s
addresse No
s are
correct?
4.5) Results from
RFC2544 tests
4.6) Results from
Y.1731 SLA
measurement

Learned from
Location A

health show cpu-utilization>


health show cpu-loadhealth show memory-

No
Yes
No
Yes
No

health show file-systemVLANs being used


VLANs are
Yes
No
correct?
<firmware version>

<vlan ids, vlan show>

Correct
Yes
No
version?
<firmware version>
Correct
Yes
No
version?
Do the latest release notes of the
Yes
No
firmware describe or identify any of the
problems being experienced?
<mac
Learned from
<mac
addresses>
Location B
addresses>

<copy of RFC2544 results>


TBH-311swt> cfm frame-loss sh
+----------------- MEP FRAME LOSS MEASUREMENT MESSAGE INFORMATION ------------------+
|
|Local|Remote
|Remote| | |Loss|Loss|Bad |Seq |Rep |
|Service
|Mepid|Mac Address
|Mepid |LMM |LMR |Near|Far |Seq |Miss |Time|
+----------------+-----+-----------------+------+----+----+----+----+----+-----+----+
|tag672
|10 |00:03:18:67:B9:22|1
|200 |200 |0 |0 |0 |0 |5 |
+----------------+-----+-----------------+------+----+----+----+----+----+-----+----+

TBH-311swt> cfm delay show


NETWORK CHECKLIST
* Critical

4 | Page

+----------------- MEP DELAY MEASUREMENT MESSAGE INFORMATION ------------------+


|
|Local|Remote
| | | |Delay |Jitter |Rep |
|Service
|Mepid|Mac Address
|RMep|DMM |DMR |in us |in us |Time|
+----------------+-----+-----------------+----+----+----+--------+--------+----+
|tag672
|10 |00:03:18:67:B9:22|1 |10 |10 |1790 |40
|5 |
+----------------+-----+-----------------+----+----+----+--------+--------+----+

4.7) Utilization
4.7.2) Long
term (weekly)
utilization
graph
4.7.3) Short
term (daily)
utilization
graph
4.7.4) Realtime utilization
graph
4.7.5) Offered
capacity

4.8) Device
management

4.7.1) Is there a customer


utilization problem? *
<copy of long term utilization graph>

Yes

<>

No

<X>

<copy of short term utilization graph>

<copy of link utilization of link using STG utility>

What is the capacity being


transmitted or received from the
core aggregation point to the
ring under investigation?

<TX capacity>
<RX capacity>

4.8.1) <details of the device management


configuration>

Is this
offered
capacity
exceeding
the ring
capacity?

Yes

4.8.2) Is
the device
accessible
and
manageabl
e via cli
and SNMP?

Yes

<
>

No

<
X
>

No

4.8.3) Have you


Yes
4.8.4) Has this
Yes
moved the CPE
changed the
from the last mile
symptoms?
radio connection
No
No
and cascaded it
directly off the high
site switch?
Has a sniffer trace of the problem been
Yes
Are the
obtained?
any
deduction
No
s that can
be made
from the
trace?
NETWORK CHECKLIST
* Critical

Yes
No

5 | Page

<details and deductions made from the sniffer trace>

5. Port checks
5.1) Configuration

5.1.1)
Ethernet
port speed
setting*

5.1.3)
Ethernet
port
duplex
setting*
5.2) Statistics

5.2.3) Switch
port statistics
5.2.4) Radio
statistics
5.3) Errors

<>

100

<>

1000

<>

10000

<>

Auto

<>

Half

<>

Full

<>

Duplex

<>

5.1.2) Are
the
Ethernet
port speed
settings
correct?

Yes

5.1.4) Are
the
Ethernet
port duplex
settings
correct?

Yes

5.2.1) Are the port statistics


within acceptable limits?
5.2.2) Do the link LEDs
indicate proper cable
connection?
<copy of switch port statistics>

No

No

Yes
No
Yes
No

<copy of radio port statistics>


5.3.1) Are there CRC
errors displayed on the
port?

5.3.4) Are there any


pause frames on the
switch?

5.4)
Administrativ
e

10

Yes

<>

No

<X>

Yes
No

5.4.1) Has the port been


administratively reset?
5.4.3) Is the switch port
correctly configured as a
trunk or access port as per
the requirement?
5.4.5) Has the port and
cable been correctly
labelled physically?
5.4.7) Have the ports been

NETWORK CHECKLIST
* Critical

5.3.2) Are these


errors cabling
related?

Yes

5.3.5) Are there


any pause frames
on the radio?

Yes

No

No

5.3.3) Are
these errors
radio
related?

Yes

5.3.6) Are
the traffic
counters
incremente
d in both
directions?

Yes

No

No

Ye
s
No
Ye
s
No

5.4.2) Has the port been


physically reset?
5.4.4) Is the port disabled,
either administratively or
due to a fault?

Ye
s
No
Ye
s
No

Ye
s
No
Ye

5.4.6) Do the physical


connections corresponding
to the labelling?
5.4.8) Do the switch and

Ye
s
No
Ye

6 | Page

5.5) OAM

labelled in the switch or


device configuration?

5.5.1) Has an Ethernet OAM


loopback been performed
on the last mile link?

Ye
s
No

device labels correspond to


the physical connections?

No

s
No

5.5.2) <results of eoam>

6. Reticulation checks
6.1) Cabling

6.1.2) Has a visual check


of the cabling been
conducted and is it
acceptable? *

6.1.4) Type of cabling


being used for circuit

6.2) Results of
link tests
(including
patch cables)
(if copper)
6.3) Copper
checks
6.4) Fibre
checks

6.4.7) Results
of circuit ATP
(if fibre)

<

>

No

<
X
>

6.1.3) Are photos available of the


reticulation?

6.1.5) Is
the patch
cabling
damaged?

Ye
s
No

<Results of tests
at Location B>

Are the maximum copper


cable lengths being
exceeded?
6.4.1) Are the fibre pigtails
correctly connected? (RX to
TX)?
6.4.3) Does the fibre exceed
maximum allowable
attenuation?

Ye
s
No
Ye
s
No
Ye
s
No

Has a cable recently been


moved from one switch
port to another?
6.4.2) Does the fibre pigtail
have a half-break?

6.4.5) Is an attempt being


made to incorrectly connect
a multimode device to a
single mode device?
<copy of ATP results>

Ye
s
No

NETWORK CHECKLIST
* Critical

Location A

Location B

Yes

No

Yes
No
Yes
No

<Results of tests
at Location A>

6.1.4.1)
Copper
6.1.4.2)
Fibre
Are the
copper
tests
acceptable
?

Ye
s

Yes
No

6.1.6) Is
the port,
sfp or xfp
damaged?
Are the
copper
tests
acceptabl
e?

6.4.4) Are the different


devices connected via fibre
operating at incompatible
light frequencies?
6.4.6) Has the fibre cable
been recently cleaned?

Yes
No

Yes
No

Yes
No
Yes
No
Yes
No
Yes
No

Measured
TX loss
Measured
RX loss
Measured
TX loss
Measured
RX loss

<db>
<db>
<db>
<db>

6.4.8) Is
fibre
damage
or signal
loss
suspecte
d?

Ye
s
No

7 | Page

7. Traffic checks
7.1.1) Rate
limit
provisioned

7.2.1) Filters

<rate limit configured>

<broadcast containment filter configured>

<status of broadcast containment filters - pm


show pm-instance xxx-xxx>

7.3) IP Checks

If DHCP is being used are


IP being correctly
assigned?
Is the correct subnet and
subnet mask associated
with the IP address?

Ye
s
No
Ye
s
No

7.1.2) Is this
the rate limit
ordered by
customer? *

Ye
s

<
>

No

<
X
>

7.1.3) Are the


management
VLANs setup
correctly
(geographically
partitioned) and
rate limited?
7.2.2) What is
the ratio of
unicasts versus
broadcasts and
is it normal (less
than 10%)?
7.2.3) Has
broadcast filter
been
configured?
7.2.4) Is there a
broadcast
problem?

Ye
s
No

Are pings and tracerts


operating as expected?

Rati
o

Ye
s
No
Ye
s
No
Ye
s
No
Yes
No
Yes

Do pings with different


packet sizes fail?

No

<copy of tracerts using WinMTR tool>


7.4) Transmission paths

NETWORK CHECKLIST
* Critical

7.4.1) Is there a congestion


problem on either the primary or
backup transmission path? *

Ye
s

<
>

No

<
X
>

7.4.2) Is there a layer 2 loop (or


symptoms of one) present on

Ye
s

<
>

8 | Page

7.4.2) What type of


path protection
protocol is being
used?

PBB-TE

any of the transmission paths? *

No

Are there any path protection flaps


being recorded or logged?

Ye
s
No
Ye
s
No

G.8032
IEEE 802.1D

Is the configuration of the path


protection correct?

Other (radio
or
proprietary)

7.4.3) Primary
<details of primary path being used>
path being used
configuration
7.4.4) Primary
<copy of primary path utilization>
path capacity
7.4.5) Total
<Mb/s>
7.4.6) Total
<Mb/s>
provisioned
actual
capacity to
provisioned
subscribers
capacity on
including BTS
the
transmission
backhaul
7.5.1) Backup
<details of backup path being used>
path being used
configuration
7.5.2) Backup
<copy of backup path utilization>
path capacity
7.6) Backhaul
7.6.1) Is the
Yes
same vrf being
using for RX and No
TX
(asymmetrical
routing)?

7.4.7) Sold vs
actual
provisioned
oversubscripti
on rate

7.6.2) Do you
find curvy
blondes
attractive?

<
X
>

<%>

Yes
No

8. Logs
8.1) Do the logs report a network or port flap?

8.2) Do the logs highlight an interference problem?

NETWORK CHECKLIST
* Critical

Yes

<
>

No

<X
>

Yes

<
>

No

<X
>

9 | Page

8.3) Do the logs highlight another problem?

8.4) Switch logs

<copy of switch logs, log flash view>

8.5) Radio logs

<copy of high site and CPE radio logs

Yes

<
>

No

<X
>

>
8.6) <results of latest wireless radio channel surveys>

9. Radio checks
9.1) RF Checks

9.1.1) What type


of RF equipment
is being used?

9.1.3) Do you
have pictures of
the LOS?
Is the antennae
correctly
aligned?

NETWORK CHECKLIST
* Critical

<Vendor>

Yes
No
Yes
No

9.1.2) If this is a
PtMP and has
the configuration
been
disabled/reapplie
d? (alternatively
removed/reappli
ed) to resolve
any connectivity
issues?
9.1.4) Are there
visible LOS
issues?
Is expected
throughput over
the distance of
the link

Yes

No

Yes
No
Yes
No

10 | P a g e

acceptable?
Is there a
problem with
BERs?
Has self
interference
been eliminated?
Is there noise
from adjacent
channels?

9.2) MTU settings

9.3) Microwave

9.4) Loop testing


(these are service
disruptive)

9.5) Alarms

10.

Are the links at


Yes
the high site
No
No
synchronised?
Yes
Has external
Yes
interference
No
No
been eliminated?
Yes
Is there any
Yes
damage visible
No
No
with the
equipment or
devices?
Is the POE
Yes
Is the device
Yes
equipment
management
functioning
functioning and
No
No
correctly?
reporting no
problems?
<details of MTU settings along transmission path
Are the
Yes
MTU
No
settings
acceptabl
e?
Transmitted
<power levels>
Received field
<field strength>
power levels
strength (the
reading must
match the value
resulting from
hop calculations)
Bit error rate
<rate>
Hop
<performance
performance
metric>
Local loops:
usually used to
test the cables
interfacing the
equipment
upstreams

Yes

Yes

Baseband loop:
it permits to test
the RF circuits.

No

Yes

No

9.5.1) Are there any alarms being reported by


the devices or element manager that is relevant
to the problem being experienced? *

Yes

<>

No

<X>

CPE

Wireless

10.1) Is

NETWORK CHECKLIST
* Critical

Yes

10.2) Is

Yes

10.3) Is

Yes

11 | P a g e

there an
authenticat
ion
problem?

No

there a
provisionin
g problem?

10.4) Has the OSS been checked


for this IP or subscriber?

10.6) Is
the correct
static IP
address
being
allocated?

Ye
s
No

10.7) Is there
a current
problem with
anonymous
subscriber
levels on the
network being
high?

10.11)
10.10)
<ip
Static IP address Additional
static IP
address >
addresses
being
being used
used
10.14) Are there any other
subscribers with a problem?

Yes

No

Yes

<>

No

<X
>

No

10.5) Is
there a
problem
with the
provisionin
g of the IP
or
subscriber
on the
OSS?

Yes

10.9) Is the
latency from
the CPE to the
core
acceptable?

Yes

No

10.8) Is the
throughput
test from the
CPE to the
Core
acceptable?
(Yes
bandwidth
test tool on
CPE and from
a Windows PC)

Yes

<ip route>

10.12)
Shared
VLAN be
used

<vlan id>

10.13) POP
being used

<name of
POP>

Yes

<list of subscribers with


similar problems>

10.15) Is
this related
to a
transmissio
n or access
problem?
10.17) Is
there a
problem in
the Core IP
network?
10.19) Is
there a
customer
utilization
problem?
10.21) Is
there a
core or
peering
utilization
problem?

Yes

No

No

No

10.16) Do all the session


authentication servers pass a
health test?

Yes

10.18) Utilization graphs


customer (real-time, daily,
weekly)

<customer utilization graphs>

10.20) Utilization graphs


Core and peering (real-time,
daily, weekly)

<core utilization graphs>

NETWORK CHECKLIST
* Critical

there a
path or
route
problem?

No

<list of failed health


checks on PPPOE (or
alternative) servers>

No

No

Yes
No

Yes
No

Yes
No

12 | P a g e

10.24) What is the


MAC address or
unique identifier of
the subscriber?

<MAC address>

10.29) Are there


problems with
specific urls?

Yes

10.32) Are there


any firewall

Yes

<>

No

<X>

No

10.35) If this is an access point for


laptops or phones?

Yes
No

NETWORK CHECKLIST
* Critical

10.25)
What is
the BTS(s)
to which
the
subscriber
connects?
10.30) Are
these
problems
fixed by
excluding
the urls on
WCCP?

<BTS id>

10.33) Is
the

Yes

Yes

No

10.26) Is
there a
problem
with the
BTS?

Yes

10.31) Are
the
traceroutes
in both
directions
acceptable
?

Yes

10.34) Is
the

Yes

No

10.36) Who is the


vendor of the device?
<name of vendor>

No

No

No

10.37) Is a
proprietary
protocol
enabled?
(Disable for
laptop
access)

Yes
No

13 | P a g e

11.

VoIP

11.1)
Provide
a
snapshot
of
the
current MOS scores
for VoIP.

<Example output from NAM module

>

11.2) Interconnection
status

<Details and graphs of Interconnections to mobile and fixed line operators>

11.3)
Softswitch
session status

<Details and graphs of softswitch session details>

11.4) SBC
status

<Details and graphs of SBC session details>

session

11.5) Capacity

11.6) Calls

NETWORK CHECKLIST
* Critical

11.5.1) Is the
problem related
to a last mile link
utilization issue?

Yes

11.6.1) Are there


problems to
certain
destinations?

Yes

11.6.3) Is there
one way voice
problems?

Yes

No

No

No

<>
<X>

<>
<X>
<>
<X>

11.5.2) Is the
problem related
to a port
capacity or
license count
issue?

Yes
No

<>
<X>

11.6.2) <Description and details of


destination prefixes/numbers with
problems>

11.6.4) Is the
appropriate
codec being
utilized?

Yes
No

<>
<X>

14 | P a g e

11.7) Tests

12.

11.7.2) Are VoIP test


results from a suitable
web based voice
testing service
acceptable?

<>

Yes

<X>

No

QA

11.1) Installation and provisioning

11.2) Testing

11.3)
Results
of
Bandwidth
tests
(initial installation)

Has an RFC244 test been


completed?

Yes

If the CPE is unable to


perform an RFC2544 test
then has an MTR test been
done to a core device in
the DC?

Yes

Have
you
performed
bandwidth testing using a
tool?

Yes

No

No

No

<>
<X>
<>
<X>

<>
<X>

<Bandwidth test results>

13.

CPE

12.1) Has the problem been solved using Yes


the checklist?

<X>

No

<X>

12.2) If No, what checks need to be added to cater for the problem?
<Additional checks required>

14.

Feedback

12.1) Has the problem been solved using Yes


the checklist?

<X>

No

<X>

12.2) If No, what checks need to be added to cater for the problem?
<Additional checks required>
NETWORK CHECKLIST
* Critical

15 | P a g e

NETWORK CHECKLIST
* Critical

16 | P a g e

Vous aimerez peut-être aussi