Académique Documents
Professionnel Documents
Culture Documents
Available resources include Brocade FOS Documentation Brocade Connect and/or Brocade Partner Sites Training materials including Products, FRUs and LEDs (Webbased training module associated with this course) Brocade switch provider information including compatibility matrices
March 2008
Many common SAN problems are related to - in alphabetical order Configuration - Port, device, switch is not correctly configured Problems accessing a switch or connecting switches or end devices can be related to configuration problems
Firmware Download - FTP configuration and release.plist confusion Licensing - Customers do not have the license to do what they are attempting Problems connecting switches can be related to licensing problems Problems related to performance or problems that occur when connecting switches or end-devices can be related to marginal links Problems that occur when end-devices are not able to access each other can be related to zoning Marginal Links - Bad or marginal cables/GBICs/SFPs
March 2008
March 2008
March 2008
March 2008
Any switch added to an existing fabric must be configured properly LAN configuration information Fabric configuration information Your configuration plan should include a checklist that answers the following questions:
Special port configurations required? Are the correct license keys installed? What versions of firmware are running in the fabric? Will you be using any additional capabilities i.e. ACLs, ADs, FCIP?
March 2008
Gather all required information for new or replacement switch using a switch connection checklist Configure new or replacement switch to join an existing fabric
March 2008
March 2008
10
10
March 2008
11
11
Fabric Segmentation
Fabric segmentation is generally caused by one of the following conditions:
1. 2. 3. 4. 5. 6. 7. Licensing problems: Switches segment due to value line license limitations Zoning conflicts: The zoning configuration in both fabrics cannot be merged Admin Domain (AD) conflict: The AD configuration and/or AD zoning configurations cannot be merged Fabric parameters conflict: fabric.ops parameters do not match Port parameters conflict: ISL port settings are not compatible. FCIP tunnel settings must match. Domain ID overlap: Two or more switches have the same domain ID Access Control List (ACL): If configuration is strict all switches must comply
In addition, all switches in a fabric with user-defined ADs 1-254, ACLs, and/or a zoning database size greater than 256K must support the Reliable Commit Service (RCS) protocol
SAN Troubleshooting Basics
2008 Brocade Communications Systems, Inc. All rights reserved.
March 2008
12
12
fabstatsshow output
Fabric Manager
March 2008
13
13
switchshow Output
RSL1_ST01_B20:admin> switchshow switchName: RSL1_ST01_B20 switchState: Online switchMode: Native switchRole: Principal switchDomain: 1 switchId: fffc01 switchWwn: 10:00:00:05:1e:02:12:2c zoning: ON (lab1) Area Port Media Speed State ============================== 0 0 id N4 Online F-Port 10:00:00:00:c9:53:c6:c5 1 1 id N2 Online E-Port segmented, (domain overlap) (Trunk master)
March 2008
14
14
RSL1_ST10_B41:admin> errshow r Fabric OS: v5.2.0a 2007/01/31-12:50:27, [FABR-1001], 4,, WARNING, rsl1_st10_b41_1, port 8, ELP rejected by the other switch
March 2008 SAN Troubleshooting Basics
2008 Brocade Communications Systems, Inc. All rights reserved.
15
15
fabstatsshow Output
RSL1_ST01_B20:admin> fabstatsshow Description domain ID forcibly changed: E_Port offline transitions: Reconfigurations: Segmentations due to: Loopback: Incompatibility: Overlap: Zoning: E_Port Segment: 0 8 < Identifies mismatch 0 0 0 Count 0 7 (Last on port 14) 6 -----------------------------------------
16
16
Licensing Conflicts
Switches can be purchased with value line licenses
A value line 2 license enables the switch to exist in a two domain fabric A value line 4 license enables the switch to exist in a four domain fabric
Prior to Fabric OS v3.1.2/4.2 value line licensed switches in fabrics that exceeded the allowable number of domains segmented After Fabric OS v3.1.2/4.2 value line licensed switches in fabrics that exceeded the allowable number of domains have a grace period
The switch is allowed to join the fabric but Web Tools access is disabled after 45 days
quietmode on:
0x102b9f00 (tFcph): Jan 31 18:44:15 CRITICAL FABRIC-SIZE_EXCEEDED, 1, Critical fabric size (3) exceeds supported configuration (2). Switch status marginal. Contact Technical Support. 0x102b9f00 (tFcph): Jan 31 18:44:15 CRITICAL FABRIC-WEBTOOL_LIFE, 1, Webtool will be disabled in 44 days 23 hours and 50 minutes
March 2008
17
17
sw4100:admin> cfgshow Defined configuration: <truncated output> Effective configuration: cfg: cfg4 zone: Red_Zone; 1,4; 1,5
sw4900:admin> cfgshow Defined configuration: <truncated output> Effective configuration: cfg: cfg4 zone: Red_Zone; 1,4; 1,5 zone: Blue_Zone; 2,8; 2,11
March 2008
18
18
sw4100:admin> cfgshow Defined configuration: <truncated output> alias: Device1 1,1 <truncated output> Effective configuration: No effective configuration
sw4900:admin> cfgshow Defined configuration: <truncated output> zone: Device1 1,1; 2,3 <truncated output> Effective configuration: No effective configuration
March 2008
19
19
sw4100:admin> cfgshow Defined configuration: <truncated output> zone: Green_Zone 1,1; 2,3 <truncated output> Effective configuration: No effective configuration
sw4900:admin> cfgshow Defined configuration: <truncated output> zone: Green_Zone 2,3; 1,1 <truncated output> Effective configuration: No effective configuration
March 2008
20
20
sw4100:admin> errshow -r Fabric OS: v5.1.0c 2006/08/15-11:37:54, [FABR-1001], 202,, WARNING, sw4100, port 1, Zone conflict
To identify zoning conflict cause, perform the following actions on both fabrics:
Display the current zone configuration in both fabrics (cfgshow) Review the zone configurations in both fabrics for configuration, type, and content mismatches Verify that the Advanced Zoning license is installed (licenseshow)
March 2008
21
21
March 2008
22
22
March 2008
23
23
You can also review these values by uploading the switch configuration file with the configupload command or Fabric Manager baseline
SAN Troubleshooting Basics
2008 Brocade Communications Systems, Inc. All rights reserved.
March 2008
24
24
March 2008
25
25
26
26
When upgrading from Fabric OS v5.1 to v5.2, what happens to ports set to mode L0.5, L1, or L2?
The long-distance mode is still displayed in command line output (switchshow, etc.), but modes L0.5, L1, and L2 cannot be configured To change the distance on these ports, use mode LD or LS
When connecting a Fabric OS v5.2 switch to a pre-Fabric OS v5.2 switch both ports on the link must have the same mode
Result: Use mode LS or LD
March 2008
27
27
Verify settings are the same by invoking portcfgshow on both switches and comparing output
SAN Troubleshooting Basics
2008 Brocade Communications Systems, Inc. All rights reserved.
March 2008
28
28
Domain ID Conflicts
March 2008
29
29
March 2008
30
30
31
If you have to escalate problem send escalation team both supportsave files
March 2008
32
32
March 2008
33
33
If the end device logs in as loop or Fabric, it will be assigned a 24-bit address Until then, it has no source ID (SID) with which to initiate communication in the fabric
March 2008
34
34
March 2008
35
35
Login Device to switch connectivity FLOGI to Fabric Port (FFFFFE) Security Policy Check Device Connection Control POLICY (DCC_POLICY) Access Control List (ACL);
Switch responses:
Accept: Assign fabric unique 24-bit address No response: Do not assign fabric address
March 2008
36
36
AD255 is the Physical Fabric view AD0-AD254 will have a filtered view of the Name Server Device attribute data may be registered:
Device Model and Vendor Firmware and Driver revisions Host name Initiators register using State Change Registration (SCR) Initiators receive notifications by Name Server of Registered State Change Notifications (RSCNs)
SAN Troubleshooting Basics
2008 Brocade Communications Systems, Inc. All rights reserved.
March 2008
37
37
Initiator PLOGI to each target device, based upon Name Server query results Process Login (PRLI) from initiator to target(s)
Provides the end-to-end connectivity for device communication
March 2008
38
38
Dont forget about LUN Masking and Persistent Binding Storage array may implement LUN Masking
Initiator WWN (Port or Node) presented to array properly? Correct LUNs made available to initiator by array?
HBAs may use Persistent Binding to specify LUN WWN or 24-bit PID to OS device mapping
Target LUN WWN (Port or Node) or PID specified correctly in host file(s) May require entry for new or replaced target LUNs
March 2008
39
39
March 2008
40
40
For remote devices, there are several commands to choose from, but start with nscamshow
Tells if remote devices are seen within the fabric.
Name Server (ns*) commands are filtered by ADs in FOS v5.2+ If ADs are implemented, select AD255 (Physical Fabric View): rsl1_st15_b20_1:admin> ad --select 255
Next get a view of the fabric configuration with cfgshow or just get a supportsave
Super command script file. It gets all these commands and more!
March 2008
41
41
Light/Signal
Fibre Channel Layer 0 connectivity
The actual light transmitted and received over FC cabling Use switchshow command to verify light/signal is being transmitted from a device. Use portflagsshow to see if LED is seen. Additionally use sfpshow to verify SFP is not faulty
March 2008
42
42
Light/Signal (cont.)
Successful light (still no speed/synchronization) output examples Use output of switchshow, portshow, and portflagsshow to verify light is being received:
March 2008
43
43
CLI output information associated with the port when speed negotiation is successful:
switchshow: port speed will display the speed1 and State will display Online portshow: port speed will display configured or negotiated speed portflagsshow: Physical command column output field will display No_Sync or In_Sync
March 2008
44
44
Ensure port is set to default values: portcfgdefault 1 Or manually set port to auto negotiate speed: Use portcfgspeed 1 0
March 2008
45
45
Physical Connectivity
Physical connectivity between a device and a switch port includes light/signal, speed, and link negotiation processes After speed negotiation the connecting points have to synchronize Devices can get into a condition defined as marginal when they go into and out of sync Commands that help identify this issue include
porterrshow The errshow output may also have relevant output
Fabric Watch can greatly augment the event reporting found in the error log (RASLog)
March 2008
46
46
A delta of the counters can help you isolate a problem to a port and/or the connected HBA or Storage device
Note that you can clear the port counters using portstatsclear on a per-port/port-group basis (granularity is dependent on FOS version) The link counters cannot be cleared without a reboot
March 2008
47
47
portstatsclear can be used to clear port errors on error statistics to left of the dotted line. The other counters get cleared on a reboot/fastboot.
March 2008 SAN Troubleshooting Basics
2008 Brocade Communications Systems, Inc. All rights reserved.
48
48
portstatsshow
Good for monitoring exact values of counters
March 2008
49
49
Error Counters
Certain port counters can point to physical link layer issues: enc_in: This counter increments when 8b/10b encoding errors are detected within a frame. enc_in errors are always detected on the ingress port. crc_err: Indicates corruption within the frame. Always seen on ingress port but will be passed by the switch unaltered through the fabric (like a trail of bread crumbs). enc_in and/or crc_err = Possible bad media (SFP, cable, patch panel)
March 2008
50
50
disc_c3: Class 3 frame has been discarded because it is not routable to a destination address
Corrupted or not-online Destination ID (DID) Timeout exceeded (Condor ASIC hold time exceeded) Counter may increment when FC nodes and/or switches rapidly transition between online and offline; look at fabriclog s output (described in the Logical Connectivity slide later)
March 2008
51
51
Link Counters
These are point-to-point errors; they do not propagate through the fabric Link failures - error conditions that cause a port to drop out of an active state
Requires the reconnecting device to FLOGI back into fabric (No speed negotiation required, since the device does not lose synchronization)
Loss of sync - occur when bit and word synchronization on link is lost Loss of signal occur when light or an electrical signal is lost on a link
Require connected device to renegotiate speed and FLOGI back into fabric
If you experience device connectivity and/or performance issues and rising link counters look for
bad cables/SFPs/patch-panel connections repeating cycles of online/offline states in fabriclog -s output
March 2008
52
52
March 2008
53
53
March 2008
54
54
portcfggport Lock to G-Port if HBA/storage has difficulties negotiating initial Loop Initialization
portcfggport <port> <0|1>
portcfg mirrorport A port configured as a mirror port will prevent HBA/storage login
portcfg mirrorport <[slot/]port#> --enable Disable mirror port configured to connect a device portcfg mirrorport <[slot/]port#> --disable
March 2008
55
55
Login Services
Three different levels of login: Fabric Login (FLOGI) is used by an N_Port or NL_Port (Nx_Ports) to establish service parameters with the switch
The following information is implicitly captured and put into the Name Server during this process: type; COS; PID; PortName (port WWN) ; and NodeName (node WWN)
N_Port Login (PLOGI) is used by one Nx_Port to establish service parameters with another N_Port or NL_Port Process Login (PRLI) is used by an upper-level process in one port to establish image pairs and service parameters with the corresponding upper-level process in the other port
For example, it can be used to establish the environment between related SCSI processes on an origination Nx_Port and a responding Nx_Port
March 2008
56
56
When devices 1st connect, their address is 000000 (unless they are loop devices, then their address will be 0000pp) FLOGI is required before any frame can be sent thru the fabric FLOGI is sent to well-known address FFFFFE (Fabric F_Port)
March 2008
57
57
portflagsshow Lists the translation of all port login state flags; same as portshow portFlags output
March 2008
58
58
portshow
March 2008
59
59
portstatsshow BB Credit
March 2008
60
60
portcamshow
Hardware enforced SID/DID zone tables are kept in ASIC
portcamshow <port>
March 2008
61
61
Port 1 transitioned from Offline to Online multiple times Check physical connectivity for bad cable, SFPs, patch-panel, etc.
March 2008
62
62
A port login (PLOGI) to the Name Server can be confirmed by looking at the Name Server information Verify using the nsshow command Unsuccessful port login means no information within the Name Server
March 2008
63
63
March 2008
64
64
State Change Register (SCR) Nx_Port request to receive notification when something in the fabric changes
FC Devices that choose to receive RSCNs must register for this service
Devices send a State Change Registration (SCR) to FFFFFD Registration indicates that the device wants to be notified of changes
Registered State Change Notification (RSCN) - issued by the Fabric Controller Service or an Nx_Port to devices that registered (issued an SCR requesting this notification) only sent to devices within an affected zone
March 2008
65
65
Registration is optional
SCSI initiators normally register SCSI targets do not register
March 2008
66
66
Sometimes it isnt a device driver issue. Applications can fail if their I/O is not satisfied quickly. (Quickly is a relative term.)
If necessary, FOS gives the ability to suppress RSCNs per port:
March 2008
67
67
March 2008
68
68
March 2008
69
69
Response when devices are online; but one does not respond to the fcping ELS ECHO frame:
rsl1_st15_b20_1:admin> fcping 0x0a0000 0x1400e2 Source: 0xa0000 Destination: 0x1400e2 Zone Check: Not Zoned Pinging 0xa0000 with 12 bytes of data: received reply from 0xa0000: 12 bytes time:650 usec <truncated output> 5 frames sent, 5 frames received, 0 frames rejected, 0 frames timeout Round-trip min/avg/max = 567/618/674 usec Pinging 0x1400e2 with 12 bytes of data: Request timed out <truncated output> 5 frames sent, 0 frames received, 0 frames rejected, 5 frames timeout Round-trip min/avg/max = 0/0/0 usec
March 2008
70
70
The mechanism for devices to login to each other through PLOGI is the same as used for device to switch login The switch acts as a middle-man
Passing PLOGI/PRLI requests and ACCEPT responses or Discarding such requests if the devices are not zoned together or in the same AD
March 2008
71
71
March 2008
72
72
March 2008
73
73
March 2008
74
74
Configure the port as a mirror port by invoking the following command: portcfg mirrorport <[slot/]port#> --enable
Verify the configuration, invoke portcfgshow <[slot/]port#> and switchshow
2. 3.
Connect a FC Analyzer to the mirror port and verify that it comes online Configure port mirroring connection between the SID & DID thru the mirror port portmirror --add <mirrorportnumber> <SourceID> <DestID>
The mirror port must be online Verify mirror connection, invoke portmirror -show
4. 5. 6.
Start FC Analyzer capture, reproduce problem, stop capture and review output Remove the port mirror connection with the portmirror --delete command: portmirror --delete <mirrorportnumber> <SourceID> <DestID> Remove the mirror port configuration (to allow other connections to this port): portcfg mirrorport <[slot/]port#> --disable
March 2008
75
75
76
There are several different types of switch support data that can be collected from a Brocade switch, router, or Director:
Switch error logs (RASLogs) Audit logs FFDC files Panic dump and core files Trace dump files
March 2008
77
77
RASLog - Overview
Starting in Fabric OS v4.4, the System Message Log began to be called the Reliability, Availability, and Serviceability Log (RASLog) RASLog error messages are defined in one of two groups
External messages CRITICAL, ERROR, WARNING, and INFO can be viewed by admin-level users Internal messages - DEBUG and PANIC can not be viewed by adminlevel users
In Fabric OS v5.1+, certain security- and zoning-related commands cause an AUDIT flag to be added to error messages
March 2008
78
78
Start 2006/03/08-11:59:32, [ZONE-3006], 9, AUDIT, INFO, NDAST01-B48, User: admin, Role: admin, Event: cfgdisable, Status: success, Info: Current zone configuration disabled. End
March 2008
79
79
RASLog - Management
Use the following commands to view the RASLog associated with external messages:
Display all external messages in the error log with no line breaks errdump (default display order: least-recent to most-recent) Display all external messages in the error log with line breaks - errshow (default display order: least-recent to most-recent) Use errdump/show -r to display error messages in reverse order: most-recent to least-recent Clear all internal and external messages from the error log with Admin level errclear command
Forward RASLog and Console log entries to a syslogd daemon on a host computer (syslogdipadd)
Especially important on dual-CP systems as host computer logs maintain a single, sequentially ordered, merged file for both CPs
March 2008
80
80
March 2008
81
81
AUDIT messages are always sent to the console, and can be configured to go to syslog servers
82
March 2008
82
In an AD-aware fabric, Audit Log configuration is done from AD255 Commands involved in configuring the Audit Log include:
auditcfg to enable auditing and define what gets audited (filters) syslogdipadd to specify IP address of syslog server configured to receive audit messages
March 2008
83
83
FFDC - Overview
To minimize requests for problem recreation from certain Brocadedefined events, Fabric OS captures First Failure Data Capture (FFDC) data
Goal: Allow Brocade engineers to gain insight into problems that are transient, difficult-to-recreate, or difficult-to-solve Triggered by error MSG_IDs that are selected by Brocade engineering Messages are written to the console and the error log with an FFDC flag
Automatically collects supportshow-like information (based on CLI commands) as readable text when the selected event occurs
A single FFDC event may create one or more FFDC files Up to 4 MB for all FFDC files combined (if max size is reached, a RASLog message is generated, and periodic console messages are sent)
FFDC files are stored on the switch, and transferred by supportsave (automatically deletes files) or savecore (does not automatically delete files)
March 2008
84
84
FFDC - Configuring
Enable and disable the FFDC functionality with the supportffdc command
Enabled by default - disable only if directed to do so by next-level support
switch:admin> supportffdc --enable <Enable FFDC> --disable <Disable FFDC> --show <Show FFDC state>
March 2008
85
85
FFDC - Capturing
The supportsave command uploads the FFDC data via FTP, and deletes it from the switch
File name indicates the triggering event, and date/time stamp (example: FSSM1005-2006-08-12-114707.ffdc)
The savecore command also uploads the FFDC data via FTP (same file name), but does not delete it from the switch
switch:admin> savecore following 1 directories contains core files: [ ]0: /core_files/ffdc_data Welcome to core files management utility. Menu 1(or R): Remove all core files 2(or F): FTP all core files 3(or r): Remove marked files 4(or f): FTP marked files 5(or m): Mark Files for action 6(or u): Un Mark Files for action 9(or e): Exit Your choice:
March 2008
86
86
In a dual-CP Director, each CP can create these files, so always check both CPs
March 2008
87
87
To upload (FTP) or delete (remove) panic dump and core files via FTP, use the savecore command
switch:admin> savecore -l /core_files/panic/core.873 /core_files/zoned/core.1234 /core_files/zoned/core.5678 /mnt/core_files/nsd/core.873 /mnt/core_files/panic/core.873 switch:admin> savecore -h 192.168.204.188 -u jsmith d core_files_here -p password f /core_files/zoned/,/mnt/core_files/nsd/ /core_files/zoned//core.1234: 1.12 kB 382.60 B/s /core_files/zoned//core.5678: 1.12 kB 381.95 B/s /mnt/core_files/nsd//core.873: 1.12 kB 382.53 B/s Files transferred successfully!
March 2008
88
88
The results from the trace operation are stored in a trace dump file
Triggered by a panic; timeout; CRITICAL-level event; or a manual trigger Binary file, retained in persistent memory Can be uploaded automatically or manually via FTP
March 2008
89
89
Use the traceftp command to manage the uploading (but not deleting) of trace dumps:
traceftp n: Manually upload trace dumps via FTP traceftp e: Enable automatic FTP upload of trace dumps traceftp d: Disable automatic FTP upload of trace dumps With traceftp e, specify the FTP server to which trace dumps are uploaded with the supportftp command must do this, or trace dump files will not be automatically uploaded
March 2008
90
90
March 2008
91
91
March 2008
92
92
March 2008
93
93
March 2008
94
94
Identify faults on the switch by checking the RASLog (errdump) for errorrelated messages As needed, compare time stamps between the RASLog and the Audit Log to determine whether user actions were a problem source
March 2008
95
95
If available, also capture the Audit logs, so that past user actions can be identified
SAN Troubleshooting Basics
2008 Brocade Communications Systems, Inc. All rights reserved.
March 2008
96
96
Fin
97