Académique Documents
Professionnel Documents
Culture Documents
Participant Guide
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 1
3rd edition (July 2008)
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 2
Table of Contents
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 3
Destination Driver Events .................................................................................. 140
Read Link Status (RLS) and Switch-on-a-Chip (SOC)............................................ 143
What is SOC or SBOD?...................................................................................... 148
Field Case........................................................................................................ 160
Drive Channel State Management ...................................................................... 161
SAS Backend.................................................................................................... 163
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 4
Terms and Conditions
Agreement
This Educational Services and Products Terms and Conditions (“Agreement”) is between
LSI Corporation (“LSI”), a Delaware corporation, doing business in AL, AZ, CA, CO, CT,
DE, FL, GA, KS, IL, MA, MD, MN, NC, NH, NJ, NY, OH, OR, PA, SC, UT, TX, VA and WA
as LSI Corporation, with a place of business at 1621 Barber Lane, Milpitas, California
95035 and you, the Student. By signing this Agreement, or clicking on the “Accept”
button as appropriate, Student accepts all of the terms and conditions set forth below.
LSI reserves the right to change or modify the terms and conditions of this Agreement
at any time.
Course materials
The course materials are derived from end-user publications and engineering data
related to LSI’s Engenio Storage Group (“ESG”) and reflect the latest information
available at the time of printing but will not include modifications if they occurred after
the date of publication. In all cases, if there is discrepancy between this information and
official publications issued by LSI, LSI’s official publications shall take precedence.
LSI assumes no obligation for the accuracy or correctness of the course materials and
assumes no obligation to correct any errors contained herein or to advise Student of
liability for the accuracy or correctness of the course materials provided to Student. LSI
makes no commitment to update the course materials and LSI reserves the right to
change the course materials, including any terms and conditions, from time to time
at its sole discretion. LSI reserves the right to seek all available remedies for any illegal
misuse of the course materials by Student. LSI reserves the right to seek all available
remedies for any illegal misuse of the course materials.
Certification
Student acknowledges that purchasing or participating in an LSI course does not imply
certification with respect to any LSI certification program. To obtain certification,
Student must successfully complete all required elements in an applicable LSI
certification program. LSI may update or change certification requirements at any time
without notice.
Ownership
LSI and its affiliates retain all right, title and interest in and to the course materials,
including all copyrights therein. LSI grants Student permission to use the course
materials for personal, educational purposes only. The resale, reproduction, or
distribution of the course materials, and the creation of derivative works based on the
course materials, is prohibited without the prior express written permission of LSI.
Nothing in this Agreement shall be construed as an assignment of any patents,
copyrights, trademarks, or trade secret information or other intellectual property rights.
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 5
Testing
While participating in course, LSI may test Student's understanding of the subject
matter. Furthermore, LSI may record the Student's participation in a course with
videotape or other recording means. Student agrees that LSI is the owner of all such
test results and recordings, and may use such test results and recordings subject to
LSI's privacy policy.
Software license
All software utilized or distributed as course materials, or an element thereof, is licensed
pursuant to the license agreement accompanying the software.
Indemnification
Student agrees to indemnify, defend and hold LSI, and all its officers, directors, agents,
employees and affiliates, harmless from and against any and all third party claims for
loss, damage, liability, and expense (including reasonable attorney's fees and costs)
arising out of content submitted by Student, Student's use of course materials (except
as expressly outlined herein), or Student's violations of any rights of another.
Disclaimer of warranties
THE COURSE MATERIALS (INCLUDING ANY SOFTWARE) ARE PROVIDED ON AN “AS
IS” AND “AS AVAILABLE” BASIS, WITHOUT WARRANTY OF ANY KIND. LSI DOES
NOT WARRANT THAT THE COURSE MATERIALS: WILL MEET STUDENT'S
REQUIREMENTS; WILL BE UNINTERRUPTED, TIMELY, SECURE, OR ERROR-FREE; OR
WILL PRODUCE RESULTS THAT ARE RELIABLE. LSI EXPRESSLY DISCLAIMS ALL
WARRANTIES, WHETHER EXPRESS, IMPLIED OR STATUTORY, ORAL OR WRITTEN,
WITH RESPECT TO THE COURSE MATERIALS, INCLUDING WITHOUT LIMITATION THE
IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR
PURPOSE WITH RESPECT TO THE SAME. LSI EXPRESSLY DISCLAIMS ANY WARRANTY
WITH RESPECT TO ANY TITLE OR NONINFRINGEMENT OF ANY THIRD-PARTY
NTELLECTUAL PROPERTY RIGHTS, OR AS TO THE ABSENCE OF COMPETING CLAIMS,
OR AS TO INTERFERENCE WITH STUDENT’S QUIET ENJOYMENT.
Limitation of liability
STUDENT AGREES THAT LSI SHALL NOT BE LIABLE FOR ANY DIRECT, INDIRECT,
INCIDENTAL, SPECIAL, CONSEQUENTIAL OR EXEMPLARY DAMAGES, INCLUDING BUT
NOT LIMITED TO, DAMAGES FOR LOSS OF PROFITS, GOODWILL, USE, DATA OR
OTHER SUCH LOSSES, ARISING OUT OF THE USE OR INABILITY TO USE THE COURSE
MATERIALS, EVEN IF LSI HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH
DAMAGES, LSI'S LIABILITY FOR DAMAGES TO STUDENT FOR ANY CAUSE
WHATSOEVER, REGARDLESS OF THE FORM OF ANY CLAIM OR ACTION, SHALL NOT
EXCEED THE AGGREGATE FEES PAID BY STUDENT FOR THE USE OF THE COURSE
MATERIALS INVOLVED IN THE CLAIM.
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 6
Miscellaneous
Student agrees to not export or re-export the course materials without the appropriate
United States and foreign government licenses, and shall otherwise comply with all
applicable export laws. In the event that course materials in the form of software is
acquired by or on behalf of a unit or agency of the United States government (the
“Agency”), the Agency agrees that such software is comprised of “commercial computer
software” and “commercial computer software documentation” as such terms are used
in 48 C.F.R. 12.212 (Sept. 1995) and is provided to the Agency for evaluation or
licensing (A) by or on behalf of civilian agencies, consistent with the policy set forth in
48 C.F.R. 12.212; or (B) by or on behalf of units of the Department of Defense,
consistent with the policies set forth in 48 C.F.R. 227-7202-1 (June 1995) and
227.7203-3 (June 1995).
This Agreement shall be governed by and construed in accordance with the laws of the
State of California, with regard to its choice of law or conflict of law provisions. In the
event of any conflict between foreign laws, rules and regulations and those of the
United States, the laws, rules and regulations of the United States shall govern.
In any action or proceeding to enforce the rights under this Assignment, the prevailing
party shall be entitled to recover reasonable costs and attorneys' fees.
In the event that any provision of this Agreement shall, in whole or in part, be
determined to be invalid, unenforceable or void for any reason, such determination shall
affect only the portion of such provision determined to be invalid, unenforceable or void,
and shall not affect the remainder of such provision or any other provision of this
Agreement. This Agreement constitutes the entire agreement between LSI and
Student relating to the course materials and supersedes any prior agreements, whether
written or oral, between the parties.
Trademark acknowledgments
Engenio, the Engenio design, HotScaletm, SANtricity, and SANsharetm are trademarks or
registered trademarks of LSI Corporation. All other brand and product names may be
trademarks of their respective companies.
Copyright notice
© 2006, 2007, 2008 LSI Corporation. All rights reserved
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 7
Left Blank Intentionally
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 8
Storage Systems Diagnostics and Troubleshooting Course
Outline
Course Description:
Storage Systems Diagnostics and Troubleshooting is an advanced course that presents
the technical aspects of diagnosing and troubleshooting LSI-based storage systems
through advanced data analysis and in-depth troubleshooting.
The basic objective of this course is to equip the participants with the essential concepts
associated with troubleshooting and repairing LSI-based storage systems using either
SANtricitytm Storage Management software, analysis of support data or controller shell
commands.
The information contained in the course is derived from internal engineering publications
and is confidential to LSI Corporation. It reflects the latest information available at the
time of printing but may not include modifications if they occurred after the date of
publication.
Prerequisites:
Ideally the successful student will have completed both the Installation and
Configuration and the Support and Maintenance courses offered by Global Education
services at LSI Corporation.
Students should have at least 6 months field exposure with LSI storage products and
technologies in a support function.
Audience:
This course is designed for customer support personnel responsible for diagnosing and
troubleshooting LSI storage systems through the use of support data analysis and
controller shell access. The course is designed for individuals employed as Tier 3 support
of LSI-based storage systems.
It is assumed that the student has in-depth experience and knowledge with Fiber
Channel Storage Area Network (SAN) technologies including RAID, Fiber Channel
topology, hardware components, installation, and configuration.
Course Length:
Approximately 4 days in length with 60% lecture and 40% hands-on lab.
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 9
Course Objectives
Upon completion of this course, the participant will be able to:
• Recognize the underlying behavior of LSI-based storage systems
• Analyze a storage system for failures through the analysis of support data
• Successfully analyze backend fiber channel errors
• Successfully interpret configuration errors
Course Modules
1. Storage System Support Data Analysis
2. Storage System Level Overview
3. Configuration Overview and Analysis
4. IO Driver and Drive Side Error Reporting and Analysis
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 10
Module 3: Configuration Overview and Analysis
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 11
Left Blank Intentionally
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 12
Module 1: Storage System Support Data Overview
Upon completion should be able to complete the following:
• Describe the purpose of the files that are included within an the All Support Data
Capture
• Analyze the Major Event Log at a high level in order to diagnose an event
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 13
All support Data Capture
• Benefits
– Provides a point-in-time snapshot of system status.
– Contains all logs needed for a ‘first look’ at system failures.
– Easy customer interface through the GUI.
– Non-disruptive
• Drawbacks
– Requires GUI accessibility.
– Can take some time to gather on a large system.
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 14
All Support Data Capture Files - 06.xx.xx.xx
• driveDiagnosticData.bin
– Drive log information contained in a binary format.
• majorEventLog.txt
– Major Event Log
• NVSRAMdata.txt
– NVSRAM settings from both controllers
• objectBundle
– Binary format file containing java object properties
• performanceStatistics.csv
– Current performance statistics by volume
• persistentReservations.txt
– Volumes with persistent reservations will be noted here
• readLinkStatus.csv
– RLS diagnostic information in comma separated value format
• recoveryGuruProcedures.html
– Recovery Guru procedures for all failures on the system
• recoveryProfile.csv
– Log of all changes made to the configuration
• socStatistics.csv
– SOC diagnostic information in comma separated value format
• stateCaptureData.dmp/txt
– Informational shell commands ran on both controllers
• storageArrayConfiguration.cfg
– Saved configuration for use in the GUI script engine
• storageArrayProfile.txt
– Storage array profile
• unreadableSectors.txt
– Unreadable sectors will be noted here, noting the volume and drive LBA
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 15
All Support Data Capture Files - 07.xx.xx.xx
• Contains all the same files as the 06.xx.xx.xx releases but adds 3 new files.
– Connections.txt
• Lists the physical connections between expansion trays
– ExpansionTrayLog.txt
• ESM event log for each ESM in the expansion trays
– featureBundle.txt
• Lists all premium features and their status on the system
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 16
Major Event Log (MEL) Overview
Major Event Log Facts
• Array controllers log events and state transitions to an 8192 event circular buffer.
– Log is permanent
– Survives:
• Power cycles
• Controller swaps
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 17
General Raw Data Categories (06.xx)
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 18
Byte Swapping
• Remember when byte swapping select all of the bytes in the field
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 19
Quick View of the Locations Raw Data Fields (06.xx)
MELH - Signature
MEL version - 2 means 5.x code or 06.x code
Event Description - Includes: Event Group, Component, Internal Flags, Log Group &
Priority
I/O Origin – refer to the MEL spec for the event type
Reporting Controller - 0=A 1=B
Valid? - 0=Not valid 1=Valid data
O1 - Number of Optional Data Fields
O2 - Total length of all of the Optional Data Fields in Hex
F1 - Length of this optional data field
F2 - Data field type (If there is a value of 0x8000 this is a continuation of
the previous optional data field. This would be read as a continuation
of the previous data field type 0x010d.)
F3 - The “cc” means drive side channel and the following value refers to
the channel number and is 1 relative.
Sense Data - Vendor specific depending on the component type.
N/U - Not Used
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 20
Comparison of the Locations of the Summary Information and Raw
Data (07.xx)
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 21
Quick View of the Locations Raw Data Fields (07.xx)
Event Description - includes: Event Group, Component, Internal Flags, Log Group & Priority
1. – I/O Origin
2. - Reserved
3. - Controller reported by (0=A 1=B)
4. - Number of optional data fields present
5. - Total length of optional Data
6. - Single optional field length
7. - Data field type, data field types that begin with 0x8000 are a continuation of the
previous data field of the same type
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 22
MEL Summary Information
• Skey/ASC/ASCQ
– Defined in Chapter 11 (06.xx), 12 (07.xx) of the Software Interface Spec
• AEN Posted events
– Event 3101
• Drive returned check condition events
– Event 100a
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 23
Controller Return States
• Return status and RPC function call as defined in the MEL Specification
• Return Status
0x01 = RETCODE_OK
0x07 = createVolume_1()
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 24
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 25
Event Specific Codes
6/3f/80 = Drive no longer usable (The controller set the drive state to
“Failed – Write Failure”)
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 26
Sense Data (SIS Chapter 5)
• Byte 14 FRU = 0x7d
– FRU is Drive Group (Devnum = 0x60000d)
• Byte 26 = 0x02
– Tray ID = 2
• Byte 27 = 0x05
– Slot = 5
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 27
Sense Data (SIS Chapter 5)
• Byte 26 = 0xd5
1 1 0 1 0 1 0 1 = 0x55 = tray 85
• Byte 27 = 0x69
0110 1001
• IO Origin field
o 0x00 = Normal AVT
o 0x01 = Forced AVT
• LUN field
o Number of volumes being transferred
o Will be 0x00 if it is a forced volume transfer
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 28
Automatic Volume Transfer
• IO Origin field
– 0x00 = Normal AVT
– 0x01 = Forced AVT
• LUN field
– Number of volumes being transferred
– Will be 0x00 if it is a forced volume transfer
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 29
Mode Select Page 2C
• IOP ID Field
o Contains the Host Number that issued the Mode Select (referenced in the
tditnall command output)
• Optional data is defined in the Software interface Specification, section 6.15 (or 5.15)
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 30
Module 2: Storage System Analysis
Upon completion should be able to complete the following:
• Log into the controller shell
• Identify and modify the controller states
• Recognize the battery function within the controllers
• Describe the network functionality
• List developer functions available within the controller shell commands
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 31
State Capture Data File
Amethyst/Chromium (06.16.xx,06.19.xx/06.23.xx)
The following commands are collected in the state capture for the Amethyst and
Chromium releases:
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 32
Crystal (07.10.xx.xx)
The following commands are collected in the state capture for the Crystal release:
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 33
Accessing the Controller Shell
• 06.xx firmware controllers allow access to the controller shell over the network via
rlogin
• 07.xx firmware controllers allow access to the controller shell over the network via
telnet
• The shell allows user to access controller firmware commands & routines directly
• The shell allows user to access controller firmware commands & routines directly.
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 34
Controller Analysis
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 35
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 36
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 37
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 38
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 39
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 40
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 41
Controller Analysis
• bidShow 255 (07.xx)
getObjectGraph_MT / getObjectGraph_MT 99
• Prior to Chromium 2 (06.60.xx.xx), and in Crystal (07.xx) the getObjectGraph_MT
command was used several times to collect the following:
• getObjectGraph_MT 1 – Controller Information
• getObjectGraph_MT 4 – Drive Information
• getObjectGraph_MT 8 – Component Status
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 42
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 43
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 44
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 45
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 46
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 47
Additional Output
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 48
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 49
Knowledge Check
Find the same information in the StateCaptureData.txt file. List what command was
referenced to find the information.
Command Referenced
06.xx 07.xx
Controller Firmware version:
Board ID:
Network IP Address:
Volume Ownership (by SSID):
ESM Firmware Version:
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 50
Additional Commands
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 51
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 52
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 53
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 54
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 55
Debug Queue
• There is no standard for data written to the debug queue, each core asset team
writes the information it feels is needed for debug.
• Because so much data is being written to the debug queue, it is important to gather
it as soon as possible after the initial failure.
• Because there is no standard for the data written to the debug queue, it is necessary
for multiple development teams to work in conjunction to analyze the debug queue.
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 56
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 57
Debug Queue Rules
• First check ‘dqlist’ to verify which trace contains events during the time of failure
• It is possible that there may not be a debug queue trace file that contains the
timeline of the failure, in this case, no information can be gained
• First data capture is a must with the debug queue as information is logged very
quickly
Summary
• Look at the first / last timestamps and remember that they’re in GMT.
• Don’t just type ‘dqprint’ unless you actually want to flush and print the ‘trace’
trace file (the one we’re currently writing new debug queue data to). Only typing
‘dqprint’ can actually make you lose the useful data if you’re not paying
attention.
• Keep in mind that the debug queue wasn’t designed for you to read, only for you
to collect and someone in development to read.
• Remember, even LSI developers, when looking at debug queue traces, need to
go back to the core asset team that actually wrote the code that printed specific
debug queue data, in order to decode it.
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 58
Knowledge Check
True False
True False
The Debug Module is needed for access to all controller shell commands.
True False
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 59
Modifying Controller States
• Controller states can by modified via the GUI to place a controller offline, in-
service mode, online, or to reset a controller
• These same functions can be achieved from the controller shell if GUI access is
not available
• Commands that end in _MT use the SYMbol layer and require that the network
be enabled but does not require that the controller actually be on the network.
The controller must also be through Start Of Day
• The _MT commands are valid for both 06.xx and 07.xx firmware
• The legacy (06.xx and lower) commands are referenced in the ‘Troubleshooting
and Technical Reference Guide Volume 1’ on page 27
• To transfer all volumes from the alternate controller and place the
alternate controller in service mode
-> setControllerServiceMode_MT 1
• While the controller is in service mode it is still powered on and is available for
shell access. However it is not available for host I/O, similar to a ‘passive’ mode.
• To transfer all volumes from the alternate controller and place the
alternate controller offline
-> setControllerToFailed_MT 1
• While the controller is offline it is powered off and is unavailable for shell access.
It is not available for host I/O
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 60
• To place the alternate controller back online from either an offline
state, or from in service mode
-> setControllerToOptimal_MT 1
• This will place the alternate controller back online and active, however will not
automatically redistribute the volumes to the preferred controller
– sysReboot
– resetController_MT 0
– isp rdacMgrAltCtlReset
–
• Reset the alternate controller
– altCtlReset 2
– resetController_MT 1
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 61
Diagnostic Data Capture (DDC)
Brief History
• Multiple ancient IO events in the field
• Ancient IO
• Master abort due to bad address accessed by the fibre channel chip results in
PCI error
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 62
DDC Trigger
• MEL event gets logged whenever DDC logs are available in the system
• Batteries
– Get enabled if system has batteries which are sufficiently charged
– DDC logs triggered by ancient IO MAY sustain without batteries, as
ancient IO does not cause hard reboot.
• DDC info is persistent across power cycle, and controller reboot provided the
following is true:
– System contains batteries which are sufficiently charged
• Binary
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 63
DDC MEL Events
• MEL_EV_DDC_AVAILABLE
– Event # 6900
– Diagnostic data is available
– Critical
• MEL_EV_DDC_RETRIEVE_STARTED
– Event # 6901
– Diagnostic data retrieval operation started
– Informational
• MEL_EV_DDC_RETRIEVE_COMPLETED
– Event # 6902
– Diagnostic data retrieval operation completed
– Informational
• MEL_EV_DDC_NEEDS_ATTENTION_CLEARED
– Event # 6903
– Diagnostic data Needs Attention status cleared
– Informational
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 64
Knowledge Check
1) A controller can only be placed offline via the controller shell interface.
True False
True False
True False
True False
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 65
Left Blank Intentionally
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 66
Module 3: Configuration Overview and Analysis
Upon completion should be able to complete the following:
• Describe the difference between the legacy configuration structures and the new
07.xx firmware configuration database
• Analyze an array’s configuration from shell output and recognize any errors in
the configuration
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 67
Configuration Overview and Analysis
• In 06.xx firmware, the storage array configuration was maintained as data structures
resident in controller memory with pointers to related data structures
• The data structures were written to DACstore with physical references (devnums)
instead of memory pointer references
• Drawbacks of this design are that the physical references used in DACstore
(devnums) could change, which could cause a configuration error when the
controllers are reading the configuration information from DACstore
• As of 07.xx the storage array configuration has been changed to a database design
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 68
Configuration Overview and Analysis
• Pieces
– Pieces are simply the slice of a disk that one volume is utilizing, there
could be multiple pieces on a drive, but a piece can only reference one
drive
• Piece Structures
– Piece structures maintain the following configuration data
• A pointer to the volume structure
• A pointer to the drive structure
• Devnum of drive that the piece resides on
• Spared devnum if a global hot spare has taken over
• The piece’s state
• Drive Structures
– Drive structures maintain the following configuration data
• The drives devnum and tray/slot information
• Blocksize, Capacity, Data area start and end
• The drive’s state and status
• The drive’s flags
• The number of volumes resident on the drive (assuming it is
assigned)
• Pointers to all pieces that are resident on the drive (assuming it is
assigned)
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 69
• Volume Structures
– Volume structures maintain the following information
• SSID number
• RAID level
• Capacity
• Segment size
• Volume state
• Volume label
• Current owner
• Pointer to the first piece
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 70
Configuration Overview and Analysis (07.xx)
• Each record maintains a reference to it’s parent record and it’s own specific state
info
• The “Virtual Disk Manager” (VDM) uses this information and facilitates the
configuration and I/O behaviors of each volume group
– VDM is the core module that consists of the drive manager, the piece
manager, the volume manager, the volume group manager, and
exclusive operations manager
• Pieces
– Pieces may also be referenced as ‘Ordinals’. Just remember that piece ==
ordinal and ordinal == piece
• Piece Records
– Piece records maintain the following configuration data
• A reference to the RAID Volume Record
• Update Timestamp of the piece record
• The persisted ordinal (what piece number, in stripe order, is this
record in the RAID Volume)
• The piece’s state
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 71
• Drive Records
– Note that there is no reference to the piece record itself, only the ordinal
value
– The parent record for an assigned drive is the Volume Group record
– Volume Records only refer back to their parent volume group record via
the WWN of the volume group
– Note that the Volume Group record does not reference anything but itself
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 72
07.xx configuration layout
o 07.xx configuration uses hard set values such as physical device WWNs,
and internally set WWN values for RAID Volumes and Volume Groups
which will not change once created.
• Provides for a more robust and reliable means of handling failure scenarios
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 73
Knowledge Check
True False
3) Shell commands to analyze the config did not change between 06.xx and 07.xx.
True False
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 74
Drive and Volume State Management
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 75
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 76
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 77
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 78
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 79
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 80
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 81
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 82
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 83
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 84
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 85
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 86
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 87
Volume State Management
Beginning with Crystal there are different classifications for volume group states
• Incomplete – Drives are missing and there is not enough redundancy available to
allow I/O operations to continue
• Exported – Volume group and associated volumes are offline as a result of a user
initiated export (used in preparation for a drive migration)
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 88
Hot Spare Behavior
• A hot spare can spare for a failed drive or NotPresent drive that has failed pieces
• If an InUse hot spare drive fails and that failure causes any volumes in the
volume group to transition to failed state, then the failed InUse hot spare will
remain integrated in the VG to provide the best chance or recovery
• If none of the volumes in the volume group are in the failed state, then the failed
InUse hot spare is de-integrated from the volume group making it a “failed
standby” hot spare and another optimal standby hot spare will be integrated
• If failure occurred due to reconstruction (read error), then the InUse hot spare
drive won’t be failed but it will be de-integrated from the volume group. We
won’t retry integration with another standby hot spare drive. This “read error”
information is not persisted or held in memory so we will retry integration if the
controller was ever rebooted or if there was an event that would start
integration.
• When copyback completes, the InUse hot spare drive is de-integrated from its
group and is transitioned to a Standby Optimal hot spare drive.
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 89
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 90
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 91
Volume Mappings Information
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 92
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 93
Knowledge Check
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 94
Portable Volume Groups in 07.xx
• This procedure is now gone and has been replaced by portable volume group
functionality.
o The Volume Group is placed in the “Export” state and the drives marked
offline and spun down
o Drive references are removed once all drives in the “Exported” volume
group are physically removed from the donor system
o The user must specify that the configuration of the new disks be
“Imported” to the current system configuration
o Once “Imported” the configuration data on the migrated group and the
existing configuration on the receiving system are synchronized and the
volume group is brought online
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 95
RAID 6 Volumes in 07.xx
• First we should get the “Marketing” stuff out the of the way
• XBB-2 (Which will release with Emerald 7.3x) will support RAID 6
o P is for parity, just like we’ve always had for RAID 5 and can be used to
reconstruct data
• A RAID 6 Volume Group can survive up to two drive failures and maintain access
to user data
• Minimum number of drives for a RAID 6 Volume Group is five drives with a
maximum of 30
• There is some additional capacity overhead due to the need to store both P and
Q data (i.e. the capacity of two disks instead of one like in RAID 5)
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 96
Troubleshooting Multiple Drive Failures
• When addressing a multiple drive failure, there are several key pieces of information
that need to be determined prior to performing any state modifications.
• RAID Level
o Is it a RAID 6?
– RAID 6 volume group failures occur after 3 drives have failed in
the volume group
o Is it a RAID 3/5 or RAID 1?
– RAID 5 volume group failures occur after two drives have failed in
an volume group.
o RAID 1 volume group failures occur when enough drives fail to cause an
incomplete mirror.
– This could be as few as two drives or half the drives + 1.
o RAID 0 volume groups are dead upon the first drive failure
• Despite the drive failures is each individual volume group configuration complete?
– i.e. Are all drives accounted for, regardless of failed or optimal?
• How many drives have failed and what volume group does each drive belong?
• In what order did the drives fail in each individual volume group?
• Are there any backend errors that lead to the initial drive failures?
o This is the most common cause of multiple drive failures, all backend
issues must be fixed or isolated before continuing any further
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 97
• RAID 5 and RAID 3 Volume Groups
o After the second drive failure the volume group and associated volumes
are marked as failed, no I/Os have been accepted since the second drive
failed
o Up until the second drive failure, data in the stripe is consistent across
the drives
• RAID 0
o As there is no redundancy these arrays cannot generally be recovered.
However, the drives can be revived and checked – no guarantees can be
made that the data will be recovered.
• If any of the drives have an ‘offline’ status (06.xx), reviving drives could revert them
to an unassigned state
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 98
Multiple Drive Failures – How Many Drives?
• Assuming the volume group configuration is complete and all drives are accounted
for you need to determine how many drives are failed
• Using the output of ionShow 12 determine whether or not these drives are in an
open state
o If the drives are in a closed state they will be inaccessible and attempts
to spin up, revive, or reconstruct will likely fail
• Determining the failure order is just as important as determining the status of the
failed volume group’s configuration
• Often times, failures occur close together and will show up either at the same
timestamp or within seconds of each other in the MEL
isp cfgPrepareDrive,0x<phydev>
Note: this is the only command that uses the “phydev” address not the
devnum address
• This command will spin the drive up, but not place it back in service.
It will still be listed as failed by the controller.
However since it is spun up, it will service direct disk reads of the DACstore region
necessary for the following commands.
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 99
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 100
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 101
I’ve got my failure order, what’s next?
• Using the information on the previous slides you should have now determined what
the failure order is of the drives.
• Special considerations need to be made depending on the RAID level of the failed
volume group
o For RAID 6 volume groups, the most important piece of information is the
first two drives that failed
o For RAID 5 volume groups, the most important piece of information is the
first drive that failed
o For RAID 1 volume groups, the most important piece of information is the
first drive that failed causing the mirror to break.
• Before making any modifications to the failed drives, any unused global hot spares
should be failed to prevent them sparing for drives unnecessarily.
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 102
Reviving Drives
• Begin with the last drive that failed and revive drives until the volume group
becomes degraded
o Check to see if the volume group is degraded, if not move on to the next
drive (Last -> First) and revive it. Repeat this step until the volume group
is degraded
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 103
Cleanup
• If data checks out, reconstruct the remaining failed drives, replace drives as
warranted
• Once reconstructions have begun, the previously failed hot spares can be revived
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 104
Multiple Drive Failures – A Few Final Notes
• If there is any doubt about the failure order, the array configuration, or you are
simply not confident – find a senior team member to consult with prior to taking any
action.
– Beyond this you can ALWAYS escalate
• You are dealing with a customer’s data, be mindful of this at all times.
– Think about what you are doing, establish a plan based on high level
facts
• If there are multiple drive failures, there is chance that a backend problem is at fault
– Get the failure order information, address the backend issue, spin up
drives and restore access.
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 105
Offline Volume Groups
• This behavior can cause situations where a volume group is left in an offline status
with all drives present, or with one drive listed as out of service
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 106
Offline Volume Groups (06.xx)
• In order to bring a volume group online through the controller shell with no
pieces out of service, or only one piece out of service
– isp cfgMarkNonOptimalDriveGroupOnline,<SSID>
• Where ‘SSID’ is any volume in the group, this only needs to be run once
against any volume in the group
• Volume Groups that do not have all members (drives) present during start of day
will transition to their appropriate state
• Even though the group is listed as degraded or dead, it is possible that all
volumes will still be in an optimal state since no pieces are marked as out
of service
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 107
Clearing the Configuration
• In extreme situations it may be necessary to clear the configuration from the system
• This can be accomplished by either clearing the configuration information from the
appropriate region in DACstore or by completely wiping DACstore from the drives
and rebuilding it during start of day
– Advanced >> Recovery >> Clear Configuration >> Storage Array (07.xx)
– sysWipe
– sysWipeZero 1 (06.xx)
– dsmWipeAll (07.xx)
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 108
• Where <writeNewDacstore> is either a 0 to not write new DACstore until
start of day or the drive is (re)inserted into a system, or a 1 to write new
clean DACstore once it has been cleared
• There are times where the Feature Enable Identifier key becomes
corrupt, in order to clear it and generate a new Feature Enable Identifier
use the following command.
• For 07.xx systems, you must also remove the safe header from the
database
• dbmRemoveSubRecordType 18 (07.xx)
• Once this has been completed on both controllers, they will need to both
be rebooted in order to generate a new ID.
• All premium feature keys will need to be regenerated with the new ID
and reapplied.
• There are times that volumes are lost and need to be recovered, either due to a
configuration problem with the storage array, or the customer simply deleted the
wrong volume
• Multiple pieces must be known about the missing volume in order to ensure data
recovery
– Drives and Piece Order of the drives in the missing volume group
– Capacity of each volume in the volume group
– Disk offset where each volume starts
– Segment Size of the volumes
– RAID level of the group
– Last known state of the drives
• This information can be obtained from historical capture all support data files
relatively easy
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 109
• Finding Capacity, Offset, RAID level, and Segment size
• The last known state of the drives is a special case where a drive was previously
failed in a volume prior to the deletion of the volume, it must be failed again after
the recreation of the volume in order to maintain consistent data/parity
• When specifying the capacity, specify it in bytes for a better chance of data
recovery, if entered in Gigabytes there could be some rounding discrepancies in the
outcome
• A lost volume can be created using this method as many times as necessary until the
data is recovered as long as there are no writes that take place to the volume when
it is recreated improperly
• NEVER use this method to create a brand new volume that contains no data. Doing
so will cause data corruption upon degradation, since the volume was never
initialized during creation.
• If creating volumes using the GUI, instead of the ‘recover volume’ CLI command,
steps must first be made in the controller shell in order to prevent initialization
• There is a flag in the controller shell that defines whether or not to initialize the data
region of the drives upon new volume creations
– writeZerosFlag
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 110
Recovering Lost Volumes – Setup
-> writeZerosFlag
value = 0 = 0x0
-> writeZerosFlag=1
-> writeZerosFlag
value = 1 = 0x1
-> VKI_EDIT_OPTIONS
1) writeZerosFlag=1
1) writeZerosFlag=1
->
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 111
Recovering Lost Volumes
• A lost volume can be created using this method as many times as necessary until the
data is recovered as long as there are no writes that take place to the volume when
it is recreated improperly
• NEVER use this method to create a brand new volume that contains no data. Doing
so will cause data corruption upon degradation, since the volume was never
initialized during creation
• Always verify that once the volume has been recreated that the system has been
cleaned up from all changes made during the volume recreation process
-> writeZerosFlag
value = 1= 0x1
-> writeZerosFlag=0
-> writeZerosFlag
value = 0 = 0x0
-> VKI_EDIT_OPTIONS
1) writeZerosFlag=1
->
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 112
Recovering Lost Volumes – IMPORTANT
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 113
Knowledge Check
1) 06.xx – List the process required to determine the drive failure order for a
volume group.
2) 07.xx – List the process required to determine the drive failure order for a
volume group.
True False
True False
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 114
Module 4: Fibre Channel Overview and Analysis
Upon completion should be able to complete the following:
• Describe how fibre channel topology works
• Determine how fibre channel topology relates to the different protocols that LSI
uses in its storage array products
• Analyze backend errors for problem determination and isolation
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 115
Fibre Channel
• Arbitration is required for one port (the ‘initiator’) to communicate with another (the
‘target’)
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 116
Fibre Channel Arbitrated Loop (FC-AL) – The LIP
• Prior to beginning I/O operations on any drive channel a Loop Initialization (LIP)
must occur.
• A 128-bit (four word) map is passed around the loop by the loop master (the
controller)
– Each offset in the map corresponds to an ALPA and has a state of either
0 for unclaimed or 1 for claimed
• The LIP process is the same regardless of drive trays attached (JBOD & SBOD)
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 117
Fibre Channel Arbitrated Loop (FC-AL) – The LIP
– If a device was not previously addressed it will pass the frame on to the
next device in the loop
– The loop map is once again passed from device to device in the loop
– Each device will check it’s hard address against the loop map
– If the offset of the loop map that corresponds to the device’s hard
address is available (set to 0) it will set that bit to 1, assuming the
corresponding ALPA, and pass the loop map on to the next device
– If the hard address is not available it will pass the loop map on and await
the LISA stage of initialization
– Devices that assumed an ALPA in the LIPA phase will simply pass the
map on to the next device
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 118
Fibre Channel Arbitrated Loop (FC-AL) – The LIP
• How are hard addresses determined?
– Hard Addresses are determined by the ‘ones’ digit of the drive tray ID
and the slot position of the device in the drive tray
– Controllers are set via hardware to always assume the same hard IDs to
ensure that they assume the lower two ALPA addresses in the loop map
(0x01 for “A” and 0x02 for “B”)
– I/Os that were in progress when the LIP occurred can be recovered
quickly without the need for lengthy timeouts and retries
– Devices that had not assumed an ALPA on the loop map in the LIPA and
LIHA phase of initialization will now take the first available ALPA in the
loop map
– When the LISA phase is received again by the loop master it will check
the frame header for a specific value that indicates that LISA had
completed
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 119
Fibre Channel Arbitrated Loop (FC-AL) – The LIP
• Once LISA has completed, the loop master will distribute the loop map again and
each device will enter it’s hex ALPA in the order that it is received
• The loop master will distribute the completed loop map to all devices to inform them
of their relative position in loop to the loop master
• The loop master ends the LIP by transmitting a CLS (Close) frame to all devices on
the loop placing them in monitor mode
• The ‘ones’ digit of the tray ID not being unique among the drive
trays on a given loop
– A bad device on the loop will corrupt the ALPA map resulting in devices
not assuming the correct address or not participating in the loop
• The net of these conditions is that LIPs become a disruptive process that can have
adverse affects on the operation of the loop
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 120
Fibre Channel Arbitrated Loop (FC-AL) – Communication
• Each port has what is referred to as a Loop Port State Machine (LPSM) that is used
to define the behavior when it requires access or use of the loop
• While the loop is idle, the LPSM will be in MONITOR mode and transmitting IDLE
frames
• In order for one device to communicate with another arbitration must be performed
– An ARB frame will be passed along the loop from the initiating device to
the target device
– If the ARB frame is received and contains the ALPA of the initiating device
it will transition from MONITOR to ARB_WON
– An OPN (Open) frame will be sent to the device that it wishes to open
communication with
– CLS (Close) is sent and the device ports return to the MONITOR state
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 121
Knowledge Check
1) The Fibre Channel protocol does not have very much overhead for login and
communication.
True False
True False
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 122
Drive Side Architecture Overview
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 123
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 124
SCSI Architecture Model Terminology
• nexus: A relationship between two SCSI devices, and the SCSI initiator port and
SCSI target port objects within those SCSI devices.
• I_T nexus: A nexus between a SCSI initiator port and a SCSI target port.
• logical unit: A SCSI target device object, containing a device server and task
manager, that implements a device model and manages tasks to process commands
sent by an application client.
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 125
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 126
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 127
Role column
FCdr – Fibre Channel drive
SATAdr – SATA drive
SASdr – SAS drive
ORP columns indicate the overall state of the lu for disk device types (normally should
be “+++”).
+) alternate itn is up
d) alternate itn is degraded
-) alternate itn is down
x) there is no alternate itn
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 128
-) chosen itn is not preferred
) no itn preferences
The Channels column indicates the state of the itn on that channel which is for its lu.
*) up & chosen
+) up & not chosen
D) degraded & chosen
D) degraded & not chosen
-) down
x) not present
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 129
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 130
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 131
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 132
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 133
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 134
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 135
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 136
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 137
Fibre Channel Overview and Analysis
• In order to reset the backend statistics that are displayed by the previous
commands
o iopPerfMonRestart
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 138
Knowledge Check
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 139
Destination Driver Events
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 140
Destination Driver Events (Error Codes)
02-0b/00/00 IO timeout
ff-00/01/00 ITN fail timeout (ITN has been disconnected for too long)
ff-00/02/00 device fail timeout (all ITNs to device have been discon. for too long)
ff-00/03/00 cmd breakup error
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 141
Destination Driver Events (Error Codes)
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 142
Read Link Status (RLS) and Switch-on-a-Chip (SOC)
• Each port on each device maintains a Link Error Status Block (LESB) which tracks
the following errors
• Read Link Status (RLS) is a link service that collects the LESB from each device
• Transmission Words
– Two types:
• Data Word
– Dxx.y, Dxx.y, Dxx.y, Dxx.y
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 143
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 144
RLS Diagnostics
• Analyze RLS Counts:
– Identify the first device (in Loop Map Order) that detects high number of
Link Errors
• Link Error Severity Order: LF > LOS > ITW
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 145
RLS Diagnostics Example
Example:
• Drive [0,9] has high error counts in ITW, LF, and LOS
Important Note:
• Logs need to be interpreted, not merely read
• The data is representative of errors seen by the devices on the loop
• No Standard error counting
• Different devices may count the error in different rate
• RLS counts are still valid in SOC environments
• Not valid however for SATA trays
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 146
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 147
What is SOC or SBOD?
• Switch-On-a-Chip ( SOC )
• Switch Bunch Of Disks (SBOD)
Features:
• Crossbar switch (Loop-Switch)
• Supported in FC-AL topologies
• Per device monitoring
SOC Components
• Controllers
– 6091 Controller
– 399x Controller
• Drive Trays
– 2Gb SBOD ESM (2610)
– 4Gb ESM (4600 – Wrigley)
SBOD vs JBOD
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 148
What is the SES?
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 149
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 150
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 151
SOC Statistics
• In order to clear the drive side SOC statistics
clearSocErrorStatistics_MT
socShow 1
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 152
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 153
Determining SFP Ports
• 4GB SBOD drive enclosures ports start from the center and go to the outside
(Wrigley-Husker)
– Other misc.
• Bypassed, Byp_LIPF8, Byp_TmOut, Byp_RxLOS, Byp_Sync,
Byp_LIPIso, Byp_LTBI, Byp_Manu, Byp_Redn, Byp_Snoop,
Byp_CRC, Byp_OS
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 154
Port Insertion Count (PIC)
• Port insertion count – The number of times the device has been inserted into this
port.
• The value is incremented each time a port successfully transitions from the
bypassed state to inserted state.
• Range: 0-255 = 28
• Possible States:
– Note: This implies that a loop can go down and up multiple instances in
one SOC polling cycle and only be detected once.
– Range: 0-255 = 28
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 155
Relative Frequency Drive Error Avg.
(relFrq count / RFDEA)
• SBODs are connected to multiple devices.
• Overtime clocks tend to drift. SBODs employ a clock check feature comparing
the relative frequency of all attached devices to the clock connected to the
SBOD.
• If one transmitter is transmitting at the slow end and its partner at the fast end
of tolerance range then the two clocks are in specification but will have extreme
difficulty in communicating
• Loop Cycle allows for an understanding that an attempt is being made to bring
up the loop.
– Does not mean the loop has come up
• Possible States:
– Same as Loop States (LS)
• Up, Down, Transition states as loop is coming up
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 156
Ordered Set Error Count (OSErr / OSEC)
• Number of Ordered Sets that are received with an encoding error.
Other values
• Sample Time:
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 157
Analysis of RLS/ SOC
• RLS is an error reporting mechanism that reports errors as seen by the devices
on the array.
• Do not always expect the first capture of the RLS/ SOC to pin point the
problematic device.
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 158
Analysis of SOC
• Errors are generally not propagated through the loop in a SOC Environment.
– What is recorded is the communication statistics between two devices.
• The component connected to the port with the highest errors in the
aforementioned stats is the most likely candidate for a bad component
Known Limitations
• Non-optimal configurations
– i.e. improper cabling
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 159
Field Case
• readLinkStatus.csv
• RLS stats show drive tray 1 & 2 are on channel 1 & 3 (All counts zero)
• The drive tray can continue to operate after it is up without the SES.
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 160
Drive Channel State Management
This feature provides a mechanism for identifying drive-side channels where device
paths (IT nexus) are experiencing channel related I/O problems.
• There are two states for a drive channel – OPTIMAL and DEGRADED
– Timeout errors
– Controller detected errors: Misrouted FC Frames and Bad ALPA errors, for
example
– Drive detected errors: SCSI Parity Errors, for example
– Link Down errors
• When a drive channel is marked degraded a critical event will be logged to the
MEL and a needs attention condition set in Storage Manager
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 161
• A drive channel that is marked degraded will be persisted through a reboot as
the surviving controller will direct the rebooting controller to mark the path
degraded
– If there is no alternate controller the drive path will be marked OPTIMAL
again
• The drive channel will not automatically transition back to an OPTIMAL state
(with the exception of the above situation) unless directed by the user via the
Storage Manager software
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 162
SAS Backend
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 163
SAS Backend Overview and Analysis
• Statistics collected from PHYs
– A SAS Wide port consists of multiple PHYs, each with independent error
counters
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 164
SAS Error counts
• IDWC – Invalid Dword Count
– A dword that is not a data word or a primitive (i.e., in the character
context, a dword that contains an invalid character, a control character in
other that the first character position, a control character other than
K28.3 or K28.5 in the first character position, or one or more characters
with a running disparity error). This could mark the beginning of a loss of
Dword synchronization. After the fourth non-nullified (if followed by a
valid Dword) Invalid Dword, Dword synchronization is lost.
• Not available through the GUI interface, only CLI or the support bundle.
– sasClearPhyErrStats
– clearSasErrorStatistics_MT
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 165
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 166
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 167
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 168
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 169
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 170
SAS Backend Overview and Analysis
• If a PHY has a high error count, look at the device that the PHY is directly
attached to
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 171
Left Blank Intentionally
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 172
Appendix A: SANtricity Managed Storage Systems
• Disk performance
• SANtricity robustness • Native IB interfaces
• Dedicated data cache • Switched-loop backend
Key features
• 4 Gb/s interfaces • FC | SATA intermixing
• Switched-loop backend • SANtricity robustness
• FC | SATA intermixing
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 173
6998 /6994 /6091 (Front)
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 174
Attribute 3994 | 3992
• Performance value
• SANtricity robustness
Key features • FC | SATA intermixing
• 4 Gb/s interfaces
• Switched-loop backend
3992 (Back)
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 175
3994 (Back)
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 176
Left Blank Intentionally
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 177
Appendix B: Simplicity Managed Storage Systems
Affordable and reliable storage designed for SMB, departmental and remote-site
customers
Intuitive, task-oriented management software designed for sites with limited IT
resources that need to be self-sufficient
FC and SAS connectivity with support SAS/SATA drives (SATA drive support mid-2007)
• Shared DAS
• High availability/reliability
Key features • SAS host interfaces
• Robust, intuitive Simplicity software
• Snapshot / Volume Copy
Drives 42 SAS
1333
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 178
Attribute 1532
Drives 42 SAS
1532
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 179
Attribute 1932
• High availability/reliability
• Robust, intuitive Simplicity software
Key features
• 4 Gb/s host interfaces
• Snapshot / Volume Copy
Drives 42 SAS
1932
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 180
SAS Drive Tray (Front)
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 181
Left Blank Intentionally
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 182
Appendix C – State, Status, Flags (06.xx)
0 Optimal
1 Non-existent drive
2 Unassigned, w/DACstore
3 Failed
4 Replaced
5 Removed – optimal pg2A = 0
6 Removed – replaced pg2A = 4
7 Removed – Failed pg2A = 3
8 Unassigned, no DACstore
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 183
Drive State (d_flags)
0x00000100 Drive is locked for diagnostics
0x00000200 Drive contains config. sundry_
0x00000400 Drive is marked deleted by Raid Mgr._0
0x00000800 Defined drive without drive
0x00001000 Drive is spinning or accessible
0x00002000 Drive contains a format or accessible
0x00004000 Drive is designated as HOT SPARE
0x00008000 Drive has been removed
0x00010000 Drive has an ADP93 DACstore
0x00020000 DACstore update failed
0x00040000 Sub-volume consistency checked during SOD
0x00080000 Drive is part of a foreign rank (cold added).
0x00100000 Change vdunit number
0x00200000 Expanded DACstore parameters
0x00400000 Reconfiguration performed in reverse VOLUME order
0x00800000 Copy operation is active (not queued).
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 184
Volume State, Status, Flags
From pp 17 – 18, Troubleshooting and Technical Reference Guide – Volume 1
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 185
VOLUME Flags (vd_flags)
These flags are bit values, and the following flags are valid:
0x00000001 Configured
0x00000002 Open
0x00000004 On-Line
0x00000008 Not Suspended
0x00000010 Resources available
0x00000020 Degraded
0x00000040 Spare piece - VOLUME has Global Hot Spare drive in use
0x00000080 RAID 1 ping-pong state
0x00000100 RAID 5 left asymmetric mapping
0x00000200 Write-back caching enabled
0x00000400 Read caching enabled
0x00000800 Suspension in progress while switching Global Hot Spare drive
0x00001000 Quiescence has been aborted or stopped
0x00010000 Prefetch enabled
0x00020000 Prefetch multiplier enabled
0x00040000 IAF not yet started, don't restart yet
0x00100000 Data scrubbing is enabled on this unit
0x00200000 Parity check is enabled on this unit
0x00400000 Reconstruction read failed
0x01000000 Reconstruction in progress
0x02000000 Data initialization in progress
0x04000000 Reconfiguration in progress
0x08000000 Global Hot Spare copy-back in progress
0x90000000 VOLUME halted; awaiting graceful termination of any reconstruction,
verify, or copy-back
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 186
From p 27, Troubleshooting and Technical Reference Guide – Volume 1
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 187
Left Blank Intentionally
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 188
Appendix D – Chapter 2 - MEL Data Format
Major Event Log Specification 349-1053040 (Software Release 6.16)
LSI Logic Confidential
Chapter 2: MEL Data Format
The event viewer formats and displays the most meaningful fields of major event log entries from
the controller. The data displayed for individual events varies with the event type and is described
in the Events Description section. The raw data contains the entire major event data structure
retrieved from the controller subsystem. The event viewer displays the raw data as a character
string. Fields that occupy multiple bytes may appear to be byte swapped depending on the host
system. Fields that may appear as byte swapped are noted in the table below.
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 189
Table 2-1: MEL Data Fields
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 190
Table 2-1: MEL Data Fields
Note: If the log entry field does not have a version number, the format will be as shown below.
Table 2-2: Constant Data Field format, No Version Number
If the log entry field contains version 1, the format will be as shown below.
Table 2-3: Constant Data Field Format, Version 1
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 191
Table 2-3: Constant Data Field Format, Version 1
The Event Number is a 4 byte encoded value that includes bits for drive and controller inclusion,
event priority, and the event value. The Event Number field is encoded as follows
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 192
2.2.4.1. Event Number - Internal Flags Field Details
The Internal Flags are used internally within the controller firmware for events that require unique
handling. The host application ignores these values.
The Log Group field indicates what kind of event is being logged. All events are logged in the
system log. The values for the Log Group Field are described as follows:
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 193
2.2.4.4. Event Number - Event Group Field Details
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 194
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 195
2.2.5. Timestamp (Bytes 20 - 23) Field Details
The Timestamp field is a 4 byte value that corresponds to the real time clock on the controller.
The real time clock is set (via the boot menu) at the time of manufacture. It is incremented every
second and started relative to January 1, 1970.
The IOP ID is used by MEL to associate multiple log entries with a single event or I/O. The IOP ID
is guaranteed to be unique for each I/O. A valid IOP ID may not be available for certain MEL
entries and some events use this field to log other information. The event descriptions will
indicate if the IOP ID is being used for unique log information.
Logging of data for this field is optional and is zero when not specified.
A valid I/O Origin may not be available for certain MEL entries and some events use this field to
log other information. The event descriptions will indicate if the I/O Origin is being used for unique
log information. Logging of data for this field is optional and is zero when not specified. When
decoding MEL events, additional FRU information can be found in the Software Interface
Specification.
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 196
2.2.10. Controller Number (Bytes 40-43) Field Details
The Controller Number field specifies the controller associated with the event being logged.
Logging of data for this field is optional and is zero when not specified.
This field identifies the category of the log entry. This field is identical to the event group field
encoded in the event number.
Identifies the component type associated with the log entry. This is identical to the Component
Group list encoded in the event number
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 197
Table 2-12: Component Type Field Details
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 198
Table 2-12: Component Type Field Details
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 199
Table 2-13: Component Type Location Values
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 200
Table 2-13: Component Type Location Values
This field contains a value of 1 if the component location field contains valid data. If the
component location data is not valid or cannot be determined the value is 0.
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 201
2.2.16. Optional Field Data Field Details
The length in bytes of the optional data field data (including the Data Field Type)
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 202
Appendix E – Chapter 30 – Data Field Types
Major Event Log Specification 349-1053040 (Software Release 6.16)
LSI Logic Confidential
Chapter 30: Data Field Types
This table describes data field types.
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 203
Table 30-1: Data Field Types
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 204
Table 30-1: Data Field Types
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 205
Table 30-1: Data Field Types
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 206
Table 30-1: Data Field Types
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 207
Table 30-1: Data Field Types
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 208
Table 30-1: Data Field Types
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 209
Table 30-1: Data Field Types
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 210
Table 30-1: Data Field Types
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 211
Table 30-1: Data Field Types
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 212
Table 30-1: Data Field Types
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 213
Left Blank Intentionally
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 214
Appendix F – Chapter 31 – RPC Function Numbers
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 215
Table 31-1: SYMbol RPC Functions
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 216
Table 31-1: SYMbol RPC Functions
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 217
Table 31-1: SYMbol RPC Functions
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 218
Table 31-1: SYMbol RPC Functions
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 219
Table 31-1: SYMbol RPC Functions
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 220
Table 31-1: SYMbol RPC Functions
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 221
Table 31-1: SYMbol RPC Functions
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 222
Table 31-1: SYMbol RPC Functions
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 223
Table 31-1: SYMbol RPC Functions
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 224
Table 31-1: SYMbol RPC Functions
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 225
Table 31-1: SYMbol RPC Functions
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 226
Table 31-1: SYMbol RPC Functions
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 227
Table 31-1: SYMbol RPC Functions
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 228
Appendix G – Chapter 32 – SYMbol Return Codes
Major Event Log Specification 349-1053040 (Software Release 6.16)
LSI Logic Confidential
Return Codes
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 229
Return Codes
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 230
Return Codes
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 231
Return Codes
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 232
Return Codes
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 233
Return Codes
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 234
Return Codes
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 235
Return Codes
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 236
Return Codes
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 237
Return Codes
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 238
Return Codes
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 239
Return Codes
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 240
Return Codes
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 241
Return Codes
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 242
Return Codes
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 243
Return Codes
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 244
Return Codes
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 245
Return Codes
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 246
Return Codes
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 247
Return Codes
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 248
Return Codes
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 249
Return Codes
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 250
Return Codes
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 251
Return Codes
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 252
Return Codes
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 253
Return Codes
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 254
Return Codes
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 255
Return Codes
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 256
Return Codes
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 257
Return Codes
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 258
Return Codes
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 259
Left Blank Intentionally
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 260
Appendix H – Chapter 5 - Host Sense Data
The first byte of all sense data contains the response code field that indicates the error type and
format of the sense data.:
If the response code is 0x70 or 0x71, the sense data format is Fixed. See “5.1.1.Request Sense
Data - Fixed Format” on page 5-189. f the response code is 0x72 or 0x73, the sense data format
is Descriptor. See “5.1.2.Request Sense Data - Descriptor Format” on page 5-205.
For more information on sense data response codes, see SPC-3, SCSI Primary Commands.
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 261
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 262
5. 1. 1. 1. Incorrect Length Indicator (ILI) - Byte 2
This bit is used to inform the host system that the requested non-zero byte transfer length for a
Read or Write Long command does not exactly match the available data length. The information
field in the sense data will be set to the difference (residue) of the requested length minus the
actual length in bytes. Negative values will be indicated by two's complement notation. Since the
controller does not support Read or Write Long, this bit is always zero.
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 263
5. 1. 1. 3. Information Bytes - Bytes 3-6
This field is implemented as defined in the SCSI standard for direct access devices. The
information could be any one of the following types of information:
² The unsigned logical block address indicating the location of the error being reported.
² The first invalid logical block address if the sense key indicates an illegal request.
The value in this field will be 152 (0x98) in most cases. However, there are situations when only
the standard sense data will be returned. For these sense blocks, the additional sense length is
10 (0x0A).
The command-specific field will always be zero-filled for sense data returned for commands other
than Reassign Blocks.
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 264
be used to determine where the error occurred. As an example, the Additional Sense Code for
SCSI bus parity error is returned for a parity error detected on either the host bus or one of the
drive buses. In this case, the FRU field must be evaluated to determine if the error occurred on
the host channel or a drive channel.
Because of the large number of replaceable units possible in an array, a single byte is not
sufficient to report a unique identifier for each individual field replaceable unit. To provide
meaningful information that will decrease field troubleshooting and problem resolution time, FRUs
have been grouped. The defined FRU groups are listed below.
A FRU group consisting of the host SCSI bus, its SCSI interface chip, and all initiators and other
targets connected to the bus.
A FRU group consisting of the SCSI interface chips on the controller which connect to the drive
buses.
A FRU group consisting of the controller logic used to implement the on-board data buffer.
A FRU group consisting of the ASICs on the controller associated with the array functions.
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 265
5.1.1.7.5. Controller Other Group (0x05)
A FRU group consisting of all controller related hardware not associated with another group.
A FRU group consisting of subsystem components that are monitored by the array controller,
such as power supplies, fans, thermal sensors, and AC power monitors. Additional information
about the specific failure within this FRU group can be obtained from the additional FRU bytes
field of the array sense.
A FRU group consisting of subsystem components that are configurable by the user, on which
the array controller will display information (such as faults).
A FRU group consisting of the attached enclosure devices. This group includes the power
supplies, environmental monitor, and other subsystem components in the sub-enclosure.
A FRU group consisting of a drive (embedded controller, drive electronics, and Head Disk
Assembly), its power supply, and the SCSI cable that connects it to the controller; or supporting
sub-enclosure environmental electronics.
For SCSI drive-side arrays, the FRU code designates the channel ID in the most significant nibble
and the SCSI ID of the drive in the least significant nibble. For Fibre Channel drive-side arrays,
the FRU code contains an internal representation of the drive’s channel and id. This
representation may change and does not reflect the physical location of the drive. The sense data
additional FRU fields will contain the physical drive tray and slot numbers.
NOTE: Channel ID 0 is not used because a failure of drive ID 0 on this channel would cause an
FRU code of 0x00, which the SCSI-2 standard defines as no specific unit has been identified to
have failed or that the data is not available.
This field is valid for a sense key of Illegal Request when the sense-key specific valid (SKSV) bit
is on. The sense-key specific field will contain the data defined below. In this release of the
software, the field pointer is only supported if the error is in the CDB
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 266
² C/D = 1 indicates the illegal parameter is in the CDB.
² C/D = 0 indicates that the illegal parameter is in the parameters sent during a Data Out phase.
² BPV = 0 indicates that the value in the Bit Pointer field is not valid.
² BPV = 1 indicates that the Bit Pointer field specifies which bit of the byte designated by the Field
Pointer field is in error. When a multiple-bit error exists, the Bit Pointer field will point to the most
significant (left-most) bit of the field.
The Field Pointer field indicates which byte of the CDB or the parameter was in error. Bytes are
numbered from zero. When a multiple-byte field is in error, the pointer will point to the most-
significant byte.
This is a bit-significant field that indicates the recovery actions performed by the array controller.
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 267
The total retry count is for all errors seen during execution of a single CDB set.
These fields store information when multiple errors are encountered during execution of a
command. The ASC/ASCQ pairs are presented in order of most recent to least recent error
detected.
5.1.1.13.1. FRU Group Qualifiers for the Host Channel Group (Code 0x01)
FRU Group Qualifier - Bytes 26 (MSB) & 27 (LSB) - The most significant byte indicates which
host channel is reporting the failed component. The least significant byte provides the device type
and state of the device being reported
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 268
5.1.1.13.2. Mini-hub Port
Mini-Hub Port indicates which of the Mini-Hub ports is being referenced. For errors where the
Mini-Hub port is irrelevant port 0 is specified
Controller Number indicates which controller the host interface is connected to.
The least significant byte provides the device type and state of the device being reported.
Host Channel Number indicates which channel of the specified controller. Values 1 through 4 are
valid.
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 269
5.1.1.13.4.2. Host Channel Device Type Identifier
The Host Channel Device Type Identifier is defined as:
5.1.1.13.5. FRU Group Qualifiers For Controller Drive Interface Group (Code 0x02)
FRU Group Qualifier - Bytes 26 (MSB) & 27 (LSB) - The most significant byte indicates which
drive channel is reporting the failed component. The least significant byte provides the device
type and state of the device being reported.
The Mini-Hub Port indicates which of the Mini-Hub ports is being referenced. For errors where the
Mini- Hub port is irrelevant port 0 is specified.
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 270
5.1.1.13.5.3. Drive Channel Number
Drive Channel Number indicates which channel. Values 1 through 6 are valid.
5.1.1.13.6. FRU Group Qualifiers For The Subsystem Group (Code 0x06)
FRU Group Qualifier - Bytes 26 (MSB) & 27 (LSB) - The most significant byte indicates which
primary component fault line is reporting the failed component. The information returned depends
on the configuration set up by the user. For more information, see OLBS 349-1059780, External
NVSRAM Specification for Software Release 7.10. The least significant byte provides the device
type and state of the device being reported. The format for the least significant byte is the same
as Byte 27 of the FRU Group Qualifier for the Sub-Enclosure Group (0x08).
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 271
5.1.1.13.7. FRU Group Qualifiers For The Sub-Enclosure Group (Code 0x08)
FRU Group Qualifier - Bytes 26 (MSB) & 27 (LSB) - The most significant byte indicates which
enclosure identifier is reporting the failed component. The least significant byte provides the
device type and state of the device being reported.
Statuses are reported such that the first enclosure for each channel is reported, followed by the
second enclosure for each channel.
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 272
The Sub-Enclosure Device Type Identifier is defined as
5.1.1.13.8. FRU Group Qualifiers For The Redundant Controller Group (Code 0x09)
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 273
FRU Group Qualifier - Bytes 26 (MSB) & 27 (LSB) - The most significant byte indicates which tray
contains the failed controller. The least significant byte indicates the failed controller within the
tray.
5.1.1.13.9. FRU Group Qualifiers For The Drive Group (Code 0x10 – 0xFF)
FRU Group Qualifier - Bytes 26 (MSB) & 27 (LSB) - The most significant byte indicates the tray
number of the affected drive. The least significant byte indicates the drive’s physical slot within
the drive tray indicated in byte 26.
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 274
5.1.1.13.9.2. Drive Group LSB Format:
This field provides information read from the array controller VLSI chips and other sources. It is
intended primarily for development testing, and the contents are not specified.
The error detection point field will indicate where in the software the error was detected. It is
intended primarily for development testing, and the contents are not specified.
This field contains the original Command Descriptor Block received from the host.
This bit position field provides information about the host. Definitions are given below.
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 275
5. 1. 1. 19. Controller Serial Number - Bytes 54-69
This sixteen-byte field contains the manufacturing identification of the array hardware. Bytes of
this field are identical to the information returned by the Unit Serial Number page in the Inquiry
Vital Product Data.
The Array Application Software Revision Level matches that returned by an Inquiry command.
The LUN number field is the logical unit number in the Identify message received from the host
after selection.
This field indicates the status of the LUN. It's contents are defined in the logical array page
description in the Mode Parameters section of this specification except for the value of 0xFF,
which is unique to this field.
A value of 0xFF returned in this byte indicates the LUN is undefined or is currently unavailable
(reported at Start of Day before the LUN state is known).
The host ID is the SCSI ID of the host that selected the array controller for execution of this
command.
This field contains the software revision level of the drive involved in the error if the error was a
drive error and the controller was able to retrieve the information.
This field identifies the Product ID of the drive involved in the error if the error was a drive error
and the controller was able to determine this information. This information is obtained from the
drive Inquiry command.
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 276
This byte indicates the configured RAID level for the logical unit returning the sense data. The
values that can be returned are 0, 1, 3, 5, or 255. A value of 255 indicates that the LUN RAID
level is undefined.
These bytes identify the source of the sense block returned in the next field. Byte 102 identifies
the channel and ID of the drive. Refer to the FRU group codes for physical drive ID assignments.
Byte 103 is reserved for identification of a drive logical unit in future implementations and it is
always set to zero in this release.
For drive detected errors, these fields contain the data returned by the drive in response to the
Request Sense command from the array controller. If multiple drive errors occur during the
transfer, the sense data from the last error will be returned.
This field contains the controller’s internal sequence number for the IO request.
MMDDYY/HHMMSS
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 277
Left Blank Intentionally
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 278
Appendix I – Chapter 11 – Sense Codes
Chapter 11: Sense Codes
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 279
11.2. Additional Sense Codes and Qualifiers
This section lists the Additional Sense Codes (ASC), and Additional Sense Code Qualifier
(ASCQ) values returned by the array controller in the sense data. SCSI-2 defined codes are used
when possible. Array specific error codes are used when necessary, and are assigned SCSI-2
vendor unique codes 0x80-0xFF. More detailed sense key information may be obtained from the
array controller command descriptions or the SCSI-2 standard.
Codes defined by SCSI-2 and the array vendor specific codes are shown below. The most
probable sense keys (listed below for reference) returned for each error are also listed in the
table. A sense key encapsulated by parentheses in the table is an indication that the sense key is
determined by the value in byte 0x0A. See Section .
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 280
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 281
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 282
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 283
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 284
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 285
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 286
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 287
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 288
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 289
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 290
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 291
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 292
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 293
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 294
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 295