Académique Documents
Professionnel Documents
Culture Documents
Agenda
Increasingly Complex Architectures Specialization within IT/IS Command and Control Issues Consequences of Fix it NOW! Preparation Understanding Your Deployment Knowing Your Routine Knowing Your Limits Execution Ask Your Neighbors Identify/Refine Your Target Problem vs. Routine Client, Server or Both? Recent Changes Lather, Rinse, Repeat...
We don't handle that Different team Communication often rare and/or difficult Simple questions answered slowly No one really sees big picture
We can't do that until the next window Change Control != everyone informed Software integration demands team integration as well Multiple vendors/contractors may be involved
Panic mode Time-to-resolution faces sometimes arbitrary limits All hands on deck Overall technical guidance lacking Troubleshooting becomes scattershot
Pluralitas non est ponenda sine neccesitate. Plurality should not be posited without necessity. William of Ockham, c. 1285-1349
Close relatives:
When two theories explain the same phenomenon, choose the simpler admit no more causes..than such as are both true and sufficient... (Newton) KISS: Keep It Simple, Stupid
Multiple failures highly unlikely Far more likely that one root failure triggered additional problems Playing it could be introduces complexity and (probably) politics Don't chase rabbits!
It's far more than just your stuff Hardware (or lack thereof!) Operating System Network (within the data center) Network (long haul/extranet/VPN) Dependencies (directory, SAN) Special-purpose devices (firewalls/proxies/reverse-proxies) Network appliances
Understand what normal looks like Be sure to profile peak time too! Logins/sessions per day User patterns (e.g. Accounting end-of-month) Domino platform statistics can be VERY useful
Vendor benchmarks Third party testing/whitepapers Software specifications CPU utilization RAM consumption ESPECIALLY important in virtual environments
Quick check with peers may identify common problem quickly Formalize this process if you can weekly outage reports? May also be indicative of general network issues Allows you to handle some issues without vendor involvement
Most missed aspect of troubleshooting Identify scope/range of affected users Identify scope/range of affected servers LOOK FOR COMMON FACTORS!
Take a snapshot of the problem Compare it to routine data May identify particular areas of concern May allow vendor to focus their efforts better/faster Examples:
Pay particular attention to period just BEFORE problem (last 10 minutes) Be prepared to be pointed in a different direction!
DON'T GO AFTER A FLY WITH A SLEDGEHAMMER! Resist the urge to turn on all the debug Overly ambitious debug can present its own performance cost
DEBUG_TCP_ALL in IBM Lotus Domino VP_TRACE_ALL in IBM Lotus Sametime debug=FINEST in Java
It's worth a round of data gathering to target server debug more specifically High-level client-side debug correlates well with trace logs
Live HTTP Headers (Firefox add-on) Firebug (Firefox add-on) Fiddler (MSIE proxy)
Back to Change Control Look for ANY changes close to start of problem Don't forget to check for OS patches/updates Look for new stuff too... Check all along the data flow
Be prepared to cycle through this process several times Apply same principles to each area of troublehsooting Example:
Identify/Refine shows only particular users suffering Logs show directory issues Now, users not experiencing problems are routine Troubleshoot directory by comparing problem users against routine users e.g. get LDIF dumps for both
Please
More