Monitoring, Diagnosing, and Repairing: Eric Anderson U.C. Berkeley

Monitoring, Diagnosing, and
Repairing
Eric Anderson
U.C. Berkeley
Overview
 What is System Administration?
– What is the problem?
– Goals of Dissertation Research
– Goals of System Administration
 Monitoring, diagnosing, and repairing
 Dissertation Timeline
 Conclusion
1-Sep-19 2
What is the problem?
 Problems occur in systems, and result in loss of
productivity
– Server failures  denial of service
– System overload  lower productivity
 Cost is too high
– Cost of ownership estimated at $5,000-$15,000/year/machine
– Median salary (~50k) / (median # machines/admin)  $700
 Our goal: Reduce cost by
– Repairing problems faster (possibly automatically)
– Handling more problems
1-Sep-19 3
Goals of Dissertation Research
 Describe field of System Administration
 Monitoring, Diagnosing, and Repairing:
– Approach: Synthesize solutions from other fields of research
1) Detect previously ignored problems
2) Automatic repair of some problems
3) Reduce number of administrators needed
4) Support users’ understanding of system
 Apply here & distribute software
 Thesis: Through our approach, we can achieve
goals 1-4.
1-Sep-19 4
Goals of System Administration
Goal: Support cost-effective use of the computer
environment
More specifically (some non-technical):
Environment: uniform, customizable, high performance and
available
Faults & errors: recovery from benign errors, protection from
malicious attacks
Users: training, accounting & planning, legal
1-Sep-19 5
Monitoring, Diagnosing, and
Repairing (MDR)
 Introductory examples
 Fundamental requirements
 Environmental constraints
 Previous work
 Six key innovations
 Architecture
 Details on innovations
 Evaluation methodology
1-Sep-19 6
MDR: Examples — Intro
 Four examples
1) Broken component
2) Resource overload — transient
3) Resource contention — user program
4) Resource exhaustion — long term
 Previous Solutions
– Pay someone to watch
– Ignore or wait for someone to complain
– Specialized scripts (not general  vast repeated work)
1-Sep-19 7
MDR: Example 1
Web server has crashed/hung
 Gather information: process existence, service
uptime, restart times
 Analyze data: process not responding, and hasn’t
been recently restarted.
 Automatic repair: restart daemon.
 Notify administrator: had to restart daemon.
1-Sep-19 8
MDR: Example 2
The NOW is “slow.”
 Gather data: load, process info, CPU info
 Analyze data: bounds on expected values
 Notified administrator: fileserver overloaded.
 Visualize data: nfsd’s are overloaded.
 Repair: admin moves data, adds disks, or starts
more nfsd’s
1-Sep-19 9
MDR: Example 3
User running program
 Gather: user statistics, CPU, disk
 Visualize: spending too much time waiting on remote
accesses
(User fixes program, gathering, visualization repeated)
 Analyze: some nodes have less throughput
 Visualize: those have other jobs running on them
 Repair: user is benchmarking so kills all extraneous
processes
1-Sep-19 10
MDR: Example 4
Web server increasing beyond capacity
 Gather: CPU, request rate, reply latency
 Analyze: Burst lengths getting longer, latency
increasing
 Visualize: Graph of burst lengths & CPU usage over
time
 Repair: Order more machines, install load balancer
1-Sep-19 11
MDR: Fundamental Requirements
 Gathering
 Flexible data gathering, self-describing storage
 Analyzing
 Calculate statistical measures, identify relevant statistics.
 Notifying
 Flexible infrequent messages to administrators or users
 Visualizing
 Maximize information/pixel, support multiple interfaces
 Repairing
 Automate simple repairs, support group operations
1-Sep-19 12
MDR: Environmental Constraints
 Change is inherent
– Lack of Web/Mbone 5 years ago, now most/many have these.
 Problems on many time-scales
– Second-Minute transients vs. Week-Month capacity problems
 Must operate under very adverse conditions
– Often used when system is broken
– Would like at least post-mortum analysis
 Need to handle hundreds – thousands of nodes
– Scalability: All sites are getting larger, possibly wide area
– Our system has 200 (NOW) – 2000 (Soda) nodes
1-Sep-19 13
MDR: Previous Systems
 Many previous systems: I’ve looked at about 16.
 Not comprehensive, not extensible.
 Look at a few that did a nice job of a piece:
 [Fink97] — Run test, notify display engine
+ Easy to add tests
+ Selectivity of notification good
– Tests are just programs (redo gathering)
– Central, non-fault tolerant solution
– Many hard coded constants
1-Sep-19 14
MDR: Previous Systems, cont.
 [Hard92] — buzzerd: Pager notification system
+ Flexible rules for notification
+ External interface for adding notify requests
– Simplistic gathering
– Poor fault tolerance
 [Pier96] — Igor group fixes
+ Flexible operations
+ Nice reporting of success/failure
– Weak security, runs as root
– No delegation of responsibility
1-Sep-19 15
MDR: Six Key Innovations (1-3)
 Replicated, semi-hierarchical, data storage nodes
– Rendezvous point for programs
– Handles scaling and fault-tolerance
 Self describing structures
– Functions (visualize, summarize) + data go in database (OO)
– DB has machine and human readable descriptions of data
 End to end notification
– Detect problems in MDR system
– Guarantee important messages get to users
1-Sep-19 16
MDR: Six Key Innovations (4-6)
 Aggregation and High Resolution Color Displays
– Reduce information to manageable amounts
– Maximize information per unit area
 Partially self-configuring
– Learn averages, deviations, burst sizes
– Learn which values are relevant to problems
 Secure, user-specified group repairs
– Don’t enable malicious attacks
– Automate repairs of many machines
1-Sep-19 17
MDR: Architecture
Aggregation
Gather Agent
SQL-based Data Repository Engine
vmstat thread
ping thread
tcpdump thread
Daemon
Restarter
E-mail or Long-term
Diagnostic
Phone graphing Tolerance,
Console
Notifier Relevance
Learner
1-Sep-19 18
MDR-Arch: Derivations
SQL-based Data Repository
Daemon
Restarter
E-mail or Tolerance,
Diagnostic
Phone Relevance
Console
Notifier Learner
1-Sep-19 19
Key: Semi-Hier. DBs.
Top level cache Top level cache
Mid level cache Mid level cache Mid level cache
Per-node Per-node Per-node Per-node Per-node

database database database database database
 Fault tolerance
 Scalability:
– Caches don’t need to commit to disk — authoritative copy
elsewhere.
– Batching updates over wide area links.
1-Sep-19 20
Key: Self-Describing
 De-couple data gathering, data storage, and data use
 Self-Describing for Humans
– Descriptions of meanings of values stored with tables
– Description of methods of gathering stored with tables
– Column names help with self
 Self-Describing for Computers
– Functions for visualizing or summarizing data
– Indication of resource selection from resource statistics
1-Sep-19 21
Key: End-to-End Notification
Recall: System must operate under extreme conditions
 Humans must validate that system is still working
– Standalone display can indicate timestamps, mark out of
date data
– Wireless machine could intermittently contact notification
system
– Pager could be automatically paged every so often
 Problems should be propagated to end users.
– Flexible notification — connected systems, e-mail, pager.
– Limit over-notification
1-Sep-19 22
Key: Aggregation & HiRes
 System target has hundreds – thousands of nodes
 Aggregate by showing out of bounds, relevant values
(via automatic tuning)
 Also want overview of system
– Aggregate across similar statistics; show value (fill) &
dispersion (shade)
– Use color to highlight important values.
– Aggregate across values (machine utilization = CPU + disk +
memory)
– Maximize data/pixel [Tufte]
1-Sep-19 23
Key: Agg & HiRes: Snapshot
1-Sep-19 24
Key: Self-Configuring
 Single statistics
– Phase 1: Calculate averages, standard deviations, burst
sizes
– Worked in other systems [Jaco88, Karn91]
 Identify relevant statistics
– Give system Boolean examples (variables out of bounds,
and system working/not working) get function.
– Works for Boolean disjunctions in some cases:
• With lots of irrelevant variables [Litt89]
• With random bad examples [Sloa89]
• In some cases, with malicious bad examples [Ande94]
1-Sep-19 25
Key: Secure Remote Actions
 Security because of malicious attacks, benign errors
 Delegation to remove SA from the loop
 Independence from particular algorithms
– Building a library
– Program with principals (hosts, users), and properties
(signed, sealed, verifiable)
 Use secure, run-time extensible languages
 Actions report through gathering system
1-Sep-19 26
MDR: Testing Methodology
 Fault injection
– Deliberately make the system slow
– Break hardware/software components
 Feature comparison
– Paper comparison with other systems
 Usage in practice
– Experience important to show system works
– We have need of administrative tools
 Testimonials
– Experience at other sites lends credibility
1-Sep-19 27
MDR: Demo
 Hierarchical structure working (1 level right now)
 Alternative Interface
 Fault Injection
 Need for Aggregation
 Crufty right now
 Demo
1-Sep-19 28
Timeline: Key Pieces
1) (DBs) Replicated, semi-hierarchical, data storage nodes
2) (SDS) Self describing structures
3) (Vis) Aggregation and High Resolution Color Displays
4) (E2EN) End to end notification
5) (ReS) Automatic Restart
6) (Cfg) Partially self-configuring
7) (Rep) Secure, user-specified group repairs
1-Sep-19 29
Timeline
Deadlines: LISA 6/97 USENIX 12/97 OSDI 3/98 LISA 6/98 Graduation 12/98
Prototype 1,2,3
(DBs, SelfD, Vis) SOSP
3/99
Prototype 4,5
Notify, Restart
Prototype 6,7
AConfig, Repair
Experience
with 1-7
Architecture of
Complete System
Writing
June, 1997 Dec, 1997 June, 1998 Dec, 1998 Mar, 1999
1-Sep-19 30
Conclusion
 Description of field shows breadth
 Monitoring, diagnosing, and repairing shows depth
– Examples show importance of problem
– Fundamental goals & environmental constraints show
understanding of problem
– Key innovations show differences from previous systems.
– Architecture and initial prototype show approach to problem
– Testing methods show ways to validate solution.
 Timeline shows plan & milestones to graduation
1-Sep-19 31
Old Slides
Solutions
 Managing stable storage
 Supporting users
 Simplifying security
 Monitoring, diagnosing, and repairing
1-Sep-19 33
Managing Stable Storage
 Consistency vs. availability
 Fault tolerance
 Scalability
 Recoverability
 Customization
1-Sep-19 34
Supporting Users
 Automated help desk
– Searchable collection of questions
– Easy method for addition
 Remote device access
 Site-wide training
1-Sep-19 35
Goals: Environment
 Uniform
– Supports user mobility by eliminating arbitrary changes
– Increases effectiveness by avoiding need for users to learn multiple
interfaces
 Customizable
– Handles special systems and special needs [firewalls, servers]
– Obviously reduces uniformity
1-Sep-19 36
Goals: Environment, cont.
 High Performance
– Increases effectiveness of users [HCI/psych]
– Limited by cost-effectiveness
 Available
– Effectiveness is 0 if system isn’t working
– Balanced against expense
1-Sep-19 37
Goals: Faults & Errors
 Benign errors:
– Accidentally deleted files
– Unnoticed runaway processes
 Malicious attacks:
– TCP SYN attack
– Sendmail bugs
– Data stealing
– False data injection
1-Sep-19 38
Goals: Users
 Training
– Troubleshooting = one-on-one training
– Larger sessions = classes
 Accounting
– Supports management, helps billing
 Capacity Planning
– Expanding systems takes time
 Legal
– Sensitive information needs protection
1-Sep-19 39
Simplifying Security
USENIX talk says “If cryptography is so great, why isn’t it used more?”
SA’s worry about security to protect data.
 Goal: Ease development of secure applications
 Write programs using principals & properties rather than keys and algorithms
 Unify various forms of available cryptography (public key, secret-key, PGP,
Kerberos)
 My use: protected, transferable rights to allow various actions
– Modify system configurations (add filesystems, printers)
– Kill/restart processes (runaway, after configurations modified)
– Access data (private logs, for backups, etc.)
1-Sep-19 40
Conclusion
 System administration as area of research
– Description of field
– Areas for future research
• Managing stable storage
• Supporting users
 Initial investigation of research area
– Monitoring, diagnosing, and repairing
• Broad, draws from many fields
1-Sep-19 41

Monitoring, Diagnosing, and Repairing: Eric Anderson U.C. Berkeley

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Monitoring, Diagnosing, and Repairing: Eric Anderson U.C. Berkeley

Transféré par

Droits d'auteur :

Formats disponibles

Monitoring, Diagnosing, and

SQL-based Data Repository

Mid level cache Mid level cache Mid level cache

Per-node Per-node Per-node Per-node Per-node

Vous aimerez peut-être aussi