Académique Documents
Professionnel Documents
Culture Documents
Repairing
Eric Anderson
U.C. Berkeley
Overview
What is System Administration?
– What is the problem?
– Goals of Dissertation Research
– Goals of System Administration
Monitoring, diagnosing, and repairing
Dissertation Timeline
Conclusion
1-Sep-19 2
What is the problem?
Problems occur in systems, and result in loss of
productivity
– Server failures denial of service
– System overload lower productivity
Cost is too high
– Cost of ownership estimated at $5,000-$15,000/year/machine
– Median salary (~50k) / (median # machines/admin) $700
Our goal: Reduce cost by
– Repairing problems faster (possibly automatically)
– Handling more problems
1-Sep-19 3
Goals of Dissertation Research
Describe field of System Administration
Monitoring, Diagnosing, and Repairing:
– Approach: Synthesize solutions from other fields of research
1) Detect previously ignored problems
2) Automatic repair of some problems
3) Reduce number of administrators needed
4) Support users’ understanding of system
Apply here & distribute software
Thesis: Through our approach, we can achieve
goals 1-4.
1-Sep-19 4
Goals of System Administration
Goal: Support cost-effective use of the computer
environment
More specifically (some non-technical):
Environment: uniform, customizable, high performance and
available
Faults & errors: recovery from benign errors, protection from
malicious attacks
Users: training, accounting & planning, legal
1-Sep-19 5
Monitoring, Diagnosing, and
Repairing (MDR)
Introductory examples
Fundamental requirements
Environmental constraints
Previous work
Six key innovations
Architecture
Details on innovations
Evaluation methodology
1-Sep-19 6
MDR: Examples — Intro
Four examples
1) Broken component
2) Resource overload — transient
3) Resource contention — user program
4) Resource exhaustion — long term
Previous Solutions
– Pay someone to watch
– Ignore or wait for someone to complain
– Specialized scripts (not general vast repeated work)
1-Sep-19 7
MDR: Example 1
Web server has crashed/hung
Gather information: process existence, service
uptime, restart times
Analyze data: process not responding, and hasn’t
been recently restarted.
Automatic repair: restart daemon.
Notify administrator: had to restart daemon.
1-Sep-19 8
MDR: Example 2
The NOW is “slow.”
Gather data: load, process info, CPU info
Analyze data: bounds on expected values
Notified administrator: fileserver overloaded.
Visualize data: nfsd’s are overloaded.
Repair: admin moves data, adds disks, or starts
more nfsd’s
1-Sep-19 9
MDR: Example 3
User running program
Gather: user statistics, CPU, disk
Visualize: spending too much time waiting on remote
accesses
(User fixes program, gathering, visualization repeated)
Analyze: some nodes have less throughput
Visualize: those have other jobs running on them
Repair: user is benchmarking so kills all extraneous
processes
1-Sep-19 10
MDR: Example 4
Web server increasing beyond capacity
Gather: CPU, request rate, reply latency
Analyze: Burst lengths getting longer, latency
increasing
Visualize: Graph of burst lengths & CPU usage over
time
Repair: Order more machines, install load balancer
1-Sep-19 11
MDR: Fundamental Requirements
Gathering
Flexible data gathering, self-describing storage
Analyzing
Calculate statistical measures, identify relevant statistics.
Notifying
Flexible infrequent messages to administrators or users
Visualizing
Maximize information/pixel, support multiple interfaces
Repairing
Automate simple repairs, support group operations
1-Sep-19 12
MDR: Environmental Constraints
Change is inherent
– Lack of Web/Mbone 5 years ago, now most/many have these.
Problems on many time-scales
– Second-Minute transients vs. Week-Month capacity problems
Must operate under very adverse conditions
– Often used when system is broken
– Would like at least post-mortum analysis
Need to handle hundreds – thousands of nodes
– Scalability: All sites are getting larger, possibly wide area
– Our system has 200 (NOW) – 2000 (Soda) nodes
1-Sep-19 13
MDR: Previous Systems
Many previous systems: I’ve looked at about 16.
Not comprehensive, not extensible.
Look at a few that did a nice job of a piece:
[Fink97] — Run test, notify display engine
+ Easy to add tests
+ Selectivity of notification good
– Tests are just programs (redo gathering)
– Central, non-fault tolerant solution
– Many hard coded constants
1-Sep-19 14
MDR: Previous Systems, cont.
[Hard92] — buzzerd: Pager notification system
+ Flexible rules for notification
+ External interface for adding notify requests
– Simplistic gathering
– Poor fault tolerance
[Pier96] — Igor group fixes
+ Flexible operations
+ Nice reporting of success/failure
– Weak security, runs as root
– No delegation of responsibility
1-Sep-19 15
MDR: Six Key Innovations (1-3)
Replicated, semi-hierarchical, data storage nodes
– Rendezvous point for programs
– Handles scaling and fault-tolerance
Self describing structures
– Functions (visualize, summarize) + data go in database (OO)
– DB has machine and human readable descriptions of data
End to end notification
– Detect problems in MDR system
– Guarantee important messages get to users
1-Sep-19 16
MDR: Six Key Innovations (4-6)
Aggregation and High Resolution Color Displays
– Reduce information to manageable amounts
– Maximize information per unit area
Partially self-configuring
– Learn averages, deviations, burst sizes
– Learn which values are relevant to problems
Secure, user-specified group repairs
– Don’t enable malicious attacks
– Automate repairs of many machines
1-Sep-19 17
MDR: Architecture
Aggregation
Gather Agent
SQL-based Data Repository Engine
vmstat thread
ping thread
tcpdump thread
Daemon
Restarter
E-mail or Long-term
Diagnostic
Phone graphing Tolerance,
Console
Notifier Relevance
Learner
1-Sep-19 18
MDR-Arch: Derivations
Daemon
Restarter
E-mail or Tolerance,
Diagnostic
Phone Relevance
Console
Notifier Learner
1-Sep-19 19
Key: Semi-Hier. DBs.
Top level cache Top level cache
1-Sep-19 20
Key: Self-Describing
De-couple data gathering, data storage, and data use
Self-Describing for Humans
– Descriptions of meanings of values stored with tables
– Description of methods of gathering stored with tables
– Column names help with self
Self-Describing for Computers
– Functions for visualizing or summarizing data
– Indication of resource selection from resource statistics
1-Sep-19 21
Key: End-to-End Notification
Recall: System must operate under extreme conditions
Humans must validate that system is still working
– Standalone display can indicate timestamps, mark out of
date data
– Wireless machine could intermittently contact notification
system
– Pager could be automatically paged every so often
Problems should be propagated to end users.
– Flexible notification — connected systems, e-mail, pager.
– Limit over-notification
1-Sep-19 22
Key: Aggregation & HiRes
System target has hundreds – thousands of nodes
Aggregate by showing out of bounds, relevant values
(via automatic tuning)
Also want overview of system
– Aggregate across similar statistics; show value (fill) &
dispersion (shade)
– Use color to highlight important values.
– Aggregate across values (machine utilization = CPU + disk +
memory)
– Maximize data/pixel [Tufte]
1-Sep-19 23
Key: Agg & HiRes: Snapshot
1-Sep-19 24
Key: Self-Configuring
Single statistics
– Phase 1: Calculate averages, standard deviations, burst
sizes
– Worked in other systems [Jaco88, Karn91]
Identify relevant statistics
– Give system Boolean examples (variables out of bounds,
and system working/not working) get function.
– Works for Boolean disjunctions in some cases:
• With lots of irrelevant variables [Litt89]
• With random bad examples [Sloa89]
• In some cases, with malicious bad examples [Ande94]
1-Sep-19 25
Key: Secure Remote Actions
Security because of malicious attacks, benign errors
Delegation to remove SA from the loop
Independence from particular algorithms
– Building a library
– Program with principals (hosts, users), and properties
(signed, sealed, verifiable)
Use secure, run-time extensible languages
Actions report through gathering system
1-Sep-19 26
MDR: Testing Methodology
Fault injection
– Deliberately make the system slow
– Break hardware/software components
Feature comparison
– Paper comparison with other systems
Usage in practice
– Experience important to show system works
– We have need of administrative tools
Testimonials
– Experience at other sites lends credibility
1-Sep-19 27
MDR: Demo
Hierarchical structure working (1 level right now)
Alternative Interface
Fault Injection
Need for Aggregation
Crufty right now
Demo
1-Sep-19 28
Timeline: Key Pieces
1) (DBs) Replicated, semi-hierarchical, data storage nodes
2) (SDS) Self describing structures
3) (Vis) Aggregation and High Resolution Color Displays
4) (E2EN) End to end notification
5) (ReS) Automatic Restart
6) (Cfg) Partially self-configuring
7) (Rep) Secure, user-specified group repairs
1-Sep-19 29
Timeline
Deadlines: LISA 6/97 USENIX 12/97 OSDI 3/98 LISA 6/98 Graduation 12/98
Prototype 1,2,3
(DBs, SelfD, Vis) SOSP
3/99
Prototype 4,5
Notify, Restart
Prototype 6,7
AConfig, Repair
Experience
with 1-7
Architecture of
Complete System
Writing
June, 1997 Dec, 1997 June, 1998 Dec, 1998 Mar, 1999
1-Sep-19 30
Conclusion
Description of field shows breadth
Monitoring, diagnosing, and repairing shows depth
– Examples show importance of problem
– Fundamental goals & environmental constraints show
understanding of problem
– Key innovations show differences from previous systems.
– Architecture and initial prototype show approach to problem
– Testing methods show ways to validate solution.
Timeline shows plan & milestones to graduation
1-Sep-19 31
Old Slides
Solutions
Managing stable storage
Supporting users
Simplifying security
Monitoring, diagnosing, and repairing
1-Sep-19 33
Managing Stable Storage
Consistency vs. availability
Fault tolerance
Scalability
Recoverability
Customization
1-Sep-19 34
Supporting Users
Automated help desk
– Searchable collection of questions
– Easy method for addition
Remote device access
Site-wide training
1-Sep-19 35
Goals: Environment
Uniform
– Supports user mobility by eliminating arbitrary changes
– Increases effectiveness by avoiding need for users to learn multiple
interfaces
Customizable
– Handles special systems and special needs [firewalls, servers]
– Obviously reduces uniformity
1-Sep-19 36
Goals: Environment, cont.
High Performance
– Increases effectiveness of users [HCI/psych]
– Limited by cost-effectiveness
Available
– Effectiveness is 0 if system isn’t working
– Balanced against expense
1-Sep-19 37
Goals: Faults & Errors
Benign errors:
– Accidentally deleted files
– Unnoticed runaway processes
Malicious attacks:
– TCP SYN attack
– Sendmail bugs
– Data stealing
– False data injection
1-Sep-19 38
Goals: Users
Training
– Troubleshooting = one-on-one training
– Larger sessions = classes
Accounting
– Supports management, helps billing
Capacity Planning
– Expanding systems takes time
Legal
– Sensitive information needs protection
1-Sep-19 39
Simplifying Security
USENIX talk says “If cryptography is so great, why isn’t it used more?”
SA’s worry about security to protect data.
Goal: Ease development of secure applications
Write programs using principals & properties rather than keys and algorithms
Unify various forms of available cryptography (public key, secret-key, PGP,
Kerberos)
My use: protected, transferable rights to allow various actions
– Modify system configurations (add filesystems, printers)
– Kill/restart processes (runaway, after configurations modified)
– Access data (private logs, for backups, etc.)
1-Sep-19 40
Conclusion
System administration as area of research
– Description of field
– Areas for future research
• Managing stable storage
• Supporting users
Initial investigation of research area
– Monitoring, diagnosing, and repairing
• Broad, draws from many fields
1-Sep-19 41