W3a Reliability, MTTF, Availability, Redundancy

Basic Concepts
Reliability, MTTF, Availability,

Redundancy, etc.
Syamsul Bahrin Abdul Hamid

MCT Dept. IIUM
1
Ideas Ahead
CprE 545: Fault Tolerant Systems (G. Manimaran) 2

Ideas Ahead
WHY?
Definitions
• Reliability of a system is defined to be the probability
that the given system will perform its required function
under specified conditions for a specified period of
time.
• MTBF (Mean Time Between Failures): Average time a

system will run between failures. The MTBF is usually
expressed in hours. This metric is more useful to the
user than the reliability measure.

Approach
HOW?
Approaches to increase the reliability of a system
Increasing reliability of a system
1. Load vs. Strength 1. More vs. Less

2. Commercial vs.
Military
3. Quality vs. Cost

Reliability expressions
• Exponential Failure Law:
• Reliability of a system is often modeled as:
– R(t) = exp(-λt)
• where λ is the failure rate expressed as
percentage failures per 1000 hours or as failures
per hour.
– When the product “λt” is small,

• R(t) = 1 - λt

Relation between MTBF and the Failure rate
• MTBF is the average time a system will run between
failures and is given by:
∞ ∞
– MTBF = ∫0 R(t) dt = ∫0 exp(-λt) dt = 1 / λ
– In other words, the MTBF of a system is the

reciprocal of the failure rate.
– If “λ” is the number of failures per hour, the MTBF

is expressed in hours.

A simple example
• A system has 4000 components with a failure rate of
0.02% per 1000 hours. Calculate λ and MTBF.
• λ = (0.02 / 100) * (1 / 1000) * 4000 = 8 * 10-4

failures/hour
• MTBF = 1 / (8 * 10-4 ) = 1250 hours

Relation between Reliability and MTBF
• R(t) = (1 – λt) = (1 – t / MTBF)

• Therefore,
– MTBF = t / (1 – R(t))
1.0
0.8
Reliability 0.6
R(t) 0.4 0.36
0.2
0 1 MTBF 2 MTBF
Time t

An example
• A first generation computer contains 10000

components each with λ = 0.5%/(1000 hours). What is
the period of 99% reliability?
• MTBF = t / (1 – R(t)) = t / (1 – 0.99)

– t = MTBF * 0.01 = 0.01 / λav
– Where λav is the average failure rate
– N = No. of components = 10000
– λ = failure rate of a component
• = 0.5% / (1000 hours) = 0.005/1000 = 5 * 10-6 per hour
• Therefore, λav = N λ = 10000 * 5 * 10-6 = 5 * 10-2 per hour
• Therefore, t = 0.01 / (5 * 10-2 ) = 12 minutes

Maintainability
• Maintainability of a system is the probability of
isolating and repairing a “fault” in the system within a
given time.
• Maintainability is given by:
– M(t) = 1 – exp(-µt)
– Where µ is the repair rate
– And t is the permissible time constraint for the
maintenance action
– µ = 1/(Mean Time To Repair) = 1/MTTR
– M(t) = 1 – exp(-t/MTTR)

Availability
• Availability of a system is the probability that the system will be
functioning according to expectations at any time during its
scheduled working period.
• Availability = System up-time / (System up-time + System down-time)
• System down-time = No. of failures * MTTR
• System down-time = System up-time * λ * MTTR
• Therefore,
– Availability = System up-time / (System up-time + (System up-time *

λ * MTTR)
• = 1 / (1 + (λ *MTTR)
– Availability = MTBF / (MTBF + MTTR)

Assembly Line
ASSEMBLY
ANYONE?
Assembly Line
1. Read Instruction
2. Design a process flow
3. Ask if required to general
supervisor (front desk)
4. Record all time & activity
5. Reflect on what happen

1. Defining Level of Maintenance
Maintenance can be performed in 2 ways:
a. By Default
Equipment is repaired as it fails – usually on an
emergency basis
b. By Plan
There have been forethought as to what level of
maintenance is required.
The Overall View Of Maintenance
We make money only when the equipment is running
Schedule Equipment Fully Unscheduled

Takedown Breakdown
Maintenance Functional Maintenance
Making RM
Schedule Repair Unscheduled Repair
The question is not IF but when we do the

maintenance
1. Reactive Maintenance
• This is the type of maintenance when it is done on an
emergency basis.
• This type of maintenance incur very high cost as stocks
needs to be available for the spare parts, overtime for the
unscheduled work and unscheduled productivity and
production losses due to parts not available.
• There are 2 major types of emergencies:
• Real
• Contrived
• Real emergencies could be breakdown further as follows:
- Unforeseeable
- Foreseeable
- Unrecognized
- Recognized
- Unreported
- Reported – Not acted upon
• Real Emergencies
• Due to something which is broken and needs to be
fixed.
• Either production or manufacturing output are
compromised, or grave safety or environmental
condition existed
• Real emergencies consist of foreseeable and not
foreseeable
• Unforeseeable – there are no reasonable or
economical way that the problem could have been
detected before it become emergency conditions
• Foreseeable – consists of 2 types which are
unrecognized and recognized
• Unrecognized - (but foreseeable) conditions, which
could let to emergency, represent situation that require
management direction, or emphasis and sometimes
training. Workers who does not recognize potential
emergency must be trained to see small problem that
will grow to big problem.
• Recognize problem consists of 2 types which is
reported and unreported.
• Unreported – (but recognize) emergency represent a
definite problem for management. Knowledgeable
people does not feel compel to report problems, due to
feeling of no action will be done even if reported.
• Reported (and recognize) but no acted upon – the
worst kind of emergency. Sole responsibility of
maintenance department and create credibility issues.
This kind of issues, create another form of emergency
which is contrived.
• Contrived Emergencies
• Not really an emergencies. Abuse of priority system.
• Come about when the originator feels that the job
would not be started unless given high priority
• Emergencies are expensive, it is important that no
reported pre-emergencies are condition.
• Building credibility takes time, it is the only way that
contrived emergencies will be eliminated.
Moving Towards Maintenance By Plan
From Reactive To Planned

Maintenance Maintenance
• No tactics run to failure • Pre-planned maintenance

tactics
• Effective emergency
response • Effective balance between
reactive/preventive/predictive
• Hectic Work Environment
• Controlled work environment
Maintenance By Plan
• Value of maintenance by plan
• Without planning and scheduling resources are in reactive
mode which is only 30-40% effective
• “Loading” resources by planning and scheduling on a
daily basis increases the effectiveness by more than 60%
• Significant improvement in OEE
• Reduce maintenance cost by up to 40%
• Reduce usage of parts up to 66% compares to
emergency mode
• Reduces human stress
Maintenance By Plan
• Maintenance by Plan could be breakdown as follows:
• Preventive Maintenance
• Predictive Maintenance
• Proactive Maintenance
1. Preventive Maintenance (PM)
• Anybody in maintenance team have heard of this
category under planned maintenance. There are various
benefit to preventive maintenance among others are:
• Reduce cost of maintenance
• Increase uptime of production equipment
• Higher worker productivity (while performing PM
procedures)
• What do we define by PM:
• Basic maintenance performed on machinery or
facilities at an established intervals or frequency.
• PM Procedures
• A good PM procedures should incorporate the
following elements:
1. A list of tools, parts or instruments required to
perform PM, at the beginning of every procedure
2. A form for taking measurement, readings
3. Limit or range of values to indicate whether the
measurement is normal
4. Safety consideration such as “Lock out tag out” or
“hot work” procedure
• PM Procedures
• A good PM procedures should incorporate the
following elements:
5. The data form must ensure that the data are
actually taken, or the technician is really on site
6. Arrange in advance when a shutdown is required to
do the PM, this would ensure that enough time are
given to the technician to do the PM properly.
• PM Sample
• PM Sample results
2. Predictive Maintenance (PDM)
A successful PDM program relies on dedicated effort to
detect, analyze ad correct problem, before failure occurs.
Periodic
Monitoring • Once a piece of
PDM Cycle equipment have been
added to the program
and base lined it, it
Measurement enters the PDM cycle.
Repair Exceed No
Equipment Engineering Limit?
The established
parameters are
Yes measured periodically
Write Corrective Analyze (weekly, bi-weekly,
Work Order Problem monthly, etc).
Periodic
Monitoring
PDM Cycle
• If the measurement
exceed the
Measurement
Repair Exceed No established
Equipment Engineering Limit? engineering limit, it
must be analyze
Yes
further.
Write Corrective Analyze
Work Order Problem
• Analysis can take many
Periodic
Monitoring forms. For example, a
PDM Cycle vibration signature, can
be taken on rotating
equipment. A trained
Measurement analyst may review the
Repair Exceed No
Equipment Engineering Limit?
signature for common
problem such as
Yes misalignment and
Write Corrective Analyze imbalance, as well as for
Work Order Problem not so common problem
like resonance.
Periodic
• Once the source of the
Monitoring problem is determined,
PDM Cycle the best repair activity
can be chosen. If the
engineering limit is set
Measurement low enough, there will still
Repair Exceed No
Equipment Engineering Limit? be plenty of time to
correct the problem
Yes before further damage
Write Corrective Analyze occurs. A work request is
Work Order Problem usually written to start the
repair process.
Periodic
Monitoring
PDM Cycle
• Correction of the root
problem allows the
Measurement
Repair Exceed No equipment to re-enter
Equipment Engineering Limit? the periodic monitoring
program
Yes
Write Corrective Analyze

Work Order Problem
The spectrum of PDM
• There have been historical misconception that equipment
failures cannot be predicted.
• With predictive technology, a vast number of equipment
failures can be predicted.
• Next we could see the various detection method for PDM
The spectrum of PDM
Spectrum of Predictive Maintenance
Equipment
Equipment Types Failure Mode Failure Cause Detection Method
Category
Rotating Machinery
Pumps, Motor, Premature Bearing Vibration and Lube
Excessive Force
Compressor, Blowers Loss Analysis
Over, Under, or
Spectographic &
Lubrication Failure Improper Lube; Heat
Ferrographic Analaysis
and Moisture
Electrical Equipment
Motor, Cable, Starters, Time / Resistance Test,
Insulation Failure Heat, Moisture
Transformers IR Scans Oil Analysis
Moisture Splice
Corona Discharge Ultrasound
Method
Heat Transfer Exchangers, Sediment / Material Heat Transfer
Fouling
Equipment Consdensers Build Up Calculations
Containment and Tanks, Piping, Corrosion Meters,
Corrosion Chemical Attack
Transfer Equipment Reactors Thickness Checks
Stress cracks Metal Fatigue Acoustic Emmision
The mortality of machinery
Finding The Parameters
• Failures that form the latter part of the curve are caused
by identifiable physical phenomenon.
• Depending on machine complexity, there may be several
aging processes at work in a single piece of equipment,
any which may cause ultimate failure.
• These processes are usually related to basic physics of
the materials and how the machine is used.
Finding The Parameters (cont)
• Knowledge of the physical properties of material come
from either theoretical or empirical derived conclusions.
• To understand how failure can be predicted, the mortality
of machinery and the finding of parameters needs to be
understood.
• Example 1:
• Ohm’s Law and theory of potential differences may
define the nature of electrical current in conductors
and insulators.
• Example 2:
• In addition, Many parameter used to predict failures
follow from empirical studies and the application of
statistical analysis to actual failures.
• Experiment in 1920 by Alvin Palgren, helped predict
life of bearing under various load conditions.
• He come out with the formula
• Example 3:
• Further 1930, experiments showed that measuring the
total movement of the machine during operation and
measuring the speed of this movement could
essentially accomplished measurement of forces on
bearing. This movement is called vibration.
• Thus, forces on bearing can be determine by
measuring vibration at or near the bearings
Defining Limits
• Measurement of physical parameter is not enough to
detect the destructive effects on machine or process.
• It is important to establish limit or rate of change in
parameter that may be excessive or damaging.
• One of the method of developing a limit requires that a
number of failures be observed before a safe limit is
established.
• Good PDM requires that limit be tested at the same time
as the monitoring of other factor on a device.
Defining Limits (cont)
• If time permits, device is to be taken out of service,
inspected for defect or failure in question
• Ideally the limit will be set at a measurement value just
below the point corresponding to the 1st discovery of
irreparable or costly defects
• Various engineering limit have been established by
manufacturer, professional body and industrial groups.
• A vibration Institute have established level of equipment
health as a function of vibration velocity based on
experiments.
• Useful for categorizing vibration level of equipment
operating between 600rpm and 3600rpm
• In determining a equipment is having issues of not, we
could for start with identifying equipment > 0.30ips
• A further breakdown could be done if the number of
equipment is still large.
The 4Ts of Correction
• The correction phase of the PDM is one of the most
important part of the program
• Too often we are happily spending too much time in
collection of data and analyzing, but not making ourselves
involved in the correction of the problem
• The issues with correction of problem at most plant is due
to the employee/machine interface.
• The employee need 4 things, time, target tools and
training
The 4Ts of Correction (cont)
• Time – employees must be given time to do the work. A
good alignment procedure may takes 2-12 hours,
schedule for a shutdown in ensuring work is done at its
best
• Target – employees need to know what is the target to be
achieve, for example in the case of alignment what is the
required target for a good alignment +/- 0.02mm??
• Tools – to perform the job, must be available to the
employees. Too often employee are required to perform
their job using a specialized or high precision tools which
is either not available or in a locked closet somewhere.
The 4Ts of Correction (cont)
• Training – employee needs to have training in the skill
and method required for common repairs derived from
predictive maintenance.
3. Proactive Maintenance (PAM)
• Proactive maintenance is a maintenance strategy for
stabilizing the reliability of machines or equipment.
• Its central theme involves directing corrective actions
aimed at failure root causes, not active failure symptoms,
faults, or machine wear conditions.
3. Proactive Maintenance (PAM)
• A typical proactive maintenance regiment involves three
steps:
1. setting a quantifiable target or standard relating to a root
cause of concern (e.g., a target fluid cleanliness level for
a lubricant),
2. implementing a maintenance program to control the root
cause property to within the target level (e.g., routine
exclusion or removal of contaminants), and
3. routine monitoring of the root cause property using a
measurement technique (e.g., particle counting) to verify
the current level is within the target.
Comparison Between 4 Types Of Maintenance
Maintenance Technique Needed Human Body
Strategy Parallel
Proactive Monitoring and Cholesterol and
Maintenance correction of failing blood pressure
Predictive Monitoring of vibration, Detection of heart
Maintenance heat, alignment, wear disease using EKG
dibris or ultrasonics
Preventive Periodic component By-pass or
Maintenance replacement transplant surgery
Reactive Large maintenance Heart attack or
Maintenance budget stroke
Criticality And Priority
• Where to start ???
• While all equipment is important, some are more
important that the others
• In manufacturing, criticality can only be measured in
term of its impact to the entire production system
• Criticality depends on many factors, and generally
requires some analysis to identify.
• Factors include:
• Capacity • Safety and environment considerations
• Reliability • Process Capability
Criticality And Priority
• Criticality Factors Definitions
• Control points: the machine which have the least capacity
in the process, covered as in Capability Planning and
OEE Courses.
• Process capability: the ability of the process to produce
parts that conforms to engineering specifications (CpK),
covered as in Statistical Process Control Courses (SPC)
 Reliability: Probability of product performs its intended
function for specified length of time (MTBF)
 Maintainability: Ease and/or cost or maintaining repairing
products (MTTR) and system availability,
Criticality Dictates Maintenance
Possible Tactics
2. The Management Metric
A. MTBF/MTBR/MTTF/MTTR
Example:
In an steel manufacturing, the plant have been divided
into 3 different sections, foundry, casting and finishing.
Find the MTBF for the foundry sections. Assuming
these data are for foundry alone.
No. of emergency work orders in the past 24 hours = 10
Total equipment failure = 4
Functional equipment failure = 6
Example:
Do not worry about the exact definition of each type of
failure. An emergency work order needs to be written
anytime an asset has a problem and a maintenance
person is called to the asset to investigate or make a
repair)
Time: 24 hours
Calculation:
MTBF = 24 hours = 2.4 hours
10 emergency work orders
Example: Find the MTTR for the foundry:
Total repair time (including wait time) = 4hrs
Total emergency work order = 10
Calculation:
MTTR = 4 hours = 0.4 hours
10 emergency work orders
= 0.4 hrs x 60 min = 24 min
MTTR is always reported in term of minutes.
Example:
MTTF/MTBR = MTBF – MTTR
= 2.4 hrs – 0.4 hrs
= 2.0 hrs
Example:
Time: 5 days
Previous MTBF: Day 1 = 3 hours
Current MTBF: Day 5= 4.2 hours
Calculation:
MTBF % change= 4.2 hours (current MTBF on Day 5) =
3 hours (MTBF on Day 1)
= 1.4 or 40%
CLASS DISCUSSION ON:
PRESENTATION BY 2 GROUPS:
(WILL BE IN RANDOM)
https://www.random.org/integers/?mode=advanced
*** 10 MIN EACH ***
This will be in the next class:

Exercise 5 Besides is the failure
hrs
data obtained from an
Base line 1400 TBF TTF TTR assembly plant, making
Fails 1420 windshields in hours of
Restarts 1425 5
Fails 1654 234 229 running.
Restarts 1663 9
Fails 1897 243 234
TBF – Time Between Failure
Restarts 1904 7 TTF – Time To Failures
Fails 2010 113 106
Restarts 2022 12 TTR – Time to Repair
Fails 2312 302 290
Restarts 2321 9 Calculate the following
Fails 2498 186 177
Restarts 2508 10
from the data:
Fails 2690 192 182
Restarts 2703 13 MTBF, MTTR, MTTF /
Today 2900 MTBR, Availability
hrs
Exercise 5 : (cont)
MTBF = hours
MTTR = hours
MTTF/MTBR= hours
Availability = %
Unavailability =

W3a Reliability, MTTF, Availability, Redundancy

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

W3a Reliability, MTTF, Availability, Redundancy

Transféré par

Droits d'auteur :

Formats disponibles

Basic Concepts

Reliability, MTTF, Availability,

Syamsul Bahrin Abdul Hamid

CprE 545: Fault Tolerant Systems (G. Manimaran) 2

• MTBF (Mean Time Between Failures): Average time a

CprE 545: Fault Tolerant Systems (G. Manimaran) 4

Increasing reliability of a system

1. Load vs. Strength 1. More vs. Less

CprE 545: Fault Tolerant Systems (G. Manimaran) 6

• Reliability of a system is often modeled as:

– When the product “λt” is small,

CprE 545: Fault Tolerant Systems (G. Manimaran) 8

– In other words, the MTBF of a system is the

– If “λ” is the number of failures per hour, the MTBF

CprE 545: Fault Tolerant Systems (G. Manimaran) 9

• λ = (0.02 / 100) * (1 / 1000) * 4000 = 8 * 10-4

• MTBF = 1 / (8 * 10-4 ) = 1250 hours

CprE 545: Fault Tolerant Systems (G. Manimaran) 10

• R(t) = (1 – λt) = (1 – t / MTBF)

CprE 545: Fault Tolerant Systems (G. Manimaran) 11

• A first generation computer contains 10000

• MTBF = t / (1 – R(t)) = t / (1 – 0.99)

• Therefore, λav = N λ = 10000 * 5 * 10-6 = 5 * 10-2 per hour

• Therefore, t = 0.01 / (5 * 10-2 ) = 12 minutes

CprE 545: Fault Tolerant Systems (G. Manimaran) 12

– µ = 1/(Mean Time To Repair) = 1/MTTR

CprE 545: Fault Tolerant Systems (G. Manimaran) 13

• Availability = System up-time / (System up-time + System down-time)

• System down-time = No. of failures * MTTR

• System down-time = System up-time * λ * MTTR

– Availability = System up-time / (System up-time + (System up-time *

– Availability = MTBF / (MTBF + MTTR)

CprE 545: Fault Tolerant Systems (G. Manimaran) 14

CprE 545: Fault Tolerant Systems (G. Manimaran) 16

We make money only when the equipment is running

Schedule Equipment Fully Unscheduled

Schedule Repair Unscheduled Repair

The question is not IF but when we do the

From Reactive To Planned

• No tactics run to failure • Pre-planned maintenance

Write Corrective Analyze

*** 10 MIN EACH ***

This will be in the next class:

Vous aimerez peut-être aussi

* 10 MIN EACH *