The MTBF, or mean time between failure, is a statistical measure used to predict the behavior of a large group of samples, or units. For example, the MTBF may be used to determine maintenance schedules, to determine how many spares should be kept on hand to compensate for failures in a group of units, or as an indicator of system reliability. In order to calculate MTBF, you need to know the total unit hours of testing conducted during the trial in question and the number of failures that occurred.
The formula for mean time between failure or MTBF is:
where T is the total number of unit hours from the trial in question, and R is the number of failures.
An Example of Calculating MTBF
Whether you're evaluating the reliability of new software or trying to decide how many spare widgets to keep on hand in your warehouse, the process for calculating MTBF is the same.
Determine the Total Time Tested
Identify the Number of Failures
Divide the Number of Test Hours by the Number of Failures
The first metric you must know is the total "unit hours" of testing that took place in your reliability study. Imagine that your subject is warehouse widgets, and that 50 of them were tested for 500 hours each. In that case, the total unit hours spent testing is:
Next, identify the number of failures across the entire population that was tested. In this case, consider that there were 10 widget failures in total.
You know that 25,000 total unit hours of testing took place, and there were 10 widget failures. Divide the total number of test hours by the number of failures to find the mean time between failures:
So in this particular data model, the MTBR is 2,500 unit hours.
Putting the MTBR Into Context
Before you jump into calculating a "reliability equation" like the MTBF, it's important to understand its context. The MTBF isn't meant to predict the behavior of a single unit; instead, it's meant to predict the typical results from a group of units. In the example above, your calculations aren't telling you that each widget is expected to last 2,500 hours. Instead, they're saying that if you run a group of widgets, the average time between failures within the group is 2,500 hours.
Another Statistic: The MTTR Calculation
One of the challenges of statistics is making your statistical models echo real-world situations as precisely as possible. So your reliability calculations might also need to include the MTTR, or mean time to repair – whether for estimating downtime within your systems or budgeting personnel hours to effect said repairs.
To calculate the MTTR, divide the total time spent on repairs by the number of repairs made. So, if during your warehouse widget test your maintenance crew worked 500 person hours and made 10 repairs, you could extrapolate the MTTR:
So your MTTR is 50 person hours per repair. This doesn't mean that every repair will take 50 hours – in fact there may be quite a bit of disparity between actual repair times. Again, this isn't a prediction that every repair, or even most repairs, will take 50 person hours to conduct. It just tells you that when you take a step back and look at your widget population as a whole, the population as a whole will start to approach that average.
- David De Lossy/Photodisc/Getty Images