When you build models in statistics, you will usually test them, making sure the models match real-world situations. The residual is a number that helps you determine how close your theorized model is to the phenomenon in the real world. Residuals are not too hard to understand: They are just numbers that represent how far away a data point is from what it “should be” according to the predicted model.
Mathematically, a residual is the difference between an observed data point and the expected -- or estimated -- value for what that data point should have been. The formula for a residual is R = O - E, where “O” means the observed value and “E” means the expected value. This means that positive values of R show values higher than expected, whereas negative values show values lower than expected. For example, you might have a statistical model that says when a man’s weight is 140 pounds, his height should be 6 feet, or 72 inches. When you go out and collect data, you might find someone who weighs 140 pounds but is 5 feet 9 inches, or 69 inches. The residual is then 69 inches minus 72 inches, giving you a value of negative 3 inches. In other words, the observed data point is 3 inches below the expected value.
Residuals are especially useful when you want to check if your theorized model works in the real world. When you create a model and calculate its expected values, you are theorizing. But when you go collect data, you might find that the data don't match the model. One way to find this mismatch between your model and the real world is to calculate residuals. For example, if you find that your residuals are all consistently far away from your estimated values, your model might not have a strong underlying theory. An easy way to use residuals in this way is to plot them.
When you calculate the residuals, you have a handful of numbers, which is hard for humans to interpret. Plotting the residuals can often show you patterns. These patterns can lead you to determine whether the model is a good fit. Two aspects of residuals can help you analyze a plot of residuals. First, residuals for a good model should be scattered on both sides of zero. That is, a plot of residuals should have about the same amount of negative residuals as positive residuals. Second, residuals should appear to be random. If you see a pattern in your residual plot, such as them having a clear linear or curved pattern, your original model could have an error.
Special Residuals: Outliers
Outliers, or residuals of extremely large values, appear unusually far away from the other points on your plot of residuals. When you find a residual that is an outlier in your data set, you must think carefully about it. Some scientists recommend removing outliers because they are “anomalies” or special cases. Others recommend further investigation as to why you have such a large residual. For example, you might be making a model of how stress affects school grades and theorize that more stress usually means worse grades. If your data show this to be true except for one person, who has very low stress and very low grades, you might ask yourself why. Such a person might simply not care about anything, including school, explaining the large residual. In this case, you might consider taking the residual out of your data set because you want to model only students who care about school.