Linear regression is a statistical method for examining the relationship between a dependent variable, denoted as y, and one or more independent variables, denoted as x. The dependent variable must be continuous, in that it can take on any value, or at least close to continuous. The independent variables can be of any type. Although linear regression cannot show causation by itself, the dependent variable is usually affected by the independent variables.
Linear Regression Is Limited to Linear Relationships
By its nature, linear regression only looks at linear relationships between dependent and independent variables. That is, it assumes there is a straight-line relationship between them. Sometimes this is incorrect. For example, the relationship between income and age is curved, i.e., income tends to rise in the early parts of adulthood, flatten out in later adulthood and decline after people retire. You can tell if this is a problem by looking at graphical representations of the relationships.
Linear Regression Only Looks at the Mean of the Dependent Variable
Linear regression looks at a relationship between the mean of the dependent variable and the independent variables. For example, if you look at the relationship between the birth weight of infants and maternal characteristics such as age, linear regression will look at the average weight of babies born to mothers of different ages. However, sometimes you need to look at the extremes of the dependent variable, e.g., babies are at risk when their weights are low, so you would want to look at the extremes in this example.
Just as the mean is not a complete description of a single variable, linear regression is not a complete description of relationships among variables. You can deal with this problem by using quantile regression.
Linear Regression Is Sensitive to Outliers
Outliers are data that are surprising. Outliers can be univariate (based on one variable) or multivariate. If you are looking at age and income, univariate outliers would be things like a person who is 118 years old, or one who made $12 million last year. A multivariate outlier would be an 18-year-old who made $200,000. In this case, neither the age nor the income is very extreme, but very few 18-year-old people make that much money.
Outliers can have huge effects on the regression. You can deal with this problem by requesting influence statistics from your statistical software.
Data Must Be Independent
Linear regression assumes that the data are independent. That means that the scores of one subject (such as a person) have nothing to do with those of another. This is often, but not always, sensible. Two common cases where it does not make sense are clustering in space and time.
A classic example of clustering in space is student test scores, when you have students from various classes, grades, schools and school districts. Students in the same class tend to be similar in many ways, i.e., they often come from the same neighborhoods, they have the same teachers, etc. Thus, they are not independent.
Examples of clustering in time are any studies where you measure the same subjects multiple times. For example, in a study of diet and weight, you might measure each person multiple times. These data are not independent because what a person weighs on one occasion is related to what he or she weighs on other occasions. One way to deal with this is with multilevel models.