In sports, hard work in training and practice sessions is often rewarded with high placings in competitions and games (in a proportional way). In other words, the old-school refrain of "No pain, no gain!" rings with a lot of truth, although a more optimistic framing of the same idea is, "The harder you objectively work, the greater your level of objective success."

You could test this idea by choosing 100 distance runners at random (perhaps using an online survey to collect participants) and having them race each other over a distance of 5 kilometers (3.1 miles). You could ask them to report how many miles per week they ran on average in the preceding three months before this test.

If you then plotted a graph of 5K speed vs. average miles per week, you would expect to see a positive **correlation** between training and performance. But would this be a "perfect" correlation? In other words, can you think of reasons to expect data points that would deviate from the predicted relationship between training volume and 5K speed?

Welcome to the world of linear regression analysis, a marvelous and usually quite interesting tool to help scrutinize and quantify relationships between apparently related variables. In addition to the example above, you can imagine countless others (e.g., rainfall vs. vegetation level; income vs. access to medical care in the U.S.) of personal and civic interest.

Read on for more than you ever expected to know about matters related to the now-famous "R-squared formula" in statistics.

## About Linear Equations

A **linear equation** is so named because it produces a straight line when graphed using x and y coordinates. It can be expressed in the form:

**y** = a + b**x**

In this scheme, a and b are constants, x is called the **independent variable**, and y is known as the **dependent variable**. Another way to state this relationship is "the variation of y with x."

What this translates to in the real world is that x is usually a variable you can control or pick in an experiment or analysis (such as the number of miles run), and y is a variable that seems to have some kind of dependency on x (such as running speed).

**Example:** Graph the equation y = 5x − 7.

In linear equations, a is known as the y-intercept. You can see from the graph that this is the value of y where the graph crosses the y-axis. If it does not, then the graph is a vertical line, and the equation assumes the form x = a constant. Such a graph does not establish anything at all about y as a function of x and cannot be put in the form **y** = a + b**x**.

The constant b is called the *slope* of the line, familiarly known as "rise over run" in introductory mathematics courses. It can be positive (represented by an upward-sloping line in relation to the x- and y-axes), negative (a downward-sloping line) or 0 (a horizontal line).

## What Is Correlation Between Variables?

Above, you were invited to consider the impact of a variable behavior (physical training) on an outcome (a 5K time) proposed to hinge to some unknown but considerable extent on that variable behavior.

By choosing a sizable number of subjects for your analysis (N = 100), you aim to seek determine whether a meaningful and reproducible relationship exists; if you only looked at three or four runners and one or two happened to have a cold on test day, the results would be less helpful.

If you charged $10 for an app that you developed and somehow had no start-up or maintenance costs, your profit would just be the number of units you sold times ten: y = 10x. There would thus be a "perfect," or invariant, correlation between the number of units sold and profit. If you plotted the graph, a single line would obviously join all the points.

But what about correlations that are clearly in play but are not "perfect"? In science, this is in fact the case most of the time, and linear regression analysis is the tool scientists use to determine the extent or power of any relationships determined between variables in the world.

## What Is Confounding in Statistics?

Imagine sampling 1,000 people from the U.S. population who report consuming more than three cups of coffee per day and comparing the collective rate of lung cancer in this group to the lung-cancer rate of 1,000 randomly chosen Americans who report drinking no coffee at all. Would you be surprised to find that the coffee-drinking group wound up experiencing significantly more lung cancers than the abstainers?

If you're already thinking that either the study design was flawed, or there is something insidious and previously unknown about coffee, you're on the right track. It would perhaps not be surprising to find that the rate of cigarette smoking is far higher among heavy coffee drinkers than in people who drink moderate amounts or none at all.

In this case, cigarette smoking is known as a **confounding variable**. Because it has measurable effects on the outcome of interest without being related to the independent variable, it throws noise into the study. Statisticians and researchers have to be able to control for such confounding variables when designing studies and analyzing the data these produce.

## About Regression Analysis

Say you carry out your training-versus-5K time analysis, and much to your delight, you see that there is in fact a relationship between work and results: Those who report more rigorous preparation tend to have faster times. But the graph is not a line by any means; instead, it is a sort of cloud that looks like a line could be run through it and capture the mathematical "essence" of the cloud of points, called a *scatter plot*.

In order to perform what is called a linear regression analysis, which is the process used to determine a best line of fit in a scatter plot, you must be able to make two assumptions. One is that the relationship is in fact linear rather than, say, curvilinear, as when y varies with some exponential power of x.

The other is that the relationship between y and x is such that y is *continuous*, that is, not a *discrete* variable such as 1, 2 or 3 classes in a semester.

In a graph of 5K speed vs. training volume for your 100 subjects, there is no true line representing the graph. That means that there is also no real slope or y-intercept. There is, however, a line that best fits all of the plotted points and minimizes the total difference between the line and all of the individual data points. This line produces an estimate of the y-intercept and slope and the equation describing it is of the form noted above:

**ŷ** = a + **b**x

ŷ is called "y hat," and the graph is called a **line of best fit** or, for reasons soon to become clear, a **least-squares line**.

- As you may have determined, you aren't expected to solve these equations by hand. Not only will your calculator perform this function for you, but you can also use any number of online tools to do the job for you (see the Resources for an example).

## What Is the Correlation Coefficient r?

In the above equation, the constants a and b are estimates derived from the mean values of x and y in the sample (such as average training volume and average 5K time), written as x̅ and y̅. The derivation is too extensive for this discussion, but for completeness' sake,

a = y̅ − bx̅

b = ∑[(x − x̅)(y − y̅ )]/ ∑(x − x̅ )^{2}

The constant b is derived from the magnitude of the deviations. Intuitively, you may already perceive that smaller values of all of the quantities in parentheses in this equation are associated with a better "fit" between the data and the line created to determine a linear relationship between x and y within those points.

The expression for the constant b above can be written:

b = r(S_{y}/S_{x}),

Where S_{y} and S_{x} are the standard deviations of the x and y values in the set. At last, you have arrived at a key quantity in regression analysis: The **correlation coefficient r**, which can vary between −1.0 and 1.0.

- r is the bottom item on the output screen of the LinRegTTest on TI-83, TI-83+ and TI-84+ calculators.

## What Is the Coefficient of Determination?

The correlation coefficient r on its own is very useful. A value close to 1.0 indicates a near-perfect positive correlation, as in the example of your app sales. A value close to −1.0 indicates a strong negative correlation, in which moving the independent variable (say, hours spent partying) one way results in moving another (say, GPA) in the opposite direction.

A second important quantity in linear regression analysis is the **coefficient of determination**. In discussions of linear regression, the coefficient of determination is always the square of the correlation coefficient r, so it is simply (r)^{2} = r^{2}. Note that this value cannot be negative.

The coefficient of determination is not merely a numerical transformation from the correlation coefficient; it also has great explanatory value in many cases. It is usually expressed as a percentage rather than a decimal number, for this is the language statisticians prefer to use when conveying information to other scientists and especially the public.

## Why Use the r2 Value?

First, it is useful to know what r^{2} actually represents. It is best defined as *the percentage of variation in the dependent or predicted variable (y) that can be explained by variation in the independent or explanatory variable (x)* using the best-fit line generated by the regression analysis.

If the value of r^{2} in your running study turned out to be 0.64, you could state that 64 percent of the variation in 5K times was explained by differences in training volume. (Quick quiz: What values of r could result in a coefficient of determination of 0.64?)

By the same token, the value 1 – r^{2}, expressed as a percentage, represents the percent of variation in *y* that is not explained by variation in *x*. This may appear to be a trivially true result, but in some cases, you may be more explicitly interested in differences rather than similarities.

In your running analysis, if you did not divide your subjects into categories based on factors such as age, sex and general health, you could expect to have a number of confounding variables in your analysis, thus driving down the value of r^{2} and exposing the limits of the investigative power of your analysis.

## Linear Regression Calculator

In the Resources, you'll find an example of a tool that allows you to input as many x and y values as you wish from a data set and perform a linear regression, generating r and r2 in the process. Playing around with increasingly larger data sets and tinkering with the variation by "feel" is a great way to familiarize yourself with linear regression and its graphical implications.