Statisticians and scientists often have a requirement to investigate the relationship between two variables, commonly called x and y. The purpose of testing any two such variables is usually to see if there is some link between them, known as a correlation in science. For example, a scientist might want to know if hours of sun exposure can be linked to rates of skin cancer. To mathematically describe the strength of a correlation between two variables, such investigators often use R2.
Statisticians use the technique of linear regression to find the straight line that best fits a series of x and y data pairs. They do this through a series of calculations which derive the equation of the best line. This mathematical description of the line will be a linear equation and have the general form of y = mx + b, where x and y are the two variables in the data pairs, m is the slope of the line and b is its y intercept.
The calculations which find the best straight line will produce a linear equation to fit any set of data, even if that data is not actually very linear. In order to have an indication of how well the data actually fit a straight line, statisticians also calculate a number known as the correlation coefficient. This is given the symbol r or R and is a measure of how closely aligned the data pairs are to the best straight line through them.
Significance of R
R can have any value between -1 and 1. A negative value of R simply means that the best fit straight line slants downwards moving left to right, rather than upwards. The closer R is to either the of the two extremes, the better the fit of the data points to the line, with either -1 or 1 being a perfect fit and an R value of zero meaning that there is no fit and the points are totally random. If the data points are well aligned to the straight line, there is said to be some correlation between them, hence the name correlation coefficient for R.
Some statisticians prefer to work with the value of R2, which is simply the correlation coefficient squared, or multiplied by itself, and is known as the coefficient of determination. R2 is very similar to R and also describes the correlation between the two variables, however it is also slightly different. It measures the percent of variation in the y variable which can be attributed to variation in the x variable. An R2 value of 0.9, for example, means that 90 percent of the variation in the y data is due to variation in the x data. This does not necessarily mean that x is truly affecting y, but that it appears to be doing so.