In statistics we often look for connections between two or more variables; we often want to know if we can say something about one variable if we know something about another. For example, we might be interested in how weight varies with height, or how blood cholesterol level varies with the dose of statins taken. The exploration of relationships between variables is called regression analysis or regression modelling. The examples given below are for a simple regression analysis involving just two variables and a simple straight line model. More complex situations arise very often in statistics, but the same general principles apply.
Figure 1 illustrates some (fictitious) data of the type that might arise from an experiment.
On the horizontal axis we have a controlled or non-random variable, x, taking values 1, 2, …, 10. On the vertical axis the values, y, are not so neat: it looks as though the y values tend to increase as x increases, but there is clearly some other variation too giving rise to fluctuations in the values of y. If we can assume that these fluctuations are random, and if we can assume that without the fluctuations in y the points would lie on a straight line, then we can calculate the regression line for y on x as shown. This line is the best estimate we can make from the given data of the relationship between y and x. If we were able to eliminate the random variation in y, the data points would lie on a straight line; it is this straight line we are estimating.
The regression line can be used to predict the y value for a given x value. Predicting y for a value of x within the range of the data (e.g. x = 6.5 as shown) is called interpolation, and it is generally pretty reliable. Predicting y for a value of x way beyond the data (e.g. x = 20) is extrapolation and it can be very inaccurate as the simple straight line relationship between x and y may break down. (Many statistical howlers arise from extrapolation: a favourite example extrapolates the growth to date in the number of Elvis impersonators to predict that 1 in 3 of the world’s population will be an Elvis impersonator by 2019.)
In figure 2, again for fictitious data, we have a rather different situation.
Here neither variable is controlled. Perhaps we have selected a random sample of individuals and measured two things, x and y. (In a real example we might measure height and weight, or IQ and salary, hoping in each case to spot any connections there might be.) Here the variation represents the differences between individuals rather than random fluctuations or errors of measurement, and no amount of fine measurement will eliminate it. So it doesn’t make sense to think in terms of a straight line relationship between y and x. Instead, we think in terms of the average (or mean) value of y for a given x. Under the right conditions (x and y coming from a two dimensional normal distribution) the mean value of y as x varies will form a straight line. It is this regression line that is shown in the diagram. And we can use the regression line to predict the mean value of y for a given value of x (e.g. x = 6.5 as shown.) As before, this is interpolation and fairly safe. Extrapolation to values of x beyond the data is unsafe and best avoided.
Because we have two random variables here, we could perfectly well be interested in the mean value of x as y varies. That is, we could be interested in the regression line of x on y. (Note that this would not have made sense for the situation shown in figure 1 where x was a controlled variable.) Figure 3 shows the regression line for x on y in red.
This regression line would be used to estimate the mean value of x for a given y (e.g. y = 4.5 as shown).
The fact that the two regression lines are quite distinct is sometimes found puzzling, but it need not be. Asking two distinct questions – how does the mean value of y vary with x? and how does the mean value of x vary with y? – should be expected to give two distinct answers!