Linear Regression

Further Study

The data sets in this module were created by F. J. Anscombe and were first published in the article "Graphs in Statistical Analysis," American Statistician, Vol. 27, pp 17-21 (1973). An alternative data set created by J. M.. Bremmer can be found here.

Mathematical Details. Our treatment of linear regression has been quite limited. In particular, the mathematical details have been omitted. For further information, including a discussion of more esoteric (but important) considerations, such as the standard deviation of the regression, the confidence intervals for the slope and intercept, and the use of a weighted linear regression, consult the following textbook:

Miller, J. C.; Miller, J. N. Statistics for Analytical Chemistry, Ellis Horwood: Chichester

A handout showing the derivation of the equations for a linear regression is available here.

Effect of Outliers. This applet demonstrates how an outlier affects the results of linear regression. The original data consists of a five-point linear model. Click to add a sixth point and observe its effect on the regression line. Try placing the sixth point at various places both near and far from the line, as well at low, middle and high values for X.

Residuals.
As you have seen, it is not a good idea to rely
on the value of R^{2} or R as the sole measure of a model's appropriateness.
An additional tool for evaluating a regression model is to examine a plot
of the residual error in Y as a function of X. If a model is appropriate,
then the residual errors, should be randomly scattered around a value of
zero. To calculate the residual error for each value of X, use your regression
line to calculate the predicted Y; that is

Y_{pred} = slope*X + intercept

Next, calculate the residual errors

RE = (Y_{expt} - Y_{pred})

Finally, plot the residual errors vs. the values of X and examine the plot.
Here are
the three data sets that you worked with in this module. Create residual
plots for each and see if the results agree with your earlier determination
about the validity of fitting a straight-line to the data. For Data Set 2,
try both the straight-line model and a 2^{nd}-order polynomial
model. For Data Set 3, try the straight-line model with and
without the apparent outlier.