Linear Regression

Effects of Outliers

Residuals

email comments to harvey@depauw.edu or william.otto@maine.edu

Further Study

The data sets in this module were created by F. J. Anscombe and were first published in the article "Graphs in Statistical Analysis," American Statistician, Vol. 27, pp 17-21 (1973). An alternative data set created by J. M.. Bremmer can be found here.

Mathematical Details. Our treatment of linear regression has been quite limited. In particular, the mathematical details have been omitted. For further information, including a discussion of more esoteric (but important) considerations, such as the standard deviation of the regression, the confidence intervals for the slope and intercept, and the use of a weighted linear regression, consult the following textbook:

Miller, J. C.; Miller, J. N. Statistics for Analytical Chemistry, Ellis Horwood: Chichester

A handout showing the derivation of the equations for a linear regression is available here.

Effect of Outliers. This applet demonstrates how an outlier affects the results of linear regression. The original data consists of a five-point linear model. Click to add a sixth point and observe its effect on the regression line. Try placing the sixth point at various places both near and far from the line, as well at low, middle and high values for X.

Residuals. As you have seen, it is not a good idea to rely on the value of R2 or R as the sole measure of a model's appropriateness. An additional tool for evaluating a regression model is to examine a plot of the residual error in Y as a function of X. If a model is appropriate, then the residual errors, should be randomly scattered around a value of zero. To calculate the residual error for each value of X, use your regression line to calculate the predicted Y; that is

Ypred = slope*X + intercept

Next, calculate the residual errors

RE = (Yexpt - Ypred)

Finally, plot the residual errors vs. the values of X and examine the plot. Here are the three data sets that you worked with in this module. Create residual plots for each and see if the results agree with your earlier determination about the validity of fitting a straight-line to the data. For Data Set 2, try both the straight-line model and a 2nd-order polynomial model. For Data Set 3, try the straight-line model with and without the apparent outlier.