Outliers

Problem 1

Summary

last modified on 1/23/07

email comments to harvey@depauw.edu or william.otto@maine.edu

Introduction

When we collect and analyze several replicate portions of a material we do so with the intent of characterizing that material in some way. Suppose, for example, that we gather seven replicate sediment samples from a local stream and bring them back to the lab with the intent of determining the concentration of Pb in the sediment. After analyzing each replicate, we obtain the following results (in ppb)

4.5, 4.9, 5.6, 4.2, 6.2, 5.2, 9.9

and report the average (5.8 ppb) as an estimate of the amount of Pb in the sediment and the standard deviation (1.9 ppb) as an estimate of the uncertainty in that result.

The mean and standard deviation given above provide a reasonable summary of the sediment if the seven replicates come from a single population. The last value of 9.9 ppb, however, appears unusually large in comparison to the other results. If this sample is not from the same population as the other six samples, then it should not be included in our final summary. Such data points are called outliers.

Discarding the last result changes the mean from 5.8 ppb to 5.1 ppb and the standard deviation from 1.9 ppb to 0.73 ppb, which are not insignificant changes. Is discarding the suspect data point justifiable? Is it ever justifiable to discard a data point? If so, what criteria should influence a decision to reject an apparent outlier?

When you complete this module you should:

• appreciate how the relative position of one data point from the remaining data affects your ability to determine if it is an outlier
• be able to use the Q-test to identify and reject a possible outlier
• understand how paying attention to your data as it is collected can help you in identifying possible outliers

Before tackling some problems, read an explanation of how the Q-test works by following the link on the left.