SIMPLE LINEAR REGRESSION AND CORRELATION

CHAPTER 11 SIMPLE LINEAR REGRESSION AND CORRELATION

Figure 11-9 Patterns for residual plots. (a) satisfactory, (b) funnel, (c) double bow, (d) nonlinear. [Adapted from Montgomery, Peck, and Vining (2001).]

is preferred. It requires judgment to assess the abnormality of such plots. (Refer to the discussion of the fat pencil method in Section 6-7). We may also standardize the residuals by computing di ei 2 2, i 1, 2 p , n. If the errors are normally distributed, approximately 95% of the standardized residuals should fall in the interval ( 2, 2). Residuals that are far outside this interval may indicate the presence of an outlier, that is, an observation that is not typical of the rest of the data. Various rules have been proposed for discarding outliers. However, outliers sometimes provide important information about unusual circumstances of interest to experimenters and should not be automatically discarded. For further discussion of outliers, see Montgomery, Peck and Vining (2001). It is frequently helpful to plot the residuals (1) in time sequence (if known), (2), against the yi, and (3) against the independent variable x. These graphs will usually look like one of the four general patterns shown in Fig. 11-9. Pattern (a) in Fig. 11-9 represents the ideal situation, while patterns (b), (c), and (d ) represent anomalies. If the residuals appear as in (b), the variance of the observations may be increasing with time or with the magnitude of yi or xi. Data transformation on the response y is often used to eliminate this problem. Widely used variance-stabilizing transformations include the use of 1y, ln y, or 1 y as the response. See Montgomery, Peck, and Vining (2001) for more details regarding methods for selecting an appropriate transformation. If a plot of the residuals against time has the appearance of (b), the variance of the observations is increasing with time. Plots of residuals against yi and xi that look like (c) also indicate inequality of variance. Residual plots that look like (d) indicate model inadequacy; that is, higher order terms should be added to the model, a transformation on the x-variable or the y-variable (or both) should be considered, or other regressors should be considered. EXAMPLE 11-7

The regression model for the oxygen purity data in Example 11-1 is y 74.283 14.947x. Table 11-4 presents the observed and predicted values of y at each value of x from this data set, along with the corresponding residual. These values were computed using Minitab and show

11-8 ADEQUACY OF THE REGRESSION MODEL

Table 11-4 Oxygen Purity Data from Example 11-1, Predicted Values, and Residuals Hydrocarbon Level, x 1 2 3 4 5 6 7 8 9 10 0.99 1.02 1.15 1.29 1.46 1.36 0.87 1.23 1.55 1.40 Oxygen Purity, y 90.01 89.05 91.43 93.74 96.73 94.45 87.59 91.77 99.42 93.65 Predicted Value, y 89.069009 89.518136 91.464353 93.560279 96.105332 94.608242 87.272501 92.662025 97.452713 95.207078 Residual e y y 0.940991 0.468136 0.034353 0.179721 0.624668 0.158242 0.317499 0.892025 1.967287 1.557078 11 12 13 14 15 16 17 18 19 20 Hydrocarbon Level, x 1.19 1.15 0.98 1.01 1.11 1.20 1.26 1.32 1.43 0.95 Oxygen Purity, y 93.54 92.52 90.56 89.54 89.85 90.39 93.25 93.41 94.98 87.33 Predicted Value, y 92.063189 91.614062 88.919300 89.368427 90.865517 92.212898 93.111152 94.009406 95.656205 88.470173 Residual y e y 1.476811 0.905938 1.640700 0.171573 1.015517 1.822898 0.138848 0.599406 0.676205 1.140173

the number of decimal places typical of computer output. A normal probability plot of the residuals is shown in Fig. 11-10. Since the residuals fall approximately along a straight line in the gure, we conclude that there is no severe departure from normality. The residuals are also plotted against the predicted value yi in Fig. 11-11 and against the hydrocarbon levels xi in Fig. 11-12. These plots do not indicate any serious model inadequacies.

11-8.2

Coef cient of Determination(R2)

The quantity R2 SSR SST 1 SSE SST (11-34)

is called the coef cient of determination and is often used to judge the adequacy of a regression model. Subsequently, we will see that in the case where X and Y are jointly distributed random variables, R2 is the square of the correlation coef cient between X and Y. From

99.9 99 Cumulative normal probability 95 80 50 20 5 1 0.1 1.9 0.9 0.1 Residuals 1.1 2.1 Residuals 2.5 2 1.5 1 0.5 0 0.5 1 1.5 2 2.5 87