in the row labeled 18, we see the numbers 1.33 1.73 2.10 2.88 3.92. Locate 57 relative to these numbers; it is greater than the last entry 3.92. This entry is in the column labeled 0.001. This means that the probability, when {3 = 0, that the random variable It I is as large or larger than 3.92 is 0.001, or in compact notation, Prob( It I ~ 3.92] = 0.001 Of course the probability that It I is as large or larger than 57 is even less. So we can say that the p-value of the test of {3 = 0 for the acid content data is less than 0.001. This is a formal confirmation of our previous informal conclusion that {3 -:/= O. In other words, as we would expect, the cheap measurement of the acid number of a chemical sample contains significant information about the organic acid content of the sample.
Least-squares tests and estimates are optimal if the population of errors can be assumed to have a normal distribution. If the normality assumption is not satisfied, then least-squares procedures are still valid but they may be far from optimal. Consider the least-squares test of {3 = 0 described in the preceding section and suppose the error population is non normal. The least-squares test is still approximately valid (provided the sample size is not too small and the nonnormality is not too extreme) in the sense that the calculated p-value is approximately equal to the true p-value, but there are other tests that are more powerful in the sense that they are better at detecting when {3 -:/= O. There are two approaches to dealing with the question of normality. One way is to check the data for normality and, if nonnormality is detected, try to correct it. Or, one can use regression methods other than least-squares, such as those presented in s 4, 5, and 6, which do not depend on the assumption of normality. A number of plots and tests, based on the residuals, have been developed for checking the normality of the errors, but here we mention only the normal probability plot. The standardized residuals are put in increasing order and are plotted against what their expected values would be if they came from a sample of n independent standard normal random variables. The plot should look nearly linear if the assumption of normality is valid. A normal probability plot of the residuals from the acid content data, shown in Figure 3.4, looks sufficiently linear to be consistent with the assumption of normality.
Normal scores Normal probability plot of the residuals for the acid content data.
When a normal probability plot is very nonlinear, the data can sometimes be transformed so that normality is more closely approximated.
3.6 AN EXAMPLE OF MULTIPLE REGRESSION Simple regression is when there is only one explanatory variable and multiple regression is when there are more than one. The turnip green data set in Section 1.2 has three explanatory variables. Let us do a regression analysis of these data. The linear regression model for these data is
In terms of the observed data the model is (3.6) for i
1,2, ... ,27.
The discussion in Section 3.3 extends to multiple regression. The least-squares estimates of f3 o' f31' f3z, and f33 are defined to be the values of ffio, ffi" ffi2' and ffi3 which give the2 least sum of squares of, . . , . . . , . . which give the the residuals, that is, " least value of Lei' where e i = Yi - (f3o + f3l x il + f3zx iz + f33Xi3).
Algebraic formulas for these four least-squares estimates are very messy if they are written out for each individual parameter. But if we introduce matrix notation the formula for the vector of estimates is quite compact.
Matrix Notation. Boldface letters are used to denote matrices and vectors, capital letters for matrices and lowercase letters for vectors. Vectors in formulas are taken to be column vectors. (Elsewhere, it is sometimes more convenient to write a vector as a row vector.) Let
