Galton's Sweet Pea Data in Java

Table 1.4 Galton's Sweet Pea Data
Diameter of Parent Peas (in lix) of an inch) Mean Diameter of Offspring Peas 17.5 17.3 16.0 16.3 15.6 16.0 15.3
21 20
19 18 17 16 15
Source: Galton (1886).
Figure 1.2 display part of his data. For each of seven diameters, he found sweet peas having approximately that diameter and planted them. These were the "parent" sweet peas. After the plants grew and produced peas, these "offspring" sweet peas were harvested and their diameters were measured. Galton noticed two things about these data. First, the averages of the offspring diameters had an approximately linear relationship with the parent diameters. Just by eye we can see in Figure 1.2 that a straight line could be drawn that fits the data fairly well. Second, he noticed that the average diameters of the offspring peas appear to "regress" toward a common average. (In 1877 he used the word "revert" but in an 1885 paper he changed to the word" regress".) The overall
Parent peas Figure 1.2 Plot of the sweet pea data.
average diameter of the offspring peas is about 16.3. For each of the seven parental diameters, the average diameter of the offspring differs from the parental diameter in the direction of the overall average. For example, the offspring of parents with diameter 21 have an average diameter of 17.5, which is in the direction of 16.3. And the offspring of parents with diameter 15 have an average diameter of 15.3, which is in the direction of 16.3. Galton later referred to this phenomenon as "regression toward mediocrity". One might think that regression would imply that, after many generations, all sweet peas would end up having the same diameter. But the regression only pertains to the average diameters of the offspring peas. The individual peas have diameters that vary around the average. The variability of the individual diameters compensates for the regression of the average diameters, so that the distribution of diameters in the population of offspring peas is actually the same as the distribution of diameters in the population of parent peas.
NOTES 1.3a. In regression terminology we say that the variable Y "depends" on the variables XI' ... ' Xp or that the value of Y is a "response" to the values of XI' ... ' Xp. Although the words suggest so, this should not be taken to mean that the relationship between the X's and Y is necessarily one of cause and effect. A regression model can be used to describe associative relationships as well as causative relationships. 1.3b. There are a number of reasons why we cannot expect Y to be an exact mathematical function of XI' X 2 , and X 3 The measurements of the three explanatory variables are incomplete in the sense that, for example, the air temperature measurement was taken at one particular time but the temperature probably varied somewhat throughout the production process. The measurements are subject to error due to imprecision in the measuring instruments and human errors. There are certainly other factors, besides sunlight, soil moisture, and air temperature, that influence the production of vitamin B 2. The production of vitamin B2 is so complicated that it seems impossible to express it exactly in any mathematical formula. 1.7a. For more about Francis Galton's contributions to statistics see Stigler (1986, 8). 1. 7b. To see how variability compensates for the regression of the averages, look at Figure 1.3. The ellipse represents a cloud of data points in a large sweet pea experiment. The x-coordinate of a data point is the diameter of a parent pea and the y-coordinate is the diameter of one of its offspring peas. Let us focus on peas having diameter 21 or more; for convenience we call these "large" peas. The large parent peas are those associated with the
