(White, 1989; Mielniczuk and Tyrcha, 1993; Geman et al., 1992). Hampshire and Perlmutter (1990) specialize this result to show that an MLP classifier approaches the Bayesian decision boundary with a wide variety of penalty functions, including those that do not allow the MLP outputs to be interpreted as probabilities. A similar result for the softmax activation function (Equation (2.3), p. 11) does not appear in the literature, but can be seen to follow if we consider a softmax output for a particular class to be equivalent to a logistic activation function for that class versus all of the other classes.
There are two different aspects here: 1. Equations (4.1) and (4.2) show that p is minimized when z; =
2. The results cited above show that this minimum is reached under certain conditions. That is, + provided the MLP is allowed to have an arbitrary number of hidden layer units.
However, allowing the number of hidden layer units to be arbitrarily large for a fixed set of training data is not a good strategy. As Richard and Lippmann (1991) point out, the estimates of will only be good if:
1. sufficient training data are available;
2. the network has sufficient complexity to implement the correct classification;
3 . the classes are sampled with the correct a
Note that a combination of logistic outputs with a sum of squares penalty function gives unbiased estimators of the posterior probability of class membership, given the conditions above. Hence we have the asymptotic property that the outputs from a multi-class MLP with a sum of squares and logistic functions will also sum to one. However, these are asymptotic results and may not help greatly in dealing with small samples. Note that nothing in this discussion depends on the fact that we are fitting an MLP model. All that is assumed is that the method is flexible enough to model 1 I).
Suppose we have a training set, consisting of sample points. We consider the left-hand side of equation (4.1), but now we take the expectation over We consider the training set to have been drawn from a population of training sets of size N , and to emphasize this we write estimated from and evaluated a t the point Suppressing the summation over q we write (using the same argument as in the derivation of (4.1)),
= variance
+ bias'.
There are a number of forms of consistency, depending on the type of convergence that is considered; however, any reasonable definition will require that both bias and variance go to zero as the size of is increased. The derivation of consistency results for the MLP can be described heuristically (Geman et al., 1992) as the process of decreasing the bias, by letting the size of the network grow as m, but not too fast so that the variance does not grow too fast. However for finite there is a trade-off between bias and variance. As one fits models with more parameters the bias is decreased; however, the variance of
the estimator is increased (Geman et al., 1992). Hence there is an optimal point beyond which adding more parameters is counterproductive. Unfortunately in practice estimating this point is difficult and computationally expensive. This is essentially the problem of overfitting and in many instances the best practical advice that can be offered is simply to limit the number of parameters in an arbitrary fashion. See Section 5.3.5 (p. 61) for a discussion of the Network Information Criteria and other methods for model selection (see also Hastie et al., 2001, Section 7 for a discussion of this).
Minimizing the sum of squares penalty function coincides with finding the maximum likelihood estimator for a random variable with a Gaussian distribution. the target takes values in the set l}, However, in a classification task, the Gaussian distribution is not a plausible model. An alternative is to assume that is a Bernoulli random variable and let = = 1) and 1 - T = = We can then write = (x)T(l - x ) ' - T , (4.3) which is the likelihood function for a Bernoulli random variable The Bernoulli distribution can be written as a member of the exponential family of distributions3
where 6' is a location parameter and $ is a scale parameter. The fact that has an exponential family distribution means that a well developed body of theory, that of (McCullagh and Nelder, 1989), can be applied. Write p = E(T), then in a linear model = and the w are estimated by least squares. In a generalized linear model we write 7 = where T is called the linear predictor, and then model 7 = g ( p ) , where g is the link function. The link function ' the canonical link. If p is a Gaussian distribution for which 6 = 7 is referred to then 6' = p and $ = and if g is the identity function we have = = p , the usual linear model. To write (4.3) in the form of equation (4.4), we reparameterize it