Werbos, P. (1974).
PhD thesis, Harvard University
White, H. (1989). Learning in artificial neural networks: A statistical perspective. 1:425-464. Widrow, B. and Lehr, M . A . (1990). 30 years of adaptive networks: perceptron, 78(9):1415-1442. madaline and backpropagation. Wilkinson, G . G., Fierens, F., and Kanellopoulos, I. (1995). Integration of neural and statistical approaches in spatial data classification. 2:l -20. Williams, P. (1995). Bayesian regularization and pruning using a Laplacian prior. 7:117-143. Wilson, J. (1992). A comparison of procedures for classifying remotely-sensed data using simulated data sets incorporating autocorrelations between spectral responses. 13( 14):2701-2725.
Learning consists of minimizing the discrepancies between the observed and fitted values. To implement this in a neural network, a penalty function p = p ( z * , t ) is imposed, such that the function has a minimum value when the output values equal the target values. We then minimize p over the unknown parameters. Minimization techniques can be divided into two types: local, which find a local minimum; and global', which attempt to find the global minimum. Generally, global minimization is more desirable but it is also more computationally demanding and, due to the complexity of the models and the potential size of the data sets, it is often impractical. The most widely used methods for neural networks are a class of local minimization schemes based on gradient information.minimization techniques There is now a vast literature on function minimization as applied to neural networks. Battiti (1992) gives a survey of the main characteristics of the different methods and their mutual relations, sub-titled "between steepest descent and Newton's method". The paper makes the point that steepest descent uses first-order information and is computationally cheap per iteration, whereas Newton's method uses the full second-order information and is cornputationally expensive per iter'Global techniques include genetic algorithms; simulated annealing; and perhaps Bayesian learning which integrates over the parameter space rather than finding the minima.
A Statzstzcal Approach to Neural Networks for Pattern Recognztzon by Robert A. Dunne Copyright @ 2007 John Wiley & Sons, Inc.
ation. Many minimization schemes attempt to gain the benefits of second--order information without incurring the full computational costs. Recent surveys of the area can be found in Bishop (1995a), Ripley (1996), and Smagt (1994). Barnard and Cole (1989) give details of a conjugate gradient implementation and Charalambous (1990) gives an interesting variant on conjugate gradient, using subvectors of the weights rather than the whole weight vector, that may be feasible for parallel MLP architectures. Mdler (1993) recommends a version of conjugate gradients called scaled conjugate gradients and Fahlman (1989) gives a n algorithm called quickprop . Both of these appear widely in the neural network literature. Jervis and Fitzgerald (1993) conclude that scaled conjugate gradient and quickprop are both expedient minimization schemes. Himmelblau (1990) treats the problem of calculating the MLP Hessian matrix in parallel. In the case where the MLP is used for function approximation, rather than classification, the lack of a non-linear activation function at the output layer means that, for given inputs and R matrix, determining the matrix is a linear problem. This means that a linear approach (Gauss-Markov) can be used to estimate and then a non-linear one to estimate R, at each learning iteration. This appears to have been considered independently by Webb et al. (1988) and Barton (1991). Fahlman (1989) gives a discussion of the problem of comparing minimization algorithms. The paper raises the issues of
benchmark problems; stopping criteria (which can make a large difference to the reported learning times); reporting of training times, including best, worst, average and the question of iron-converging trials.
There is also a vast numerical analysis literature on function minimization, see Nash (1990), Adby and Dempster (1974), Rowan (1990), Nelder and Mead (1965), O Neill (1971), Shanno and Phua (1980) and Press et (1992).
