Werbos, P. (1974).
PhD thesis, Harvard University
White, H. (1989). Learning in artificial neural networks: A statistical perspective. 1:425-464. Widrow, B. and Lehr, M . A . (1990). 30 years of adaptive networks: perceptron, 78(9):1415-1442. madaline and backpropagation. Wilkinson, G . G., Fierens, F., and Kanellopoulos, I. (1995). Integration of neural and statistical approaches in spatial data classification. 2:l -20. Williams, P. (1995). Bayesian regularization and pruning using a Laplacian prior. 7:117-143. Wilson, J. (1992). A comparison of procedures for classifying remotely-sensed data using simulated data sets incorporating autocorrelations between spectral responses. 13( 14):2701-2725.
Learning consists of minimizing the discrepancies between the observed and fitted values. To implement this in a neural network, a penalty function p = p ( z * , t ) is imposed, such that the function has a minimum value when the output values equal the target values. We then minimize p over the unknown parameters. Minimization techniques can be divided into two types: local, which find a local minimum; and global', which attempt to find the global minimum. Generally, global minimization is more desirable but it is also more computationally demanding and, due to the complexity of the models and the potential size of the data sets, it is often impractical. The most widely used methods for neural networks are a class of local minimization schemes based on gradient information.minimization techniques There is now a vast literature on function minimization as applied to neural networks. Battiti (1992) gives a survey of the main characteristics of the different methods and their mutual relations, sub-titled "between steepest descent and Newton's method". The paper makes the point that steepest descent uses first-order information and is computationally cheap per iteration, whereas Newton's method uses the full second-order information and is cornputationally expensive per iter'Global techniques include genetic algorithms; simulated annealing; and perhaps Bayesian learning which integrates over the parameter space rather than finding the minima.
A Statzstzcal Approach to Neural Networks for Pattern Recognztzon by Robert A. Dunne Copyright @ 2007 John Wiley & Sons, Inc.
benchmark problems; stopping criteria (which can make a large difference to the reported learning times); reporting of training times, including best, worst, average and the question of iron-converging trials.
There is also a vast numerical analysis literature on function minimization, see Nash (1990), Adby and Dempster (1974), Rowan (1990), Nelder and Mead (1965), O Neill (1971), Shanno and Phua (1980) and Press et (1992).
