CIS 311: Neural Networks
Neural Network Tuning and Overfitting Avoidance
(Continuation)
3.4. Growing and Pruning of Networks
The neural network architecture impacts considerably its performance. Finding proper neural network architectures requires search in the space of the network topologies, which in the general case is enormous. Especially for the case of neural networks trained by gradient methods, like the backpropagation algorithm, network topology search is a computationally very slow process as it involves slow training and it is therefore an infeasible technique.
The alternatives for finding proper neural network architectures are:
- perform training of a small network with a small number of units and weights, and grow it iteratively with corresponding training and weight updating;
- perform training of a large network and next prune it iteratively by removing units and weights. There are two popular approaches to network pruning:
- weight elimination: train the network and eliminate connections associated with weights having magnitudes below some threshold;
- node pruning: train the network and next prune units if the removal of a unit leads to error decrease more than a certain threshold.
3.5. Neural Network Committees
The idea for organising neural network committees is based on the divide-and-conquer principle: "solve a complex task by decomposing it into simple subtasks addressed by separate networks and combine the networks to construct a solution of the whole task".
The learning task is distributed among several neural networks trained separately and they together decide as a committee the overall solution of the task.
The neural network committees are universal approximators.
There are two main groups of neural network committees:
- static: the overall solution is decided without the direct use of the input examples. There are two methods in this group:
- dynamic: the overall solution is decided with the use of the examples with an integrating mechanism. There are two methods in this group:
3.5.1. Network Ensemble Averaging
Ensemble averaging suggests to train several neural networks differently and next to combine their outputs for producing the overall solution. The ultimate objective is to improve the overall solution quality based on the expectation that differently trained networks even if converging to local optima when considered together will produce a solution that is better than the individual network subsolutions.
Ensemble Averaging Algorithm
Initialization: Examples {( xe, ye )}e=1N
Do
There are two approaches to combining the network outputs:
- averaging committees: they are always better than a randomly selected network
ocommittee = ( 1 / Ki ) Sk=1K ok where: K is the number of networks in the committee
- weighted committees: perform a weighted combination of the network outputs giving greater weight to the better networks:
ocommittee = Sk=1K ( 1 / ( Ek * Sk=1K 1 / Ek )) * ok where: Ek = ( 1/N ) Se=1N ( ye - oe )2
The weighted committees of networks are always better than the best member if the errors Ek are accurately measured.
3.5.2. Boosting of Neural Networks
Neural network boosting suggests to train the networks in the committee on different subsets from the examples' set.
Boosting by resampling suggests to train repeatedly a neural network on various distributions of the examples and next to combine the already trained networks into a single composite model.
In boosting each example is associated with a corresponding weight. By changing the example weights during boosting the neural network learning process is directed toward different examples from the training set which leads to different networks. The final neural network is a combination of the outputs of the individual networks which are also weighted by a function of their accuracy on the training examples [Freund and Schapire, 1996].
Boosting Networks Algorithm
Initialization: Examples {( xe, ye )}e=1N,
associate each e-th example with initial weight L1( e ) = 1 / N
Repeat
- compute the error as a sum of the weights of misclassified examples
et = Se, oe=/=ye Lt( e )
- terminate if et > 0.5
otherwise compute the weights
Lt+1( e ) = Lt( e ) / Z * ( et / ( 1 - et )) when the network classifies correctly e: oe=ye
Lt+1( e ) = Lt( e ) / Z * 1, when the network misclassifies e: oe=/=ye
where Z is normalization of Lt( e )) to make it a distribution.
Finally: build a classifier by weighting as follows
oboosting( e ) = max St, e log( 1 / ( et / ( 1 - et ))) * ot,e
Suggested Readings:
C. Bishop (1995). Neural Networks for Pattern Recognition, Oxford University Press, Oxford, UK, Section 9, pp.353-368.
Y. Freund and R.E. Schapire (1996). Experiments with a New Boosting Algorithm, in: Proc. of 12th Int. Conference on Machine Learning ICML-96.
S. Haykin (1999). Neural Networks. A Comprehensive Foundation, Second Edition, Prentice-Hall, Inc., New Jersey, 1999.