CIS 311: Neural Networks

Neural Network Tuning and Overfitting Avoidance

(Continuation)

3.4. Growing and Pruning of Networks

The neural network architecture impacts considerably its performance. Finding proper neural network architectures requires search in the space of the network topologies, which in the general case is enormous. Especially for the case of neural networks trained by gradient methods, like the backpropagation algorithm, network topology search is a computationally very slow process as it involves slow training and it is therefore an infeasible technique.

The alternatives for finding proper neural network architectures are:

- perform training of a small network with a small number of units and weights, and grow it iteratively with corresponding training and weight updating;

- perform training of a large network and next prune it iteratively by removing units and weights. There are two popular approaches to network pruning:

- weight elimination: train the network and eliminate connections associated with weights having magnitudes below some threshold;

- node pruning: train the network and next prune units if the removal of a unit leads to error decrease more than a certain threshold.

3.5. Neural Network Committees

The idea for organising neural network committees is based on the divide-and-conquer principle: "solve a complex task by decomposing it into simple subtasks addressed by separate networks and combine the networks to construct a solution of the whole task".

The learning task is distributed among several neural networks trained separately and they together decide as a committee the overall solution of the task.

The neural network committees are universal approximators.

There are two main groups of neural network committees:

- static: the overall solution is decided without the direct use of the input examples. There are two methods in this group:

- ensemble averaging: combines linearly, as a linear sum the outputs from the neural networks in the committee
- boosting: converts a weak network into one that achieves high accuracy

- dynamic: the overall solution is decided with the use of the examples with an integrating mechanism. There are two methods in this group:

- mixtures of experts: combines nonlinearly the network outputs by means of a single neural network
- hierarchical mixture of experts: combines nonlinearly the network outputs by means of several networks composed hierarchically

3.5.1. Network Ensemble Averaging

Ensemble averaging suggests to train several neural networks differently and next to combine their outputs for producing the overall solution. The ultimate objective is to improve the overall solution quality based on the expectation that differently trained networks even if converging to local optima when considered together will produce a solution that is better than the individual network subsolutions.

Ensemble Averaging Algorithm

Initialization: Examples {( xe, ye )}e=1N
Do

- design a set of neural networks
- train the networks with the same training examples but with different initial conditions, that is with different initial weights
- continue even to overtraining
- combine the network outputs

There are two approaches to combining the network outputs:

- averaging committees: they are always better than a randomly selected network

ocommittee = ( 1 / Ki ) Sk=1K ok where: K is the number of networks in the committee

- weighted committees: perform a weighted combination of the network outputs giving greater weight to the better networks:

ocommittee = Sk=1K ( 1 / ( Ek * Sk=1K 1 / Ek )) * ok where: Ek = ( 1/N ) Se=1N ( ye - oe )2

The weighted committees of networks are always better than the best member if the errors Ek are accurately measured.

3.5.2. Boosting of Neural Networks

Neural network boosting suggests to train the networks in the committee on different subsets from the examples' set.
Boosting by resampling suggests to train repeatedly a neural network on various distributions of the examples and next to combine the already trained networks into a single composite model.

In boosting each example is associated with a corresponding weight. By changing the example weights during boosting the neural network learning process is directed toward different examples from the training set which leads to different networks. The final neural network is a combination of the outputs of the individual networks which are also weighted by a function of their accuracy on the training examples [Freund and Schapire, 1996].

Boosting Networks Algorithm

Initialization: Examples {( xe, ye )}e=1N, associate each e-th example with initial weight L1( e ) = 1 / N
Repeat

- iterations t during which each network is trained using examples from a distribution specified by the weights Lt( e )

- compute the error as a sum of the weights of misclassified examples

et = Se, oe=/=ye Lt( e )

- terminate if et > 0.5
otherwise compute the weights

Lt+1( e ) = Lt( e ) / Z * ( et / ( 1 - et )) when the network classifies correctly e: oe=ye

Lt+1( e ) = Lt( e ) / Z * 1, when the network misclassifies e: oe=/=ye

where Z is normalization of Lt( e )) to make it a distribution.

Finally: build a classifier by weighting as follows

oboosting( e ) = max St, e log( 1 / ( et / ( 1 - et ))) * ot,e

Suggested Readings:

C. Bishop (1995). Neural Networks for Pattern Recognition, Oxford University Press, Oxford, UK, Section 9, pp.353-368.

Y. Freund and R.E. Schapire (1996). Experiments with a New Boosting Algorithm, in: Proc. of 12th Int. Conference on Machine Learning ICML-96.

S. Haykin (1999). Neural Networks. A Comprehensive Foundation, Second Edition, Prentice-Hall, Inc., New Jersey, 1999.