CIS 311: Neural Networks
Neural Network Tuning and Overfitting Avoidance
(Continuation)
3. Neural Network Tuning
Attempts to balance the statistical bias and variance for achieving overfitting avoidance can be made by tuning:
- the network learning algorithm
- the provided training examples
Attempts to balance the statistical bias and variance for achieving overfitting avoidance can be made with the following neural network tuning strategies:
- regularization
- early stopping
- growing neural networks
- prunning neural networks
- committees of neural networks
3.1. Regularization
The risk of overfitting the examples could be minimized if the variance factor is used in the error function to penalize neural network models with high curvatures. For this reason a weight decay factor that stimulates learning of low-magnitude weights is accomodated in the error making it a regularized average error RAE:
RAE = ( 1 / N ) ( Si=1N ( yi - f( xi ))2 + k Sj=1M wj2 )
where k is a regularization parameter, wj are the network weights, and M is the number of all weights in the network.
The regularization is a roughness penalty since small magnitude weights imply a more ``regular'' approximation. A choice of k=0 favors network function surfaces interpolating the example points tightly, while a large k favors flat function surfaces.
The choice of proper values for the regularization parameter k is subtle since it determines the degree of fitting the provided training examples and governs the amount of smoothing.
3.1.1. Selecting a Proper Regularization Parameter
Having in mind that the examples are usually noisy, a fitness function with too small k will tend to overfit the training data, hence the noise. This will lead to wildly undulating surfaces that are not likely to make good predictions.
At the other extreme, very large values of k will produce quite smooth surfaces that underfit the examples, and does not also give good predictions.
Proper values for the regularization parameter k may be selected from a certain interval relying on a proof that as long as k lies within this interval:
0 < k < 2s2 / wTw
the mean squared error of the RAE approximator is smaller than that of the best least squares approximator without regularization.
As an unbiased estimator of s2 the residual error over the training examples can be used.
3.1.2. Network Learning with Weight Decay
The regularization technique applied to the error function requires modification of the training rules:
- in case of gradient descent learning the training rule becomes w( t ) = ( 1 - hk )w( t-1 ) - h¶Ee/¶w( t-1 )
- in case of least squares learning the formula becomes w = ( XTX + kI )-1 XT y
3.2. Early Stopping
The network error on the given data generally decreases during training as a function of the number of training epochs (learning cycles with all examples). However, the error on unseen testing examples begin to increase after a certain epoch. Therefore, the training process can be stopped at the point where the validation error on the testing set is minimal.
When a quadratic error criterion is used this early stopping technique leads to similar network performance as this achieved using weight decay regularization.
3.3. Training with Noise
Training with noise can be implemented by adding a random vector of real values to each training example.
In case of incremental learning a new random vector is added to each example before its presentation to the network for updating the weights.
In case of batch learning several copies of the training set are made in advance by perturbing each example with different random vectors.
Training with noise using small values of the noise amplitude helps to improve the network generalization and is also equivalent to weight decay regularization.
Suggested Readings:
S. Geman, E. Bienenstock and R. Doursat (1992). Neural Networks and the Bias/Variance Dilemma, Neural Computation, vol. 4 , N: 1, pp. 1-58.
J. Moody (1994). Prediction Risk and Architecture Selection for Neural Networks, in: V.Cherkassky, J.H.Friedman and H.Wechsler (Eds.), From Statistics to Neural Networks: Theory and Pattern Recognition Applications, NATO ASI Series F, SpringerVerlag, New York, pp.147-165.
C. Bishop (1995). Neural Networks for Pattern Recognition, Oxford University Press, Oxford, UK, Section 9, pp.338-352.
S. Haykin (1999). Neural Networks. A Comprehensive Foundation, Second Edition, Prentice-Hall, Inc., New Jersey, 1999.