CIS 311: Neural Networks

Neural Network Tuning and Overfitting Avoidance

1. The Bias/Variance Dilemma

One of the most serious problems that arises in connectionist learning by neural networks is overfitting of the provided training examples. This means that the learned function fits very closely the training data however it does not generalise well, that is it can not model sufficiently well unseen data from the same task.

Since the ultimate goal behind machine learning is attaining generalisation such overfitting problems should be detected and addressed carefully.

Criteria for avoiding overfitting with the training data and increasing the generalization are given by the statistical and generalisation theories.

1.1. Definitions

These criteria are employed to balance the statistical bias and statistical variance when doing neural network learning in order to achieve smallest average generalization error (on unseen data from the same task).

Statistical bias is the complexity restriction that the neural network architecture imposes on the degree of fitting accurately the target function. The statistical bias accounts only for the degree of fitting the given training data, but not for the level of generalization.

Statistical variance is the deviation of the neural network learning efficacy from one data sample to another sample that could be described by the same target function model. This is the statistical variance that accounts for the generalisation of whether or not the neural network fits the examples without regard to the specificities of the provided data.

When learning from a fixed and finite example set, the residual error may be low but this is not enough for a high generalisation since the examples are not reliable due to noise and uncertainties. Such difficulties may be addressed by decomposing the error ED of the neural network f( x ), being an approximation of the true function f^( x )=E[ y|x ] with respect to the training data D, in two components:

ED[ ( f( x ) - E[ y|x ] )2 ] = ( ED[ f( x ) ] - E[ y|x ] )2 + ED[ ( f( x ) - ED[ f( x ) ] )2 ] = BIAS2( f( x ) ) + VAR( f( x ) )

where: BIAS( f( x ) ) = ED[ f( x ) ] - E[ y|x ] is the statistical bias
and VAR( f( x ) ) = ED[ ( f( x ) - ED[ f( x ) ] )2 ] is the statistical variance

The neural network performance can be improved if we reduce both the statistical bias and the statistical variance. However there is a natural trade-off between the bias and variance, this is the so called bias/variance dilemma:

A neural network that fits closely the provided training examples has a low bias but a high variance. If we reduce the network variance this will lead to a decrease in the level of fitting the data.

A possible strategy for reducing both the statistical bias and variance is to consider more data points, or to increase the considered training examples.
A possible strategy for reducing the statistical bias is growing of the neural network.
A possible strategy for reducing the statistical variance is pruning of the neural network.

1.2. Measuring the Bias and the Variance

The statistical bias and variance estimate how close is the learned neural network model f( x ) with weights w from the true one with weights w in the regression function f^( x )=E[ y|x ]=XTw. The corresponding mean squared error MSE is a sum of the bias and variance simply written as:

MSE( f( x ) ) = BIAS2( f( x ) ) + VAR( f( x ) )

There is a pragmatic methodology for measuring the integrated bias and variance [Geman et al., 1992] (page 28). The best model inferred from the neural network using the given training examples is assumed as true regressor:

f*( x ) = XTw = E[ y|x ]

Next, a number of different sets D1, D2, ..., D10, each of size N/2, are randomly generated from the same training series. Ten models are produced by estimating the neural with each corresponding data set:

f*D1( x ), f*D2( x ), ..., f*D10( x )

The statistical bias is assessed with the future testing data S using the formula:

BIAS2( f( x ) ) = ( 1 / S ) i=1S ( fm( xi ) - f*D( xi ))2

where: fm( xi ) is the average from these 10 neural networks: fm( xi )=(1/10)j=110 f*Dj( xi ) with respect to the i-th example vector xi.

The statistical variance is assessed as follows:

VAR( f( x ) ) = ( 1 / S ) i=1S((1/10) j=110 ( f*Dj( xi ) - fm( xi ))2 )

where j enumerates the neural networks from the f*D1( x ), f*D2( x ), ..., f*D10( x ).

Empirically the variance contribution VAR to the MSE error is taken to estimate the dependence of the neural network on the circumstances in the training examples.

2. Nonlinear Cross Validation

The generalization error can be estimated by nonlinear cross-validation (NCV) [Moody, 1994]. Nonlinear cross-validation is a statistical estimation procedure developed especially for non-linear models like the neural networks.

Nonlinear Cross-validation Algorithm

- train a neural network using all the provided examples D={( xe, ye )}e=1N

- prepare a number of disjoint subsets: Di from D, 1≤iv, from the given examples D so that these subsets Di contain non-overlapping examples

- re-estimate the trained neural network with the subsets of remaining examples Dj such that { Dj and Di }={ empty }, where each Dj contains Nj examples

- calculate the cross-validation error of each perturbed network version Fj with the remaining sets Di,
that is with all remaining data except Dj ( Ni = N - Nj for ij ):

ncvDiFj = ( 1 / Ni ) e from Di ( ye - Fj( xe )) for each Fj, 1≤jv

- obtain the v-fold cross-validation estimate of the prediction error by averaging the errors of all perturbed neural network versions Fj:

NCV = ( 1 / v ) j=1v ncvFj