*CIS 311: Neural Networks*

**Neural Network Tuning and Overfitting Avoidance**

**1. The Bias/Variance Dilemma**

One of the most serious problems that arises in connectionist learning by neural networks
is *overfitting* of the provided training examples. This means that the learned function *fits very closely the training data* however it *does not generalise well*, that is it can not model sufficiently well unseen data from the same task.

*Since the ultimate goal behind machine learning is attaining generalisation such overfitting problems should be detected and addressed carefully*.

Criteria for avoiding overfitting with the training data and increasing the generalization are given by the *statistical* and *generalisation theories*.

1.1. Definitions

These criteria are employed to *balance the statistical bias and statistical variance* when doing neural network learning in order to achieve smallest average generalization error (on unseen data from the same task).

*Statistical bias is the complexity restriction that the neural network architecture imposes on the degree of fitting accurately the target function. *
The statistical bias accounts only for the degree of fitting the given training data, but not for the level of generalization.

*Statistical variance is the deviation of the neural network learning efficacy from one data sample to another sample that could be described by the same target function model.*
This is the statistical variance that accounts for the generalisation of whether or not the neural network fits the examples without regard to the specificities of the provided data.

When learning from a fixed and finite example set, the residual error may be low but this is not enough for a high generalisation since the examples are not reliable due to noise and uncertainties.
Such difficulties may be addressed by *decomposing the error* *E** _{D}* of the neural network

*E*_{D}[ ( *f*( **x** ) - *E*[ *y*|**x** ] )^{2} ] = ( *E** _{D}*[

where: BIAS( *f*( **x** ) ) = *E** _{D}*[

and VAR(

The neural network performance can be improved if we *reduce both the statistical bias and the statistical variance*. However there is a natural trade-off between the bias and variance, this is the so called *bias/variance dilemma*:

*A neural network that fits closely the provided training examples has a low bias but a high variance. If we reduce the network variance this will lead to a decrease in the level of fitting the data*.

A possible strategy for reducing both the statistical bias and variance is to consider more data points, or to increase the considered training examples.

A possible strategy for reducing the statistical bias is growing of the neural network.

A possible strategy for reducing the statistical variance is pruning of the neural network.

1.2. Measuring the Bias and the Variance

The statistical bias and variance estimate how close is the learned neural network model *f*( **x** ) with weights w from the true one with weights w in the regression function *f*^( **x** )=*E*[ *y*|**x** ]=**X**^{T}**w**.
The corresponding mean squared error MSE is a sum of the bias and variance simply written as:

MSE( *f*( **x** ) ) = BIAS^{2}( *f*( **x** ) ) + VAR( *f*( **x** ) )

There is a pragmatic methodology for measuring the *integrated bias and variance* [Geman et *al*., 1992] (page 28). The best model inferred from the neural network using the given training
examples is assumed as true regressor:

*f**( **x** ) = **X**^{T}**w** = *E*[ *y*|**x** ]

Next, a number of different sets *D*_{1}, *D*_{2}, ..., *D*_{10}, each of size *N*/2, are randomly generated
from the same training series. Ten models are produced by estimating the neural with each corresponding data set:

*f*** _{D1}*(

The statistical bias is assessed with the future testing data *S* using the formula:

BIAS^{2}( *f*( **x** ) ) = ( 1 / *S* ) ∑* _{i}*=1

where: *f*_{m}( **x*** _{i}* ) is the average from these 10 neural networks:

The statistical variance is assessed as follows:

VAR( *f*( **x** ) ) = ( 1 / S ) ∑* _{i}*=1

where j enumerates the neural networks from the *f*** _{D1}*(

Empirically the variance contribution VAR to the MSE error is taken to estimate the dependence of the neural network on the circumstances in the training examples.

**2. Nonlinear Cross Validation**

The generalization error can be estimated by *nonlinear cross-validation* (NCV) [Moody, 1994].
Nonlinear cross-validation is a statistical estimation procedure developed especially for non-linear models like the neural networks.

__Nonlinear Cross-validation Algorithm__

- *train a neural network* using all the provided examples *D*={( **x*** _{e}*,

- *prepare* a number of disjoint subsets: *D _{i}* from

- *re-estimate the trained neural network* with the subsets of remaining examples *D _{j}* such that {

- *calculate* the cross-validation error of each perturbed network version *F _{j}* with the remaining sets

that is with all remaining data except

ncv_{Di}^{Fj} = ( 1 / *N _{i}* ) ∑

- *obtain the v-fold cross-validation estimate* of the prediction error by averaging the errors of all perturbed neural network versions *F _{j}*:

NCV = ( 1 / *v* ) ∑* _{j}*=1

**Suggested Readings**:

S. Geman, E. Bienenstock and R. Doursat (1992). Neural Networks and the Bias/Variance Dilemma,
*Neural Computation*, vol. 4 , N: 1, pp. 1-58.

J. Moody (1994). Prediction Risk and Architecture Selection for Neural Networks, in:
V.Cherkassky, J.H.Friedman and H.Wechsler (Eds.), *From Statistics to Neural Networks:
Theory and Pattern Recognition Applications*, NATO ASI Series F,
SpringerVerlag, New York, pp.147-165.

C. Bishop (1995). *Neural Networks for Pattern Recognition*,
Oxford University Press, Oxford, UK, Section 9, pp.332-338.

S. Haykin (1999). *Neural Networks. A Comprehensive Foundation*,
Second Edition, Prentice-Hall, Inc., New Jersey, 1999.