*CIS 311: Neural Networks*

**Practical Aspects of Backpropagation**

*Example*: Learning from the XOR examples with a __one-hidden-layer network__ with __two hidden __and __one output unit__ (assuming a threshold activation function)

*Solution*: Such a multilayer network correctly classifies the XOR examples with the following weights:

w= - 3/2_{10}w= - 1/2_{20}w= - 1/2_{30}

w= 1_{11}w= 1_{21}w= -2_{31}

w= 1_{21}w= 1_{22}w= 1_{32}

( x1,x2 ) | Y

---------------

(0,0) | 0 The output *unit 3 is off* because the input units 1 and 2 are off

(0,1) | 1 The output *unit 3 is on* due to the positive excitation from unit 2

(1,0) | 1 The output *unit 3 is on* due to the positive excitation from unit 2

(1,1) | 0 The output *unit 3 is off* because of the inhibitory effect from unit 1

**1. Practical Aspects of Backpropagation**

While it is possible to get excellent fits to training data, the application of backpropagation is fraught with difficulties and pitfalls for the prediction of the performance on independent test data. Unlike most other learning systems that have been previously discussed, there are far more choices to be made in applying the gradient descent method.

The key variations of these choices are :

· *the learning rate and local minima* - the selection of a learning rate is of critical importance in finding the true global minimum of the error distance.

*Backpropagation training with too small a learning rate will make agonizingly slow progress. Too large a learning rate will proceed much faster, but may simply produce oscillations between relatively poor solutions*.

Both of these conditions are generally detectable through experimentation and sampling of results after a fixed number of training epochs.

Typical values for the learning rate parameter are numbers between 0 and 1:

0.05 < h < 0.75

One would like to use the largest learning rate that still converges to the minimum solution.

· *momentum - *empirical evidence shows that the use of a term called *momentum* in the backpropagation algorithm can be helpful in speeding the convergence and avoiding local minima.

*The idea about using a momentum is to stabilize the weight change by making nonradical revisions using a combination of the gradient decreasing term with a fraction of the previous weight change*:

Δ *w(t)* = -∂*Ee*/∂*w(t)* + α Δ *w(t-1)*

where a is taken 0£ a £ 0.9, and *t* is the index of the current weight change.

This gives the system a certain amount of inertia since the weight vector will tend to continue moving in the same direction unless opposed by the gradient term.

The momentum has the following effects:

- it smooths the weight changes and suppresses cross-stitching, that is cancels side-to-sideoscillations accross the error valey;

- when all weight changes are all in the same direction the momentum amplifies the learning rate causing a faster convergence;

- enables to escape from small local minima on the error surface.

*The hope is that the momentum will allow a larger learning rate and that this will speed convergence and avoid local minima. On the other hand, a learning rate of 1 with no momentum will be much faster when no problem with local minima or non-convergence is encountered *;

· *sequential or random presentation* - the epoch is the fundamental unit for training, and the length of training often is measured in terms of epochs. During a training epoch with revision after a particular example, the examples can be presented in the same sequential order, or the examples could be presented in a different random order for each epoch. The random representation usually yields better results ;

The randomness has advantages and disadvantages:

- *advantages*: it gives the algorithm some stochastic search properties. The weight state tends to jitter around its equilibrium, and may visit occasionally nearby points. Thus it may escape trapping in suboptimal weight configurations.
*The on-line learning may have a better chance of finding a global minimum than the true gradient descent technique*.

- *disadvantages*: the weight vector never settles to a stable configuration. Having found a good minimum it may then continue to wander around it.

· *random initial state* - unlike many other learning systems, the neural network begin in a random state. The network weights are initialized to some choice of random numbers with a range typically between -0.5 and 0.5 ( the inputs are usually normalized to numbers between 0 and 1 ). *Even with identical learning conditions, the random initial weights can lead to results that differ from one training session to another*.

**Suggested Readings**:

Bishop,C. (1995) "*Neural Networks for Pattern Recognition*",

Oxford University Press, Oxford, UK, pp.116-149.