CIS 311: Neural Networks

Multilayer Perceptrons

(Continuation)

2. Backpropagation Learning Algorithm

MLP became applicable on practical tasks after the discovery of a supervised training algorithm for learning their weights, this is the backpropagation learning algorithm. The backpropagation algorithm for training multilayer neural networks is a generalization of the LMS training procedure for nonlinear logistic outputs. As with the LMS procedure, training is iterative with the weights adjusted after the presentation of each example.

The error backpropagation algorithm includes two passes through the network:
- forward pass and
- backward pass.

During the backward pass the weights are adjusted in accordance with the error correction rule. It suggests that the actual network output is subtracted from the given output in the example. The weights are adjusted so as to make the network output more closer to the desired one.

The backpropagation algorithm does gradient descent as it moves in direction opposite to the gradient of the error, that is in direction of the steepest decrease of the error. This is the direction of most rapid error decrease by varying all the weights simultaneously in proportion of how much good is done by the individual changes:

Ñ E( w ) = [ E/w0, E/w1, …, E/ wd ]

This is the gradient descent learning strategy that requires a continuous activation function at the nodes of the MLP network.

The backpropagation training algorithm uses the gradient search technique to minimize a cost function equal to the mean square difference between the desired and actual net outputs. The network is trained initially selecting small random weights and then presenting all training data incrementally. Weights are adjusted after every trial using side information specifying the correct class until weights converge and the cost function is reduced to an acceptable value.

The overall idea of backpropagation relies on examining the interdependencies that influence the computation of the gradient in the weight space:

·effect on a particular weight change from the output of the neuron;

·effect on a particular weight change from all neurons in the next layer;

·effect on a particular weight change from the output nodes.

Backpropagation Training Algorithm

Initialization: Examples {( xe, ye )}e=1N, initial weights wi set to small random values, learning rate h = 0.1
Repeat

for each training example ( x, y ) - calculate the outputs using the sigmoid function:

oj = s ( sj ) = 1 / ( 1 + e-sj ), sj = Si=0d wij oi where oi = xi // comment: at the hidden units j

ok = s ( sk ) = 1 / ( 1 + e-sk ), sk = S i=0d wjk oj // comment: at the output units k

compute the error derivatives bk at the nodes k in the output layer:

bk = ok ( 1 - ok ) [ yk - ok] // comment: effects from the output nodes

compute the changes for weights j®k on connections to nodes in the output layer:

D wjk = h bk oj // comment: effects from the output of the neuron

D w0k = h bk

compute the error derivatives bj for the hidden nodes j with the formula:

bj = oj ( 1 - oj ) [ åk bk wjk] // comment: effects from multiple nodes in the next layer

compute the changes for the weights i®j on connections to nodes in the hidden layer:

D wij = h bj oi // comment: effects from the output of the neuron

D w0j= h bj

- update the weights by the computed changes:

w = w + Dw

until termination condition is satisfied.

3. Derivation of the Backpropagation Algorithm

The backpropagation algorithm for training the multilayer perceptron implements a generalized delta rule according to which with each training example every weight is updated as follows:

w = w + D w

where: D w= - h Ee / w and Ee = ( 1/2 ) Sk ( yk - ok )2

The implementation of the generalized delta rule requires to derive an expression for the computation of the derivatives Ee / w:

Ee / w = ( Ee / s )( s / w )

The first part Ee / s reflects the change of the error as a function of the change in the network weighted input to the unit. The second part s / w reflects the change of the error as a function of the change of particular weight w to that node. Since:

s / w = ( Sl wl ol ) / w = o // comment: from the case wl = w

the expression is reduced as follows: Ee / w = ( Ee / s ) o

Delta Rule for weights j®k on connections to nodes in the output layer

Ee / wjk = ( Ee / sk ) oj

Ee / sk = ( Ee / ok )( ok / sk )

Ee / ok = ( (( 1/2 ) Sl ( yl - ol )2 )) / ok

= ( (( 1/2 )( yk - ok )2 )) / ok // comment: from the case l = k

= ( 1/2 ) 2 ( yk - ok ) [ ( yk - ok ) / ok]

= - ( yk - ok )

ok / sk= s ( sk ) / sk

= ok ( 1 - ok ) // comment: from s ( sk ) = 1 / ( 1 + e-sk )

Therefore:

Ee / sk = - ( yk - ok ) ok ( 1 - ok )

and when we substitute: bk = ok ( 1 - ok ) [ yk - ok]

the Delta rule for the output units becomes:

D wjk = -Ee / wjk = h bk oj

Delta Rule for weights i®j on connections to nodes in the hidden layer

In this case the error depends on the errors committed by all output units:

Ee / wij = ( Ee / sj) oi

Ee / sj = Sk ( Ee / sk )( sk / sj )

= Sk ( -bk ) sk / sj // comment: from Ee / sk = -bk

= Sk ( -bk ) ( sk / oj )( oj / sj )

= Sk ( -bk) wjk ( oj / sj)

= Sk ( -bk) wjkoj ( 1 - oj )

Therefore, when we substitute:

bj = - Ee / sj= oj ( 1 - oj ) Sk ( -bk) wjk

the Delta rule for the hidden units becomes:

Dwij = -Ee / wij = h bj oi

Note: This analysis was made for a single training pattern, but it can be generalized so that:

Etotal / wij= Se Ee / wij

Thus, we just need to sum out weight changes over the examples.