311: Neural Networks
Gradient Descent Learning Algorithms
for Single-layer Perceptrons
1. Unthresholded Perceptron
We need a rule that converges towards the unknown, target function even if the examples are not linearly separable.
The key idea is to use gradient descent to search the space of weight vectors to find weights that best fit the training examples. The learning task is relaxed to that of training an unthresholded Perceptron:
o = Si=0d wi xi
2. The Gradient Descent Rule
The objective is to minimize the following error: E( w ) = ( ½ ) S e ( ye – oe )2
The training is a process of minimizing the error E( w ) in the direction of the steepest most rapid decrease, that is in direction opposite to the gradient:
Ñ
E( w ) = [ ¶E/¶w0, ¶E/¶w1, …, ¶E/¶wd ]which leads to the Gradient descent training rule:
wi = wi - h ¶E/¶wi
The weight update can be derived as follows:
¶
E/¶wi = ¶ ( ( ½ )S e( ye–oe )2)/¶wi= ( ½ )Se¶( ye–oe )2/¶wi
= ( ½ )Se2( ye–oe )¶( ye–oe )/¶wi
= Se( ye–oe )¶( ye–wi xie )/¶wi
= Se( ye–oe )( -xie )
where xie denotes the i-th component of the example
The Gradient descent training rule becomes:
wi = wi + h Se ( ye–oe ) xie
In order to avoid overstepping of the error surface the value of the learning rate h may be gradually reduced as the number of gradient steps grows.
Gradient Descent Learning Algorithm
Initialization: Examples {( xe, ye)}e=1N, initial weights wi set to small random values, learning rate parameter h = 0.1
Repeat
for each training example ( xe, ye )
- calculate the output: o = Si=0d wi xie
- if the Perceptron does not respond correctly compute weight corrections:
Dwi = Dwi + h ( ye - oe ) xie
update the weights with the accumulated error from all examples
wi = wi + Dwi // this is the Gradient Descent Rule
until termination condition is satisfied.
Example: Suppose an example of Perceptron which accepts two inputs x1 and x2, with weights w1 = 0.5 and w2 = 0.3 and w0 = -1.
Let the following example is given: x1 = 2, x2 = 1, y = 0
The output of the Perceptron is :
O = 2 * 0.5 + 1 * 0.3 - 1 = 0.3
The weight updates according to the gradient descent algorithm will be:
Dw1 = ( 0 - 0.3 ) * 2 = - 0.6
Dw2 = ( 0 - 0.3 ) * 1 = - 0.3
Dw0 = ( 0 - 0.3 ) * 1 = - 0.3
Let another example is given: x1 = 1, x2 = 2, y = 1
The output of the Perceptron is :
O = 1 * 0.5 + 2 * 0.3 - 1 = 0.1
The weight updates according to the gradient descent algorithm will be:
Dw1 = - 0.6 + ( 1 - 0.1 ) * 1 = 0.3
Dw2 = - 0.3 + ( 1 - 0.1 ) * 2 = 1.5
Dw0 = - 0.3 + ( 1 - 0.1 ) * 1= 0.6
If there are no more examples in the batch, the weights will be modified as follows:
w1 = 0.5 + 0.3 = 0.8
w2 = 0.3 + 1.5 = 1.8
w0 = - 1 + 0.6 = 1.6
Suggested Readings:
Bishop,C. (1995) "Neural Networks for Pattern Recognition",
Oxford University Press, Oxford, UK, pp.98-105.
Haykin, Simon. (1999). "Neural Networks. A Comprehensive Foundation",
Second Edition, Prentice-Hall, Inc., New Jersey, 1999.