311: Neural Networks

Gradient Descent Learning Algorithms
for Single-layer Perceptrons

1. Unthresholded Perceptron

We need a rule that converges towards the unknown, target function even if the examples are not linearly separable.

The key idea is to use gradient descent to search the space of weight vectors to find weights that best fit the training examples. The learning task is relaxed to that of training an unthresholded Perceptron:

o = Si=0d wi xi

2. The Gradient Descent Rule

The objective is to minimize the following error: E( w ) = ( ½ ) S e ( yeoe )2

The training is a process of minimizing the error E( w ) in the direction of the steepest most rapid decrease, that is in direction opposite to the gradient:

Ñ E( w ) = [ E/w0, E/w1, …, E/wd ]

which leads to the Gradient descent training rule:

wi = wi - h E/wi

The weight update can be derived as follows:

E/wi = ( ( ½ )S e( yeoe )2)/wi

= ( ½ )Se( yeoe )2/wi

= ( ½ )Se2( yeoe )( yeoe )/wi

= Se( yeoe )( yewi xie )/wi

= Se( yeoe )( -xie )

where xie denotes the i-th component of the example

The Gradient descent training rule becomes:

wi = wi + h Se ( yeoe ) xie

In order to avoid overstepping of the error surface the value of the learning rate h may be gradually reduced as the number of gradient steps grows.

Gradient Descent Learning Algorithm

Initialization: Examples {( xe, ye)}e=1N, initial weights wi set to small random values, learning rate parameter h = 0.1

Repeat

for each training example ( xe, ye )

- calculate the output: o = Si=0d wi xie

- if the Perceptron does not respond correctly compute weight corrections:

Dwi = Dwi + h ( ye - oe ) xie

update the weights with the accumulated error from all examples

wi = wi + Dwi // this is the Gradient Descent Rule

until termination condition is satisfied.

Example: Suppose an example of Perceptron which accepts two inputs x1 and x2, with weights w1 = 0.5 and w2 = 0.3 and w0 = -1.

Let the following example is given: x1 = 2, x2 = 1, y = 0
The output of the Perceptron is :

O = 2 * 0.5 + 1 * 0.3 - 1 = 0.3

The weight updates according to the gradient descent algorithm will be:

Dw1 = ( 0 - 0.3 ) * 2 = - 0.6

Dw2 = ( 0 - 0.3 ) * 1 = - 0.3

Dw0 = ( 0 - 0.3 ) * 1 = - 0.3

Let another example is given: x1 = 1, x2 = 2, y = 1

The output of the Perceptron is :

O = 1 * 0.5 + 2 * 0.3 - 1 = 0.1

The weight updates according to the gradient descent algorithm will be:

Dw1 = - 0.6 + ( 1 - 0.1 ) * 1 = 0.3

Dw2 = - 0.3 + ( 1 - 0.1 ) * 2 = 1.5

Dw0 = - 0.3 + ( 1 - 0.1 ) * 1= 0.6

If there are no more examples in the batch, the weights will be modified as follows:

w1 = 0.5 + 0.3 = 0.8

w2 = 0.3 + 1.5 = 1.8

w0 = - 1 + 0.6 = 1.6

Suggested Readings:

Bishop,C. (1995) "Neural Networks for Pattern Recognition",
Oxford University Press, Oxford, UK, pp.98-105.

Haykin, Simon. (1999). "Neural Networks. A Comprehensive Foundation",
Second Edition, Prentice-Hall, Inc., New Jersey, 1999.