0% found this document useful (0 votes)
18 views155 pages

Lecture 10

Uploaded by

dylan.j.gormley
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views155 pages

Lecture 10

Uploaded by

dylan.j.gormley
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 155

Lecture 10

Neural Networks and


Deep Learning
Dr. Amr El-Wakeel
Lane Department of Computer
Science and Electrical Engineering

Spring 24
Neural Networks and
Deep Learning

Acknowledgment: Dr. Omid Dehzangi


Neural Networks

▪ Advantages
– prediction accuracy is generally high
– robust, works when training examples contain errors
– output may be discrete, real-valued, or a vector of several
discrete or real-valued attributes
– fast evaluation of the learned target function
▪ Criticism
– long training time
– difficult to understand the learned function (weights)
– not easy to incorporate domain knowledge

3
A Neuron

x0
- mk
w0

x1 w1
 f output y
xn wn

Input weight weighted Activation


vector x vector w sum function

• The n-dimensional input vector x is mapped into variable y by means of


the scalar product and a nonlinear function mapping

4
Network Training

▪ The ultimate objective of training


– obtain a set of weights that makes almost all the tuples in
the training data classified correctly
▪ Steps
– Initialize weights with random values
– Feed the input tuples into the network one by one
– For each unit
• Compute the net input to the unit as a linear combination of all
the inputs to the unit
• Compute the output value using the activation function
• Compute the error
• Update the weights and the bias
5
Feeding data through the net

(1  0.25) + (0.5  (-1.5)) = 0.25 + (-0.75) = - 0.5

1
Activation : = 0.3775
1+ e 0.5
6
Typical activation functions

• Logistic sigmoid, aka logit:


f(a) = s(a) = 1/(1+e-a)
• Hyperbolic tangent:
Normalized to have
same range and slope
f(a) = tanh(a) = (ea-e-a)/(ea+e-a) at a=0

• Cumulative Gaussian (error


function):
a
f(a) = 2x=-∞ N(x|0,1)dx - 1
As above, but f is on a
– This one has a lighter tail log-scale

7
Example: Voice Recognition

▪ Task: Learn to discriminate between two different


voices saying “Hello”

▪ Data
– Sources
• Steve
• David
– Format
• Frequency distribution (60 bins)
• Analogy: cochlea

8
▪ Network architecture
– Feed forward network
• 60 input (one for each frequency bin)
• 6 hidden
• 2 output (0-1 for “Steve”, 1-0 for “David”)

9
Presenting the data
Steve

David

10
(untrained network)
Steve

0.43

0.26

David

0.73

0.55

11
Calculate error
Steve

0.43 – 0 = 0.43

0.26 –1 = 0.74

David

0.73 – 1 = 0.27

0.55 – 0 = 0.55

12
Backprop error and adjust weights
Steve

0.43 – 0 = 0.43

0.26 – 1 = 0.74

1.17

David

0.73 – 1 = 0.27

0.55 – 0 = 0.55

0.8213
▪ Repeat process (sweep) for all training pairs
– Present data
– Calculate error
– Backpropagate error
– Adjust weights
▪ Repeat process multiple times

14
(trained network)
Steve

0.01

0.99

David

0.99

0.01

15
Network Parameters

▪ How are the weights initialized?


▪ How many hidden layers and how many
neurons?
▪ How many examples in the training set?

16
Weights

▪ In general, initial weights are randomly


chosen, with typical values between -1.0 and
1.0 or -0.5 and 0.5.
▪ There are two types of NNs:
– Fixed Networks, where the weights are fixed
– Adaptive Networks, where the weights are
changed to reduce prediction error

17
Size of Training Data
▪ Rule of thumb:
– the number of training examples should be at
least five to ten times the number of weights
of the network.

▪ Other rule:
|W|= number of weights
|W|
N a = expected accuracy on
(1 - a) test set

18
Training Basics

▪ The most basic method of training a neural network is


trial and error.

▪ If the network isn't behaving the way it should,


change the weighting of a random link by a random
amount. If the accuracy of the network declines, undo
the change and make a different one.

▪ It takes time, but the trial and error method does


produce results.

19
Training: Backprop algorithm
▪ The Backprop algorithm searches for weight values
that minimize the total error of the network over the
set of training examples (training set).
▪ Backprop consists of the repeated application of the
following two passes:
– Forward pass: in this step the network is activated on one
example and the error of (each neuron of) the output layer is
computed.
– Backward pass: in this step the network error is used for
updating the weights. Starting at the output layer, the error is
propagated backwards through the network, layer by layer. This is
done by recursively computing the local gradient of each neuron.

20
Back Propagation

▪ Back-propagation training algorithm


Network activation
Forward Step

Error propagation
Backward Step

▪ Backprop adjusts the weights of the ANN in order to


minimize the network total mean squared error.

21
Perceptrons
▪ Initial proposal of connectionist networks
▪ Rosenblatt, 50’s and 60’s
▪ Essentially a linear discriminant composed of
nodes, weights
I1 W1 I1 W1
or

I2
W2
 O I2
W2 O

W3 W3
Activation Function
I3     I3
O =  
1 :  wi I i  +   0
i 
 0 : otherwise 
 
 
1 22
Perceptron Example

2 .5

1
.3
 =-1

2(0.5) + 1(0.3) + -1 = 0.3 , O=1

Learning Procedure:
• Randomly assign weights (between 0-1)
• Present inputs from training data
• Get output O, nudge weights to gives results toward our
desired output T
• Repeat; stop when no errors, or enough epochs completed
23
Perceptron Training

wi (t + 1) = wi (t ) + wi (t )
wi (t ) = (T − O ) I i
Weights including Threshold. T=Desired, O=Actual output.
Example: T=0, O=1, W1=0.5, W2=0.3, I1=2, I2=1,Theta=-1

w1 (t + 1) = 0.5 + (0 − 1)(2) = −1.5


w2 (t + 1) = 0.3 + (0 − 1)(1) = −0.7
w (t + 1) = −1 + (0 − 1)(1) = −2
If we present this input again, we’d output 0 instead

24
Using a perceptron network

▪ This (and other networks) are generally used to learn how to


make classifications
▪ Assume you have collected some data regarding the diagnosis
of patients with heart disease
– Age, Sex, Chest Pain Type, Resting BPS, Cholesterol, …, Diagnosis
(<50% diameter narrowing, >50% diameter narrowing)

– 67,1,4,120,229,…, 1
– 37,1,3,130,250,… ,0
– 41,0,2,130,204,… ,0

▪ Train network to predict heart disease of new patient

25
Perceptron

▪ Can add learning rate to speed up the learning process;


just multiply in with delta computation
▪ Essentially a linear discriminant
▪ Perceptron theorem: If a linear discriminant exists that can
separate the classes without error, the training procedure is
guaranteed to find that line or plane.

Class1 Class2

26
Exclusive Or (XOR) Problem

1 0

Input: 0,0 Output: 0


Input: 0,1 Output: 1
Input: 1,0 Output: 1
Input: 1,1 Output: 0

0 1

XOR Problem: Not Linearly Separable!

We could construct multiple layers of perceptrons to get around this


problem. A typical multi-layered system minimizes LMS Error,
27
Multi-Layer Perceptron

Output vector

Err j = O j (1 − O j ) Errk w jk
Output nodes k

 j =  j + (l) Err j
wij = wij + (l ) Err j Oi
Hidden nodes
Err j = O j (1 − O j )(T j − O j )
wij 1
Oj = −I j
1+ e
Input nodes
I j =  wij Oi +  j
i
Input vector: xi
28
Gradient Descent

• Think of the N weights as a point in an N-dimensional


space

• Add a dimension for the observed error

• Try to minimize your position on the “error surface”

29
Error Surface

error

weights

Error as function of weights


in multidimensional space
30
Compute
Gradient
deltas

• Trying to make error decrease the fastest


• Compute:
• GradE = [dE/dw1, dE/dw2, . . ., dE/dwn]
• Change i-th weight by
• deltawi = -c * dE/dwi
Derivatives of error for weights

• We need a derivative!
• Activation function must be continuous,
differentiable, non-decreasing, and easy to compute

31
LMS Learning
• LMS = Least Mean Square learning Systems, more general than the
previous perceptron learning rule. The concept is to minimize the
total error, as measured over all training examples, P. O is the raw

output, as calculated by:
i
wi I i + 

 P P
( )
1
Distance ( LMS ) = T − O 2
2 P
E.g. if we have two patterns and
T1=1, O1=0.8, T2=0, O2=0.5 then D=(0.5)[(1-0.8)2+(0-0.5)2]=.145

We want to minimize the LMS:


C-learning rate
E W(old)
W(new)

W 32
How do we pick c?

1. Tuning set, or

2. Cross validation, or

3. Small for slow, conservative learning

33
Estimating Error Rates

• Partition: Training-and-testing
– use two independent data sets, e.g., training set (2/3), test
set(1/3)
– used for data set with large number of samples
• Cross-validation
– divide the data set into k subsamples
– use k-1 subsamples as training data and one sub-sample as test
data --- k-fold cross-validation
– for data set with moderate size
• Bootstrapping (leave-one-out)
– for small size data
34
LMS Gradient Descent
• Using LMS, we want to minimize the error. We can do this by finding
the direction on the error surface that most rapidly reduces the error
rate; this is finding the slope of the error function by taking the
derivative. The approach is called gradient descent (similar to hill
climbing). To compute how much to change weight for link k:

Error Oj = f (I W )
wk = −c
wk
Error Error O j O j
Chain rule: =  = I k f ' ( ActivationFunction( I kWk ) )
wk O j wk wk
1
  (TP − OP )2  1 (T j − O j )2
Error 2 P
= = 2 = −(T j − O j )
O j O j O j

wk = −c (− (T j − O j ) )I k f ' ( Activation Function )

We can remove the sum since we are taking the partial derivative wrt Oj
35
Activation Function
• To apply the LMS learning rule, also known
as the delta rule, we need a differentiable
activation function.
wk = cI k (T j − O j ) f ' ( Activation _ Function )
Old: New:
1 :  wi I i +   0  O=
1
O= i  − wi I i + 
 0 : otherwise  1+ e i

− 36
LMS vs. Limiting Threshold
• With the new sigmoidal function that is differentiable, we
can apply the delta rule toward learning.
• Perceptron Method
– Forced output to 0 or 1, while LMS uses the net output
– Guaranteed to separate, if no error and is linearly separable
• Otherwise it may not converge
• Gradient Descent Method:
– May oscillate and not converge
– May converge to wrong answer
– Will converge to some minimum even if the classes are not linearly
separable, unlike the earlier perceptron training method

37
Backpropagation Networks
• Attributed to Rumelhart and McClelland, late 70’s
• To bypass the linear classification problem, we can
construct multilayer networks. Typically we have fully
connected, feedforward networks.
Input Layer Hidden Layer Output Layer

I1 O1
H1

I2
1
O( x) = − w j , x H j
H2 O2
I3 1+ e j

Wi,j 1 Wj,k 1
1
H ( x) = −  wi , x I i
1’s - bias
1+ e i
38
Backprop - Learning

Learning Procedure:

Randomly assign weights (between 0-1)


Present inputs from training data, propagate to outputs
Compute outputs O, adjust weights according to the delta
rule, backpropagating the errors. The weights will be
nudged closer so that the network learns to give the
desired output.
Repeat; stop when no errors, or enough epochs completed

39
Backprop - Modifying Weights

We had computed:

wk = cI k (T j − O j ) f ' ( ActivationFunction);


 1 
f = −sum 
1+ e 
wk = cI k (T j − O j )( f ( sum)(1 − f ( sum) )
For the Output unit k, f(sum)=O(k). For the output units, this is:

w j ,k = cH j (Tk − Ok )Ok (1 − Ok )
For the Hidden units (skipping some math), this is:

wi , j = cH j (1 − H j ) I i  (Tk − Ok )Ok (1 − Ok )w j ,k


k

I H O
Wi,j Wj,k 40
Example of Back-propagation algorithm

Figure 1: An example of a multilayer feed-forward neural network. Assume that the


learning rate c is 0.9 and the first training example, X = (1,0,1) whose class label is 1.

Note: The sigmoid function is applied to hidden layer and output layer.

41
Initial input and weight values
x1 x2 x3 w14 w15 w24 w25 w34 w35 w46 w56 w04 w05 w06
-----------------------------------------------------------------------------------
1 0 1 0.2 -0.3 0.4 0.1 -0.5 0.2 -0.3 -0.2 -0.4 0.2 0.1

Table 2: The net input and output calculation


Unit j Net input Ij Output Oj
-----------------------------------------------------------------------------------
4 0.2 + 0 -0.5 -0.4 = -0.7 1/(1+e 0.7)=0.332
5 -0.3 +0+0.2 +0.2 =0.1 1/(1+e 0.1)=0.525
6 (-0.3)(0.332)-(0.2)(0.525)+0.1 = -0.105 1/(1+e0.105)=0.474

Table 3: Calculation of the error at each node


Unit j j
-----------------------------------------------------------------------------
6 (0.474)(1-0.474)(1-0.474)=0.1311
5 (0.525)(1-0.525)(0.1311)(-0.2)=-0.0065
4 (0.332)(1-0.332)(0.1311)(-0.3)=-0.0087

42
Calculation for weight updating
Weight New value
------------------------------------------------------------------------------
w46 -03+(0.9)(0.1311)(0.332)= -0.261
w56 -0.2+(0.9)(0.1311)(0.525)= -0.138
w14 0.2 +(0.9)(-0.0087)(1) = 0.192
w15 -0.3 +(0.9)(-0.0065)(1) = -0.306
w24 0.4+ (0.9)(-0.0087)(0) = 0.4
w25 0.1+ (0.9)(-0.0065)(0) = 0.1
w34 -0.5+ (0.9)(-0.0087)(1) = -0.508
w35 0.2 + (0.9)(-0.0065)(1) = 0.194
w06 0.1 + (0.9)(0.1311) = 0.218
w05 0.2 + (0.9)(-0.0065)=0.194
w04 -0.4 +(0.9)(-0.0087) = -0.408

43
The decision boundary perspective…
Initial random weights
The decision boundary perspective…
Present a training instance / adjust the weights
The decision boundary perspective…
Present a training instance / adjust the weights
The decision boundary perspective…
Present a training instance / adjust the weights
The decision boundary perspective…
Present a training instance / adjust the weights
The decision boundary perspective…
Eventually ….
The effect of local minima

• Because of random weight initialization, each


training run will find a different solution

Validation error

M
50
Regularizing neural networks

Demonstration of over-fitting
(M = # hidden units)

The number of hidden units determines the


complexity of the learned function

51
Regularizing neural networks

• Use cross-validation to select the network architecture


(number of layers, number of units per layer)
• Add to E a term (λ/2)Sjiwji2 that penalizes large weights, so

Use cross-validation to select λ (Empirically, this leads to significant


improvements in generalization. This is because producing over-fitted mappings
requires high curvature and hence large weights. Weight decay keeps the weights
small and hence the mappings are smooth)
• Use early-stopping and cross-validation (next slide)
• Take a Bayesian approach: Put a prior on the w’s and
integrate over them to make predictions
52
Early Stopping
A typical curve showing performance during training.
But here is performance on unseen data, estimated via CV.
Training data
Validation data
error

Starting to overfit

Time/Iteration
53
Early Stopping

• During training, keep track of the network’s performance on a


separate validation set of data.

• At the point where error continues to improve on the training


set, but starts to get worse on the validation set, that is when
training should be stopped, since it is starting to overfit on the
training data.

• The problem here is that it is not always clear cut.

MATLAB Ref for improving ANN generalization:


http://www.mathworks.com/help/nnet/ug/improve-neural-network-generalization-and-avoid-overfitting.html

54
Backpropagation
• Very powerful - can learn any function, given enough hidden units!
With enough hidden units, we can generate any function.
• Have the problem of Generalization vs. Memorization. With too
many units, we will tend to memorize the input and not generalize
well. Some schemes exist to “prune” the neural network.
• Networks require extensive training, many parameters to fiddle
with. Can be extremely slow to train. May also fall into local
minima.
• Inherently parallel algorithm, ideal for multiprocessor hardware.
• Despite the cons, a very powerful algorithm that has seen
widespread successful deployment.

55
Applications of Feed-forward nets

– Pattern recognition
• Character recognition
• Face Recognition
• Speech recognition
• Etc.

– Navigation of a car

– Stock-market prediction

56
Introduction to:
Deep Learning
aka or related to
Deep Neural Networks
DL is providing breakthrough results in speech
recognition and image classification …
From this Hinton et al 2012 paper:
http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/38131.pdf

go here: http://yann.lecun.com/exdb/mnist/
From here:
http://people.idsia.ch/~juergen/cvpr2012.pdf
So, 1. what exactly is deep learning ?

And, 2. why is it generally better than other methods on


image, speech and certain other types of data?
So, 1. what exactly is deep learning ?

And, 2. why is it generally better than other methods on


image, speech and certain other types of data?

The short answers


1. ‘Deep Learning’ means using a neural network
with several layers of nodes between input and output

2. the series of layers between input & output do


feature identification and processing in a series of stages,
just as our brains seem to.
OK, but:
3. multilayer neural networks (MLPs) have been around
for 35 years. What’s actually new?
we have always had good algorithms for learning the
weights in networks with 1 hidden layer

but these algorithms are not good at learning the weights for
networks with more hidden layers

what’s new is: algorithms for training many-later networks


So: multiple layers make sense
Many-layer neural network architectures should be capable of learning the true
underlying features and ‘feature logic’, and therefore generalise very well …
Feature representations
pixel 1

Learning
algorithm
pixel 2
Input
Motorbikes
Input space “Non”-Motorbikes
pixel 2

pixel 1
Feature representations
handle

Feature Learning
wheel representation algorithm

Input
Motorbikes
Input space “Non”-Motorbikes Feature space

“handle”
pixel 2

pixel 1 “wheel”
How is computer perception done?
Object
detection

Image Low-level Recognition


vision features

Audio
classification This is written text.

Audio Low-level Transcribed speech


audio features

Helicopter
control
Low-level state
Helicopter
features Action
Audio features

Spectrogram MFCC

Problems of hand-tuned
Key question: Can we features
1. Needs expert knowledge
automatically learn a good feature
2. Time-consuming and expensive
representation?
3. Does not generalize to other domains
Flux ZCR Rolloff
The goal of Unsupervised Feature Learning

Unlabeled images

Learning
algorithm

Feature representation
Why feature hierarchies
object models
• Natural progression from
low level to high level
structure as seen in natural
complexity
object parts
(combination • Easier to monitor what is
of edges) being learnt and to guide the
machine to better subspaces

edges

pixels
But, until very recently, our weight-learning
algorithms simply did not work on multi-layer
architectures
The new way to train multi-layer NNs…
The new way to train multi-layer NNs…

Train this layer first


then this layer
then this layer
then this layer
finally this layer
The new way to train multi-layer NNs…

EACH of the (non-output) layers is trained


to be an auto-encoder
Basically, it is forced to learn good
features that describe what comes from
the previous layer
Smaller Network: Convolutional Neural Networks

• We know it is good to learn a small model.


• From this fully connected model, do we really need all the
edges?
• Can some of these be shared?
Consider learning an image:

• Some patterns are much smaller than the


whole image

Can represent a small region with fewer parameters

“beak” detector
Same pattern appears in different places:
They can be compressed!

What about training a lot of such “small” detectors


and each detector must “move around”.

“upper-left
beak” detector

They can be compressed


to the same parameters.

“middle beak”
detector
A convolutional layer
A CNN is a neural network with some convolutional layers
(and some other layers). A convolutional layer has a number
of filters that does convolutional operation.

Beak detector

A filter
Convolution
These are the network
parameters to be learned.
1 -1 -1
1 0 0 0 0 1 -1 1 -1 Filter 1
0 1 0 0 1 0 -1 -1 1
0 0 1 1 0 0
1 0 0 0 1 0 -1 1 -1
-1 1 -1 Filter 2
0 1 0 0 1 0
0 0 1 0 1 0 -1 1 -1



6 x 6 image
Each filter detects a
small pattern (3 x 3).
Convolution
Filter 1

stride=1 1 -1 -1
-1 1 -1
1 0 0 0 0 1 Dot
-1 -1 1
product
0 1 0 0 1 0 3 -1
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0

6 x 6 image
Convolution
1 -1 -1
If stride=2 -1 1 -1 Filter 1
-1 -1 1
1 0 0 0 0 1
0 1 0 0 1 0 3 -3
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0

6 x 6 image
Convolution
1 -1 -1 Filter 1
stride=1 -1 1 -1
-1 -1 1
1 0 0 0 0 1
0 1 0 0 1 0 3 -1 -3 -1
0 0 1 1 0 0
1 0 0 0 1 0 -3 1 0 -3
0 1 0 0 1 0
0 0 1 0 1 0 -3 -3 0 1

6 x 6 image 3 -2 -2 -1
Convolution
-1 1 -1
-1 1 -1 Filter 2
stride=1 -1 1 -1
Repeat this for each filter
1 0 0 0 0 1
0 1 0 0 1 0 3 -1 -3 -1
-1 -1 -1 -1
0 0 1 1 0 0
1 0 0 0 1 0 -3 1 0 -3
-1 -1 -2 1
0 1 0 0 1 0 Feature
0 0 1 0 1 0 -3 -3 Map0 1
-1 -1 -2 1

6 x 6 image 3 -2 -2 -1
-1 0 -4 3
Two 4 x 4 images
Forming 2 x 4 x 4 matrix
Color image: RGB 3 channels

11 -1-1 -1-1 -1-1 11 -1-1


1 -1 -1 -1 1 -1
-1-1 11 -1-1 -1-1-1 111 -1-1-1 Filter 2
-1 1 -1 Filter 1 -1 1 -1
-1-1 -1-1 11 -1-1 11 -1-1
-1 -1 1
Color image
1 0 0 0 0 1
1 0 0 0 0 1
0 11 00 00 01 00 1
0 1 0 0 1 0
0 00 11 01 00 10 0
0 0 1 1 0 0
1 00 00 10 11 00 0
1 0 0 0 1 0
0 11 00 00 01 10 0
0 1 0 0 1 0
0 00 11 00 01 10 0
0 0 1 0 1 0
0 0 1 0 1 0
Convolution v.s. Fully Connected

1 0 0 0 0 1 1 -1 -1 -1 1 -1
0 1 0 0 1 0 -1 1 -1 -1 1 -1
0 0 1 1 0 0 -1 -1 1 -1 1 -1

1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
convolution
image

x1
1 0 0 0 0 1
0 1 0 0 1 0 x2
Fully- 0 0 1 1 0 0
1 0 0 0 1 0
connected



0 1 0 0 1 0
0 0 1 0 1 0
x36
1 -1 -1 Filter 1 1 1
-1 1 -1 2 0
-1 -1 1 3 0
4: 0 3


1 0 0 0 0 1
0 1 0 0 1 0 0
0 0 1 1 0 0 8 1
1 0 0 0 1 0 9 0
0 1 0 0 1 0 10: 0


0 0 1 0 1 0
1 0
6 x 6 image
3 0
14
fewer parameters! 15 1 Only connect to 9
16 1 inputs, not fully
connected

1 -1 -1 1: 1
-1 1 -1 Filter 1 2: 0
-1 -1 1 3: 0
4: 0 3


1 0 0 0 0 1
0 1 0 0 1 0 7: 0
0 0 1 1 0 0 8: 1
1 0 0 0 1 0 9: 0 -1
0 1 0 0 1 0 10: 0


0 0 1 0 1 0
1 0
6 x 6 image
3: 0
14:
Fewer parameters 15: 1
16: 1 Shared weights
Even fewer parameters

The whole CNN

cat dog ……
Convolution

Max Pooling
Can
Fully Connected repeat
Feedforward network
Convolution many
times

Max Pooling

Flattened
Max Pooling

1 -1 -1 -1 1 -1
-1 1 -1 Filter 1 -1 1 -1 Filter 2
-1 -1 1 -1 1 -1

3 -1 -3 -1 -1 -1 -1 -1

-3 1 0 -3 -1 -1 -2 1

-3 -3 0 1 -1 -1 -2 1

3 -2 -2 -1 -1 0 -4 3
Why Pooling

• Subsampling pixels will not change the object


bird
bird

Subsampling

We can subsample the pixels to make image smaller

fewer parameters to characterize the image


Max Pooling

New image
1 0 0 0 0 1 but smaller
0 1 0 0 1 0 Conv
3 0
0 0 1 1 0 0 -1 1
1 0 0 0 1 0
0 1 0 0 1 0 Max 3 1
0 3
0 0 1 0 1 0 Pooling
2 x 2 image
6 x 6 image
Each filter
is a channel
The whole CNN

3 0
-1 1 Convolution

3 1
0 3
Max Pooling

A new image Can


repeat
Convolution many
Smaller than the original
times
image
The number of channels Max Pooling

is the number of filters


The whole CNN

cat dog ……
Convolution

Max Pooling

Fully Connected A new image


Feedforward network
Convolution

Max Pooling

Flattened A new image


Summary

A CNN compresses a fully connected


network in two ways:

• Reducing number of connections


• Shared weights on the edges
• Max pooling further reduces the complexity
Flattening 3

1
3 0
-1 1 3

3 1 -1
0 3 Flattened

1 Fully Connected
Feedforward network

3
AlphaGo

Next move
Neural
(19 x 19
Network positions)

19 x 19 matrix
Fully-connected feedforward network
Black: 1
can be used
white: -1
none: 0 But CNN performs much better
AlphaGo’s policy network
The following is quotation from their Nature article:
Note: AlphaGo does not use Max Pooling.
CNN in speech recognition

The filters move in the


CNN frequency direction.
Frequency

Image Time
Spectrogram
AlexNet Architecture
Showing 81 filters of
11x11x3.

Capture low-level
features like oriented
edges, blobs.
Top 9 patches that activate each filter

in layer 1
Each 3x3 block shows the top 9 patches
for one filter.
Second layer
Second layer

Note how the previous


low-level features are
combined to detect a
little more abstract
features like textures.
ConvNets as generic feature extractor

• A well-trained ConvNets is an excellent feature


extractor.

• Chop the network at desired layer and use the output


as a feature representation to train a SVM on some
other vision dataset.

•Improve further by taking a pre-trained ConvNet and re-training it on a


different dataset. Called fine-tuning
Last time: ConvNets

Based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson


Smaller Network: Recurrent neural network

This is our fully connected network. If x1 .... xn, n is very large and growing,
this network would become too large. We now will input one xi at a time,
and re-use the same edge weights.
Outline

• Sequential prediction problems


• Vanilla RNN unit
– Forward and backward pass
– Back-propagation through time (BPTT)
• Long Short-Term Memory (LSTM) unit
• Gated Recurrent Unit (GRU)
• Applications
Sequential prediction tasks

• In Perceptron, MLP, and CNN, we focused mainly on


prediction problems with fixed-size inputs and outputs
• But what if the input and/or output is a variable-length
sequence?
Text classification

• Sentiment classification: classify a restaurant or movie or


product review as positive or negative

– “The food was really good”


– “The vacuum cleaner broke within two weeks”
– “The movie had slow parts, but overall was worth watching”

• What feature representation or predictor structure can we use


for this problem?
Sentiment classification

• “The food was really good”

Classifier
Hidden state
“Memory” h5
“Context”

h1 h2 h3 h4

“The” “food” “was” “really” “good”

Recurrent Neural Network (RNN)


Image Caption Generation

“The” “dog” “is” “hiding” “STOP”

Classifier Classifier Classifier Classifier Classifier

h1 h2 h3 h4 h5

h0 h1 h2 h3 h4

CNN “START” “The” “dog” “is” “hiding”


Summary: Input-output scenarios

Single - Single Feed-forward Network

Multiple - Single Sequence Classification

Single - Multiple Image Captioning

Multiple - Multiple Image Captioning

Multiple - Multiple Translation


Recurrent Neural Network (RNN)

Output at time t yt

Classifier
Hidden Recurrence:
representation
at time t
ht ℎ𝑡 = 𝑓𝑊 (𝑥𝑡 , ℎ𝑡−1 )
new function input at old
Hidden layer state of W time t state

Input at time t xt
Unrolling the RNN
y3
y2
Classifier
y1 h3
Classifier h3
h2 Hidden layer
Classifier h2

h1 h1 Hidden layer
x3
Hidden layer t=3
x2
h0 t=2
x1
t=1
Vanilla RNN Cell
ht

ℎ𝑡 = 𝑓𝑊 (𝑥𝑡 , ℎ𝑡−1 )
ht-1 xt 𝑥𝑡
= tanh 𝑊 ℎ
𝑡−1

J. Elman, Finding structure in time, Cognitive science 14(2), pp. 179–211, 1990
Vanilla RNN Cell
ht

ℎ𝑡 = 𝑓𝑊 (𝑥𝑡 , ℎ𝑡−1 )
W
𝑥𝑡
= tanh 𝑊 ℎ
𝑡−1
ht-1 xt

𝑒 𝑎 − 𝑒 −𝑎
𝜎 𝑎 tanh 𝑎 = 𝑎
𝑒 + 𝑒 −𝑎
tanh 𝑎 = 2𝜎 2𝑎 − 1

Image source
RNN Forward Pass

e1 e2 e3 𝑒𝑡 = −log(𝑦𝑡 (𝐺𝑇𝑡 ))

y1 y2 y3 𝑦𝑡 = softmax(𝑊𝑦 ℎ𝑡 )

𝑥𝑡
h1 h2 h3 ℎ𝑡 = tanh 𝑊 ℎ
𝑡−1

shared weights
h0 x1 h1 x2 h2 x3
Backpropagation Through Time (BPTT)

• Most common method used to train RNNs


• The unfolded network (used during forward pass) is
treated as one big feed-forward network that accepts
the whole time series as input
• The weight updates are computed for each copy in
the unfolded network, then summed (or averaged)
and applied to the RNN weights
Unfolded RNN Forward Pass

e1 e2 e3 𝑒𝑡 = −log(𝑦𝑡 (𝐺𝑇𝑡 ))

y1 y2 y3 𝑦𝑡 = softmax(𝑊𝑦 ℎ𝑡 )

𝑥𝑡
h1 h2 h3 ℎ𝑡 = tanh 𝑊 ℎ
𝑡−1

h0 x1 h1 x2 h2 x3
Unfolded RNN Backward Pass

e1 e2 e3 𝑒𝑡 = −log(𝑦𝑡 (𝐺𝑇𝑡 ))

y1 y2 y3 𝑦𝑡 = softmax(𝑊𝑦 ℎ𝑡 )

𝑥𝑡
h1 h2 h3 ℎ𝑡 = tanh 𝑊 ℎ
𝑡−1

h0 x1 h1 x2 h2 x3
Backpropagation Through Time (BPTT)

• Most common method used to train RNNs


• The unfolded network (used during forward pass) is
treated as one big feed-forward network that accepts
the whole time series as input
• The weight updates are computed for each copy in the
unfolded network, then summed (or averaged) and
applied to the RNN weights
• In practice, truncated BPTT is used: run the RNN forward
𝑘1 time steps, propagate backward for 𝑘2 time steps
https://machinelearningmastery.com/gentle-introduction-backpropagation-
time/
http://www.cs.utoronto.ca/~ilya/pubs/ilya_sutskever_phd_thesis.pdf
Recurrent Neural Network
How does RNN reduce complexity?

• Given function f: h’,y=f(h,x) h and h’ are vectors with


the same dimension

y1 y2 y3

h0 f h1 f h2 f h3 ……

x1 x2 x3
No matter how long the input/output sequence is, we only
need one function f. If f’s are different, then it becomes a
feedforward NN. This may be treated as another compression
from fully connected network.
Deep RNN h’,y = f1(h,x), g’,z = f2(g,y)


z1 z2 z3

g0 f2 g1 f2 g2 f2 g3 ……

y1 y2 y3

h0 f1 h1 f1 h2 f1 h3 ……

x1 x2 x3
Bidirectional RNN y,h=f1(x,h) z,g = f2(g,x)
x1 x2 x3

g0 f2 g1 f2 g2 f2 g3

z1 z2 z3

p=f3(y,z) f3 p1 f3 p2 f3 p3

y1 y2 y3

h0 f1 h1 f1 h2 f1 h3

x1 x2 x3
Pyramid RNN Significantly speed up training

• Reducing the number of time steps


Bidirectional
RNN

W. Chan, N. Jaitly, Q. Le and O. Vinyals, “Listen, attend and spell: A neural


network for large vocabulary conversational speech recognition,” ICASSP, 2016
Naïve RNN

y Wh h
Wi
h' x

h f h'
y Wo h’ Note, y is computed
from h’
x softmax

We have ignored the bias


Problems with naive RNN

• When dealing with a time series, it tends to


forget old information. When there is a distant
relationship of unknown length, we wish to
have a “memory” to it.
• Vanishing gradient problem.
The sigmoid layer outputs numbers between 0-1 determine how much
each component should be let through. Pink X gate is point-wise multiplication.
This
Output
This sigmoid
decides
LSTM
gategate
what info
Controls what
Isdetermines how
to add to the cellmuch
state
goes into output
information goes thru

Ct-1

ht-1

Forget input
gate gate
The core idea is this cell
statesigmoid
Why Ct, it is changed
or tanh:
slowly, with
Sigmoid: 0,1only minor
gating as switch.
linear interactions.
Vanishing It is veryin
gradient problem
easy for
LSTM information
is handled to flow
already.
along replaces
ReLU it unchanged.
tanh ok?
it decides what component
is to be updated.
C’t provides change contents

Updating the cell state

Decide what part of the cell


state to output
RNN vs LSTM
Peephole LSTM

Allows “peeping into the memory”


Naïve RNN vs LSTM yt

yt ct-1 ct
LSTM
Naïve
ht-1 ht ht-1 ht
RNN

xt xt

c changes slowly ct is ct-1 added by something

h changes faster ht and ht-1 can be very different


These 4 matrix
computation should
be done concurrently.
xt
z W
ht-1

ct-1 xt
zi = σ( Wi )
ht-1
Controls Controls Updating Controls
forget gate input gate information Output gate xt
zf = σ( Wf )
ht-1
zf zi z zo

xt
zo = σ( Wo )

ht-1 xt ht-1

Information flow of LSTM


xt

z =tanh( W ht-1 )

ct-1 ct-1
diagonal
“peephole” zo zf zi obtained by the same way

zf zi z zo

ht-1 xt
Information flow of LSTM
Element-wise multiply

yt

ct-1 ct
ct = zf  ct-1 + ziz

tanh ht = zo  tanh(ct)

yt = σ(W’ ht)

zf zi z zo

ht-1 xt ht

Information flow of LSTM


LSTM information flow

yt yt+1

ct-1 ct ct+1

tanh tanh

zf zi z zo zf zi z zo

ht+1
ht-1 xt ht xt+1

Information flow of LSTM


LSTM

GRU – gated recurrent unit


(more compression)

reset gate Update gate

It combines the forget and input into a single update gate.


It also merges the cell state and hidden state. This is simpler
than LSTM. There are many other variants too.

X,*: element-wise multiply


GRUs also takes xt and ht-1 as inputs. They perform some
calculations and then pass along h t. What makes them different
from LSTMs is that GRUs don't need the cell layer to pass values
along. The calculations within each iteration insure that the h t
values being passed along either retain a high amount of old
information or are jump-started with a high amount of new
information.
Feed-forward vs Recurrent Network

1. Feedforward network does not have input at each step


2. Feedforward network has different parameters for each layer

x f1 a1 f2 a2 f3 a3 f4 y

t is layer
at = ft(at-1) = σ(W tat-1 + bt)

h0 f h1 f h2 f h3 f g y4

x1 x2 x3 x4
t is time step

at= f(at-1, xt) = σ(W h at-1 + W ixt + bi)


yt

No input xt at t-1
each step hat-1 ahtt
No output yt at
each step 1-
at-1 is the output of
the (t-1)-th layer
reset update
at is the output of r z h'
the t-th layer
No reset gate
t-1
hat-1 xt xt
h’=σ(Wat-1)
z=σ(W’at-1)
Highway Network at = z  at-1 + (1-z)  h

• Highway Network • Residual Network


at at
+ at-1
z controls red arrow
h’
h’
Gate copy
controlle copy
r
at-1 at-1

Training Very Deep Networks Deep Residual Learning for Image


https://arxiv.org/pdf/1507.0622 Recognition
8v2.pdf http://arxiv.org/abs/1512.03385
output layer output layer output layer

Highway Network automatically


determines the layers needed!
Input layer Input layer Input layer
Highway Network Experiments
Grid LSTM
Memory for both
time and depth
depth
y a’ b’

c c’ c c’
LSTM Grid
LSTM
h h’ h h’

x a b

time
Grid LSTM
h' b'

a’ c c'
b’
a a'
tanh
c c’
Grid
LSTM
h h’ zf zi z zo

a b
h b
The network differs from existing deep LSTM architectures in that the cells are connected between
network layers as well as along the spatiotemporal dimensions of the data. The network provides a
You
unifiedcan generalize
way of using LSTM forthis to 3D,
both deep and more
and sequential .
computation.
Applications of LSTM / RNN
Neural machine translation

LSTM
Sequence to sequence chat model
Baidu’s speech recognition using RNN
Useful Resources / References

• http://cs231n.stanford.edu/slides/winter1516_lecture10.pdf
• http://www.cs.toronto.edu/~rgrosse/csc321/lec10.pdf

• R. Pascanu, T. Mikolov, and Y. Bengio, On the difficulty of training recurrent neural


networks, ICML 2013
• S. Hochreiter, and J. Schmidhuber, Long short-term memory, Neural computation,
1997 9(8), pp.1735-1780
• F.A. Gers, and J. Schmidhuber, Recurrent nets that time and count, IJCNN 2000
• K. Greff , R.K. Srivastava, J. Koutník, B.R. Steunebrink, and J. Schmidhuber, LSTM: A
search space odyssey, IEEE transactions on neural networks and learning systems,
2016
• K. Cho, B. Van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk,
and Y. Bengio, Learning phrase representations using RNN encoder-decoder for
statistical machine translation, ACL 2014
• R. Jozefowicz, W. Zaremba, and I. Sutskever, An empirical exploration of recurrent
network architectures, JMLR 2015
Questions

You might also like