0% found this document useful (0 votes)

18 views155 pages

Lecture 10

Uploaded by

dylan.j.gormley

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views155 pages

Lecture 10

Uploaded by

dylan.j.gormley

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 155

Lecture 10

Neural Networks and

Deep Learning
Dr. Amr El-Wakeel
Lane Department of Computer
Science and Electrical Engineering

Spring 24
Neural Networks and
Deep Learning

Acknowledgment: Dr. Omid Dehzangi

Neural Networks

▪ Advantages
– prediction accuracy is generally high
– robust, works when training examples contain errors
– output may be discrete, real-valued, or a vector of several
discrete or real-valued attributes
– fast evaluation of the learned target function
▪ Criticism
– long training time
– difficult to understand the learned function (weights)
– not easy to incorporate domain knowledge

3
A Neuron

x0
- mk
w0

x1 w1
 f output y
xn wn

Input weight weighted Activation

vector x vector w sum function

• The n-dimensional input vector x is mapped into variable y by means of

the scalar product and a nonlinear function mapping

4
Network Training

▪ The ultimate objective of training

– obtain a set of weights that makes almost all the tuples in
the training data classified correctly
▪ Steps
– Initialize weights with random values
– Feed the input tuples into the network one by one
– For each unit
• Compute the net input to the unit as a linear combination of all
the inputs to the unit
• Compute the output value using the activation function
• Compute the error
• Update the weights and the bias
5
Feeding data through the net

(1  0.25) + (0.5  (-1.5)) = 0.25 + (-0.75) = - 0.5

1
Activation : = 0.3775
1+ e 0.5
6
Typical activation functions

• Logistic sigmoid, aka logit:

f(a) = s(a) = 1/(1+e-a)
• Hyperbolic tangent:
Normalized to have
same range and slope
f(a) = tanh(a) = (ea-e-a)/(ea+e-a) at a=0

• Cumulative Gaussian (error

function):
a
f(a) = 2x=-∞ N(x|0,1)dx - 1
As above, but f is on a
– This one has a lighter tail log-scale

7
Example: Voice Recognition

▪ Task: Learn to discriminate between two different

voices saying “Hello”

▪ Data
– Sources
• Steve
• David
– Format
• Frequency distribution (60 bins)
• Analogy: cochlea

8
▪ Network architecture
– Feed forward network
• 60 input (one for each frequency bin)
• 6 hidden
• 2 output (0-1 for “Steve”, 1-0 for “David”)

9
Presenting the data
Steve

David

10
(untrained network)
Steve

0.43

0.26

David

0.73

0.55

11
Calculate error
Steve

0.43 – 0 = 0.43

0.26 –1 = 0.74

David

0.73 – 1 = 0.27

0.55 – 0 = 0.55

12
Backprop error and adjust weights
Steve

0.43 – 0 = 0.43

0.26 – 1 = 0.74

1.17

David

0.73 – 1 = 0.27

0.55 – 0 = 0.55

0.8213
▪ Repeat process (sweep) for all training pairs
– Present data
– Calculate error
– Backpropagate error
– Adjust weights
▪ Repeat process multiple times

14
(trained network)
Steve

0.01

0.99

David

0.99

0.01

15
Network Parameters

▪ How are the weights initialized?

▪ How many hidden layers and how many
neurons?
▪ How many examples in the training set?

16
Weights

▪ In general, initial weights are randomly

chosen, with typical values between -1.0 and
1.0 or -0.5 and 0.5.
▪ There are two types of NNs:
– Fixed Networks, where the weights are fixed
– Adaptive Networks, where the weights are
changed to reduce prediction error

17
Size of Training Data
▪ Rule of thumb:
– the number of training examples should be at
least five to ten times the number of weights
of the network.

▪ Other rule:
|W|= number of weights
|W|
N a = expected accuracy on
(1 - a) test set

18
Training Basics

▪ The most basic method of training a neural network is

trial and error.

▪ If the network isn't behaving the way it should,

change the weighting of a random link by a random
amount. If the accuracy of the network declines, undo
the change and make a different one.

▪ It takes time, but the trial and error method does

produce results.

19
Training: Backprop algorithm
▪ The Backprop algorithm searches for weight values
that minimize the total error of the network over the
set of training examples (training set).
▪ Backprop consists of the repeated application of the
following two passes:
– Forward pass: in this step the network is activated on one
example and the error of (each neuron of) the output layer is
computed.
– Backward pass: in this step the network error is used for
updating the weights. Starting at the output layer, the error is
propagated backwards through the network, layer by layer. This is
done by recursively computing the local gradient of each neuron.

20
Back Propagation

▪ Back-propagation training algorithm

Network activation
Forward Step

Error propagation
Backward Step

▪ Backprop adjusts the weights of the ANN in order to

minimize the network total mean squared error.

21
Perceptrons
▪ Initial proposal of connectionist networks
▪ Rosenblatt, 50’s and 60’s
▪ Essentially a linear discriminant composed of
nodes, weights
I1 W1 I1 W1
or

I2
W2
 O I2
W2 O

W3 W3
Activation Function
I3     I3
O =  
1 :  wi I i  +   0
i 
 0 : otherwise 
 
 
1 22
Perceptron Example

2 .5

1
.3
 =-1

2(0.5) + 1(0.3) + -1 = 0.3 , O=1

Learning Procedure:
• Randomly assign weights (between 0-1)
• Present inputs from training data
• Get output O, nudge weights to gives results toward our
desired output T
• Repeat; stop when no errors, or enough epochs completed
23
Perceptron Training

wi (t + 1) = wi (t ) + wi (t )
wi (t ) = (T − O ) I i
Weights including Threshold. T=Desired, O=Actual output.
Example: T=0, O=1, W1=0.5, W2=0.3, I1=2, I2=1,Theta=-1

w1 (t + 1) = 0.5 + (0 − 1)(2) = −1.5

w2 (t + 1) = 0.3 + (0 − 1)(1) = −0.7
w (t + 1) = −1 + (0 − 1)(1) = −2
If we present this input again, we’d output 0 instead

24
Using a perceptron network

▪ This (and other networks) are generally used to learn how to

make classifications
▪ Assume you have collected some data regarding the diagnosis
of patients with heart disease
– Age, Sex, Chest Pain Type, Resting BPS, Cholesterol, …, Diagnosis
(<50% diameter narrowing, >50% diameter narrowing)

– 67,1,4,120,229,…, 1
– 37,1,3,130,250,… ,0
– 41,0,2,130,204,… ,0

▪ Train network to predict heart disease of new patient

25
Perceptron

▪ Can add learning rate to speed up the learning process;

just multiply in with delta computation
▪ Essentially a linear discriminant
▪ Perceptron theorem: If a linear discriminant exists that can
separate the classes without error, the training procedure is
guaranteed to find that line or plane.

Class1 Class2

26
Exclusive Or (XOR) Problem

1 0

Input: 0,0 Output: 0

Input: 0,1 Output: 1
Input: 1,0 Output: 1
Input: 1,1 Output: 0

0 1

XOR Problem: Not Linearly Separable!

We could construct multiple layers of perceptrons to get around this

problem. A typical multi-layered system minimizes LMS Error,
27
Multi-Layer Perceptron

Output vector

Err j = O j (1 − O j ) Errk w jk
Output nodes k

 j =  j + (l) Err j
wij = wij + (l ) Err j Oi
Hidden nodes
Err j = O j (1 − O j )(T j − O j )
wij 1
Oj = −I j
1+ e
Input nodes
I j =  wij Oi +  j
i
Input vector: xi
28
Gradient Descent

• Think of the N weights as a point in an N-dimensional

space

• Add a dimension for the observed error

• Try to minimize your position on the “error surface”

29
Error Surface

error

weights

Error as function of weights

in multidimensional space
30
Compute
Gradient
deltas

• Trying to make error decrease the fastest

• Compute:
• GradE = [dE/dw1, dE/dw2, . . ., dE/dwn]
• Change i-th weight by
• deltawi = -c * dE/dwi
Derivatives of error for weights

• We need a derivative!
• Activation function must be continuous,
differentiable, non-decreasing, and easy to compute

31
LMS Learning
• LMS = Least Mean Square learning Systems, more general than the
previous perceptron learning rule. The concept is to minimize the
total error, as measured over all training examples, P. O is the raw

output, as calculated by:
i
wi I i + 

 P P
( )
1
Distance ( LMS ) = T − O 2
2 P
E.g. if we have two patterns and
T1=1, O1=0.8, T2=0, O2=0.5 then D=(0.5)[(1-0.8)2+(0-0.5)2]=.145

We want to minimize the LMS:

C-learning rate
E W(old)
W(new)

W 32
How do we pick c?

1. Tuning set, or

2. Cross validation, or

3. Small for slow, conservative learning

33
Estimating Error Rates

• Partition: Training-and-testing
– use two independent data sets, e.g., training set (2/3), test
set(1/3)
– used for data set with large number of samples
• Cross-validation
– divide the data set into k subsamples
– use k-1 subsamples as training data and one sub-sample as test
data --- k-fold cross-validation
– for data set with moderate size
• Bootstrapping (leave-one-out)
– for small size data
34
LMS Gradient Descent
• Using LMS, we want to minimize the error. We can do this by finding
the direction on the error surface that most rapidly reduces the error
rate; this is finding the slope of the error function by taking the
derivative. The approach is called gradient descent (similar to hill
climbing). To compute how much to change weight for link k:

Error Oj = f (I W )
wk = −c
wk
Error Error O j O j
Chain rule: =  = I k f ' ( ActivationFunction( I kWk ) )
wk O j wk wk
1
  (TP − OP )2  1 (T j − O j )2
Error 2 P
= = 2 = −(T j − O j )
O j O j O j

wk = −c (− (T j − O j ) )I k f ' ( Activation Function )

We can remove the sum since we are taking the partial derivative wrt Oj
35
Activation Function
• To apply the LMS learning rule, also known
as the delta rule, we need a differentiable
activation function.
wk = cI k (T j − O j ) f ' ( Activation _ Function )
Old: New:
1 :  wi I i +   0  O=
1
O= i  − wi I i + 
 0 : otherwise  1+ e i

− 36
LMS vs. Limiting Threshold
• With the new sigmoidal function that is differentiable, we
can apply the delta rule toward learning.
• Perceptron Method
– Forced output to 0 or 1, while LMS uses the net output
– Guaranteed to separate, if no error and is linearly separable
• Otherwise it may not converge
• Gradient Descent Method:
– May oscillate and not converge
– May converge to wrong answer
– Will converge to some minimum even if the classes are not linearly
separable, unlike the earlier perceptron training method

37
Backpropagation Networks
• Attributed to Rumelhart and McClelland, late 70’s
• To bypass the linear classification problem, we can
construct multilayer networks. Typically we have fully
connected, feedforward networks.
Input Layer Hidden Layer Output Layer

I1 O1
H1

I2
1
O( x) = − w j , x H j
H2 O2
I3 1+ e j

Wi,j 1 Wj,k 1
1
H ( x) = −  wi , x I i
1’s - bias
1+ e i
38
Backprop - Learning

Learning Procedure:

Randomly assign weights (between 0-1)

Present inputs from training data, propagate to outputs
Compute outputs O, adjust weights according to the delta
rule, backpropagating the errors. The weights will be
nudged closer so that the network learns to give the
desired output.
Repeat; stop when no errors, or enough epochs completed

39
Backprop - Modifying Weights

We had computed:

wk = cI k (T j − O j ) f ' ( ActivationFunction);

 1 
f = −sum 
1+ e 
wk = cI k (T j − O j )( f ( sum)(1 − f ( sum) )
For the Output unit k, f(sum)=O(k). For the output units, this is:

w j ,k = cH j (Tk − Ok )Ok (1 − Ok )
For the Hidden units (skipping some math), this is:

wi , j = cH j (1 − H j ) I i  (Tk − Ok )Ok (1 − Ok )w j ,k

I H O
Wi,j Wj,k 40
Example of Back-propagation algorithm

Figure 1: An example of a multilayer feed-forward neural network. Assume that the

learning rate c is 0.9 and the first training example, X = (1,0,1) whose class label is 1.

Note: The sigmoid function is applied to hidden layer and output layer.

41
Initial input and weight values
x1 x2 x3 w14 w15 w24 w25 w34 w35 w46 w56 w04 w05 w06
-----------------------------------------------------------------------------------
1 0 1 0.2 -0.3 0.4 0.1 -0.5 0.2 -0.3 -0.2 -0.4 0.2 0.1

Table 2: The net input and output calculation

Unit j Net input Ij Output Oj
-----------------------------------------------------------------------------------
4 0.2 + 0 -0.5 -0.4 = -0.7 1/(1+e 0.7)=0.332
5 -0.3 +0+0.2 +0.2 =0.1 1/(1+e 0.1)=0.525
6 (-0.3)(0.332)-(0.2)(0.525)+0.1 = -0.105 1/(1+e0.105)=0.474

Table 3: Calculation of the error at each node

Unit j j
-----------------------------------------------------------------------------
6 (0.474)(1-0.474)(1-0.474)=0.1311
5 (0.525)(1-0.525)(0.1311)(-0.2)=-0.0065
4 (0.332)(1-0.332)(0.1311)(-0.3)=-0.0087

42
Calculation for weight updating
Weight New value
------------------------------------------------------------------------------
w46 -03+(0.9)(0.1311)(0.332)= -0.261
w56 -0.2+(0.9)(0.1311)(0.525)= -0.138
w14 0.2 +(0.9)(-0.0087)(1) = 0.192
w15 -0.3 +(0.9)(-0.0065)(1) = -0.306
w24 0.4+ (0.9)(-0.0087)(0) = 0.4
w25 0.1+ (0.9)(-0.0065)(0) = 0.1
w34 -0.5+ (0.9)(-0.0087)(1) = -0.508
w35 0.2 + (0.9)(-0.0065)(1) = 0.194
w06 0.1 + (0.9)(0.1311) = 0.218
w05 0.2 + (0.9)(-0.0065)=0.194
w04 -0.4 +(0.9)(-0.0087) = -0.408

43
The decision boundary perspective…
Initial random weights
The decision boundary perspective…
Present a training instance / adjust the weights
The decision boundary perspective…
Present a training instance / adjust the weights
The decision boundary perspective…
Present a training instance / adjust the weights
The decision boundary perspective…
Present a training instance / adjust the weights
The decision boundary perspective…
Eventually ….
The effect of local minima

• Because of random weight initialization, each

training run will find a different solution

Validation error

M
50
Regularizing neural networks

Demonstration of over-fitting
(M = # hidden units)

The number of hidden units determines the

complexity of the learned function

51
Regularizing neural networks

• Use cross-validation to select the network architecture

(number of layers, number of units per layer)
• Add to E a term (λ/2)Sjiwji2 that penalizes large weights, so

Use cross-validation to select λ (Empirically, this leads to significant

improvements in generalization. This is because producing over-fitted mappings
requires high curvature and hence large weights. Weight decay keeps the weights
small and hence the mappings are smooth)
• Use early-stopping and cross-validation (next slide)
• Take a Bayesian approach: Put a prior on the w’s and
integrate over them to make predictions
52
Early Stopping
A typical curve showing performance during training.
But here is performance on unseen data, estimated via CV.
Training data
Validation data
error

Starting to overfit

Time/Iteration
53
Early Stopping

• During training, keep track of the network’s performance on a

separate validation set of data.

• At the point where error continues to improve on the training

set, but starts to get worse on the validation set, that is when
training should be stopped, since it is starting to overfit on the
training data.

• The problem here is that it is not always clear cut.

MATLAB Ref for improving ANN generalization:

http://www.mathworks.com/help/nnet/ug/improve-neural-network-generalization-and-avoid-overfitting.html

54
Backpropagation
• Very powerful - can learn any function, given enough hidden units!
With enough hidden units, we can generate any function.
• Have the problem of Generalization vs. Memorization. With too
many units, we will tend to memorize the input and not generalize
well. Some schemes exist to “prune” the neural network.
• Networks require extensive training, many parameters to fiddle
with. Can be extremely slow to train. May also fall into local
minima.
• Inherently parallel algorithm, ideal for multiprocessor hardware.
• Despite the cons, a very powerful algorithm that has seen
widespread successful deployment.

55
Applications of Feed-forward nets

– Pattern recognition
• Character recognition
• Face Recognition
• Speech recognition
• Etc.

– Navigation of a car

– Stock-market prediction

56
Introduction to:
Deep Learning
aka or related to
Deep Neural Networks
DL is providing breakthrough results in speech
recognition and image classification …
From this Hinton et al 2012 paper:
http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/38131.pdf

go here: http://yann.lecun.com/exdb/mnist/
From here:
http://people.idsia.ch/~juergen/cvpr2012.pdf
So, 1. what exactly is deep learning ?

And, 2. why is it generally better than other methods on

image, speech and certain other types of data?
So, 1. what exactly is deep learning ?

And, 2. why is it generally better than other methods on

image, speech and certain other types of data?

The short answers

1. ‘Deep Learning’ means using a neural network
with several layers of nodes between input and output

2. the series of layers between input & output do

feature identification and processing in a series of stages,
just as our brains seem to.
OK, but:
3. multilayer neural networks (MLPs) have been around
for 35 years. What’s actually new?
we have always had good algorithms for learning the
weights in networks with 1 hidden layer

but these algorithms are not good at learning the weights for
networks with more hidden layers

what’s new is: algorithms for training many-later networks

So: multiple layers make sense
Many-layer neural network architectures should be capable of learning the true
underlying features and ‘feature logic’, and therefore generalise very well …
Feature representations
pixel 1

Learning
algorithm
pixel 2
Input
Motorbikes
Input space “Non”-Motorbikes
pixel 2

pixel 1
Feature representations
handle

Feature Learning
wheel representation algorithm

Input
Motorbikes
Input space “Non”-Motorbikes Feature space

“handle”
pixel 2

pixel 1 “wheel”
How is computer perception done?
Object
detection

Image Low-level Recognition

vision features

Audio
classification This is written text.

Audio Low-level Transcribed speech

audio features

Helicopter
control
Low-level state
Helicopter
features Action
Audio features

Spectrogram MFCC

Problems of hand-tuned
Key question: Can we features
1. Needs expert knowledge
automatically learn a good feature
2. Time-consuming and expensive
representation?
3. Does not generalize to other domains
Flux ZCR Rolloff
The goal of Unsupervised Feature Learning

Unlabeled images

Learning
algorithm

Feature representation
Why feature hierarchies
object models
• Natural progression from
low level to high level
structure as seen in natural
complexity
object parts
(combination • Easier to monitor what is
of edges) being learnt and to guide the
machine to better subspaces

edges

pixels
But, until very recently, our weight-learning
algorithms simply did not work on multi-layer
architectures
The new way to train multi-layer NNs…
The new way to train multi-layer NNs…

Train this layer first

then this layer
then this layer
then this layer
finally this layer
The new way to train multi-layer NNs…

EACH of the (non-output) layers is trained

to be an auto-encoder
Basically, it is forced to learn good
features that describe what comes from
the previous layer
Smaller Network: Convolutional Neural Networks

• We know it is good to learn a small model.

• From this fully connected model, do we really need all the
edges?
• Can some of these be shared?
Consider learning an image:

• Some patterns are much smaller than the

whole image

Can represent a small region with fewer parameters

“beak” detector
Same pattern appears in different places:
They can be compressed!

What about training a lot of such “small” detectors

and each detector must “move around”.

“upper-left
beak” detector

They can be compressed

to the same parameters.

“middle beak”
detector
A convolutional layer
A CNN is a neural network with some convolutional layers
(and some other layers). A convolutional layer has a number
of filters that does convolutional operation.

Beak detector

A filter
Convolution
These are the network
parameters to be learned.
1 -1 -1
1 0 0 0 0 1 -1 1 -1 Filter 1
0 1 0 0 1 0 -1 -1 1
0 0 1 1 0 0
1 0 0 0 1 0 -1 1 -1
-1 1 -1 Filter 2
0 1 0 0 1 0
0 0 1 0 1 0 -1 1 -1

…
…
6 x 6 image
Each filter detects a
small pattern (3 x 3).
Convolution
Filter 1

stride=1 1 -1 -1
-1 1 -1
1 0 0 0 0 1 Dot
-1 -1 1
product
0 1 0 0 1 0 3 -1
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0

6 x 6 image
Convolution
1 -1 -1
If stride=2 -1 1 -1 Filter 1
-1 -1 1
1 0 0 0 0 1
0 1 0 0 1 0 3 -3
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0

6 x 6 image
Convolution
1 -1 -1 Filter 1
stride=1 -1 1 -1
-1 -1 1
1 0 0 0 0 1
0 1 0 0 1 0 3 -1 -3 -1
0 0 1 1 0 0
1 0 0 0 1 0 -3 1 0 -3
0 1 0 0 1 0
0 0 1 0 1 0 -3 -3 0 1

6 x 6 image 3 -2 -2 -1
Convolution
-1 1 -1
-1 1 -1 Filter 2
stride=1 -1 1 -1
Repeat this for each filter
1 0 0 0 0 1
0 1 0 0 1 0 3 -1 -3 -1
-1 -1 -1 -1
0 0 1 1 0 0
1 0 0 0 1 0 -3 1 0 -3
-1 -1 -2 1
0 1 0 0 1 0 Feature
0 0 1 0 1 0 -3 -3 Map0 1
-1 -1 -2 1

6 x 6 image 3 -2 -2 -1
-1 0 -4 3
Two 4 x 4 images
Forming 2 x 4 x 4 matrix
Color image: RGB 3 channels

11 -1-1 -1-1 -1-1 11 -1-1

1 -1 -1 -1 1 -1
-1-1 11 -1-1 -1-1-1 111 -1-1-1 Filter 2
-1 1 -1 Filter 1 -1 1 -1
-1-1 -1-1 11 -1-1 11 -1-1
-1 -1 1
Color image
1 0 0 0 0 1
1 0 0 0 0 1
0 11 00 00 01 00 1
0 1 0 0 1 0
0 00 11 01 00 10 0
0 0 1 1 0 0
1 00 00 10 11 00 0
1 0 0 0 1 0
0 11 00 00 01 10 0
0 1 0 0 1 0
0 00 11 00 01 10 0
0 0 1 0 1 0
0 0 1 0 1 0
Convolution v.s. Fully Connected

1 0 0 0 0 1 1 -1 -1 -1 1 -1
0 1 0 0 1 0 -1 1 -1 -1 1 -1
0 0 1 1 0 0 -1 -1 1 -1 1 -1

1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
convolution
image

x1
1 0 0 0 0 1
0 1 0 0 1 0 x2
Fully- 0 0 1 1 0 0
1 0 0 0 1 0
connected
…
…

…
…
0 1 0 0 1 0
0 0 1 0 1 0
x36
1 -1 -1 Filter 1 1 1
-1 1 -1 2 0
-1 -1 1 3 0
4: 0 3

…
1 0 0 0 0 1
0 1 0 0 1 0 0
0 0 1 1 0 0 8 1
1 0 0 0 1 0 9 0
0 1 0 0 1 0 10: 0

…
0 0 1 0 1 0
1 0
6 x 6 image
3 0
14
fewer parameters! 15 1 Only connect to 9
16 1 inputs, not fully
connected
…
1 -1 -1 1: 1
-1 1 -1 Filter 1 2: 0
-1 -1 1 3: 0
4: 0 3

…
1 0 0 0 0 1
0 1 0 0 1 0 7: 0
0 0 1 1 0 0 8: 1
1 0 0 0 1 0 9: 0 -1
0 1 0 0 1 0 10: 0

…
0 0 1 0 1 0
1 0
6 x 6 image
3: 0
14:
Fewer parameters 15: 1
16: 1 Shared weights
Even fewer parameters
…
The whole CNN

cat dog ……
Convolution

Max Pooling
Can
Fully Connected repeat
Feedforward network
Convolution many
times

Max Pooling

Flattened
Max Pooling

1 -1 -1 -1 1 -1
-1 1 -1 Filter 1 -1 1 -1 Filter 2
-1 -1 1 -1 1 -1

3 -1 -3 -1 -1 -1 -1 -1

-3 1 0 -3 -1 -1 -2 1

-3 -3 0 1 -1 -1 -2 1

3 -2 -2 -1 -1 0 -4 3
Why Pooling

• Subsampling pixels will not change the object

bird
bird

Subsampling

We can subsample the pixels to make image smaller

fewer parameters to characterize the image

Max Pooling

New image
1 0 0 0 0 1 but smaller
0 1 0 0 1 0 Conv
3 0
0 0 1 1 0 0 -1 1
1 0 0 0 1 0
0 1 0 0 1 0 Max 3 1
0 3
0 0 1 0 1 0 Pooling
2 x 2 image
6 x 6 image
Each filter
is a channel
The whole CNN

3 0
-1 1 Convolution

3 1
0 3
Max Pooling

A new image Can

repeat
Convolution many
Smaller than the original
times
image
The number of channels Max Pooling

is the number of filters

The whole CNN

cat dog ……
Convolution

Max Pooling

Fully Connected A new image

Feedforward network
Convolution

Max Pooling

Flattened A new image

Summary

A CNN compresses a fully connected

network in two ways:

• Reducing number of connections

• Shared weights on the edges
• Max pooling further reduces the complexity
Flattening 3

1
3 0
-1 1 3

3 1 -1
0 3 Flattened

1 Fully Connected
Feedforward network

3
AlphaGo

Next move
Neural
(19 x 19
Network positions)

19 x 19 matrix
Fully-connected feedforward network
Black: 1
can be used
white: -1
none: 0 But CNN performs much better
AlphaGo’s policy network
The following is quotation from their Nature article:
Note: AlphaGo does not use Max Pooling.
CNN in speech recognition

The filters move in the

CNN frequency direction.
Frequency

Image Time
Spectrogram
AlexNet Architecture
Showing 81 filters of
11x11x3.

Capture low-level
features like oriented
edges, blobs.
Top 9 patches that activate each filter

in layer 1
Each 3x3 block shows the top 9 patches
for one filter.
Second layer
Second layer

Note how the previous

low-level features are
combined to detect a
little more abstract
features like textures.
ConvNets as generic feature extractor

• A well-trained ConvNets is an excellent feature

extractor.

• Chop the network at desired layer and use the output

as a feature representation to train a SVM on some
other vision dataset.

•Improve further by taking a pre-trained ConvNet and re-training it on a

different dataset. Called fine-tuning
Last time: ConvNets

Based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Smaller Network: Recurrent neural network

This is our fully connected network. If x1 .... xn, n is very large and growing,
this network would become too large. We now will input one xi at a time,
and re-use the same edge weights.
Outline

• Sequential prediction problems

• Vanilla RNN unit
– Forward and backward pass
– Back-propagation through time (BPTT)
• Long Short-Term Memory (LSTM) unit
• Gated Recurrent Unit (GRU)
• Applications
Sequential prediction tasks

• In Perceptron, MLP, and CNN, we focused mainly on

prediction problems with fixed-size inputs and outputs
• But what if the input and/or output is a variable-length
sequence?
Text classification

• Sentiment classification: classify a restaurant or movie or

product review as positive or negative

– “The food was really good”

– “The vacuum cleaner broke within two weeks”
– “The movie had slow parts, but overall was worth watching”

• What feature representation or predictor structure can we use

for this problem?
Sentiment classification

• “The food was really good”

Classifier
Hidden state
“Memory” h5
“Context”

h1 h2 h3 h4

“The” “food” “was” “really” “good”

Recurrent Neural Network (RNN)

Image Caption Generation

“The” “dog” “is” “hiding” “STOP”

Classifier Classifier Classifier Classifier Classifier

h1 h2 h3 h4 h5

h0 h1 h2 h3 h4

CNN “START” “The” “dog” “is” “hiding”

Summary: Input-output scenarios

Single - Single Feed-forward Network

Multiple - Single Sequence Classification

Single - Multiple Image Captioning

Multiple - Multiple Image Captioning

Multiple - Multiple Translation

Recurrent Neural Network (RNN)

Output at time t yt

Classifier
Hidden Recurrence:
representation
at time t
ht ℎ𝑡 = 𝑓𝑊 (𝑥𝑡 , ℎ𝑡−1 )
new function input at old
Hidden layer state of W time t state

Input at time t xt
Unrolling the RNN
y3
y2
Classifier
y1 h3
Classifier h3
h2 Hidden layer
Classifier h2

h1 h1 Hidden layer
x3
Hidden layer t=3
x2
h0 t=2
x1
t=1
Vanilla RNN Cell
ht

ℎ𝑡 = 𝑓𝑊 (𝑥𝑡 , ℎ𝑡−1 )
ht-1 xt 𝑥𝑡
= tanh 𝑊 ℎ
𝑡−1

J. Elman, Finding structure in time, Cognitive science 14(2), pp. 179–211, 1990
Vanilla RNN Cell
ht

ℎ𝑡 = 𝑓𝑊 (𝑥𝑡 , ℎ𝑡−1 )
W
𝑥𝑡
= tanh 𝑊 ℎ
𝑡−1
ht-1 xt

𝑒 𝑎 − 𝑒 −𝑎
𝜎 𝑎 tanh 𝑎 = 𝑎
𝑒 + 𝑒 −𝑎
tanh 𝑎 = 2𝜎 2𝑎 − 1

Image source
RNN Forward Pass

e1 e2 e3 𝑒𝑡 = −log(𝑦𝑡 (𝐺𝑇𝑡 ))

y1 y2 y3 𝑦𝑡 = softmax(𝑊𝑦 ℎ𝑡 )

𝑥𝑡
h1 h2 h3 ℎ𝑡 = tanh 𝑊 ℎ
𝑡−1

shared weights
h0 x1 h1 x2 h2 x3
Backpropagation Through Time (BPTT)

• Most common method used to train RNNs

e1 e2 e3 𝑒𝑡 = −log(𝑦𝑡 (𝐺𝑇𝑡 ))

y1 y2 y3 𝑦𝑡 = softmax(𝑊𝑦 ℎ𝑡 )

𝑥𝑡
h1 h2 h3 ℎ𝑡 = tanh 𝑊 ℎ
𝑡−1

h0 x1 h1 x2 h2 x3
Unfolded RNN Backward Pass

e1 e2 e3 𝑒𝑡 = −log(𝑦𝑡 (𝐺𝑇𝑡 ))

y1 y2 y3 𝑦𝑡 = softmax(𝑊𝑦 ℎ𝑡 )

𝑥𝑡
h1 h2 h3 ℎ𝑡 = tanh 𝑊 ℎ
𝑡−1

h0 x1 h1 x2 h2 x3
Backpropagation Through Time (BPTT)

• Most common method used to train RNNs

• The unfolded network (used during forward pass) is
treated as one big feed-forward network that accepts
the whole time series as input
• The weight updates are computed for each copy in the
unfolded network, then summed (or averaged) and
applied to the RNN weights
• In practice, truncated BPTT is used: run the RNN forward
𝑘1 time steps, propagate backward for 𝑘2 time steps
https://machinelearningmastery.com/gentle-introduction-backpropagation-
time/
http://www.cs.utoronto.ca/~ilya/pubs/ilya_sutskever_phd_thesis.pdf
Recurrent Neural Network
How does RNN reduce complexity?

• Given function f: h’,y=f(h,x) h and h’ are vectors with

the same dimension

y1 y2 y3

h0 f h1 f h2 f h3 ……

x1 x2 x3
No matter how long the input/output sequence is, we only
need one function f. If f’s are different, then it becomes a
feedforward NN. This may be treated as another compression
from fully connected network.
Deep RNN h’,y = f1(h,x), g’,z = f2(g,y)

…
z1 z2 z3

g0 f2 g1 f2 g2 f2 g3 ……

y1 y2 y3

h0 f1 h1 f1 h2 f1 h3 ……

x1 x2 x3
Bidirectional RNN y,h=f1(x,h) z,g = f2(g,x)
x1 x2 x3

g0 f2 g1 f2 g2 f2 g3

z1 z2 z3

p=f3(y,z) f3 p1 f3 p2 f3 p3

y1 y2 y3

h0 f1 h1 f1 h2 f1 h3

x1 x2 x3
Pyramid RNN Significantly speed up training

• Reducing the number of time steps

Bidirectional
RNN

W. Chan, N. Jaitly, Q. Le and O. Vinyals, “Listen, attend and spell: A neural

network for large vocabulary conversational speech recognition,” ICASSP, 2016
Naïve RNN

y Wh h
Wi
h' x

h f h'
y Wo h’ Note, y is computed
from h’
x softmax

We have ignored the bias

Problems with naive RNN

• When dealing with a time series, it tends to

forget old information. When there is a distant
relationship of unknown length, we wish to
have a “memory” to it.
• Vanishing gradient problem.
The sigmoid layer outputs numbers between 0-1 determine how much
each component should be let through. Pink X gate is point-wise multiplication.
This
Output
This sigmoid
decides
LSTM
gategate
what info
Controls what
Isdetermines how
to add to the cellmuch
state
goes into output
information goes thru

Ct-1

ht-1

Forget input
gate gate
The core idea is this cell
statesigmoid
Why Ct, it is changed
or tanh:
slowly, with
Sigmoid: 0,1only minor
gating as switch.
linear interactions.
Vanishing It is veryin
gradient problem
easy for
LSTM information
is handled to flow
already.
along replaces
ReLU it unchanged.
tanh ok?
it decides what component
is to be updated.
C’t provides change contents

Updating the cell state

Decide what part of the cell

state to output
RNN vs LSTM
Peephole LSTM

Allows “peeping into the memory”

Naïve RNN vs LSTM yt

yt ct-1 ct
LSTM
Naïve
ht-1 ht ht-1 ht
RNN

xt xt

c changes slowly ct is ct-1 added by something

h changes faster ht and ht-1 can be very different

These 4 matrix
computation should
be done concurrently.
xt
z W
ht-1

ct-1 xt
zi = σ( Wi )
ht-1
Controls Controls Updating Controls
forget gate input gate information Output gate xt
zf = σ( Wf )
ht-1
zf zi z zo

xt
zo = σ( Wo )

ht-1 xt ht-1

Information flow of LSTM

z =tanh( W ht-1 )

ct-1 ct-1
diagonal
“peephole” zo zf zi obtained by the same way

zf zi z zo

ht-1 xt
Information flow of LSTM
Element-wise multiply

ct-1 ct
ct = zf  ct-1 + ziz

tanh ht = zo  tanh(ct)

yt = σ(W’ ht)

zf zi z zo

ht-1 xt ht

Information flow of LSTM

LSTM information flow

yt yt+1

ct-1 ct ct+1

tanh tanh

zf zi z zo zf zi z zo

ht+1
ht-1 xt ht xt+1

Information flow of LSTM

LSTM

GRU – gated recurrent unit

(more compression)

reset gate Update gate

It combines the forget and input into a single update gate.

It also merges the cell state and hidden state. This is simpler
than LSTM. There are many other variants too.

X,*: element-wise multiply

GRUs also takes xt and ht-1 as inputs. They perform some
calculations and then pass along h t. What makes them different
from LSTMs is that GRUs don't need the cell layer to pass values
along. The calculations within each iteration insure that the h t
values being passed along either retain a high amount of old
information or are jump-started with a high amount of new
information.
Feed-forward vs Recurrent Network

1. Feedforward network does not have input at each step

2. Feedforward network has different parameters for each layer

x f1 a1 f2 a2 f3 a3 f4 y

t is layer
at = ft(at-1) = σ(W tat-1 + bt)

h0 f h1 f h2 f h3 f g y4

x1 x2 x3 x4
t is time step

at= f(at-1, xt) = σ(W h at-1 + W ixt + bi)

No input xt at t-1
each step hat-1 ahtt
No output yt at
each step 1-
at-1 is the output of
the (t-1)-th layer
reset update
at is the output of r z h'
the t-th layer
No reset gate
t-1
hat-1 xt xt
h’=σ(Wat-1)
z=σ(W’at-1)
Highway Network at = z  at-1 + (1-z)  h

Training Very Deep Networks Deep Residual Learning for Image

https://arxiv.org/pdf/1507.0622 Recognition
8v2.pdf http://arxiv.org/abs/1512.03385
output layer output layer output layer

c c’ c c’
LSTM Grid
LSTM
h h’ h h’

x a b

time
Grid LSTM
h' b'

a’ c c'
b’
a a'
tanh
c c’
Grid
LSTM
h h’ zf zi z zo

a b
h b
The network differs from existing deep LSTM architectures in that the cells are connected between
network layers as well as along the spatiotemporal dimensions of the data. The network provides a
You
unifiedcan generalize
way of using LSTM forthis to 3D,
both deep and more
and sequential .
computation.
Applications of LSTM / RNN
Neural machine translation

LSTM
Sequence to sequence chat model
Baidu’s speech recognition using RNN
Useful Resources / References

• http://cs231n.stanford.edu/slides/winter1516_lecture10.pdf
• http://www.cs.toronto.edu/~rgrosse/csc321/lec10.pdf

• R. Pascanu, T. Mikolov, and Y. Bengio, On the difficulty of training recurrent neural

networks, ICML 2013
• S. Hochreiter, and J. Schmidhuber, Long short-term memory, Neural computation,
1997 9(8), pp.1735-1780
• F.A. Gers, and J. Schmidhuber, Recurrent nets that time and count, IJCNN 2000
• K. Greff , R.K. Srivastava, J. Koutník, B.R. Steunebrink, and J. Schmidhuber, LSTM: A
search space odyssey, IEEE transactions on neural networks and learning systems,
2016
• K. Cho, B. Van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk,
and Y. Bengio, Learning phrase representations using RNN encoder-decoder for
statistical machine translation, ACL 2014
• R. Jozefowicz, W. Zaremba, and I. Sutskever, An empirical exploration of recurrent
network architectures, JMLR 2015
Questions

Deviant Clade Companion (Final Download)
100% (5)
Deviant Clade Companion (Final Download)
61 pages
The Pepperfry Case Study
100% (1)
The Pepperfry Case Study
6 pages
Artificial Neural Networks: HCMC University of Technology Sep. 2008
No ratings yet
Artificial Neural Networks: HCMC University of Technology Sep. 2008
71 pages
Artificial Neural Networks: HCMC University of Technology Sep. 2008
No ratings yet
Artificial Neural Networks: HCMC University of Technology Sep. 2008
71 pages
Artificial Neural Networks
No ratings yet
Artificial Neural Networks
71 pages
neural (2)
No ratings yet
neural (2)
32 pages
Neural Net 3rdclass
No ratings yet
Neural Net 3rdclass
35 pages
Chapter 5 Artificial Neural Networks
No ratings yet
Chapter 5 Artificial Neural Networks
50 pages
ANN-Implemetation of Back-Prop
No ratings yet
ANN-Implemetation of Back-Prop
89 pages
Lec03 NeuralNetwork
No ratings yet
Lec03 NeuralNetwork
39 pages
Neural Network Presentation
No ratings yet
Neural Network Presentation
33 pages
Lecture 4
No ratings yet
Lecture 4
50 pages
ML Unit - 2
No ratings yet
ML Unit - 2
70 pages
Neural Network
No ratings yet
Neural Network
44 pages
Neural
No ratings yet
Neural
53 pages
ANN MODULE 1 Part2
No ratings yet
ANN MODULE 1 Part2
58 pages
Chapter_7
No ratings yet
Chapter_7
68 pages
2021 Lecture11 NeuralNetworks
No ratings yet
2021 Lecture11 NeuralNetworks
48 pages
Artificial Neural Networks
No ratings yet
Artificial Neural Networks
26 pages
Neural Network
100% (1)
Neural Network
54 pages
Classification BP Regression KNN Other Classifiers_ Final.ppt
No ratings yet
Classification BP Regression KNN Other Classifiers_ Final.ppt
116 pages
Neural Networks Handout
No ratings yet
Neural Networks Handout
7 pages
3ML.05.NeuralNetworks DeepLearning
No ratings yet
3ML.05.NeuralNetworks DeepLearning
67 pages
Artificial Neural Network
No ratings yet
Artificial Neural Network
35 pages
Ann MJJ-1
No ratings yet
Ann MJJ-1
64 pages
2023-Lecture11-NeuralNetworks
No ratings yet
2023-Lecture11-NeuralNetworks
48 pages
Lecture 10 Neural Network
No ratings yet
Lecture 10 Neural Network
34 pages
Multi Layer Perceptron Haykin
No ratings yet
Multi Layer Perceptron Haykin
50 pages
unit 2 -ml
No ratings yet
unit 2 -ml
18 pages
chapter_5_summary
No ratings yet
chapter_5_summary
5 pages
Anthony Kuh - Neural Networks and Learning Theory
No ratings yet
Anthony Kuh - Neural Networks and Learning Theory
72 pages
Mod 2.4,2.5,2.6 Architecture Design
No ratings yet
Mod 2.4,2.5,2.6 Architecture Design
20 pages
4.2 Ann
No ratings yet
4.2 Ann
26 pages
Jntuk R20 ML Unit-V
No ratings yet
Jntuk R20 ML Unit-V
19 pages
Machine Learning Unit 5 Notes
No ratings yet
Machine Learning Unit 5 Notes
19 pages
Foundations of Machine Learning: Module 6: Neural Network
No ratings yet
Foundations of Machine Learning: Module 6: Neural Network
68 pages
Artificial Neural Network: Lecture Module 22
No ratings yet
Artificial Neural Network: Lecture Module 22
54 pages
855597620
No ratings yet
855597620
44 pages
2024 MTH058 Lecture02 Backpropagation
No ratings yet
2024 MTH058 Lecture02 Backpropagation
62 pages
CC511 Week 5 - 6 - NN - BP
No ratings yet
CC511 Week 5 - 6 - NN - BP
62 pages
Lecture+8
No ratings yet
Lecture+8
65 pages
Module1 ECO-598 AI & ML Aug 21
No ratings yet
Module1 ECO-598 AI & ML Aug 21
45 pages
Advanced Information Retreival: Chapter 02: Modeling - Neural Network Model
No ratings yet
Advanced Information Retreival: Chapter 02: Modeling - Neural Network Model
31 pages
Deep Learning PDF
100% (1)
Deep Learning PDF
87 pages
P5 Neural Nets
No ratings yet
P5 Neural Nets
114 pages
ML Unit-2
No ratings yet
ML Unit-2
141 pages
Kevin Swingler - Lecture 4: Multi-Layer Perceptrons
No ratings yet
Kevin Swingler - Lecture 4: Multi-Layer Perceptrons
20 pages
Data Mining Techniques: Presentation On Neural Network
No ratings yet
Data Mining Techniques: Presentation On Neural Network
55 pages
Machine Learning: Algorithms and Applications: (Continued)
No ratings yet
Machine Learning: Algorithms and Applications: (Continued)
17 pages
Basics
No ratings yet
Basics
48 pages
Neural Networks - Slides - CMU - Aarti Singh & Barnabas Poczos
No ratings yet
Neural Networks - Slides - CMU - Aarti Singh & Barnabas Poczos
36 pages
L6 Neural Network
No ratings yet
L6 Neural Network
57 pages
6-(9-17) Neural Networks
No ratings yet
6-(9-17) Neural Networks
32 pages
Machine Learning: Chapter 4. Artificial Neural Networks
No ratings yet
Machine Learning: Chapter 4. Artificial Neural Networks
34 pages
Pattern Classification 10. Linear Perceptron, Least Squares & Multi-Layer Nns
No ratings yet
Pattern Classification 10. Linear Perceptron, Least Squares & Multi-Layer Nns
38 pages
nn2
No ratings yet
nn2
12 pages
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
From Everand
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
Fouad Sabry
No ratings yet
Flood Fill: Flood Fill: Exploring Computer Vision's Dynamic Terrain
From Everand
Flood Fill: Flood Fill: Exploring Computer Vision's Dynamic Terrain
Fouad Sabry
No ratings yet
Electronics II Essentials
From Everand
Electronics II Essentials
The Editors of REA
No ratings yet
The Numpy Pocketbook: Essentials on the Go
From Everand
The Numpy Pocketbook: Essentials on the Go
Silas Meadowlark
No ratings yet
Artificial Intelligence Interview Questions
From Everand
Artificial Intelligence Interview Questions
Tech Interviews
5/5 (2)
Amazing Java: Learn Java Quickly
From Everand
Amazing Java: Learn Java Quickly
Andrei Besedin
No ratings yet
Lecture 6 7
No ratings yet
Lecture 6 7
69 pages
Lecture 5
No ratings yet
Lecture 5
66 pages
Lecture 1
No ratings yet
Lecture 1
62 pages
Lecture 4
No ratings yet
Lecture 4
51 pages
Lecture 2 3
No ratings yet
Lecture 2 3
72 pages
Atmel White Paper Introducing New Breed of Microcontrollers For 8-16-Bit Applications
No ratings yet
Atmel White Paper Introducing New Breed of Microcontrollers For 8-16-Bit Applications
15 pages
Robots Lesson Plan
No ratings yet
Robots Lesson Plan
3 pages
AVLib A Simulink Library For Multi-Agent Systems Research
No ratings yet
AVLib A Simulink Library For Multi-Agent Systems Research
7 pages
Patterns of Development
50% (2)
Patterns of Development
31 pages
Bosch Industriekessel - en - PPT
No ratings yet
Bosch Industriekessel - en - PPT
92 pages
6.1. Sources of Technology: Unit 6 Technicallgical Environment and The Business
No ratings yet
6.1. Sources of Technology: Unit 6 Technicallgical Environment and The Business
3 pages
Paper AI IN Performance Management
No ratings yet
Paper AI IN Performance Management
11 pages
4740 Lecture19 Dynamic Latches and Flipflops
No ratings yet
4740 Lecture19 Dynamic Latches and Flipflops
19 pages
64715
No ratings yet
64715
4 pages
Office 365 - Everything You Wanted To Know - Onwards
No ratings yet
Office 365 - Everything You Wanted To Know - Onwards
120 pages
Notes Communication Skills Unit V
No ratings yet
Notes Communication Skills Unit V
48 pages
Accretech 1100 Rondcom Nex 2211 Zeiss d.pdf
No ratings yet
Accretech 1100 Rondcom Nex 2211 Zeiss d.pdf
28 pages
Disadvantages of Information Technology
100% (2)
Disadvantages of Information Technology
12 pages
Complete Oneness Fullness Completeness Recovery
No ratings yet
Complete Oneness Fullness Completeness Recovery
8 pages
Beyond A2 Plus SB
No ratings yet
Beyond A2 Plus SB
145 pages
2023 09 01 NSHBCCC DBA Schedule 1 Final
No ratings yet
2023 09 01 NSHBCCC DBA Schedule 1 Final
1,465 pages
Comparative Study of Scheduling Algorithms For Real Time Environment
No ratings yet
Comparative Study of Scheduling Algorithms For Real Time Environment
4 pages
10 - Ethernet - AC500 and AC500-eCo
No ratings yet
10 - Ethernet - AC500 and AC500-eCo
18 pages
Strength Improvement and Prediction of Crack in Plastic Product
No ratings yet
Strength Improvement and Prediction of Crack in Plastic Product
38 pages
HCIA Cheat Sheet CLI Commands - Miftah Rahman (Go) - Blog
No ratings yet
HCIA Cheat Sheet CLI Commands - Miftah Rahman (Go) - Blog
10 pages
VF - Sem II - Diploma in Computer Engineering - 2021 22.docx 1 1 3
No ratings yet
VF - Sem II - Diploma in Computer Engineering - 2021 22.docx 1 1 3
3 pages
Log
No ratings yet
Log
7 pages
TD45 New
No ratings yet
TD45 New
2 pages
Material Requirement Planning
No ratings yet
Material Requirement Planning
13 pages
Intelicharger 120 12-24 Datasheet r1
No ratings yet
Intelicharger 120 12-24 Datasheet r1
4 pages
COBRA APC By Proforce International
No ratings yet
COBRA APC By Proforce International
3 pages
Amath 342 HW 1
No ratings yet
Amath 342 HW 1
2 pages
Sy0 601 21
No ratings yet
Sy0 601 21
20 pages