Lecture 10
Lecture 10
Spring 24
Neural Networks and
Deep Learning
▪ Advantages
– prediction accuracy is generally high
– robust, works when training examples contain errors
– output may be discrete, real-valued, or a vector of several
discrete or real-valued attributes
– fast evaluation of the learned target function
▪ Criticism
– long training time
– difficult to understand the learned function (weights)
– not easy to incorporate domain knowledge
3
A Neuron
x0
- mk
w0
x1 w1
f output y
xn wn
4
Network Training
1
Activation : = 0.3775
1+ e 0.5
6
Typical activation functions
7
Example: Voice Recognition
▪ Data
– Sources
• Steve
• David
– Format
• Frequency distribution (60 bins)
• Analogy: cochlea
8
▪ Network architecture
– Feed forward network
• 60 input (one for each frequency bin)
• 6 hidden
• 2 output (0-1 for “Steve”, 1-0 for “David”)
9
Presenting the data
Steve
David
10
(untrained network)
Steve
0.43
0.26
David
0.73
0.55
11
Calculate error
Steve
0.43 – 0 = 0.43
0.26 –1 = 0.74
David
0.73 – 1 = 0.27
0.55 – 0 = 0.55
12
Backprop error and adjust weights
Steve
0.43 – 0 = 0.43
0.26 – 1 = 0.74
1.17
David
0.73 – 1 = 0.27
0.55 – 0 = 0.55
0.8213
▪ Repeat process (sweep) for all training pairs
– Present data
– Calculate error
– Backpropagate error
– Adjust weights
▪ Repeat process multiple times
14
(trained network)
Steve
0.01
0.99
David
0.99
0.01
15
Network Parameters
16
Weights
17
Size of Training Data
▪ Rule of thumb:
– the number of training examples should be at
least five to ten times the number of weights
of the network.
▪ Other rule:
|W|= number of weights
|W|
N a = expected accuracy on
(1 - a) test set
18
Training Basics
19
Training: Backprop algorithm
▪ The Backprop algorithm searches for weight values
that minimize the total error of the network over the
set of training examples (training set).
▪ Backprop consists of the repeated application of the
following two passes:
– Forward pass: in this step the network is activated on one
example and the error of (each neuron of) the output layer is
computed.
– Backward pass: in this step the network error is used for
updating the weights. Starting at the output layer, the error is
propagated backwards through the network, layer by layer. This is
done by recursively computing the local gradient of each neuron.
20
Back Propagation
Error propagation
Backward Step
21
Perceptrons
▪ Initial proposal of connectionist networks
▪ Rosenblatt, 50’s and 60’s
▪ Essentially a linear discriminant composed of
nodes, weights
I1 W1 I1 W1
or
I2
W2
O I2
W2 O
W3 W3
Activation Function
I3 I3
O =
1 : wi I i + 0
i
0 : otherwise
1 22
Perceptron Example
2 .5
1
.3
=-1
Learning Procedure:
• Randomly assign weights (between 0-1)
• Present inputs from training data
• Get output O, nudge weights to gives results toward our
desired output T
• Repeat; stop when no errors, or enough epochs completed
23
Perceptron Training
wi (t + 1) = wi (t ) + wi (t )
wi (t ) = (T − O ) I i
Weights including Threshold. T=Desired, O=Actual output.
Example: T=0, O=1, W1=0.5, W2=0.3, I1=2, I2=1,Theta=-1
24
Using a perceptron network
– 67,1,4,120,229,…, 1
– 37,1,3,130,250,… ,0
– 41,0,2,130,204,… ,0
25
Perceptron
Class1 Class2
26
Exclusive Or (XOR) Problem
1 0
0 1
Output vector
Err j = O j (1 − O j ) Errk w jk
Output nodes k
j = j + (l) Err j
wij = wij + (l ) Err j Oi
Hidden nodes
Err j = O j (1 − O j )(T j − O j )
wij 1
Oj = −I j
1+ e
Input nodes
I j = wij Oi + j
i
Input vector: xi
28
Gradient Descent
29
Error Surface
error
weights
• We need a derivative!
• Activation function must be continuous,
differentiable, non-decreasing, and easy to compute
31
LMS Learning
• LMS = Least Mean Square learning Systems, more general than the
previous perceptron learning rule. The concept is to minimize the
total error, as measured over all training examples, P. O is the raw
output, as calculated by:
i
wi I i +
P P
( )
1
Distance ( LMS ) = T − O 2
2 P
E.g. if we have two patterns and
T1=1, O1=0.8, T2=0, O2=0.5 then D=(0.5)[(1-0.8)2+(0-0.5)2]=.145
W 32
How do we pick c?
1. Tuning set, or
2. Cross validation, or
33
Estimating Error Rates
• Partition: Training-and-testing
– use two independent data sets, e.g., training set (2/3), test
set(1/3)
– used for data set with large number of samples
• Cross-validation
– divide the data set into k subsamples
– use k-1 subsamples as training data and one sub-sample as test
data --- k-fold cross-validation
– for data set with moderate size
• Bootstrapping (leave-one-out)
– for small size data
34
LMS Gradient Descent
• Using LMS, we want to minimize the error. We can do this by finding
the direction on the error surface that most rapidly reduces the error
rate; this is finding the slope of the error function by taking the
derivative. The approach is called gradient descent (similar to hill
climbing). To compute how much to change weight for link k:
Error Oj = f (I W )
wk = −c
wk
Error Error O j O j
Chain rule: = = I k f ' ( ActivationFunction( I kWk ) )
wk O j wk wk
1
(TP − OP )2 1 (T j − O j )2
Error 2 P
= = 2 = −(T j − O j )
O j O j O j
We can remove the sum since we are taking the partial derivative wrt Oj
35
Activation Function
• To apply the LMS learning rule, also known
as the delta rule, we need a differentiable
activation function.
wk = cI k (T j − O j ) f ' ( Activation _ Function )
Old: New:
1 : wi I i + 0 O=
1
O= i − wi I i +
0 : otherwise 1+ e i
− 36
LMS vs. Limiting Threshold
• With the new sigmoidal function that is differentiable, we
can apply the delta rule toward learning.
• Perceptron Method
– Forced output to 0 or 1, while LMS uses the net output
– Guaranteed to separate, if no error and is linearly separable
• Otherwise it may not converge
• Gradient Descent Method:
– May oscillate and not converge
– May converge to wrong answer
– Will converge to some minimum even if the classes are not linearly
separable, unlike the earlier perceptron training method
37
Backpropagation Networks
• Attributed to Rumelhart and McClelland, late 70’s
• To bypass the linear classification problem, we can
construct multilayer networks. Typically we have fully
connected, feedforward networks.
Input Layer Hidden Layer Output Layer
I1 O1
H1
I2
1
O( x) = − w j , x H j
H2 O2
I3 1+ e j
Wi,j 1 Wj,k 1
1
H ( x) = − wi , x I i
1’s - bias
1+ e i
38
Backprop - Learning
Learning Procedure:
39
Backprop - Modifying Weights
We had computed:
w j ,k = cH j (Tk − Ok )Ok (1 − Ok )
For the Hidden units (skipping some math), this is:
I H O
Wi,j Wj,k 40
Example of Back-propagation algorithm
Note: The sigmoid function is applied to hidden layer and output layer.
41
Initial input and weight values
x1 x2 x3 w14 w15 w24 w25 w34 w35 w46 w56 w04 w05 w06
-----------------------------------------------------------------------------------
1 0 1 0.2 -0.3 0.4 0.1 -0.5 0.2 -0.3 -0.2 -0.4 0.2 0.1
42
Calculation for weight updating
Weight New value
------------------------------------------------------------------------------
w46 -03+(0.9)(0.1311)(0.332)= -0.261
w56 -0.2+(0.9)(0.1311)(0.525)= -0.138
w14 0.2 +(0.9)(-0.0087)(1) = 0.192
w15 -0.3 +(0.9)(-0.0065)(1) = -0.306
w24 0.4+ (0.9)(-0.0087)(0) = 0.4
w25 0.1+ (0.9)(-0.0065)(0) = 0.1
w34 -0.5+ (0.9)(-0.0087)(1) = -0.508
w35 0.2 + (0.9)(-0.0065)(1) = 0.194
w06 0.1 + (0.9)(0.1311) = 0.218
w05 0.2 + (0.9)(-0.0065)=0.194
w04 -0.4 +(0.9)(-0.0087) = -0.408
43
The decision boundary perspective…
Initial random weights
The decision boundary perspective…
Present a training instance / adjust the weights
The decision boundary perspective…
Present a training instance / adjust the weights
The decision boundary perspective…
Present a training instance / adjust the weights
The decision boundary perspective…
Present a training instance / adjust the weights
The decision boundary perspective…
Eventually ….
The effect of local minima
Validation error
M
50
Regularizing neural networks
Demonstration of over-fitting
(M = # hidden units)
51
Regularizing neural networks
Starting to overfit
Time/Iteration
53
Early Stopping
54
Backpropagation
• Very powerful - can learn any function, given enough hidden units!
With enough hidden units, we can generate any function.
• Have the problem of Generalization vs. Memorization. With too
many units, we will tend to memorize the input and not generalize
well. Some schemes exist to “prune” the neural network.
• Networks require extensive training, many parameters to fiddle
with. Can be extremely slow to train. May also fall into local
minima.
• Inherently parallel algorithm, ideal for multiprocessor hardware.
• Despite the cons, a very powerful algorithm that has seen
widespread successful deployment.
55
Applications of Feed-forward nets
– Pattern recognition
• Character recognition
• Face Recognition
• Speech recognition
• Etc.
– Navigation of a car
– Stock-market prediction
56
Introduction to:
Deep Learning
aka or related to
Deep Neural Networks
DL is providing breakthrough results in speech
recognition and image classification …
From this Hinton et al 2012 paper:
http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/38131.pdf
go here: http://yann.lecun.com/exdb/mnist/
From here:
http://people.idsia.ch/~juergen/cvpr2012.pdf
So, 1. what exactly is deep learning ?
but these algorithms are not good at learning the weights for
networks with more hidden layers
Learning
algorithm
pixel 2
Input
Motorbikes
Input space “Non”-Motorbikes
pixel 2
pixel 1
Feature representations
handle
Feature Learning
wheel representation algorithm
Input
Motorbikes
Input space “Non”-Motorbikes Feature space
“handle”
pixel 2
pixel 1 “wheel”
How is computer perception done?
Object
detection
Audio
classification This is written text.
Helicopter
control
Low-level state
Helicopter
features Action
Audio features
Spectrogram MFCC
Problems of hand-tuned
Key question: Can we features
1. Needs expert knowledge
automatically learn a good feature
2. Time-consuming and expensive
representation?
3. Does not generalize to other domains
Flux ZCR Rolloff
The goal of Unsupervised Feature Learning
Unlabeled images
Learning
algorithm
Feature representation
Why feature hierarchies
object models
• Natural progression from
low level to high level
structure as seen in natural
complexity
object parts
(combination • Easier to monitor what is
of edges) being learnt and to guide the
machine to better subspaces
edges
pixels
But, until very recently, our weight-learning
algorithms simply did not work on multi-layer
architectures
The new way to train multi-layer NNs…
The new way to train multi-layer NNs…
“beak” detector
Same pattern appears in different places:
They can be compressed!
“upper-left
beak” detector
“middle beak”
detector
A convolutional layer
A CNN is a neural network with some convolutional layers
(and some other layers). A convolutional layer has a number
of filters that does convolutional operation.
Beak detector
A filter
Convolution
These are the network
parameters to be learned.
1 -1 -1
1 0 0 0 0 1 -1 1 -1 Filter 1
0 1 0 0 1 0 -1 -1 1
0 0 1 1 0 0
1 0 0 0 1 0 -1 1 -1
-1 1 -1 Filter 2
0 1 0 0 1 0
0 0 1 0 1 0 -1 1 -1
…
…
6 x 6 image
Each filter detects a
small pattern (3 x 3).
Convolution
Filter 1
stride=1 1 -1 -1
-1 1 -1
1 0 0 0 0 1 Dot
-1 -1 1
product
0 1 0 0 1 0 3 -1
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
6 x 6 image
Convolution
1 -1 -1
If stride=2 -1 1 -1 Filter 1
-1 -1 1
1 0 0 0 0 1
0 1 0 0 1 0 3 -3
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
6 x 6 image
Convolution
1 -1 -1 Filter 1
stride=1 -1 1 -1
-1 -1 1
1 0 0 0 0 1
0 1 0 0 1 0 3 -1 -3 -1
0 0 1 1 0 0
1 0 0 0 1 0 -3 1 0 -3
0 1 0 0 1 0
0 0 1 0 1 0 -3 -3 0 1
6 x 6 image 3 -2 -2 -1
Convolution
-1 1 -1
-1 1 -1 Filter 2
stride=1 -1 1 -1
Repeat this for each filter
1 0 0 0 0 1
0 1 0 0 1 0 3 -1 -3 -1
-1 -1 -1 -1
0 0 1 1 0 0
1 0 0 0 1 0 -3 1 0 -3
-1 -1 -2 1
0 1 0 0 1 0 Feature
0 0 1 0 1 0 -3 -3 Map0 1
-1 -1 -2 1
6 x 6 image 3 -2 -2 -1
-1 0 -4 3
Two 4 x 4 images
Forming 2 x 4 x 4 matrix
Color image: RGB 3 channels
1 0 0 0 0 1 1 -1 -1 -1 1 -1
0 1 0 0 1 0 -1 1 -1 -1 1 -1
0 0 1 1 0 0 -1 -1 1 -1 1 -1
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
convolution
image
x1
1 0 0 0 0 1
0 1 0 0 1 0 x2
Fully- 0 0 1 1 0 0
1 0 0 0 1 0
connected
…
…
…
…
0 1 0 0 1 0
0 0 1 0 1 0
x36
1 -1 -1 Filter 1 1 1
-1 1 -1 2 0
-1 -1 1 3 0
4: 0 3
…
1 0 0 0 0 1
0 1 0 0 1 0 0
0 0 1 1 0 0 8 1
1 0 0 0 1 0 9 0
0 1 0 0 1 0 10: 0
…
0 0 1 0 1 0
1 0
6 x 6 image
3 0
14
fewer parameters! 15 1 Only connect to 9
16 1 inputs, not fully
connected
…
1 -1 -1 1: 1
-1 1 -1 Filter 1 2: 0
-1 -1 1 3: 0
4: 0 3
…
1 0 0 0 0 1
0 1 0 0 1 0 7: 0
0 0 1 1 0 0 8: 1
1 0 0 0 1 0 9: 0 -1
0 1 0 0 1 0 10: 0
…
0 0 1 0 1 0
1 0
6 x 6 image
3: 0
14:
Fewer parameters 15: 1
16: 1 Shared weights
Even fewer parameters
…
The whole CNN
cat dog ……
Convolution
Max Pooling
Can
Fully Connected repeat
Feedforward network
Convolution many
times
Max Pooling
Flattened
Max Pooling
1 -1 -1 -1 1 -1
-1 1 -1 Filter 1 -1 1 -1 Filter 2
-1 -1 1 -1 1 -1
3 -1 -3 -1 -1 -1 -1 -1
-3 1 0 -3 -1 -1 -2 1
-3 -3 0 1 -1 -1 -2 1
3 -2 -2 -1 -1 0 -4 3
Why Pooling
Subsampling
New image
1 0 0 0 0 1 but smaller
0 1 0 0 1 0 Conv
3 0
0 0 1 1 0 0 -1 1
1 0 0 0 1 0
0 1 0 0 1 0 Max 3 1
0 3
0 0 1 0 1 0 Pooling
2 x 2 image
6 x 6 image
Each filter
is a channel
The whole CNN
3 0
-1 1 Convolution
3 1
0 3
Max Pooling
cat dog ……
Convolution
Max Pooling
Max Pooling
1
3 0
-1 1 3
3 1 -1
0 3 Flattened
1 Fully Connected
Feedforward network
3
AlphaGo
Next move
Neural
(19 x 19
Network positions)
19 x 19 matrix
Fully-connected feedforward network
Black: 1
can be used
white: -1
none: 0 But CNN performs much better
AlphaGo’s policy network
The following is quotation from their Nature article:
Note: AlphaGo does not use Max Pooling.
CNN in speech recognition
Image Time
Spectrogram
AlexNet Architecture
Showing 81 filters of
11x11x3.
Capture low-level
features like oriented
edges, blobs.
Top 9 patches that activate each filter
in layer 1
Each 3x3 block shows the top 9 patches
for one filter.
Second layer
Second layer
This is our fully connected network. If x1 .... xn, n is very large and growing,
this network would become too large. We now will input one xi at a time,
and re-use the same edge weights.
Outline
Classifier
Hidden state
“Memory” h5
“Context”
h1 h2 h3 h4
h1 h2 h3 h4 h5
h0 h1 h2 h3 h4
Output at time t yt
Classifier
Hidden Recurrence:
representation
at time t
ht ℎ𝑡 = 𝑓𝑊 (𝑥𝑡 , ℎ𝑡−1 )
new function input at old
Hidden layer state of W time t state
Input at time t xt
Unrolling the RNN
y3
y2
Classifier
y1 h3
Classifier h3
h2 Hidden layer
Classifier h2
h1 h1 Hidden layer
x3
Hidden layer t=3
x2
h0 t=2
x1
t=1
Vanilla RNN Cell
ht
ℎ𝑡 = 𝑓𝑊 (𝑥𝑡 , ℎ𝑡−1 )
ht-1 xt 𝑥𝑡
= tanh 𝑊 ℎ
𝑡−1
J. Elman, Finding structure in time, Cognitive science 14(2), pp. 179–211, 1990
Vanilla RNN Cell
ht
ℎ𝑡 = 𝑓𝑊 (𝑥𝑡 , ℎ𝑡−1 )
W
𝑥𝑡
= tanh 𝑊 ℎ
𝑡−1
ht-1 xt
𝑒 𝑎 − 𝑒 −𝑎
𝜎 𝑎 tanh 𝑎 = 𝑎
𝑒 + 𝑒 −𝑎
tanh 𝑎 = 2𝜎 2𝑎 − 1
Image source
RNN Forward Pass
e1 e2 e3 𝑒𝑡 = −log(𝑦𝑡 (𝐺𝑇𝑡 ))
y1 y2 y3 𝑦𝑡 = softmax(𝑊𝑦 ℎ𝑡 )
𝑥𝑡
h1 h2 h3 ℎ𝑡 = tanh 𝑊 ℎ
𝑡−1
shared weights
h0 x1 h1 x2 h2 x3
Backpropagation Through Time (BPTT)
e1 e2 e3 𝑒𝑡 = −log(𝑦𝑡 (𝐺𝑇𝑡 ))
y1 y2 y3 𝑦𝑡 = softmax(𝑊𝑦 ℎ𝑡 )
𝑥𝑡
h1 h2 h3 ℎ𝑡 = tanh 𝑊 ℎ
𝑡−1
h0 x1 h1 x2 h2 x3
Unfolded RNN Backward Pass
e1 e2 e3 𝑒𝑡 = −log(𝑦𝑡 (𝐺𝑇𝑡 ))
y1 y2 y3 𝑦𝑡 = softmax(𝑊𝑦 ℎ𝑡 )
𝑥𝑡
h1 h2 h3 ℎ𝑡 = tanh 𝑊 ℎ
𝑡−1
h0 x1 h1 x2 h2 x3
Backpropagation Through Time (BPTT)
y1 y2 y3
h0 f h1 f h2 f h3 ……
x1 x2 x3
No matter how long the input/output sequence is, we only
need one function f. If f’s are different, then it becomes a
feedforward NN. This may be treated as another compression
from fully connected network.
Deep RNN h’,y = f1(h,x), g’,z = f2(g,y)
…
z1 z2 z3
g0 f2 g1 f2 g2 f2 g3 ……
y1 y2 y3
h0 f1 h1 f1 h2 f1 h3 ……
x1 x2 x3
Bidirectional RNN y,h=f1(x,h) z,g = f2(g,x)
x1 x2 x3
g0 f2 g1 f2 g2 f2 g3
z1 z2 z3
p=f3(y,z) f3 p1 f3 p2 f3 p3
y1 y2 y3
h0 f1 h1 f1 h2 f1 h3
x1 x2 x3
Pyramid RNN Significantly speed up training
y Wh h
Wi
h' x
h f h'
y Wo h’ Note, y is computed
from h’
x softmax
Ct-1
ht-1
Forget input
gate gate
The core idea is this cell
statesigmoid
Why Ct, it is changed
or tanh:
slowly, with
Sigmoid: 0,1only minor
gating as switch.
linear interactions.
Vanishing It is veryin
gradient problem
easy for
LSTM information
is handled to flow
already.
along replaces
ReLU it unchanged.
tanh ok?
it decides what component
is to be updated.
C’t provides change contents
yt ct-1 ct
LSTM
Naïve
ht-1 ht ht-1 ht
RNN
xt xt
ct-1 xt
zi = σ( Wi )
ht-1
Controls Controls Updating Controls
forget gate input gate information Output gate xt
zf = σ( Wf )
ht-1
zf zi z zo
xt
zo = σ( Wo )
ht-1 xt ht-1
z =tanh( W ht-1 )
ct-1 ct-1
diagonal
“peephole” zo zf zi obtained by the same way
zf zi z zo
ht-1 xt
Information flow of LSTM
Element-wise multiply
yt
ct-1 ct
ct = zf ct-1 + ziz
tanh ht = zo tanh(ct)
yt = σ(W’ ht)
zf zi z zo
ht-1 xt ht
yt yt+1
ct-1 ct ct+1
tanh tanh
zf zi z zo zf zi z zo
ht+1
ht-1 xt ht xt+1
x f1 a1 f2 a2 f3 a3 f4 y
t is layer
at = ft(at-1) = σ(W tat-1 + bt)
h0 f h1 f h2 f h3 f g y4
x1 x2 x3 x4
t is time step
No input xt at t-1
each step hat-1 ahtt
No output yt at
each step 1-
at-1 is the output of
the (t-1)-th layer
reset update
at is the output of r z h'
the t-th layer
No reset gate
t-1
hat-1 xt xt
h’=σ(Wat-1)
z=σ(W’at-1)
Highway Network at = z at-1 + (1-z) h
c c’ c c’
LSTM Grid
LSTM
h h’ h h’
x a b
time
Grid LSTM
h' b'
a’ c c'
b’
a a'
tanh
c c’
Grid
LSTM
h h’ zf zi z zo
a b
h b
The network differs from existing deep LSTM architectures in that the cells are connected between
network layers as well as along the spatiotemporal dimensions of the data. The network provides a
You
unifiedcan generalize
way of using LSTM forthis to 3D,
both deep and more
and sequential .
computation.
Applications of LSTM / RNN
Neural machine translation
LSTM
Sequence to sequence chat model
Baidu’s speech recognition using RNN
Useful Resources / References
• http://cs231n.stanford.edu/slides/winter1516_lecture10.pdf
• http://www.cs.toronto.edu/~rgrosse/csc321/lec10.pdf