Skip to content

Commit 1dac6d9

Browse files
committed
finish writing README
1 parent 877f1d2 commit 1dac6d9

File tree

5 files changed

+166
-13
lines changed

5 files changed

+166
-13
lines changed

README.md

Lines changed: 83 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -224,6 +224,19 @@ for t in range(500):
224224
```
225225

226226
## PyTorch: Defining new autograd functions
227+
Under the hood, each primitive autograd operator is really two functions that
228+
operate on Tensors. The *forward* function computes output Tensors from input
229+
Tensors. The *backward* function recieves the gradient of the output Tensors
230+
with respect to some scalar value, and computes the gradient of the input Tensors
231+
with respect to that same scalar value.
232+
233+
In PyTorch we can easily define our own autograd operator by defining a subclass
234+
of `torch.autograd.Function` and implementing the `forward` and `backward` functions.
235+
We can then use our new autograd operator by constructing an instance and calling it
236+
like a function, passing Variables containing input data.
237+
238+
In this example we define our own custom autograd function for performing the ReLU
239+
nonlinearity, and use it to implement our two-layer network:
227240

228241
```python
229242
# Code in file autograd/two_layer_net_custom_function.py
@@ -312,7 +325,22 @@ In TensorFlow, we define the computational graph once and then execute the same
312325
graph over and over again, possibly feeding different input data to the graph.
313326
In PyTorch, each forward pass defines a new computational graph.
314327

315-
# TODO: Describe static vs dynamic
328+
Static graphs are nice because you can optimize the graph up front; for example
329+
a framework might decide to fuse some graph operations for efficiency, or to
330+
come up with a strategy for distributing the graph across many GPUs or many
331+
machines. If you are reusing the same graph over and over, then this potentially
332+
costly up-front optimization can be amortized as the same graph is rerun over
333+
and over.
334+
335+
One aspect where static and dynamic graphs differ is control flow. For some models
336+
we may wish to perform different computation for each data point; for example a
337+
recurrent network might be unrolled for different numbers of time steps for each
338+
data point; this unrolling can be implemented as a loop. With a static graph the
339+
loop construct needs to be a part of the graph; for this reason TensorFlow
340+
provides operators such as `tf.scan` for embedding loops into the graph. With
341+
dynamic graphs the situation is simpler: since we build graphs on-the-fly for
342+
each example, we can use normal imperative flow control to perform computation
343+
that differs for each input.
316344

317345
To contrast with the PyTorch autograd example above, here we use TensorFlow to
318346
fit a simple two-layer net:
@@ -382,6 +410,26 @@ with tf.Session() as sess:
382410

383411

384412
## PyTorch: nn
413+
Computational graphs and autograd are a very powerful paradigm for defining
414+
complex operators and automatically taking derivatives; however for large
415+
neural networks raw autograd can be a bit too low-level.
416+
417+
When building neural networks we frequently think of arranging the computation
418+
into **layers**, some of which have **learnable parameters** which will be
419+
optimized during learning.
420+
421+
In TensorFlow, packages like [Keras](https://github.com/fchollet/keras),
422+
[TensorFlow-Slim](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/slim),
423+
and [TFLearn](http://tflearn.org/) provide higher-level abstractions over
424+
raw computational graphs that are useful for building neural networks.
425+
426+
In PyTorch, the `nn` package serves this same purpose. The nn package defines a set of
427+
**Modules**, which are roughly equivalent to neural network layers. A Module receives
428+
input Variables and computes output Variables, but may also hold internal state such as
429+
Variables containing learnable parameters. The `nn` package also defines a set of useful
430+
loss functions that are commonly used when training neural networks.
431+
432+
In this example we use the `nn` package to implement our two-layer network:
385433

386434
```python
387435
# Code in file nn/two_layer_net_nn.py
@@ -396,8 +444,10 @@ N, D_in, H, D_out = 64, 1000, 100, 10
396444
x = Variable(torch.randn(N, D_in))
397445
y = Variable(torch.randn(N, D_out), requires_grad=False)
398446

399-
# Use the nn package to define our model as a sequence of layers. Each Linear
400-
# module has its own weight and bias.
447+
# Use the nn package to define our model as a sequence of layers. nn.Sequential
448+
# is a Module which contains other Modules, and applies them in sequence to
449+
# produce its output. Each Linear Module computes output from input using a
450+
# linear function, and holds internal Variables for its weight and bias.
401451
model = torch.nn.Sequential(
402452
torch.nn.Linear(D_in, H),
403453
torch.nn.ReLU(),
@@ -438,6 +488,17 @@ for t in range(500):
438488

439489

440490
## PyTorch: optim
491+
Up to this point we have updated the weights of our models by manually mutating the
492+
`.data` member for Variables holding learnable parameters. This is not a huge burden
493+
for simple optimization algorithms like stochastic gradient descent, but in practice
494+
we often train neural networks using more sophisiticated optimizers like AdaGrad,
495+
RMSProp, Adam, etc.
496+
497+
The `optim` package in PyTorch abstracts the idea of an optimization algorithm and
498+
provides implementations of commonly used optimization algorithms.
499+
500+
In this example we will use the `nn` package to define our model as before, but we
501+
will optimize the model using the Adam algorithm provided by the `optim` package:
441502

442503
```python
443504
# Code in file nn/two_layer_net_optim.py
@@ -463,9 +524,9 @@ loss_fn = torch.nn.MSELoss(size_average=False)
463524
# Use the optim package to define an Optimizer that will update the weights of
464525
# the model for us. Here we will use stochastic gradient descent (SGD), but the
465526
# optim package contains many other optimization algoriths. The first argument
466-
# to the SGD constructor tells the optimizer which Variables it should update.
527+
# to the Adam constructor tells the optimizer which Variables it should update.
467528
learning_rate = 1e-4
468-
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
529+
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
469530
for t in range(500):
470531
# Forward pass: compute predicted y by passing x to the model.
471532
y_pred = model(x)
@@ -488,6 +549,12 @@ for t in range(500):
488549

489550

490551
## PyTorch: Custom nn Modules
552+
Sometimes you will want to specify models that are more complex than a sequence of
553+
existing Modules; for these cases you can define your own Modules by subclassing
554+
`nn.Module` and defining a `forward` which receives input Variables and produces
555+
output Variables using other modules or other autograd operations on Variables.
556+
557+
In this example we implement our two-layer network as a custom Module subclass:
491558

492559
```python
493560
# Code in file nn/two_layer_net_module.py
@@ -548,6 +615,16 @@ for t in range(500):
548615

549616

550617
## PyTorch: Control Flow + Weight Sharing
618+
As an example of dynamic graphs and weight sharing, we implement a very strange
619+
model: a fully-connected ReLU network that on each forward pass chooses a random
620+
number between 1 and 4 and uses that many hidden layers, reusing the same weights
621+
multiple times to compute the innermost hidden layers.
622+
623+
For this model can use normal Python flow control to implement the loop, and we
624+
can implement weight sharing among the innermost layers by simply reusing the
625+
same Module multiple times when defining the forward pass.
626+
627+
We can easily implement this model as a Module subclass:
551628

552629
```python
553630
# Code in file nn/dynamic_net.py
@@ -613,5 +690,4 @@ for t in range(500):
613690
# Zero gradients, perform a backward pass, and update the weights.
614691
optimizer.zero_grad()
615692
loss.backward()
616-
optimizer.step()
617-
```
693+
optimizer.step()```

README_raw.md

Lines changed: 76 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -99,6 +99,19 @@ network:
9999
```
100100

101101
## PyTorch: Defining new autograd functions
102+
Under the hood, each primitive autograd operator is really two functions that
103+
operate on Tensors. The *forward* function computes output Tensors from input
104+
Tensors. The *backward* function recieves the gradient of the output Tensors
105+
with respect to some scalar value, and computes the gradient of the input Tensors
106+
with respect to that same scalar value.
107+
108+
In PyTorch we can easily define our own autograd operator by defining a subclass
109+
of `torch.autograd.Function` and implementing the `forward` and `backward` functions.
110+
We can then use our new autograd operator by constructing an instance and calling it
111+
like a function, passing Variables containing input data.
112+
113+
In this example we define our own custom autograd function for performing the ReLU
114+
nonlinearity, and use it to implement our two-layer network:
102115

103116
```python
104117
:INCLUDE autograd/two_layer_net_custom_function.py
@@ -114,7 +127,22 @@ In TensorFlow, we define the computational graph once and then execute the same
114127
graph over and over again, possibly feeding different input data to the graph.
115128
In PyTorch, each forward pass defines a new computational graph.
116129

117-
# TODO: Describe static vs dynamic
130+
Static graphs are nice because you can optimize the graph up front; for example
131+
a framework might decide to fuse some graph operations for efficiency, or to
132+
come up with a strategy for distributing the graph across many GPUs or many
133+
machines. If you are reusing the same graph over and over, then this potentially
134+
costly up-front optimization can be amortized as the same graph is rerun over
135+
and over.
136+
137+
One aspect where static and dynamic graphs differ is control flow. For some models
138+
we may wish to perform different computation for each data point; for example a
139+
recurrent network might be unrolled for different numbers of time steps for each
140+
data point; this unrolling can be implemented as a loop. With a static graph the
141+
loop construct needs to be a part of the graph; for this reason TensorFlow
142+
provides operators such as `tf.scan` for embedding loops into the graph. With
143+
dynamic graphs the situation is simpler: since we build graphs on-the-fly for
144+
each example, we can use normal imperative flow control to perform computation
145+
that differs for each input.
118146

119147
To contrast with the PyTorch autograd example above, here we use TensorFlow to
120148
fit a simple two-layer net:
@@ -125,27 +153,74 @@ fit a simple two-layer net:
125153

126154

127155
## PyTorch: nn
156+
Computational graphs and autograd are a very powerful paradigm for defining
157+
complex operators and automatically taking derivatives; however for large
158+
neural networks raw autograd can be a bit too low-level.
159+
160+
When building neural networks we frequently think of arranging the computation
161+
into **layers**, some of which have **learnable parameters** which will be
162+
optimized during learning.
163+
164+
In TensorFlow, packages like [Keras](https://github.com/fchollet/keras),
165+
[TensorFlow-Slim](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/slim),
166+
and [TFLearn](http://tflearn.org/) provide higher-level abstractions over
167+
raw computational graphs that are useful for building neural networks.
168+
169+
In PyTorch, the `nn` package serves this same purpose. The nn package defines a set of
170+
**Modules**, which are roughly equivalent to neural network layers. A Module receives
171+
input Variables and computes output Variables, but may also hold internal state such as
172+
Variables containing learnable parameters. The `nn` package also defines a set of useful
173+
loss functions that are commonly used when training neural networks.
174+
175+
In this example we use the `nn` package to implement our two-layer network:
128176

129177
```python
130178
:INCLUDE nn/two_layer_net_nn.py
131179
```
132180

133181

134182
## PyTorch: optim
183+
Up to this point we have updated the weights of our models by manually mutating the
184+
`.data` member for Variables holding learnable parameters. This is not a huge burden
185+
for simple optimization algorithms like stochastic gradient descent, but in practice
186+
we often train neural networks using more sophisiticated optimizers like AdaGrad,
187+
RMSProp, Adam, etc.
188+
189+
The `optim` package in PyTorch abstracts the idea of an optimization algorithm and
190+
provides implementations of commonly used optimization algorithms.
191+
192+
In this example we will use the `nn` package to define our model as before, but we
193+
will optimize the model using the Adam algorithm provided by the `optim` package:
135194

136195
```python
137196
:INCLUDE nn/two_layer_net_optim.py
138197
```
139198

140199

141200
## PyTorch: Custom nn Modules
201+
Sometimes you will want to specify models that are more complex than a sequence of
202+
existing Modules; for these cases you can define your own Modules by subclassing
203+
`nn.Module` and defining a `forward` which receives input Variables and produces
204+
output Variables using other modules or other autograd operations on Variables.
205+
206+
In this example we implement our two-layer network as a custom Module subclass:
142207

143208
```python
144209
:INCLUDE nn/two_layer_net_module.py
145210
```
146211

147212

148213
## PyTorch: Control Flow + Weight Sharing
214+
As an example of dynamic graphs and weight sharing, we implement a very strange
215+
model: a fully-connected ReLU network that on each forward pass chooses a random
216+
number between 1 and 4 and uses that many hidden layers, reusing the same weights
217+
multiple times to compute the innermost hidden layers.
218+
219+
For this model can use normal Python flow control to implement the loop, and we
220+
can implement weight sharing among the innermost layers by simply reusing the
221+
same Module multiple times when defining the forward pass.
222+
223+
We can easily implement this model as a Module subclass:
149224

150225
```python
151226
:INCLUDE nn/dynamic_net.py

nn/dynamic_net.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -67,4 +67,4 @@ def forward(self, x):
6767
# Zero gradients, perform a backward pass, and update the weights.
6868
optimizer.zero_grad()
6969
loss.backward()
70-
optimizer.step()
70+
optimizer.step()

nn/two_layer_net_nn.py

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -21,8 +21,10 @@
2121
x = Variable(torch.randn(N, D_in))
2222
y = Variable(torch.randn(N, D_out), requires_grad=False)
2323

24-
# Use the nn package to define our model as a sequence of layers. Each Linear
25-
# module has its own weight and bias.
24+
# Use the nn package to define our model as a sequence of layers. nn.Sequential
25+
# is a Module which contains other Modules, and applies them in sequence to
26+
# produce its output. Each Linear Module computes output from input using a
27+
# linear function, and holds internal Variables for its weight and bias.
2628
model = torch.nn.Sequential(
2729
torch.nn.Linear(D_in, H),
2830
torch.nn.ReLU(),

nn/two_layer_net_optim.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -32,9 +32,9 @@
3232
# Use the optim package to define an Optimizer that will update the weights of
3333
# the model for us. Here we will use stochastic gradient descent (SGD), but the
3434
# optim package contains many other optimization algoriths. The first argument
35-
# to the SGD constructor tells the optimizer which Variables it should update.
35+
# to the Adam constructor tells the optimizer which Variables it should update.
3636
learning_rate = 1e-4
37-
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
37+
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
3838
for t in range(500):
3939
# Forward pass: compute predicted y by passing x to the model.
4040
y_pred = model(x)

0 commit comments

Comments
 (0)