@@ -224,6 +224,19 @@ for t in range(500):
224224```
225225
226226## PyTorch: Defining new autograd functions
227+ Under the hood, each primitive autograd operator is really two functions that
228+ operate on Tensors. The * forward* function computes output Tensors from input
229+ Tensors. The * backward* function recieves the gradient of the output Tensors
230+ with respect to some scalar value, and computes the gradient of the input Tensors
231+ with respect to that same scalar value.
232+
233+ In PyTorch we can easily define our own autograd operator by defining a subclass
234+ of ` torch.autograd.Function ` and implementing the ` forward ` and ` backward ` functions.
235+ We can then use our new autograd operator by constructing an instance and calling it
236+ like a function, passing Variables containing input data.
237+
238+ In this example we define our own custom autograd function for performing the ReLU
239+ nonlinearity, and use it to implement our two-layer network:
227240
228241``` python
229242# Code in file autograd/two_layer_net_custom_function.py
@@ -312,7 +325,22 @@ In TensorFlow, we define the computational graph once and then execute the same
312325graph over and over again, possibly feeding different input data to the graph.
313326In PyTorch, each forward pass defines a new computational graph.
314327
315- # TODO: Describe static vs dynamic
328+ Static graphs are nice because you can optimize the graph up front; for example
329+ a framework might decide to fuse some graph operations for efficiency, or to
330+ come up with a strategy for distributing the graph across many GPUs or many
331+ machines. If you are reusing the same graph over and over, then this potentially
332+ costly up-front optimization can be amortized as the same graph is rerun over
333+ and over.
334+
335+ One aspect where static and dynamic graphs differ is control flow. For some models
336+ we may wish to perform different computation for each data point; for example a
337+ recurrent network might be unrolled for different numbers of time steps for each
338+ data point; this unrolling can be implemented as a loop. With a static graph the
339+ loop construct needs to be a part of the graph; for this reason TensorFlow
340+ provides operators such as ` tf.scan ` for embedding loops into the graph. With
341+ dynamic graphs the situation is simpler: since we build graphs on-the-fly for
342+ each example, we can use normal imperative flow control to perform computation
343+ that differs for each input.
316344
317345To contrast with the PyTorch autograd example above, here we use TensorFlow to
318346fit a simple two-layer net:
@@ -382,6 +410,26 @@ with tf.Session() as sess:
382410
383411
384412## PyTorch: nn
413+ Computational graphs and autograd are a very powerful paradigm for defining
414+ complex operators and automatically taking derivatives; however for large
415+ neural networks raw autograd can be a bit too low-level.
416+
417+ When building neural networks we frequently think of arranging the computation
418+ into ** layers** , some of which have ** learnable parameters** which will be
419+ optimized during learning.
420+
421+ In TensorFlow, packages like [ Keras] ( https://github.com/fchollet/keras ) ,
422+ [ TensorFlow-Slim] ( https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/slim ) ,
423+ and [ TFLearn] ( http://tflearn.org/ ) provide higher-level abstractions over
424+ raw computational graphs that are useful for building neural networks.
425+
426+ In PyTorch, the ` nn ` package serves this same purpose. The nn package defines a set of
427+ ** Modules** , which are roughly equivalent to neural network layers. A Module receives
428+ input Variables and computes output Variables, but may also hold internal state such as
429+ Variables containing learnable parameters. The ` nn ` package also defines a set of useful
430+ loss functions that are commonly used when training neural networks.
431+
432+ In this example we use the ` nn ` package to implement our two-layer network:
385433
386434``` python
387435# Code in file nn/two_layer_net_nn.py
@@ -396,8 +444,10 @@ N, D_in, H, D_out = 64, 1000, 100, 10
396444x = Variable(torch.randn(N, D_in))
397445y = Variable(torch.randn(N, D_out), requires_grad = False )
398446
399- # Use the nn package to define our model as a sequence of layers. Each Linear
400- # module has its own weight and bias.
447+ # Use the nn package to define our model as a sequence of layers. nn.Sequential
448+ # is a Module which contains other Modules, and applies them in sequence to
449+ # produce its output. Each Linear Module computes output from input using a
450+ # linear function, and holds internal Variables for its weight and bias.
401451model = torch.nn.Sequential(
402452 torch.nn.Linear(D_in, H),
403453 torch.nn.ReLU(),
@@ -438,6 +488,17 @@ for t in range(500):
438488
439489
440490## PyTorch: optim
491+ Up to this point we have updated the weights of our models by manually mutating the
492+ ` .data ` member for Variables holding learnable parameters. This is not a huge burden
493+ for simple optimization algorithms like stochastic gradient descent, but in practice
494+ we often train neural networks using more sophisiticated optimizers like AdaGrad,
495+ RMSProp, Adam, etc.
496+
497+ The ` optim ` package in PyTorch abstracts the idea of an optimization algorithm and
498+ provides implementations of commonly used optimization algorithms.
499+
500+ In this example we will use the ` nn ` package to define our model as before, but we
501+ will optimize the model using the Adam algorithm provided by the ` optim ` package:
441502
442503``` python
443504# Code in file nn/two_layer_net_optim.py
@@ -463,9 +524,9 @@ loss_fn = torch.nn.MSELoss(size_average=False)
463524# Use the optim package to define an Optimizer that will update the weights of
464525# the model for us. Here we will use stochastic gradient descent (SGD), but the
465526# optim package contains many other optimization algoriths. The first argument
466- # to the SGD constructor tells the optimizer which Variables it should update.
527+ # to the Adam constructor tells the optimizer which Variables it should update.
467528learning_rate = 1e-4
468- optimizer = torch.optim.SGD (model.parameters(), lr = learning_rate)
529+ optimizer = torch.optim.Adam (model.parameters(), lr = learning_rate)
469530for t in range (500 ):
470531 # Forward pass: compute predicted y by passing x to the model.
471532 y_pred = model(x)
@@ -488,6 +549,12 @@ for t in range(500):
488549
489550
490551## PyTorch: Custom nn Modules
552+ Sometimes you will want to specify models that are more complex than a sequence of
553+ existing Modules; for these cases you can define your own Modules by subclassing
554+ ` nn.Module ` and defining a ` forward ` which receives input Variables and produces
555+ output Variables using other modules or other autograd operations on Variables.
556+
557+ In this example we implement our two-layer network as a custom Module subclass:
491558
492559``` python
493560# Code in file nn/two_layer_net_module.py
@@ -548,6 +615,16 @@ for t in range(500):
548615
549616
550617## PyTorch: Control Flow + Weight Sharing
618+ As an example of dynamic graphs and weight sharing, we implement a very strange
619+ model: a fully-connected ReLU network that on each forward pass chooses a random
620+ number between 1 and 4 and uses that many hidden layers, reusing the same weights
621+ multiple times to compute the innermost hidden layers.
622+
623+ For this model can use normal Python flow control to implement the loop, and we
624+ can implement weight sharing among the innermost layers by simply reusing the
625+ same Module multiple times when defining the forward pass.
626+
627+ We can easily implement this model as a Module subclass:
551628
552629``` python
553630# Code in file nn/dynamic_net.py
@@ -613,5 +690,4 @@ for t in range(500):
613690 # Zero gradients, perform a backward pass, and update the weights.
614691 optimizer.zero_grad()
615692 loss.backward()
616- optimizer.step()
617- ```
693+ optimizer.step()```
0 commit comments