finish writing README

jcjohnson · jcjohnson · commit 1dac6d9d457a · 2017-01-24T18:19:34.000-08:00
diff --git a/README.md b/README.md
@@ -224,6 +224,19 @@ for t in range(500):
 ```
 
 ## PyTorch: Defining new autograd functions
+Under the hood, each primitive autograd operator is really two functions that
+operate on Tensors. The *forward* function computes output Tensors from input
+Tensors. The *backward* function recieves the gradient of the output Tensors
+with respect to some scalar value, and computes the gradient of the input Tensors
+with respect to that same scalar value.
+
+In PyTorch we can easily define our own autograd operator by defining a subclass
+of `torch.autograd.Function` and implementing the `forward` and `backward` functions.
+We can then use our new autograd operator by constructing an instance and calling it
+like a function, passing Variables containing input data.
+
+In this example we define our own custom autograd function for performing the ReLU
+nonlinearity, and use it to implement our two-layer network:
 
 ```python
 # Code in file autograd/two_layer_net_custom_function.py
@@ -312,7 +325,22 @@ In TensorFlow, we define the computational graph once and then execute the same
 graph over and over again, possibly feeding different input data to the graph.
 In PyTorch, each forward pass defines a new computational graph.
 
-# TODO: Describe static vs dynamic
+Static graphs are nice because you can optimize the graph up front; for example
+a framework might decide to fuse some graph operations for efficiency, or to
+come up with a strategy for distributing the graph across many GPUs or many
+machines. If you are reusing the same graph over and over, then this potentially
+costly up-front optimization can be amortized as the same graph is rerun over
+and over.
+
+One aspect where static and dynamic graphs differ is control flow. For some models
+we may wish to perform different computation for each data point; for example a
+recurrent network might be unrolled for different numbers of time steps for each
+data point; this unrolling can be implemented as a loop. With a static graph the
+loop construct needs to be a part of the graph; for this reason TensorFlow
+provides operators such as `tf.scan` for embedding loops into the graph. With
+dynamic graphs the situation is simpler: since we build graphs on-the-fly for
+each example, we can use normal imperative flow control to perform computation
+that differs for each input.
 
 To contrast with the PyTorch autograd example above, here we use TensorFlow to
 fit a simple two-layer net:
@@ -382,6 +410,26 @@ with tf.Session() as sess:
 
 
 ## PyTorch: nn
+Computational graphs and autograd are a very powerful paradigm for defining
+complex operators and automatically taking derivatives; however for large
+neural networks raw autograd can be a bit too low-level.
+
+When building neural networks we frequently think of arranging the computation
+into **layers**, some of which have **learnable parameters** which will be
+optimized during learning.
+
+In TensorFlow, packages like [Keras](https://github.com/fchollet/keras),
+[TensorFlow-Slim](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/slim),
+and [TFLearn](http://tflearn.org/) provide higher-level abstractions over
+raw computational graphs that are useful for building neural networks.
+
+In PyTorch, the `nn` package serves this same purpose. The nn package defines a set of
+**Modules**, which are roughly equivalent to neural network layers. A Module receives
+input Variables and computes output Variables, but may also hold internal state such as
+Variables containing learnable parameters. The `nn` package also defines a set of useful
+loss functions that are commonly used when training neural networks.
+
+In this example we use the `nn` package to implement our two-layer network:
 
 ```python
 # Code in file nn/two_layer_net_nn.py
@@ -396,8 +444,10 @@ N, D_in, H, D_out = 64, 1000, 100, 10
 x = Variable(torch.randn(N, D_in))
 y = Variable(torch.randn(N, D_out), requires_grad=False)
 
-# Use the nn package to define our model as a sequence of layers. Each Linear
-# module has its own weight and bias.
+# Use the nn package to define our model as a sequence of layers. nn.Sequential
+# is a Module which contains other Modules, and applies them in sequence to
+# produce its output. Each Linear Module computes output from input using a
+# linear function, and holds internal Variables for its weight and bias.
 model = torch.nn.Sequential(
           torch.nn.Linear(D_in, H),
           torch.nn.ReLU(),
@@ -438,6 +488,17 @@ for t in range(500):
 
 
 ## PyTorch: optim
+Up to this point we have updated the weights of our models by manually mutating the
+`.data` member for Variables holding learnable parameters. This is not a huge burden
+for simple optimization algorithms like stochastic gradient descent, but in practice
+we often train neural networks using more sophisiticated optimizers like AdaGrad,
+RMSProp, Adam, etc.
+
+The `optim` package in PyTorch abstracts the idea of an optimization algorithm and
+provides implementations of commonly used optimization algorithms.
+
+In this example we will use the `nn` package to define our model as before, but we
+will optimize the model using the Adam algorithm provided by the `optim` package:
 
 ```python
 # Code in file nn/two_layer_net_optim.py
@@ -463,9 +524,9 @@ loss_fn = torch.nn.MSELoss(size_average=False)
 # Use the optim package to define an Optimizer that will update the weights of
 # the model for us. Here we will use stochastic gradient descent (SGD), but the
 # optim package contains many other optimization algoriths. The first argument
-# to the SGD constructor tells the optimizer which Variables it should update.
+# to the Adam constructor tells the optimizer which Variables it should update.
 learning_rate = 1e-4
-optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
+optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
 for t in range(500):
   # Forward pass: compute predicted y by passing x to the model.
   y_pred = model(x)
@@ -488,6 +549,12 @@ for t in range(500):
 
 
 ## PyTorch: Custom nn Modules
+Sometimes you will want to specify models that are more complex than a sequence of
+existing Modules; for these cases you can define your own Modules by subclassing
+`nn.Module` and defining a `forward` which receives input Variables and produces
+output Variables using other modules or other autograd operations on Variables.
+
+In this example we implement our two-layer network as a custom Module subclass:
 
 ```python
 # Code in file nn/two_layer_net_module.py
@@ -548,6 +615,16 @@ for t in range(500):
 
 
 ## PyTorch: Control Flow + Weight Sharing
+As an example of dynamic graphs and weight sharing, we implement a very strange
+model: a fully-connected ReLU network that on each forward pass chooses a random
+number between 1 and 4 and uses that many hidden layers, reusing the same weights
+multiple times to compute the innermost hidden layers.
+
+For this model can use normal Python flow control to implement the loop, and we
+can implement weight sharing among the innermost layers by simply reusing the
+same Module multiple times when defining the forward pass.
+
+We can easily implement this model as a Module subclass:
 
 ```python
 # Code in file nn/dynamic_net.py
@@ -613,5 +690,4 @@ for t in range(500):
   # Zero gradients, perform a backward pass, and update the weights.
   optimizer.zero_grad()
   loss.backward()
-  optimizer.step()
-```
+  optimizer.step()```
diff --git a/README_raw.md b/README_raw.md
@@ -99,6 +99,19 @@ network:
 ```
 
 ## PyTorch: Defining new autograd functions
+Under the hood, each primitive autograd operator is really two functions that
+operate on Tensors. The *forward* function computes output Tensors from input
+Tensors. The *backward* function recieves the gradient of the output Tensors
+with respect to some scalar value, and computes the gradient of the input Tensors
+with respect to that same scalar value.
+
+In PyTorch we can easily define our own autograd operator by defining a subclass
+of `torch.autograd.Function` and implementing the `forward` and `backward` functions.
+We can then use our new autograd operator by constructing an instance and calling it
+like a function, passing Variables containing input data.
+
+In this example we define our own custom autograd function for performing the ReLU
+nonlinearity, and use it to implement our two-layer network:
 
 ```python
 :INCLUDE autograd/two_layer_net_custom_function.py
@@ -114,7 +127,22 @@ In TensorFlow, we define the computational graph once and then execute the same
 graph over and over again, possibly feeding different input data to the graph.
 In PyTorch, each forward pass defines a new computational graph.
 
-# TODO: Describe static vs dynamic
+Static graphs are nice because you can optimize the graph up front; for example
+a framework might decide to fuse some graph operations for efficiency, or to
+come up with a strategy for distributing the graph across many GPUs or many
+machines. If you are reusing the same graph over and over, then this potentially
+costly up-front optimization can be amortized as the same graph is rerun over
+and over.
+
+One aspect where static and dynamic graphs differ is control flow. For some models
+we may wish to perform different computation for each data point; for example a
+recurrent network might be unrolled for different numbers of time steps for each
+data point; this unrolling can be implemented as a loop. With a static graph the
+loop construct needs to be a part of the graph; for this reason TensorFlow
+provides operators such as `tf.scan` for embedding loops into the graph. With
+dynamic graphs the situation is simpler: since we build graphs on-the-fly for
+each example, we can use normal imperative flow control to perform computation
+that differs for each input.
 
 To contrast with the PyTorch autograd example above, here we use TensorFlow to
 fit a simple two-layer net:
@@ -125,27 +153,74 @@ fit a simple two-layer net:
 
 
 ## PyTorch: nn
+Computational graphs and autograd are a very powerful paradigm for defining
+complex operators and automatically taking derivatives; however for large
+neural networks raw autograd can be a bit too low-level.
+
+When building neural networks we frequently think of arranging the computation
+into **layers**, some of which have **learnable parameters** which will be
+optimized during learning.
+
+In TensorFlow, packages like [Keras](https://github.com/fchollet/keras),
+[TensorFlow-Slim](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/slim),
+and [TFLearn](http://tflearn.org/) provide higher-level abstractions over
+raw computational graphs that are useful for building neural networks.
+
+In PyTorch, the `nn` package serves this same purpose. The nn package defines a set of
+**Modules**, which are roughly equivalent to neural network layers. A Module receives
+input Variables and computes output Variables, but may also hold internal state such as
+Variables containing learnable parameters. The `nn` package also defines a set of useful
+loss functions that are commonly used when training neural networks.
+
+In this example we use the `nn` package to implement our two-layer network:
 
 ```python
 :INCLUDE nn/two_layer_net_nn.py
 ```
 
 
 ## PyTorch: optim
+Up to this point we have updated the weights of our models by manually mutating the
+`.data` member for Variables holding learnable parameters. This is not a huge burden
+for simple optimization algorithms like stochastic gradient descent, but in practice
+we often train neural networks using more sophisiticated optimizers like AdaGrad,
+RMSProp, Adam, etc.
+
+The `optim` package in PyTorch abstracts the idea of an optimization algorithm and
+provides implementations of commonly used optimization algorithms.
+
+In this example we will use the `nn` package to define our model as before, but we
+will optimize the model using the Adam algorithm provided by the `optim` package:
 
 ```python
 :INCLUDE nn/two_layer_net_optim.py
 ```
 
 
 ## PyTorch: Custom nn Modules
+Sometimes you will want to specify models that are more complex than a sequence of
+existing Modules; for these cases you can define your own Modules by subclassing
+`nn.Module` and defining a `forward` which receives input Variables and produces
+output Variables using other modules or other autograd operations on Variables.
+
+In this example we implement our two-layer network as a custom Module subclass:
 
 ```python
 :INCLUDE nn/two_layer_net_module.py
 ```
 
 
 ## PyTorch: Control Flow + Weight Sharing
+As an example of dynamic graphs and weight sharing, we implement a very strange
+model: a fully-connected ReLU network that on each forward pass chooses a random
+number between 1 and 4 and uses that many hidden layers, reusing the same weights
+multiple times to compute the innermost hidden layers.
+
+For this model can use normal Python flow control to implement the loop, and we
+can implement weight sharing among the innermost layers by simply reusing the
+same Module multiple times when defining the forward pass.
+
+We can easily implement this model as a Module subclass:
 
 ```python
 :INCLUDE nn/dynamic_net.py
diff --git a/nn/dynamic_net.py b/nn/dynamic_net.py
@@ -67,4 +67,4 @@ def forward(self, x):
   # Zero gradients, perform a backward pass, and update the weights.
   optimizer.zero_grad()
   loss.backward()
-  optimizer.step()
+  optimizer.step()
diff --git a/nn/two_layer_net_nn.py b/nn/two_layer_net_nn.py
@@ -21,8 +21,10 @@
 x = Variable(torch.randn(N, D_in))
 y = Variable(torch.randn(N, D_out), requires_grad=False)
 
-# Use the nn package to define our model as a sequence of layers. Each Linear
-# module has its own weight and bias.
+# Use the nn package to define our model as a sequence of layers. nn.Sequential
+# is a Module which contains other Modules, and applies them in sequence to
+# produce its output. Each Linear Module computes output from input using a
+# linear function, and holds internal Variables for its weight and bias.
 model = torch.nn.Sequential(
           torch.nn.Linear(D_in, H),
           torch.nn.ReLU(),
diff --git a/nn/two_layer_net_optim.py b/nn/two_layer_net_optim.py
@@ -32,9 +32,9 @@
 # Use the optim package to define an Optimizer that will update the weights of
 # the model for us. Here we will use stochastic gradient descent (SGD), but the
 # optim package contains many other optimization algoriths. The first argument
-# to the SGD constructor tells the optimizer which Variables it should update.
+# to the Adam constructor tells the optimizer which Variables it should update.
 learning_rate = 1e-4
-optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
+optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
 for t in range(500):
   # Forward pass: compute predicted y by passing x to the model.
   y_pred = model(x)