xqqhelloword
diff --git a/‎README.md‎
Lines changed: 156 additions & 147 deletions b/‎README.md‎
Lines changed: 156 additions & 147 deletions
diff --git a/‎README_raw.md‎
Lines changed: 36 additions & 28 deletions b/‎README_raw.md‎
Lines changed: 36 additions & 28 deletions
diff --git a/‎autograd/two_layer_net_autograd.py‎
Lines changed: 37 additions & 40 deletions b/‎autograd/two_layer_net_autograd.py‎
Lines changed: 37 additions & 40 deletions
diff --git a/‎autograd/two_layer_net_custom_function.py‎
Lines changed: 39 additions & 38 deletions b/‎autograd/two_layer_net_custom_function.py‎
Lines changed: 39 additions & 38 deletions
diff --git a/‎nn/dynamic_net.py‎
Lines changed: 4 additions & 5 deletions b/‎nn/dynamic_net.py‎
Lines changed: 4 additions & 5 deletions
@@ -11,10 +11,15 @@ will have a single hidden layer, and will be trained with gradient descent to
 fit random data by minimizing the Euclidean distance between the network output
 and the true output.
 
+**NOTE:** These examples have been update for PyTorch 0.4, which made several
+major changes to the core PyTorch API. Most notably, prior to 0.4 Tensors had
+to be wrapped in Variable objects to use autograd; this functionality has now
+been added directly to Tensors, and Variables are now deprecated.
+
 ### Table of Contents
 - <a href='#warm-up-numpy'>Warm-up: numpy</a>
 - <a href='#pytorch-tensors'>PyTorch: Tensors</a>
-- <a href='#pytorch-variables-and-autograd'>PyTorch: Variables and autograd</a>
+- <a href='#pytorch-autograd'>PyTorch: Autograd</a>
 - <a href='#pytorch-defining-new-autograd-functions'>PyTorch: Defining new autograd functions</a>
 - <a href='#tensorflow-static-graphs'>TensorFlow: Static Graphs</a>
 - <a href='#pytorch-nn'>PyTorch: nn</a>
@@ -46,24 +51,24 @@ unfortunately numpy won't be enough for modern deep learning.
 
 Here we introduce the most fundamental PyTorch concept: the **Tensor**. A PyTorch
 Tensor is conceptually identical to a numpy array: a Tensor is an n-dimensional
-array, and PyTorch provides many functions for operating on these Tensors. Like
-numpy arrays, PyTorch Tensors do not know anything about deep learning or
-computational graphs or gradients; they are a generic tool for scientific
+array, and PyTorch provides many functions for operating on these Tensors.
+Any computation you might want to perform with numpy can also be accomplished
+with PyTorch Tensors; you should think of them as a generic tool for scientific
 computing.
 
 However unlike numpy, PyTorch Tensors can utilize GPUs to accelerate their
-numeric computations. To run a PyTorch Tensor on GPU, you simply need to cast it
-to a new datatype.
+numeric computations. To run a PyTorch Tensor on GPU, you use the `device`
+argument when constructing a Tensor to place the Tensor on a GPU.
 
 Here we use PyTorch Tensors to fit a two-layer network to random data. Like the
-numpy example above we need to manually implement the forward and backward
-passes through the network:
+numpy example above we manually implement the forward and backward
+passes through the network, using operations on PyTorch Tensors:
 
 ```python
 :INCLUDE tensor/two_layer_net_tensor.py
 ```
 
-## PyTorch: Variables and autograd
+## PyTorch: Autograd
 
 In the above examples, we had to manually implement both the forward and
 backward passes of our neural network. Manually implementing the backward pass
@@ -79,18 +84,21 @@ When using autograd, the forward pass of your network will define a
 functions that produce output Tensors from input Tensors. Backpropagating through
 this graph then allows you to easily compute gradients.
 
-This sounds complicated, it's pretty simple to use in practice. We wrap our
-PyTorch Tensors in **Variable** objects; a Variable represents a node in a
-computational graph. If `x` is a Variable then `x.data` is a Tensor, and
-`x.grad` is another Variable holding the gradient of `x` with respect to some
-scalar value.
-
-PyTorch Variables have the same API as PyTorch Tensors: (almost) any operation
-that you can perform on a Tensor also works on Variables; the difference is that
-using Variables defines a computational graph, allowing you to automatically
-compute gradients.
-
-Here we use PyTorch Variables and autograd to implement our two-layer network;
+This sounds complicated, it's pretty simple to use in practice. If we want to
+compute gradients with respect to some Tensor, then we set `requires_grad=True`
+when constructing that Tensor. Any PyTorch operations on that Tensor will cause
+a computational graph to be constructed, allowing us to later perform backpropagation
+through the graph. If `x` is a Tensor with `requires_grad=True`, then after
+backpropagation `x.grad` will be another Tensor holding the gradient of `x` with
+respect to some scalar value.
+
+Sometimes you may wish to prevent PyTorch from building computational graphs when
+performing certain operations on Tensors with `requires_grad=True`; for example
+we usually don't want to backpropagate through the weight update steps when
+training a neural network. In such scenarios we can use the `torch.no_grad()`
+context manager to prevent the construction of a computational graph.
+
+Here we use PyTorch Tensors and autograd to implement our two-layer network;
 now we no longer need to manually implement the backward pass through the
 network:
 
@@ -108,7 +116,7 @@ with respect to that same scalar value.
 In PyTorch we can easily define our own autograd operator by defining a subclass
 of `torch.autograd.Function` and implementing the `forward` and `backward` functions.
 We can then use our new autograd operator by constructing an instance and calling it
-like a function, passing Variables containing input data.
+like a function, passing Tensors containing input data.
 
 In this example we define our own custom autograd function for performing the ReLU
 nonlinearity, and use it to implement our two-layer network:
@@ -168,8 +176,8 @@ raw computational graphs that are useful for building neural networks.
 
 In PyTorch, the `nn` package serves this same purpose. The `nn` package defines a set of
 **Modules**, which are roughly equivalent to neural network layers. A Module receives
-input Variables and computes output Variables, but may also hold internal state such as
-Variables containing learnable parameters. The `nn` package also defines a set of useful
+input Tensors and computes output Tensors, but may also hold internal state such as
+Tensors containing learnable parameters. The `nn` package also defines a set of useful
 loss functions that are commonly used when training neural networks.
 
 In this example we use the `nn` package to implement our two-layer network:
@@ -180,8 +188,8 @@ In this example we use the `nn` package to implement our two-layer network:
 
 
 ## PyTorch: optim
-Up to this point we have updated the weights of our models by manually mutating the
-`.data` member for Variables holding learnable parameters. This is not a huge burden
+Up to this point we have updated the weights of our models by manually mutating
+Tensors holding learnable parameters. This is not a huge burden
 for simple optimization algorithms like stochastic gradient descent, but in practice
 we often train neural networks using more sophisiticated optimizers like AdaGrad,
 RMSProp, Adam, etc.
@@ -200,8 +208,8 @@ will optimize the model using the Adam algorithm provided by the `optim` package
 ## PyTorch: Custom nn Modules
 Sometimes you will want to specify models that are more complex than a sequence of
 existing Modules; for these cases you can define your own Modules by subclassing
-`nn.Module` and defining a `forward` which receives input Variables and produces
-output Variables using other modules or other autograd operations on Variables.
+`nn.Module` and defining a `forward` which receives input Tensors and produces
+output Tensors using other modules or other autograd operations on Tensors.
 
 In this example we implement our two-layer network as a custom Module subclass:
 
 
@@ -1,68 +1,65 @@
 import torch
-from torch.autograd import Variable
 
 """
 A fully-connected ReLU network with one hidden layer and no biases, trained to
 predict y from x by minimizing squared Euclidean distance.
 
 This implementation computes the forward pass using operations on PyTorch
-Variables, and uses PyTorch autograd to compute gradients.
+Tensors, and uses PyTorch autograd to compute gradients.
 
-A PyTorch Variable is a wrapper around a PyTorch Tensor, and represents a node
-in a computational graph. If x is a Variable then x.data is a Tensor giving its
-value, and x.grad is another Variable holding the gradient of x with respect to
-some scalar value.
-
-PyTorch Variables have the same API as PyTorch tensors: (almost) any operation
-you can do on a Tensor you can also do on a Variable; the difference is that
-autograd allows you to automatically compute gradients.
+When we create a PyTorch Tensor with requires_grad=True, then operations
+involving that Tensor will not just compute values; they will also build up
+a computational graph in the background, allowing us to easily backpropagate
+through the graph to compute gradients of some Tensors with respect to a
+downstream loss. Concretely if x is a Tensor with x.requires_grad == True then
+after backpropagation x.grad will be another Tensor holding the gradient of x
+with respect to some scalar value.
 """
 
-dtype = torch.FloatTensor
-# dtype = torch.cuda.FloatTensor # Uncomment this to run on GPU
+device = torch.device('cpu')
+# device = torch.device('cuda') # Uncomment this to run on GPU
 
 # N is batch size; D_in is input dimension;
 # H is hidden dimension; D_out is output dimension.
 N, D_in, H, D_out = 64, 1000, 100, 10
 
-# Create random Tensors to hold input and outputs, and wrap them in Variables.
-# Setting requires_grad=False indicates that we do not need to compute gradients
-# with respect to these Variables during the backward pass.
-x = Variable(torch.randn(N, D_in).type(dtype), requires_grad=False)
-y = Variable(torch.randn(N, D_out).type(dtype), requires_grad=False)
+# Create random Tensors to hold input and outputs
+x = torch.randn(N, D_in, device=device)
+y = torch.randn(N, D_out, device=device)
 
-# Create random Tensors for weights, and wrap them in Variables.
-# Setting requires_grad=True indicates that we want to compute gradients with
-# respect to these Variables during the backward pass.
-w1 = Variable(torch.randn(D_in, H).type(dtype), requires_grad=True)
-w2 = Variable(torch.randn(H, D_out).type(dtype), requires_grad=True)
+# Create random Tensors for weights; setting requires_grad=True means that we
+# want to compute gradients for these Tensors during the backward pass.
+w1 = torch.randn(D_in, H, device=device, requires_grad=True)
+w2 = torch.randn(H, D_out, device=device, requires_grad=True)
 
 learning_rate = 1e-6
 for t in range(500):
-  # Forward pass: compute predicted y using operations on Variables; these
-  # are exactly the same operations we used to compute the forward pass using
-  # Tensors, but we do not need to keep references to intermediate values since
-  # we are not implementing the backward pass by hand.
+  # Forward pass: compute predicted y using operations on Tensors. Since w1 and
+  # w2 have requires_grad=True, operations involving these Tensors will cause
+  # PyTorch to build a computational graph, allowing automatic computation of
+  # gradients. Since we are no longer implementing the backward pass by hand we
+  # don't need to keep references to intermediate values.
   y_pred = x.mm(w1).clamp(min=0).mm(w2)
 
-  # Compute and print loss using operations on Variables.
-  # Now loss is a Variable of shape (1,) and loss.data is a Tensor of shape
-  # (1,); loss.data[0] is a scalar value holding the loss.
+  # Compute and print loss. Loss is a Tensor of shape (), and loss.item()
+  # is a Python number giving its value.
   loss = (y_pred - y).pow(2).sum()
-  print(t, loss.data[0])
+  print(t, loss.item())
 
   # Use autograd to compute the backward pass. This call will compute the
-  # gradient of loss with respect to all Variables with requires_grad=True.
-  # After this call w1.grad and w2.grad will be Variables holding the gradient
+  # gradient of loss with respect to all Tensors with requires_grad=True.
+  # After this call w1.grad and w2.grad will be Tensors holding the gradient
   # of the loss with respect to w1 and w2 respectively.
   loss.backward()
 
-  # Update weights using gradient descent; w1.data and w2.data are Tensors,
-  # w1.grad and w2.grad are Variables and w1.grad.data and w2.grad.data are
-  # Tensors.
-  w1.data -= learning_rate * w1.grad.data
-  w2.data -= learning_rate * w2.grad.data
+  # Update weights using gradient descent. For this step we just want to mutate
+  # the values of w1 and w2 in-place; we don't want to build up a computational
+  # graph for the update steps, so we use the torch.no_grad() context manager
+  # to prevent PyTorch from building a computational graph for the updates
+  with torch.no_grad():
+    w1 -= learning_rate * w1.grad
+    w2 -= learning_rate * w2.grad
 
-  # Manually zero the gradients after running the backward pass
-  w1.grad.data.zero_()
-  w2.grad.data.zero_()
+    # Manually zero the gradients after running the backward pass
+    w1.grad.zero_()
+    w2.grad.zero_()
@@ -1,12 +1,11 @@
 import torch
-from torch.autograd import Variable
 
 """
 A fully-connected ReLU network with one hidden layer and no biases, trained to
 predict y from x by minimizing squared Euclidean distance.
 
 This implementation computes the forward pass using operations on PyTorch
-Variables, and uses PyTorch autograd to compute gradients.
+Tensors, and uses PyTorch autograd to compute gradients.
 
 In this implementation we implement our own custom autograd function to perform
 the ReLU function.
@@ -18,62 +17,64 @@ class MyReLU(torch.autograd.Function):
   torch.autograd.Function and implementing the forward and backward passes
   which operate on Tensors.
   """
-  def forward(self, input):
+  @staticmethod
+  def forward(ctx, x):
     """
-    In the forward pass we receive a Tensor containing the input and return a
-    Tensor containing the output. You can cache arbitrary Tensors for use in the
-    backward pass using the save_for_backward method.
+    In the forward pass we receive a context object and a Tensor containing the
+    input; we must return a Tensor containing the output, and we can use the
+    context object to cache objects for use in the backward pass.
     """
-    self.save_for_backward(input)
-    return input.clamp(min=0)
+    ctx.save_for_backward(x)
+    return x.clamp(min=0)
 
-  def backward(self, grad_output):
+  def backward(ctx, grad_output):
     """
-    In the backward pass we receive a Tensor containing the gradient of the loss
-    with respect to the output, and we need to compute the gradient of the loss
-    with respect to the input.
+    In the backward pass we receive the context object and a Tensor containing
+    the gradient of the loss with respect to the output produced during the
+    forward pass. We can retrieve cached data from the context object, and must
+    compute and return the gradient of the loss with respect to the input to the
+    forward function.
     """
-    input, = self.saved_tensors
-    grad_input = grad_output.clone()
-    grad_input[input < 0] = 0
-    return grad_input
+    x, = ctx.saved_tensors
+    grad_x = grad_output.clone()
+    grad_x[x < 0] = 0
+    return grad_x
 
 
-dtype = torch.FloatTensor
-# dtype = torch.cuda.FloatTensor # Uncomment this to run on GPU
+device = torch.device('cpu')
+# device = torch.device('cuda') # Uncomment this to run on GPU
 
 # N is batch size; D_in is input dimension;
 # H is hidden dimension; D_out is output dimension.
 N, D_in, H, D_out = 64, 1000, 100, 10
 
-# Create random Tensors to hold input and outputs, and wrap them in Variables.
-x = Variable(torch.randn(N, D_in).type(dtype), requires_grad=False)
-y = Variable(torch.randn(N, D_out).type(dtype), requires_grad=False)
+# Create random Tensors to hold input and output
+x = torch.randn(N, D_in, device=device)
+y = torch.randn(N, D_out, device=device)
 
-# Create random Tensors for weights, and wrap them in Variables.
-w1 = Variable(torch.randn(D_in, H).type(dtype), requires_grad=True)
-w2 = Variable(torch.randn(H, D_out).type(dtype), requires_grad=True)
+# Create random Tensors for weights.
+w1 = torch.randn(D_in, H, device=device, requires_grad=True)
+w2 = torch.randn(H, D_out, device=device, requires_grad=True)
 
 learning_rate = 1e-6
 for t in range(500):
-  # Construct an instance of our MyReLU class to use in our network
-  relu = MyReLU()
-  
-  # Forward pass: compute predicted y using operations on Variables; we compute
-  # ReLU using our custom autograd operation.
-  y_pred = relu(x.mm(w1)).mm(w2)
-  
+  # Forward pass: compute predicted y using operations on Tensors; we call our
+  # custom ReLU implementation using the MyReLU.apply function
+  y_pred = MyReLU.apply(x.mm(w1)).mm(w2)
+ 
   # Compute and print loss
   loss = (y_pred - y).pow(2).sum()
-  print(t, loss.data[0])
+  print(t, loss.item())
 
   # Use autograd to compute the backward pass.
   loss.backward()
 
-  # Update weights using gradient descent
-  w1.data -= learning_rate * w1.grad.data
-  w2.data -= learning_rate * w2.grad.data
+  with torch.no_grad():
+    # Update weights using gradient descent
+    w1 -= learning_rate * w1.grad
+    w2 -= learning_rate * w2.grad
+
+    # Manually zero the gradients after running the backward pass
+    w1.grad.zero_()
+    w2.grad.zero_()
 
-  # Manually zero the gradients after running the backward pass
-  w1.grad.data.zero_()
-  w2.grad.data.zero_()
 
@@ -1,6 +1,5 @@
 import random
 import torch
-from torch.autograd import Variable
 
 """
 To showcase the power of PyTorch dynamic graphs, we will implement a very strange
@@ -45,9 +44,9 @@ def forward(self, x):
 # H is hidden dimension; D_out is output dimension.
 N, D_in, H, D_out = 64, 1000, 100, 10
 
-# Create random Tensors to hold inputs and outputs, and wrap them in Variables
-x = Variable(torch.randn(N, D_in))
-y = Variable(torch.randn(N, D_out), requires_grad=False)
+# Create random Tensors to hold inputs and outputs.
+x = torch.randn(N, D_in)
+y = torch.randn(N, D_out)
 
 # Construct our model by instantiating the class defined above
 model = DynamicNet(D_in, H, D_out)
@@ -62,7 +61,7 @@ def forward(self, x):
 
   # Compute and print loss
   loss = criterion(y_pred, y)
-  print(t, loss.data[0])
+  print(t, loss.item())
 
   # Zero gradients, perform a backward pass, and update the weights.
   optimizer.zero_grad()