diff --git a/README.md b/README.md
index f58c7c3..65ffc4e 100644
--- a/README.md
+++ b/README.md
@@ -11,10 +11,15 @@ will have a single hidden layer, and will be trained with gradient descent to
 fit random data by minimizing the Euclidean distance between the network output
 and the true output.
 
+**NOTE:** These examples have been update for PyTorch 0.4, which made several
+major changes to the core PyTorch API. Most notably, prior to 0.4 Tensors had
+to be wrapped in Variable objects to use autograd; this functionality has now
+been added directly to Tensors, and Variables are now deprecated.
+
 ### Table of Contents
 - <a href='#warm-up-numpy'>Warm-up: numpy</a>
 - <a href='#pytorch-tensors'>PyTorch: Tensors</a>
-- <a href='#pytorch-variables-and-autograd'>PyTorch: Variables and autograd</a>
+- <a href='#pytorch-autograd'>PyTorch: Autograd</a>
 - <a href='#pytorch-defining-new-autograd-functions'>PyTorch: Defining new autograd functions</a>
 - <a href='#tensorflow-static-graphs'>TensorFlow: Static Graphs</a>
 - <a href='#pytorch-nn'>PyTorch: nn</a>
@@ -82,37 +87,37 @@ unfortunately numpy won't be enough for modern deep learning.
 
 Here we introduce the most fundamental PyTorch concept: the **Tensor**. A PyTorch
 Tensor is conceptually identical to a numpy array: a Tensor is an n-dimensional
-array, and PyTorch provides many functions for operating on these Tensors. Like
-numpy arrays, PyTorch Tensors do not know anything about deep learning or
-computational graphs or gradients; they are a generic tool for scientific
+array, and PyTorch provides many functions for operating on these Tensors.
+Any computation you might want to perform with numpy can also be accomplished
+with PyTorch Tensors; you should think of them as a generic tool for scientific
 computing.
 
 However unlike numpy, PyTorch Tensors can utilize GPUs to accelerate their
-numeric computations. To run a PyTorch Tensor on GPU, you simply need to cast it
-to a new datatype.
+numeric computations. To run a PyTorch Tensor on GPU, you use the `device`
+argument when constructing a Tensor to place the Tensor on a GPU.
 
 Here we use PyTorch Tensors to fit a two-layer network to random data. Like the
-numpy example above we need to manually implement the forward and backward
-passes through the network:
+numpy example above we manually implement the forward and backward
+passes through the network, using operations on PyTorch Tensors:
 
 ```python
 # Code in file tensor/two_layer_net_tensor.py
 import torch
 
-dtype = torch.FloatTensor
-# dtype = torch.cuda.FloatTensor # Uncomment this to run on GPU
+device = torch.device('cpu')
+# device = torch.device('cuda') # Uncomment this to run on GPU
 
 # N is batch size; D_in is input dimension;
 # H is hidden dimension; D_out is output dimension.
 N, D_in, H, D_out = 64, 1000, 100, 10
 
 # Create random input and output data
-x = torch.randn(N, D_in).type(dtype)
-y = torch.randn(N, D_out).type(dtype)
+x = torch.randn(N, D_in, device=device)
+y = torch.randn(N, D_out, device=device)
 
 # Randomly initialize weights
-w1 = torch.randn(D_in, H).type(dtype)
-w2 = torch.randn(H, D_out).type(dtype)
+w1 = torch.randn(D_in, H, device=device)
+w2 = torch.randn(H, D_out, device=device)
 
 learning_rate = 1e-6
 for t in range(500):
@@ -121,9 +126,10 @@ for t in range(500):
   h_relu = h.clamp(min=0)
   y_pred = h_relu.mm(w2)
 
-  # Compute and print loss
+  # Compute and print loss; loss is a scalar, and is stored in a PyTorch Tensor
+  # of shape (); we can get its value as a Python number with loss.item().
   loss = (y_pred - y).pow(2).sum()
-  print(t, loss)
+  print(t, loss.item())
 
   # Backprop to compute gradients of w1 and w2 with respect to loss
   grad_y_pred = 2.0 * (y_pred - y)
@@ -138,7 +144,7 @@ for t in range(500):
   w2 -= learning_rate * grad_w2
 ```
 
-## PyTorch: Variables and autograd
+## PyTorch: Autograd
 
 In the above examples, we had to manually implement both the forward and
 backward passes of our neural network. Manually implementing the backward pass
@@ -154,74 +160,75 @@ When using autograd, the forward pass of your network will define a
 functions that produce output Tensors from input Tensors. Backpropagating through
 this graph then allows you to easily compute gradients.
 
-This sounds complicated, it's pretty simple to use in practice. We wrap our
-PyTorch Tensors in **Variable** objects; a Variable represents a node in a
-computational graph. If `x` is a Variable then `x.data` is a Tensor, and
-`x.grad` is another Variable holding the gradient of `x` with respect to some
-scalar value.
-
-PyTorch Variables have the same API as PyTorch Tensors: (almost) any operation
-that you can perform on a Tensor also works on Variables; the difference is that
-using Variables defines a computational graph, allowing you to automatically
-compute gradients.
-
-Here we use PyTorch Variables and autograd to implement our two-layer network;
+This sounds complicated, it's pretty simple to use in practice. If we want to
+compute gradients with respect to some Tensor, then we set `requires_grad=True`
+when constructing that Tensor. Any PyTorch operations on that Tensor will cause
+a computational graph to be constructed, allowing us to later perform backpropagation
+through the graph. If `x` is a Tensor with `requires_grad=True`, then after
+backpropagation `x.grad` will be another Tensor holding the gradient of `x` with
+respect to some scalar value.
+
+Sometimes you may wish to prevent PyTorch from building computational graphs when
+performing certain operations on Tensors with `requires_grad=True`; for example
+we usually don't want to backpropagate through the weight update steps when
+training a neural network. In such scenarios we can use the `torch.no_grad()`
+context manager to prevent the construction of a computational graph.
+
+Here we use PyTorch Tensors and autograd to implement our two-layer network;
 now we no longer need to manually implement the backward pass through the
 network:
 
 ```python
 # Code in file autograd/two_layer_net_autograd.py
 import torch
-from torch.autograd import Variable
 
-dtype = torch.FloatTensor
-# dtype = torch.cuda.FloatTensor # Uncomment this to run on GPU
+device = torch.device('cpu')
+# device = torch.device('cuda') # Uncomment this to run on GPU
 
 # N is batch size; D_in is input dimension;
 # H is hidden dimension; D_out is output dimension.
 N, D_in, H, D_out = 64, 1000, 100, 10
 
-# Create random Tensors to hold input and outputs, and wrap them in Variables.
-# Setting requires_grad=False indicates that we do not need to compute gradients
-# with respect to these Variables during the backward pass.
-x = Variable(torch.randn(N, D_in).type(dtype), requires_grad=False)
-y = Variable(torch.randn(N, D_out).type(dtype), requires_grad=False)
+# Create random Tensors to hold input and outputs
+x = torch.randn(N, D_in, device=device)
+y = torch.randn(N, D_out, device=device)
 
-# Create random Tensors for weights, and wrap them in Variables.
-# Setting requires_grad=True indicates that we want to compute gradients with
-# respect to these Variables during the backward pass.
-w1 = Variable(torch.randn(D_in, H).type(dtype), requires_grad=True)
-w2 = Variable(torch.randn(H, D_out).type(dtype), requires_grad=True)
+# Create random Tensors for weights; setting requires_grad=True means that we
+# want to compute gradients for these Tensors during the backward pass.
+w1 = torch.randn(D_in, H, device=device, requires_grad=True)
+w2 = torch.randn(H, D_out, device=device, requires_grad=True)
 
 learning_rate = 1e-6
 for t in range(500):
-  # Forward pass: compute predicted y using operations on Variables; these
-  # are exactly the same operations we used to compute the forward pass using
-  # Tensors, but we do not need to keep references to intermediate values since
-  # we are not implementing the backward pass by hand.
+  # Forward pass: compute predicted y using operations on Tensors. Since w1 and
+  # w2 have requires_grad=True, operations involving these Tensors will cause
+  # PyTorch to build a computational graph, allowing automatic computation of
+  # gradients. Since we are no longer implementing the backward pass by hand we
+  # don't need to keep references to intermediate values.
   y_pred = x.mm(w1).clamp(min=0).mm(w2)
   
-  # Compute and print loss using operations on Variables.
-  # Now loss is a Variable of shape (1,) and loss.data is a Tensor of shape
-  # (1,); loss.data[0] is a scalar value holding the loss.
+  # Compute and print loss. Loss is a Tensor of shape (), and loss.item()
+  # is a Python number giving its value.
   loss = (y_pred - y).pow(2).sum()
-  print(t, loss.data[0])
-  
-  # Manually zero the gradients before running the backward pass
-  w1.grad.data.zero_()
-  w2.grad.data.zero_()
+  print(t, loss.item())
 
   # Use autograd to compute the backward pass. This call will compute the
-  # gradient of loss with respect to all Variables with requires_grad=True.
-  # After this call w1.grad and w2.grad will be Variables holding the gradient
+  # gradient of loss with respect to all Tensors with requires_grad=True.
+  # After this call w1.grad and w2.grad will be Tensors holding the gradient
   # of the loss with respect to w1 and w2 respectively.
   loss.backward()
 
-  # Update weights using gradient descent; w1.data and w2.data are Tensors,
-  # w1.grad and w2.grad are Variables and w1.grad.data and w2.grad.data are
-  # Tensors.
-  w1.data -= learning_rate * w1.grad.data
-  w2.data -= learning_rate * w2.grad.data
+  # Update weights using gradient descent. For this step we just want to mutate
+  # the values of w1 and w2 in-place; we don't want to build up a computational
+  # graph for the update steps, so we use the torch.no_grad() context manager
+  # to prevent PyTorch from building a computational graph for the updates
+  with torch.no_grad():
+    w1 -= learning_rate * w1.grad
+    w2 -= learning_rate * w2.grad
+
+    # Manually zero the gradients after running the backward pass
+    w1.grad.zero_()
+    w2.grad.zero_()
 ```
 
 ## PyTorch: Defining new autograd functions
@@ -234,7 +241,7 @@ with respect to that same scalar value.
 In PyTorch we can easily define our own autograd operator by defining a subclass
 of `torch.autograd.Function` and implementing the `forward` and `backward` functions.
 We can then use our new autograd operator by constructing an instance and calling it
-like a function, passing Variables containing input data.
+like a function, passing Tensors containing input data.
 
 In this example we define our own custom autograd function for performing the ReLU
 nonlinearity, and use it to implement our two-layer network:
@@ -242,7 +249,6 @@ nonlinearity, and use it to implement our two-layer network:
 ```python
 # Code in file autograd/two_layer_net_custom_function.py
 import torch
-from torch.autograd import Variable
 
 class MyReLU(torch.autograd.Function):
   """
@@ -250,65 +256,68 @@ class MyReLU(torch.autograd.Function):
   torch.autograd.Function and implementing the forward and backward passes
   which operate on Tensors.
   """
-  def forward(self, input):
+  @staticmethod
+  def forward(ctx, x):
     """
-    In the forward pass we receive a Tensor containing the input and return a
-    Tensor containing the output. You can cache arbitrary Tensors for use in the
-    backward pass using the save_for_backward method.
+    In the forward pass we receive a context object and a Tensor containing the
+    input; we must return a Tensor containing the output, and we can use the
+    context object to cache objects for use in the backward pass.
     """
-    self.save_for_backward(input)
-    return input.clamp(min=0)
+    ctx.save_for_backward(x)
+    return x.clamp(min=0)
 
-  def backward(self, grad_output):
+  @staticmethod
+  def backward(ctx, grad_output):
     """
-    In the backward pass we receive a Tensor containing the gradient of the loss
-    with respect to the output, and we need to compute the gradient of the loss
-    with respect to the input.
+    In the backward pass we receive the context object and a Tensor containing
+    the gradient of the loss with respect to the output produced during the
+    forward pass. We can retrieve cached data from the context object, and must
+    compute and return the gradient of the loss with respect to the input to the
+    forward function.
     """
-    input, = self.saved_tensors
-    grad_input = grad_output.clone()
-    grad_input[input < 0] = 0
-    return grad_input
+    x, = ctx.saved_tensors
+    grad_x = grad_output.clone()
+    grad_x[x < 0] = 0
+    return grad_x
 
 
-dtype = torch.FloatTensor
-# dtype = torch.cuda.FloatTensor # Uncomment this to run on GPU
+device = torch.device('cpu')
+# device = torch.device('cuda') # Uncomment this to run on GPU
 
 # N is batch size; D_in is input dimension;
 # H is hidden dimension; D_out is output dimension.
 N, D_in, H, D_out = 64, 1000, 100, 10
 
-# Create random Tensors to hold input and outputs, and wrap them in Variables.
-x = Variable(torch.randn(N, D_in).type(dtype), requires_grad=False)
-y = Variable(torch.randn(N, D_out).type(dtype), requires_grad=False)
+# Create random Tensors to hold input and output
+x = torch.randn(N, D_in, device=device)
+y = torch.randn(N, D_out, device=device)
 
-# Create random Tensors for weights, and wrap them in Variables.
-w1 = Variable(torch.randn(D_in, H).type(dtype), requires_grad=True)
-w2 = Variable(torch.randn(H, D_out).type(dtype), requires_grad=True)
+# Create random Tensors for weights.
+w1 = torch.randn(D_in, H, device=device, requires_grad=True)
+w2 = torch.randn(H, D_out, device=device, requires_grad=True)
 
 learning_rate = 1e-6
 for t in range(500):
-  # Construct an instance of our MyReLU class to use in our network
-  relu = MyReLU()
-  
-  # Forward pass: compute predicted y using operations on Variables; we compute
-  # ReLU using our custom autograd operation.
-  y_pred = relu(x.mm(w1)).mm(w2)
-  
+  # Forward pass: compute predicted y using operations on Tensors; we call our
+  # custom ReLU implementation using the MyReLU.apply function
+  y_pred = MyReLU.apply(x.mm(w1)).mm(w2)
+ 
   # Compute and print loss
   loss = (y_pred - y).pow(2).sum()
-  print(t, loss.data[0])
-  
-  # Manually zero the gradients before running the backward pass
-  w1.grad.data.zero_()
-  w2.grad.data.zero_()
+  print(t, loss.item())
 
   # Use autograd to compute the backward pass.
   loss.backward()
 
-  # Update weights using gradient descent
-  w1.data -= learning_rate * w1.grad.data
-  w2.data -= learning_rate * w2.grad.data
+  with torch.no_grad():
+    # Update weights using gradient descent
+    w1 -= learning_rate * w1.grad
+    w2 -= learning_rate * w2.grad
+
+    # Manually zero the gradients after running the backward pass
+    w1.grad.zero_()
+    w2.grad.zero_()
+
 ```
 
 ## TensorFlow: Static Graphs
@@ -421,8 +430,8 @@ raw computational graphs that are useful for building neural networks.
 
 In PyTorch, the `nn` package serves this same purpose. The `nn` package defines a set of
 **Modules**, which are roughly equivalent to neural network layers. A Module receives
-input Variables and computes output Variables, but may also hold internal state such as
-Variables containing learnable parameters. The `nn` package also defines a set of useful
+input Tensors and computes output Tensors, but may also hold internal state such as
+Tensors containing learnable parameters. The `nn` package also defines a set of useful
 loss functions that are commonly used when training neural networks.
 
 In this example we use the `nn` package to implement our two-layer network:
@@ -430,62 +439,71 @@ In this example we use the `nn` package to implement our two-layer network:
 ```python
 # Code in file nn/two_layer_net_nn.py
 import torch
-from torch.autograd import Variable
+
+device = torch.device('cpu')
+# device = torch.device('cuda') # Uncomment this to run on GPU
 
 # N is batch size; D_in is input dimension;
 # H is hidden dimension; D_out is output dimension.
 N, D_in, H, D_out = 64, 1000, 100, 10
 
-# Create random Tensors to hold inputs and outputs, and wrap them in Variables.
-x = Variable(torch.randn(N, D_in))
-y = Variable(torch.randn(N, D_out), requires_grad=False)
+# Create random Tensors to hold inputs and outputs
+x = torch.randn(N, D_in, device=device)
+y = torch.randn(N, D_out, device=device)
 
 # Use the nn package to define our model as a sequence of layers. nn.Sequential
 # is a Module which contains other Modules, and applies them in sequence to
 # produce its output. Each Linear Module computes output from input using a
-# linear function, and holds internal Variables for its weight and bias.
+# linear function, and holds internal Tensors for its weight and bias.
+# After constructing the model we use the .to() method to move it to the
+# desired device.
 model = torch.nn.Sequential(
           torch.nn.Linear(D_in, H),
           torch.nn.ReLU(),
           torch.nn.Linear(H, D_out),
-        )
+        ).to(device)
 
 # The nn package also contains definitions of popular loss functions; in this
-# case we will use Mean Squared Error (MSE) as our loss function.
-loss_fn = torch.nn.MSELoss(size_average=False)
+# case we will use Mean Squared Error (MSE) as our loss function. Setting
+# reduction='sum' means that we are computing the *sum* of squared errors rather
+# than the mean; this is for consistency with the examples above where we
+# manually compute the loss, but in practice it is more common to use mean
+# squared error as a loss by setting reduction='elementwise_mean'.
+loss_fn = torch.nn.MSELoss(reduction='sum')
 
 learning_rate = 1e-4
 for t in range(500):
   # Forward pass: compute predicted y by passing x to the model. Module objects
   # override the __call__ operator so you can call them like functions. When
-  # doing so you pass a Variable of input data to the Module and it produces
-  # a Variable of output data.
+  # doing so you pass a Tensor of input data to the Module and it produces
+  # a Tensor of output data.
   y_pred = model(x)
 
-  # Compute and print loss. We pass Variables containing the predicted and true
-  # values of y, and the loss function returns a Variable containing the loss.
+  # Compute and print loss. We pass Tensors containing the predicted and true
+  # values of y, and the loss function returns a Tensor containing the loss.
   loss = loss_fn(y_pred, y)
-  print(t, loss.data[0])
+  print(t, loss.item())
   
   # Zero the gradients before running the backward pass.
   model.zero_grad()
 
   # Backward pass: compute gradient of the loss with respect to all the learnable
   # parameters of the model. Internally, the parameters of each Module are stored
-  # in Variables with requires_grad=True, so this call will compute gradients for
+  # in Tensors with requires_grad=True, so this call will compute gradients for
   # all learnable parameters in the model.
   loss.backward()
 
-  # Update the weights using gradient descent. Each parameter is a Variable, so
+  # Update the weights using gradient descent. Each parameter is a Tensor, so
   # we can access its data and gradients like we did before.
-  for param in model.parameters():
-    param.data -= learning_rate * param.grad.data
+  with torch.no_grad():
+    for param in model.parameters():
+      param.data -= learning_rate * param.grad
 ```
 
 
 ## PyTorch: optim
-Up to this point we have updated the weights of our models by manually mutating the
-`.data` member for Variables holding learnable parameters. This is not a huge burden
+Up to this point we have updated the weights of our models by manually mutating
+Tensors holding learnable parameters. This is not a huge burden
 for simple optimization algorithms like stochastic gradient descent, but in practice
 we often train neural networks using more sophisiticated optimizers like AdaGrad,
 RMSProp, Adam, etc.
@@ -499,15 +517,14 @@ will optimize the model using the Adam algorithm provided by the `optim` package
 ```python
 # Code in file nn/two_layer_net_optim.py
 import torch
-from torch.autograd import Variable
 
 # N is batch size; D_in is input dimension;
 # H is hidden dimension; D_out is output dimension.
 N, D_in, H, D_out = 64, 1000, 100, 10
 
-# Create random Tensors to hold inputs and outputs, and wrap them in Variables.
-x = Variable(torch.randn(N, D_in))
-y = Variable(torch.randn(N, D_out), requires_grad=False)
+# Create random Tensors to hold inputs and outputs.
+x = torch.randn(N, D_in)
+y = torch.randn(N, D_out)
 
 # Use the nn package to define our model and loss function.
 model = torch.nn.Sequential(
@@ -515,12 +532,12 @@ model = torch.nn.Sequential(
           torch.nn.ReLU(),
           torch.nn.Linear(H, D_out),
         )
-loss_fn = torch.nn.MSELoss(size_average=False)
+loss_fn = torch.nn.MSELoss(reduction='sum')
 
 # Use the optim package to define an Optimizer that will update the weights of
 # the model for us. Here we will use Adam; the optim package contains many other
-# optimization algoriths. The first argument to the Adam constructor tells the
-# optimizer which Variables it should update.
+# optimization algorithms. The first argument to the Adam constructor tells the
+# optimizer which Tensors it should update.
 learning_rate = 1e-4
 optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
 for t in range(500):
@@ -529,10 +546,10 @@ for t in range(500):
 
   # Compute and print loss.
   loss = loss_fn(y_pred, y)
-  print(t, loss.data[0])
+  print(t, loss.item())
   
   # Before the backward pass, use the optimizer object to zero all of the
-  # gradients for the variables it will update (which are the learnable weights
+  # gradients for the Tensors it will update (which are the learnable weights
   # of the model)
   optimizer.zero_grad()
 
@@ -547,15 +564,14 @@ for t in range(500):
 ## PyTorch: Custom nn Modules
 Sometimes you will want to specify models that are more complex than a sequence of
 existing Modules; for these cases you can define your own Modules by subclassing
-`nn.Module` and defining a `forward` which receives input Variables and produces
-output Variables using other modules or other autograd operations on Variables.
+`nn.Module` and defining a `forward` which receives input Tensors and produces
+output Tensors using other modules or other autograd operations on Tensors.
 
 In this example we implement our two-layer network as a custom Module subclass:
 
 ```python
 # Code in file nn/two_layer_net_module.py
 import torch
-from torch.autograd import Variable
 
 class TwoLayerNet(torch.nn.Module):
   def __init__(self, D_in, H, D_out):
@@ -569,38 +585,37 @@ class TwoLayerNet(torch.nn.Module):
 
   def forward(self, x):
     """
-    In the forward function we accept a Variable of input data and we must return
-    a Variable of output data. We can use Modules defined in the constructor as
-    well as arbitrary operators on Variables.
+    In the forward function we accept a Tensor of input data and we must return
+    a Tensor of output data. We can use Modules defined in the constructor as
+    well as arbitrary (differentiable) operations on Tensors.
     """
     h_relu = self.linear1(x).clamp(min=0)
     y_pred = self.linear2(h_relu)
     return y_pred
 
-
 # N is batch size; D_in is input dimension;
 # H is hidden dimension; D_out is output dimension.
 N, D_in, H, D_out = 64, 1000, 100, 10
 
-# Create random Tensors to hold inputs and outputs, and wrap them in Variables
-x = Variable(torch.randn(N, D_in))
-y = Variable(torch.randn(N, D_out), requires_grad=False)
+# Create random Tensors to hold inputs and outputs
+x = torch.randn(N, D_in)
+y = torch.randn(N, D_out)
 
-# Construct our model by instantiating the class defined above
+# Construct our model by instantiating the class defined above.
 model = TwoLayerNet(D_in, H, D_out)
 
 # Construct our loss function and an Optimizer. The call to model.parameters()
 # in the SGD constructor will contain the learnable parameters of the two
 # nn.Linear modules which are members of the model.
-criterion = torch.nn.MSELoss(size_average=False)
+loss_fn = torch.nn.MSELoss(reduction='sum')
 optimizer = torch.optim.SGD(model.parameters(), lr=1e-4)
 for t in range(500):
   # Forward pass: Compute predicted y by passing x to the model
   y_pred = model(x)
 
   # Compute and print loss
-  loss = criterion(y_pred, y)
-  print(t, loss.data[0])
+  loss = loss_fn(y_pred, y)
+  print(t, loss.item())
 
   # Zero gradients, perform a backward pass, and update the weights.
   optimizer.zero_grad()
@@ -626,7 +641,6 @@ We can easily implement this model as a Module subclass:
 # Code in file nn/dynamic_net.py
 import random
 import torch
-from torch.autograd import Variable
 
 class DynamicNet(torch.nn.Module):
   def __init__(self, D_in, H, D_out):
@@ -664,16 +678,16 @@ class DynamicNet(torch.nn.Module):
 # H is hidden dimension; D_out is output dimension.
 N, D_in, H, D_out = 64, 1000, 100, 10
 
-# Create random Tensors to hold inputs and outputs, and wrap them in Variables
-x = Variable(torch.randn(N, D_in))
-y = Variable(torch.randn(N, D_out), requires_grad=False)
+# Create random Tensors to hold inputs and outputs.
+x = torch.randn(N, D_in)
+y = torch.randn(N, D_out)
 
 # Construct our model by instantiating the class defined above
 model = DynamicNet(D_in, H, D_out)
 
 # Construct our loss function and an Optimizer. Training this strange model with
 # vanilla stochastic gradient descent is tough, so we use momentum
-criterion = torch.nn.MSELoss(size_average=False)
+criterion = torch.nn.MSELoss(reduction='sum')
 optimizer = torch.optim.SGD(model.parameters(), lr=1e-4, momentum=0.9)
 for t in range(500):
   # Forward pass: Compute predicted y by passing x to the model
@@ -681,7 +695,7 @@ for t in range(500):
 
   # Compute and print loss
   loss = criterion(y_pred, y)
-  print(t, loss.data[0])
+  print(t, loss.item())
 
   # Zero gradients, perform a backward pass, and update the weights.
   optimizer.zero_grad()
diff --git a/README_raw.md b/README_raw.md
index b81f5c4..8c1fbed 100644
--- a/README_raw.md
+++ b/README_raw.md
@@ -11,10 +11,15 @@ will have a single hidden layer, and will be trained with gradient descent to
 fit random data by minimizing the Euclidean distance between the network output
 and the true output.
 
+**NOTE:** These examples have been update for PyTorch 0.4, which made several
+major changes to the core PyTorch API. Most notably, prior to 0.4 Tensors had
+to be wrapped in Variable objects to use autograd; this functionality has now
+been added directly to Tensors, and Variables are now deprecated.
+
 ### Table of Contents
 - <a href='#warm-up-numpy'>Warm-up: numpy</a>
 - <a href='#pytorch-tensors'>PyTorch: Tensors</a>
-- <a href='#pytorch-variables-and-autograd'>PyTorch: Variables and autograd</a>
+- <a href='#pytorch-autograd'>PyTorch: Autograd</a>
 - <a href='#pytorch-defining-new-autograd-functions'>PyTorch: Defining new autograd functions</a>
 - <a href='#tensorflow-static-graphs'>TensorFlow: Static Graphs</a>
 - <a href='#pytorch-nn'>PyTorch: nn</a>
@@ -46,24 +51,24 @@ unfortunately numpy won't be enough for modern deep learning.
 
 Here we introduce the most fundamental PyTorch concept: the **Tensor**. A PyTorch
 Tensor is conceptually identical to a numpy array: a Tensor is an n-dimensional
-array, and PyTorch provides many functions for operating on these Tensors. Like
-numpy arrays, PyTorch Tensors do not know anything about deep learning or
-computational graphs or gradients; they are a generic tool for scientific
+array, and PyTorch provides many functions for operating on these Tensors.
+Any computation you might want to perform with numpy can also be accomplished
+with PyTorch Tensors; you should think of them as a generic tool for scientific
 computing.
 
 However unlike numpy, PyTorch Tensors can utilize GPUs to accelerate their
-numeric computations. To run a PyTorch Tensor on GPU, you simply need to cast it
-to a new datatype.
+numeric computations. To run a PyTorch Tensor on GPU, you use the `device`
+argument when constructing a Tensor to place the Tensor on a GPU.
 
 Here we use PyTorch Tensors to fit a two-layer network to random data. Like the
-numpy example above we need to manually implement the forward and backward
-passes through the network:
+numpy example above we manually implement the forward and backward
+passes through the network, using operations on PyTorch Tensors:
 
 ```python
 :INCLUDE tensor/two_layer_net_tensor.py
 ```
 
-## PyTorch: Variables and autograd
+## PyTorch: Autograd
 
 In the above examples, we had to manually implement both the forward and
 backward passes of our neural network. Manually implementing the backward pass
@@ -79,18 +84,21 @@ When using autograd, the forward pass of your network will define a
 functions that produce output Tensors from input Tensors. Backpropagating through
 this graph then allows you to easily compute gradients.
 
-This sounds complicated, it's pretty simple to use in practice. We wrap our
-PyTorch Tensors in **Variable** objects; a Variable represents a node in a
-computational graph. If `x` is a Variable then `x.data` is a Tensor, and
-`x.grad` is another Variable holding the gradient of `x` with respect to some
-scalar value.
-
-PyTorch Variables have the same API as PyTorch Tensors: (almost) any operation
-that you can perform on a Tensor also works on Variables; the difference is that
-using Variables defines a computational graph, allowing you to automatically
-compute gradients.
-
-Here we use PyTorch Variables and autograd to implement our two-layer network;
+This sounds complicated, it's pretty simple to use in practice. If we want to
+compute gradients with respect to some Tensor, then we set `requires_grad=True`
+when constructing that Tensor. Any PyTorch operations on that Tensor will cause
+a computational graph to be constructed, allowing us to later perform backpropagation
+through the graph. If `x` is a Tensor with `requires_grad=True`, then after
+backpropagation `x.grad` will be another Tensor holding the gradient of `x` with
+respect to some scalar value.
+
+Sometimes you may wish to prevent PyTorch from building computational graphs when
+performing certain operations on Tensors with `requires_grad=True`; for example
+we usually don't want to backpropagate through the weight update steps when
+training a neural network. In such scenarios we can use the `torch.no_grad()`
+context manager to prevent the construction of a computational graph.
+
+Here we use PyTorch Tensors and autograd to implement our two-layer network;
 now we no longer need to manually implement the backward pass through the
 network:
 
@@ -108,7 +116,7 @@ with respect to that same scalar value.
 In PyTorch we can easily define our own autograd operator by defining a subclass
 of `torch.autograd.Function` and implementing the `forward` and `backward` functions.
 We can then use our new autograd operator by constructing an instance and calling it
-like a function, passing Variables containing input data.
+like a function, passing Tensors containing input data.
 
 In this example we define our own custom autograd function for performing the ReLU
 nonlinearity, and use it to implement our two-layer network:
@@ -168,8 +176,8 @@ raw computational graphs that are useful for building neural networks.
 
 In PyTorch, the `nn` package serves this same purpose. The `nn` package defines a set of
 **Modules**, which are roughly equivalent to neural network layers. A Module receives
-input Variables and computes output Variables, but may also hold internal state such as
-Variables containing learnable parameters. The `nn` package also defines a set of useful
+input Tensors and computes output Tensors, but may also hold internal state such as
+Tensors containing learnable parameters. The `nn` package also defines a set of useful
 loss functions that are commonly used when training neural networks.
 
 In this example we use the `nn` package to implement our two-layer network:
@@ -180,8 +188,8 @@ In this example we use the `nn` package to implement our two-layer network:
 
 
 ## PyTorch: optim
-Up to this point we have updated the weights of our models by manually mutating the
-`.data` member for Variables holding learnable parameters. This is not a huge burden
+Up to this point we have updated the weights of our models by manually mutating
+Tensors holding learnable parameters. This is not a huge burden
 for simple optimization algorithms like stochastic gradient descent, but in practice
 we often train neural networks using more sophisiticated optimizers like AdaGrad,
 RMSProp, Adam, etc.
@@ -200,8 +208,8 @@ will optimize the model using the Adam algorithm provided by the `optim` package
 ## PyTorch: Custom nn Modules
 Sometimes you will want to specify models that are more complex than a sequence of
 existing Modules; for these cases you can define your own Modules by subclassing
-`nn.Module` and defining a `forward` which receives input Variables and produces
-output Variables using other modules or other autograd operations on Variables.
+`nn.Module` and defining a `forward` which receives input Tensors and produces
+output Tensors using other modules or other autograd operations on Tensors.
 
 In this example we implement our two-layer network as a custom Module subclass:
 
diff --git a/autograd/two_layer_net_autograd.py b/autograd/two_layer_net_autograd.py
index ad9f4fa..2a5bb7f 100644
--- a/autograd/two_layer_net_autograd.py
+++ b/autograd/two_layer_net_autograd.py
@@ -1,68 +1,65 @@
 import torch
-from torch.autograd import Variable
 
 """
 A fully-connected ReLU network with one hidden layer and no biases, trained to
 predict y from x by minimizing squared Euclidean distance.
 
 This implementation computes the forward pass using operations on PyTorch
-Variables, and uses PyTorch autograd to compute gradients.
+Tensors, and uses PyTorch autograd to compute gradients.
 
-A PyTorch Variable is a wrapper around a PyTorch Tensor, and represents a node
-in a computational graph. If x is a Variable then x.data is a Tensor giving its
-value, and x.grad is another Variable holding the gradient of x with respect to
-some scalar value.
-
-PyTorch Variables have the same API as PyTorch tensors: (almost) any operation
-you can do on a Tensor you can also do on a Variable; the difference is that
-autograd allows you to automatically compute gradients.
+When we create a PyTorch Tensor with requires_grad=True, then operations
+involving that Tensor will not just compute values; they will also build up
+a computational graph in the background, allowing us to easily backpropagate
+through the graph to compute gradients of some downstream (scalar) loss with
+respect to a Tensor. Concretely if x is a Tensor with x.requires_grad == True
+then after backpropagation x.grad will be another Tensor holding the gradient
+of x with respect to some scalar value.
 """
 
-dtype = torch.FloatTensor
-# dtype = torch.cuda.FloatTensor # Uncomment this to run on GPU
+device = torch.device('cpu')
+# device = torch.device('cuda') # Uncomment this to run on GPU
 
 # N is batch size; D_in is input dimension;
 # H is hidden dimension; D_out is output dimension.
 N, D_in, H, D_out = 64, 1000, 100, 10
 
-# Create random Tensors to hold input and outputs, and wrap them in Variables.
-# Setting requires_grad=False indicates that we do not need to compute gradients
-# with respect to these Variables during the backward pass.
-x = Variable(torch.randn(N, D_in).type(dtype), requires_grad=False)
-y = Variable(torch.randn(N, D_out).type(dtype), requires_grad=False)
+# Create random Tensors to hold input and outputs
+x = torch.randn(N, D_in, device=device)
+y = torch.randn(N, D_out, device=device)
 
-# Create random Tensors for weights, and wrap them in Variables.
-# Setting requires_grad=True indicates that we want to compute gradients with
-# respect to these Variables during the backward pass.
-w1 = Variable(torch.randn(D_in, H).type(dtype), requires_grad=True)
-w2 = Variable(torch.randn(H, D_out).type(dtype), requires_grad=True)
+# Create random Tensors for weights; setting requires_grad=True means that we
+# want to compute gradients for these Tensors during the backward pass.
+w1 = torch.randn(D_in, H, device=device, requires_grad=True)
+w2 = torch.randn(H, D_out, device=device, requires_grad=True)
 
 learning_rate = 1e-6
 for t in range(500):
-  # Forward pass: compute predicted y using operations on Variables; these
-  # are exactly the same operations we used to compute the forward pass using
-  # Tensors, but we do not need to keep references to intermediate values since
-  # we are not implementing the backward pass by hand.
+  # Forward pass: compute predicted y using operations on Tensors. Since w1 and
+  # w2 have requires_grad=True, operations involving these Tensors will cause
+  # PyTorch to build a computational graph, allowing automatic computation of
+  # gradients. Since we are no longer implementing the backward pass by hand we
+  # don't need to keep references to intermediate values.
   y_pred = x.mm(w1).clamp(min=0).mm(w2)
   
-  # Compute and print loss using operations on Variables.
-  # Now loss is a Variable of shape (1,) and loss.data is a Tensor of shape
-  # (1,); loss.data[0] is a scalar value holding the loss.
+  # Compute and print loss. Loss is a Tensor of shape (), and loss.item()
+  # is a Python number giving its value.
   loss = (y_pred - y).pow(2).sum()
-  print(t, loss.data[0])
-  
-  # Manually zero the gradients before running the backward pass
-  w1.grad.data.zero_()
-  w2.grad.data.zero_()
+  print(t, loss.item())
 
   # Use autograd to compute the backward pass. This call will compute the
-  # gradient of loss with respect to all Variables with requires_grad=True.
-  # After this call w1.grad and w2.grad will be Variables holding the gradient
+  # gradient of loss with respect to all Tensors with requires_grad=True.
+  # After this call w1.grad and w2.grad will be Tensors holding the gradient
   # of the loss with respect to w1 and w2 respectively.
   loss.backward()
 
-  # Update weights using gradient descent; w1.data and w2.data are Tensors,
-  # w1.grad and w2.grad are Variables and w1.grad.data and w2.grad.data are
-  # Tensors.
-  w1.data -= learning_rate * w1.grad.data
-  w2.data -= learning_rate * w2.grad.data
+  # Update weights using gradient descent. For this step we just want to mutate
+  # the values of w1 and w2 in-place; we don't want to build up a computational
+  # graph for the update steps, so we use the torch.no_grad() context manager
+  # to prevent PyTorch from building a computational graph for the updates
+  with torch.no_grad():
+    w1 -= learning_rate * w1.grad
+    w2 -= learning_rate * w2.grad
+
+    # Manually zero the gradients after running the backward pass
+    w1.grad.zero_()
+    w2.grad.zero_()
diff --git a/autograd/two_layer_net_custom_function.py b/autograd/two_layer_net_custom_function.py
index ef75c09..6c768d1 100644
--- a/autograd/two_layer_net_custom_function.py
+++ b/autograd/two_layer_net_custom_function.py
@@ -1,12 +1,11 @@
 import torch
-from torch.autograd import Variable
 
 """
 A fully-connected ReLU network with one hidden layer and no biases, trained to
 predict y from x by minimizing squared Euclidean distance.
 
 This implementation computes the forward pass using operations on PyTorch
-Variables, and uses PyTorch autograd to compute gradients.
+Tensors, and uses PyTorch autograd to compute gradients.
 
 In this implementation we implement our own custom autograd function to perform
 the ReLU function.
@@ -18,62 +17,65 @@ class MyReLU(torch.autograd.Function):
   torch.autograd.Function and implementing the forward and backward passes
   which operate on Tensors.
   """
-  def forward(self, input):
+  @staticmethod
+  def forward(ctx, x):
     """
-    In the forward pass we receive a Tensor containing the input and return a
-    Tensor containing the output. You can cache arbitrary Tensors for use in the
-    backward pass using the save_for_backward method.
+    In the forward pass we receive a context object and a Tensor containing the
+    input; we must return a Tensor containing the output, and we can use the
+    context object to cache objects for use in the backward pass.
     """
-    self.save_for_backward(input)
-    return input.clamp(min=0)
+    ctx.save_for_backward(x)
+    return x.clamp(min=0)
 
-  def backward(self, grad_output):
+  @staticmethod
+  def backward(ctx, grad_output):
     """
-    In the backward pass we receive a Tensor containing the gradient of the loss
-    with respect to the output, and we need to compute the gradient of the loss
-    with respect to the input.
+    In the backward pass we receive the context object and a Tensor containing
+    the gradient of the loss with respect to the output produced during the
+    forward pass. We can retrieve cached data from the context object, and must
+    compute and return the gradient of the loss with respect to the input to the
+    forward function.
     """
-    input, = self.saved_tensors
-    grad_input = grad_output.clone()
-    grad_input[input < 0] = 0
-    return grad_input
+    x, = ctx.saved_tensors
+    grad_x = grad_output.clone()
+    grad_x[x < 0] = 0
+    return grad_x
 
 
-dtype = torch.FloatTensor
-# dtype = torch.cuda.FloatTensor # Uncomment this to run on GPU
+device = torch.device('cpu')
+# device = torch.device('cuda') # Uncomment this to run on GPU
 
 # N is batch size; D_in is input dimension;
 # H is hidden dimension; D_out is output dimension.
 N, D_in, H, D_out = 64, 1000, 100, 10
 
-# Create random Tensors to hold input and outputs, and wrap them in Variables.
-x = Variable(torch.randn(N, D_in).type(dtype), requires_grad=False)
-y = Variable(torch.randn(N, D_out).type(dtype), requires_grad=False)
+# Create random Tensors to hold input and output
+x = torch.randn(N, D_in, device=device)
+y = torch.randn(N, D_out, device=device)
 
-# Create random Tensors for weights, and wrap them in Variables.
-w1 = Variable(torch.randn(D_in, H).type(dtype), requires_grad=True)
-w2 = Variable(torch.randn(H, D_out).type(dtype), requires_grad=True)
+# Create random Tensors for weights.
+w1 = torch.randn(D_in, H, device=device, requires_grad=True)
+w2 = torch.randn(H, D_out, device=device, requires_grad=True)
 
 learning_rate = 1e-6
 for t in range(500):
-  # Construct an instance of our MyReLU class to use in our network
-  relu = MyReLU()
-  
-  # Forward pass: compute predicted y using operations on Variables; we compute
-  # ReLU using our custom autograd operation.
-  y_pred = relu(x.mm(w1)).mm(w2)
-  
+  # Forward pass: compute predicted y using operations on Tensors; we call our
+  # custom ReLU implementation using the MyReLU.apply function
+  y_pred = MyReLU.apply(x.mm(w1)).mm(w2)
+ 
   # Compute and print loss
   loss = (y_pred - y).pow(2).sum()
-  print(t, loss.data[0])
-  
-  # Manually zero the gradients before running the backward pass
-  w1.grad.data.zero_()
-  w2.grad.data.zero_()
+  print(t, loss.item())
 
   # Use autograd to compute the backward pass.
   loss.backward()
 
-  # Update weights using gradient descent
-  w1.data -= learning_rate * w1.grad.data
-  w2.data -= learning_rate * w2.grad.data
+  with torch.no_grad():
+    # Update weights using gradient descent
+    w1 -= learning_rate * w1.grad
+    w2 -= learning_rate * w2.grad
+
+    # Manually zero the gradients after running the backward pass
+    w1.grad.zero_()
+    w2.grad.zero_()
+
diff --git a/nn/dynamic_net.py b/nn/dynamic_net.py
index a2e1714..ce4b4a0 100644
--- a/nn/dynamic_net.py
+++ b/nn/dynamic_net.py
@@ -1,6 +1,5 @@
 import random
 import torch
-from torch.autograd import Variable
 
 """
 To showcase the power of PyTorch dynamic graphs, we will implement a very strange
@@ -45,16 +44,16 @@ def forward(self, x):
 # H is hidden dimension; D_out is output dimension.
 N, D_in, H, D_out = 64, 1000, 100, 10
 
-# Create random Tensors to hold inputs and outputs, and wrap them in Variables
-x = Variable(torch.randn(N, D_in))
-y = Variable(torch.randn(N, D_out), requires_grad=False)
+# Create random Tensors to hold inputs and outputs.
+x = torch.randn(N, D_in)
+y = torch.randn(N, D_out)
 
 # Construct our model by instantiating the class defined above
 model = DynamicNet(D_in, H, D_out)
 
 # Construct our loss function and an Optimizer. Training this strange model with
 # vanilla stochastic gradient descent is tough, so we use momentum
-criterion = torch.nn.MSELoss(size_average=False)
+criterion = torch.nn.MSELoss(reduction='sum')
 optimizer = torch.optim.SGD(model.parameters(), lr=1e-4, momentum=0.9)
 for t in range(500):
   # Forward pass: Compute predicted y by passing x to the model
@@ -62,7 +61,7 @@ def forward(self, x):
 
   # Compute and print loss
   loss = criterion(y_pred, y)
-  print(t, loss.data[0])
+  print(t, loss.item())
 
   # Zero gradients, perform a backward pass, and update the weights.
   optimizer.zero_grad()
diff --git a/nn/two_layer_net_module.py b/nn/two_layer_net_module.py
index feb6075..e86127e 100644
--- a/nn/two_layer_net_module.py
+++ b/nn/two_layer_net_module.py
@@ -1,5 +1,4 @@
 import torch
-from torch.autograd import Variable
 
 """
 A fully-connected ReLU network with one hidden layer, trained to predict y from x
@@ -22,38 +21,37 @@ def __init__(self, D_in, H, D_out):
 
   def forward(self, x):
     """
-    In the forward function we accept a Variable of input data and we must return
-    a Variable of output data. We can use Modules defined in the constructor as
-    well as arbitrary operators on Variables.
+    In the forward function we accept a Tensor of input data and we must return
+    a Tensor of output data. We can use Modules defined in the constructor as
+    well as arbitrary (differentiable) operations on Tensors.
     """
     h_relu = self.linear1(x).clamp(min=0)
     y_pred = self.linear2(h_relu)
     return y_pred
 
-
 # N is batch size; D_in is input dimension;
 # H is hidden dimension; D_out is output dimension.
 N, D_in, H, D_out = 64, 1000, 100, 10
 
-# Create random Tensors to hold inputs and outputs, and wrap them in Variables
-x = Variable(torch.randn(N, D_in))
-y = Variable(torch.randn(N, D_out), requires_grad=False)
+# Create random Tensors to hold inputs and outputs
+x = torch.randn(N, D_in)
+y = torch.randn(N, D_out)
 
-# Construct our model by instantiating the class defined above
+# Construct our model by instantiating the class defined above.
 model = TwoLayerNet(D_in, H, D_out)
 
 # Construct our loss function and an Optimizer. The call to model.parameters()
 # in the SGD constructor will contain the learnable parameters of the two
 # nn.Linear modules which are members of the model.
-criterion = torch.nn.MSELoss(size_average=False)
+loss_fn = torch.nn.MSELoss(reduction='sum')
 optimizer = torch.optim.SGD(model.parameters(), lr=1e-4)
 for t in range(500):
   # Forward pass: Compute predicted y by passing x to the model
   y_pred = model(x)
 
   # Compute and print loss
-  loss = criterion(y_pred, y)
-  print(t, loss.data[0])
+  loss = loss_fn(y_pred, y)
+  print(t, loss.item())
 
   # Zero gradients, perform a backward pass, and update the weights.
   optimizer.zero_grad()
diff --git a/nn/two_layer_net_nn.py b/nn/two_layer_net_nn.py
index f75fa40..ec4f897 100644
--- a/nn/two_layer_net_nn.py
+++ b/nn/two_layer_net_nn.py
@@ -1,5 +1,4 @@
 import torch
-from torch.autograd import Variable
 
 """
 A fully-connected ReLU network with one hidden layer, trained to predict y from x
@@ -10,54 +9,64 @@
 but raw autograd can be a bit too low-level for defining complex neural networks;
 this is where the nn package can help. The nn package defines a set of Modules,
 which you can think of as a neural network layer that has produces output from
-input and may have some trainable weights.
+input and may have some trainable weights or other state.
 """
 
+device = torch.device('cpu')
+# device = torch.device('cuda') # Uncomment this to run on GPU
+
 # N is batch size; D_in is input dimension;
 # H is hidden dimension; D_out is output dimension.
 N, D_in, H, D_out = 64, 1000, 100, 10
 
-# Create random Tensors to hold inputs and outputs, and wrap them in Variables.
-x = Variable(torch.randn(N, D_in))
-y = Variable(torch.randn(N, D_out), requires_grad=False)
+# Create random Tensors to hold inputs and outputs
+x = torch.randn(N, D_in, device=device)
+y = torch.randn(N, D_out, device=device)
 
 # Use the nn package to define our model as a sequence of layers. nn.Sequential
 # is a Module which contains other Modules, and applies them in sequence to
 # produce its output. Each Linear Module computes output from input using a
-# linear function, and holds internal Variables for its weight and bias.
+# linear function, and holds internal Tensors for its weight and bias.
+# After constructing the model we use the .to() method to move it to the
+# desired device.
 model = torch.nn.Sequential(
           torch.nn.Linear(D_in, H),
           torch.nn.ReLU(),
           torch.nn.Linear(H, D_out),
-        )
+        ).to(device)
 
 # The nn package also contains definitions of popular loss functions; in this
-# case we will use Mean Squared Error (MSE) as our loss function.
-loss_fn = torch.nn.MSELoss(size_average=False)
+# case we will use Mean Squared Error (MSE) as our loss function. Setting
+# reduction='sum' means that we are computing the *sum* of squared errors rather
+# than the mean; this is for consistency with the examples above where we
+# manually compute the loss, but in practice it is more common to use mean
+# squared error as a loss by setting reduction='elementwise_mean'.
+loss_fn = torch.nn.MSELoss(reduction='sum')
 
 learning_rate = 1e-4
 for t in range(500):
   # Forward pass: compute predicted y by passing x to the model. Module objects
   # override the __call__ operator so you can call them like functions. When
-  # doing so you pass a Variable of input data to the Module and it produces
-  # a Variable of output data.
+  # doing so you pass a Tensor of input data to the Module and it produces
+  # a Tensor of output data.
   y_pred = model(x)
 
-  # Compute and print loss. We pass Variables containing the predicted and true
-  # values of y, and the loss function returns a Variable containing the loss.
+  # Compute and print loss. We pass Tensors containing the predicted and true
+  # values of y, and the loss function returns a Tensor containing the loss.
   loss = loss_fn(y_pred, y)
-  print(t, loss.data[0])
+  print(t, loss.item())
   
   # Zero the gradients before running the backward pass.
   model.zero_grad()
 
   # Backward pass: compute gradient of the loss with respect to all the learnable
   # parameters of the model. Internally, the parameters of each Module are stored
-  # in Variables with requires_grad=True, so this call will compute gradients for
+  # in Tensors with requires_grad=True, so this call will compute gradients for
   # all learnable parameters in the model.
   loss.backward()
 
-  # Update the weights using gradient descent. Each parameter is a Variable, so
+  # Update the weights using gradient descent. Each parameter is a Tensor, so
   # we can access its data and gradients like we did before.
-  for param in model.parameters():
-    param.data -= learning_rate * param.grad.data
+  with torch.no_grad():
+    for param in model.parameters():
+      param.data -= learning_rate * param.grad
diff --git a/nn/two_layer_net_optim.py b/nn/two_layer_net_optim.py
index fce735b..84a7f2e 100644
--- a/nn/two_layer_net_optim.py
+++ b/nn/two_layer_net_optim.py
@@ -1,5 +1,4 @@
 import torch
-from torch.autograd import Variable
 
 """
 A fully-connected ReLU network with one hidden layer, trained to predict y from x
@@ -17,9 +16,9 @@
 # H is hidden dimension; D_out is output dimension.
 N, D_in, H, D_out = 64, 1000, 100, 10
 
-# Create random Tensors to hold inputs and outputs, and wrap them in Variables.
-x = Variable(torch.randn(N, D_in))
-y = Variable(torch.randn(N, D_out), requires_grad=False)
+# Create random Tensors to hold inputs and outputs.
+x = torch.randn(N, D_in)
+y = torch.randn(N, D_out)
 
 # Use the nn package to define our model and loss function.
 model = torch.nn.Sequential(
@@ -27,12 +26,12 @@
           torch.nn.ReLU(),
           torch.nn.Linear(H, D_out),
         )
-loss_fn = torch.nn.MSELoss(size_average=False)
+loss_fn = torch.nn.MSELoss(reduction='sum')
 
 # Use the optim package to define an Optimizer that will update the weights of
 # the model for us. Here we will use Adam; the optim package contains many other
 # optimization algoriths. The first argument to the Adam constructor tells the
-# optimizer which Variables it should update.
+# optimizer which Tensors it should update.
 learning_rate = 1e-4
 optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
 for t in range(500):
@@ -41,10 +40,10 @@
 
   # Compute and print loss.
   loss = loss_fn(y_pred, y)
-  print(t, loss.data[0])
+  print(t, loss.item())
   
   # Before the backward pass, use the optimizer object to zero all of the
-  # gradients for the variables it will update (which are the learnable weights
+  # gradients for the Tensors it will update (which are the learnable weights
   # of the model)
   optimizer.zero_grad()
 
diff --git a/tensor/two_layer_net_tensor.py b/tensor/two_layer_net_tensor.py
index 0624ebd..72d27b2 100644
--- a/tensor/two_layer_net_tensor.py
+++ b/tensor/two_layer_net_tensor.py
@@ -13,23 +13,24 @@
 
 The biggest difference between a numpy array and a PyTorch Tensor is that
 a PyTorch Tensor can run on either CPU or GPU. To run operations on the GPU,
-just cast the Tensor to a cuda datatype.
+just pass a different value to the `device` argument when constructing the
+Tensor.
 """
 
-dtype = torch.FloatTensor
-# dtype = torch.cuda.FloatTensor # Uncomment this to run on GPU
+device = torch.device('cpu')
+# device = torch.device('cuda') # Uncomment this to run on GPU
 
 # N is batch size; D_in is input dimension;
 # H is hidden dimension; D_out is output dimension.
 N, D_in, H, D_out = 64, 1000, 100, 10
 
 # Create random input and output data
-x = torch.randn(N, D_in).type(dtype)
-y = torch.randn(N, D_out).type(dtype)
+x = torch.randn(N, D_in, device=device)
+y = torch.randn(N, D_out, device=device)
 
 # Randomly initialize weights
-w1 = torch.randn(D_in, H).type(dtype)
-w2 = torch.randn(H, D_out).type(dtype)
+w1 = torch.randn(D_in, H, device=device)
+w2 = torch.randn(H, D_out, device=device)
 
 learning_rate = 1e-6
 for t in range(500):
@@ -38,9 +39,10 @@
   h_relu = h.clamp(min=0)
   y_pred = h_relu.mm(w2)
 
-  # Compute and print loss
+  # Compute and print loss; loss is a scalar, and is stored in a PyTorch Tensor
+  # of shape (); we can get its value as a Python number with loss.item().
   loss = (y_pred - y).pow(2).sum()
-  print(t, loss)
+  print(t, loss.item())
 
   # Backprop to compute gradients of w1 and w2 with respect to loss
   grad_y_pred = 2.0 * (y_pred - y)