|
61 | 61 | "id": "Ys81cOhXOWUP"
|
62 | 62 | },
|
63 | 63 | "source": [
|
64 |
| - "Experimental support for Cloud TPUs is currently available for Keras\n", |
65 |
| - "and Google Colab. Before you run this Colab notebooks, ensure that\n", |
66 |
| - "your hardware accelerator is a TPU by checking your notebook settings:\n", |
67 |
| - "Runtime > Change runtime type > Hardware accelerator > TPU." |
| 64 | + "Experimental support for Cloud TPUs is currently available for Keras and [Google Colaboratory (Colab)](https://colab.research.google.com). Before you run this Colab notebook, make sure that your hardware accelerator is a TPU by checking your notebook settings: **Runtime** > **Change runtime type** > **Hardware accelerator** > **TPU**." |
68 | 65 | ]
|
69 | 66 | },
|
70 | 67 | {
|
|
96 | 93 | "id": "yDWaRxSpwBN1"
|
97 | 94 | },
|
98 | 95 | "source": [
|
99 |
| - "## TPU Initialization\n", |
100 |
| - "TPUs are usually on Cloud TPU workers which are different from the local process running the user python program. Thus some initialization work needs to be done to connect to the remote cluster and initialize TPUs. Note that the `tpu` argument to `TPUClusterResolver` is a special address just for Colab. In the case that you are running on Google Compute Engine (GCE), you should instead pass in the name of your CloudTPU." |
| 96 | + "## TPU initialization\n", |
| 97 | + "\n", |
| 98 | + "TPUs are typically Cloud TPU workers, which are different from the local process running the user's Python program. Thus, you need to do some initialization work to connect to the remote cluster and initialize the TPUs. Note that the `tpu` argument to `TPUClusterResolver` is a special address just for Colab. If you are running your code on Google Compute Engine (GCE), you should instead pass in the name of your Cloud TPU." |
101 | 99 | ]
|
102 | 100 | },
|
103 | 101 | {
|
|
131 | 129 | },
|
132 | 130 | "source": [
|
133 | 131 | "## Manual device placement\n",
|
134 |
| - "After the TPU is initialized, you can use manual device placement to place the computation on a single TPU device.\n" |
| 132 | + "\n", |
| 133 | + "After the TPU is initialized, you can use manual device placement to place the computation on a single TPU device:\n" |
135 | 134 | ]
|
136 | 135 | },
|
137 | 136 | {
|
|
144 | 143 | "source": [
|
145 | 144 | "a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])\n",
|
146 | 145 | "b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])\n",
|
| 146 | + "\n", |
147 | 147 | "with tf.device('/TPU:0'):\n",
|
148 | 148 | " c = tf.matmul(a, b)\n",
|
| 149 | + "\n", |
149 | 150 | "print(\"c device: \", c.device)\n",
|
150 | 151 | "print(c)"
|
151 | 152 | ]
|
|
157 | 158 | },
|
158 | 159 | "source": [
|
159 | 160 | "## Distribution strategies\n",
|
160 |
| - "Most times users want to run the model on multiple TPUs in a data parallel way. A distribution strategy is an abstraction that can be used to drive models on CPU, GPUs or TPUs. Simply swap out the distribution strategy and the model will run on the given device. See the [distribution strategy guide](./distributed_training.ipynb) for more information." |
| 161 | + "\n", |
| 162 | + "Usually, you would want to run your model on multiple TPUs in a data-parallel way. To distribute your model on multiple TPUs (or other accelerators), TensorFlow offers several distribution strategies. You can replace your distribution strategy and the model will run on any given (TPU) device. Check the [distribution strategy guide](./distributed_training.ipynb) for more information." |
161 | 163 | ]
|
162 | 164 | },
|
163 | 165 | {
|
|
166 | 168 | "id": "DcDPMZs-9uLJ"
|
167 | 169 | },
|
168 | 170 | "source": [
|
169 |
| - "First, creates the `TPUStrategy` object." |
| 171 | + "To demonstrate this, create a `TPUStrategy` object:" |
170 | 172 | ]
|
171 | 173 | },
|
172 | 174 | {
|
|
186 | 188 | "id": "JlaAmswWPsU6"
|
187 | 189 | },
|
188 | 190 | "source": [
|
189 |
| - "To replicate a computation so it can run in all TPU cores, you can simply pass it to `strategy.run` API. Below is an example that all the cores will obtain the same inputs `(a, b)`, and do the matmul on each core independently. The outputs will be the values from all the replicas." |
| 191 | + "To replicate a computation so it can run in all TPU cores, you can pass it into the `strategy.run` API. Below is an example that shows all cores receiving the same inputs `(a, b)` and performing matrix multiplication on each core independently. The outputs will be the values from all the replicas." |
190 | 192 | ]
|
191 | 193 | },
|
192 | 194 | {
|
|
213 | 215 | },
|
214 | 216 | "source": [
|
215 | 217 | "## Classification on TPUs\n",
|
216 |
| - "As we have learned the basic concepts, it is time to look at a more concrete example. This guide demonstrates how to use the distribution strategy `tf.distribute.TPUStrategy` to drive a Cloud TPU and train a Keras model.\n" |
| 218 | + "\n", |
| 219 | + "Having covered the basic concepts, consider a more concrete example. This section demonstrates how to use the distribution strategy—`tf.distribute.TPUStrategy`—to train a Keras model on a Cloud TPU.\n" |
217 | 220 | ]
|
218 | 221 | },
|
219 | 222 | {
|
|
223 | 226 | },
|
224 | 227 | "source": [
|
225 | 228 | "### Define a Keras model\n",
|
226 |
| - "Below is the definition of MNIST model using Keras, unchanged from what you would use on CPU or GPU. Note that Keras model creation needs to be inside `strategy.scope`, so the variables can be created on each TPU device. Other parts of the code is not necessary to be inside the strategy scope." |
| 229 | + "\n", |
| 230 | + "Start with a definition of a `Sequential` Keras model for image classification on the MNIST dataset using Keras. It's no different than what you would use if you were training on CPUs or GPUs. Note that Keras model creation needs to be inside `strategy.scope`, so the variables can be created on each TPU device. Other parts of the code are not necessary to be inside the strategy scope." |
227 | 231 | ]
|
228 | 232 | },
|
229 | 233 | {
|
|
250 | 254 | "id": "qYOYjYTg_31l"
|
251 | 255 | },
|
252 | 256 | "source": [
|
253 |
| - "### Input datasets\n", |
254 |
| - "Efficient use of the `tf.data.Dataset` API is critical when using a Cloud TPU, as it is impossible to use the Cloud TPUs unless you can feed them data quickly enough. See [Input Pipeline Performance Guide](./data_performance.ipynb) for details on dataset performance.\n", |
| 257 | + "### Load the dataset\n", |
| 258 | + "\n", |
| 259 | + "Efficient use of the `tf.data.Dataset` API is critical when using a Cloud TPU, as it is impossible to use the Cloud TPUs unless you can feed them data quickly enough. You can learn more about dataset performance in the [Input pipeline performance guide](./data_performance.ipynb).\n", |
255 | 260 | "\n",
|
256 |
| - "For all but the simplest experimentation (using `tf.data.Dataset.from_tensor_slices` or other in-graph data) you will need to store all data files read by the Dataset in Google Cloud Storage (GCS) buckets.\n", |
| 261 | + "For all but the simplest experimentations (using `tf.data.Dataset.from_tensor_slices` or other in-graph data) you will need to store all data files read by the Dataset in Google Cloud Storage (GCS) buckets.\n", |
257 | 262 | "\n",
|
258 |
| - "For most use-cases, it is recommended to convert your data into `TFRecord` format and use a `tf.data.TFRecordDataset` to read it. See [TFRecord and tf.Example tutorial](../tutorials/load_data/tfrecord.ipynb) for details on how to do this. This, however, is not a hard requirement and you can use other dataset readers (`FixedLengthRecordDataset` or `TextLineDataset`) if you prefer.\n", |
| 263 | + "For most use cases, you're recommended to convert your data into the `TFRecord` format and use a `tf.data.TFRecordDataset` to read it. Check the [TFRecord and tf.Example tutorial](../tutorials/load_data/tfrecord.ipynb) for details on how to do this. It is not a hard requirement and you can use other dataset readers, such as `FixedLengthRecordDataset` or `TextLineDataset`.\n", |
259 | 264 | "\n",
|
260 |
| - "Small datasets can be loaded entirely into memory using `tf.data.Dataset.cache`.\n", |
| 265 | + "You can load entire small datasets into memory using `tf.data.Dataset.cache`.\n", |
261 | 266 | "\n",
|
262 |
| - "Regardless of the data format used, it is strongly recommended that you use large files, on the order of 100MB. This is especially important in this networked setting as the overhead of opening a file is significantly higher.\n", |
| 267 | + "Regardless of the data format used, you're strongly recommended to use large files on the order of 100MB. This is especially important in this networked setting, as the overhead of opening a file is significantly higher.\n", |
263 | 268 | "\n",
|
264 |
| - "Here you should use the `tensorflow_datasets` module to get a copy of the MNIST training data. Note that `try_gcs` is specified to use a copy that is available in a public GCS bucket. If you don't specify this, the TPU will not be able to access the data that is downloaded. " |
| 269 | + "As shown in the code below, you should use the `tensorflow_datasets` module to get a copy of the MNIST training and test data. Note that `try_gcs` is specified to use a copy that is available in a public GCS bucket. If you don't specify this, the TPU will not be able to access the downloaded data. " |
265 | 270 | ]
|
266 | 271 | },
|
267 | 272 | {
|
|
277 | 282 | " dataset, info = tfds.load(name='mnist', split=split, with_info=True,\n",
|
278 | 283 | " as_supervised=True, try_gcs=True)\n",
|
279 | 284 | "\n",
|
| 285 | + " # Normalize the input data.\n", |
280 | 286 | " def scale(image, label):\n",
|
281 | 287 | " image = tf.cast(image, tf.float32)\n",
|
282 | 288 | " image /= 255.0\n",
|
283 |
| - "\n", |
284 | 289 | " return image, label\n",
|
285 | 290 | "\n",
|
286 | 291 | " dataset = dataset.map(scale)\n",
|
287 | 292 | "\n",
|
288 |
| - " # Only shuffle and repeat the dataset in training. The advantage to have a\n", |
| 293 | + " # Only shuffle and repeat the dataset in training. The advantage of having an\n", |
289 | 294 | " # infinite dataset for training is to avoid the potential last partial batch\n",
|
290 |
| - " # in each epoch, so users don't need to think about scaling the gradients\n", |
| 295 | + " # in each epoch, so that you don't need to think about scaling the gradients\n", |
291 | 296 | " # based on the actual batch size.\n",
|
292 | 297 | " if is_training:\n",
|
293 | 298 | " dataset = dataset.shuffle(10000)\n",
|
|
304 | 309 | "id": "mgUC6A-zCMEr"
|
305 | 310 | },
|
306 | 311 | "source": [
|
307 |
| - "### Train a model using Keras high level APIs\n", |
| 312 | + "### Train the model using Keras high-level APIs\n", |
308 | 313 | "\n",
|
309 |
| - "You can train a model simply with Keras fit/compile APIs. Nothing here is TPU specific, you would write the same code below if you had mutliple GPUs and where using a `MirroredStrategy` rather than a `TPUStrategy`. To learn more, check out the [Distributed training with Keras](https://www.tensorflow.org/tutorials/distribute/keras) tutorial." |
| 314 | + "You can train your model with Keras `fit` and `compile` APIs. There is nothing TPU-specific in this step—you write the code as if you were using mutliple GPUs and a `MirroredStrategy` instead of the `TPUStrategy`. You can learn more in the [Distributed training with Keras](https://www.tensorflow.org/tutorials/distribute/keras) tutorial." |
310 | 315 | ]
|
311 | 316 | },
|
312 | 317 | {
|
|
343 | 348 | "id": "8hSGBIYtUugJ"
|
344 | 349 | },
|
345 | 350 | "source": [
|
346 |
| - "To reduce python overhead, and maximize the performance of your TPU, try out the **experimental** `experimental_steps_per_execution` argument to `Model.compile`. Here it increases throughput by about 50%:" |
| 351 | + "To reduce Python overhead and maximize the performance of your TPU, you can pass in the **experimental** argument—`experimental_steps_per_execution`—to `Model.compile`. In this example, it increases throughput by about 50%:" |
347 | 352 | ]
|
348 | 353 | },
|
349 | 354 | {
|
|
375 | 380 | "id": "0rRALBZNCO4A"
|
376 | 381 | },
|
377 | 382 | "source": [
|
378 |
| - "### Train a model using custom training loop.\n", |
379 |
| - "You can also create and train your models using `tf.function` and `tf.distribute` APIs directly. `strategy.experimental_distribute_datasets_from_function` API is used to distribute the dataset given a dataset function. Note that the batch size passed into the dataset will be per replica batch size instead of global batch size in this case. To learn more, check out the [Custom training with tf.distribute.Strategy](https://www.tensorflow.org/tutorials/distribute/custom_training) tutorial.\n" |
| 383 | + "### Train the model using a custom training loop\n", |
| 384 | + "\n", |
| 385 | + "You can also create and train your model using `tf.function` and `tf.distribute` APIs directly. You can use the `strategy.experimental_distribute_datasets_from_function` API to distribute the dataset given a dataset function. Note that in the example below the batch size passed into the dataset is the per-replica batch size instead of the global batch size. To learn more, check out the [Custom training with tf.distribute.Strategy](https://www.tensorflow.org/tutorials/distribute/custom_training) tutorial.\n" |
380 | 386 | ]
|
381 | 387 | },
|
382 | 388 | {
|
|
385 | 391 | "id": "DxdgXPAL6iFE"
|
386 | 392 | },
|
387 | 393 | "source": [
|
388 |
| - "First, create the model, datasets and tf.functions." |
| 394 | + "First, create the model, datasets and tf.functions:" |
389 | 395 | ]
|
390 | 396 | },
|
391 | 397 | {
|
|
396 | 402 | },
|
397 | 403 | "outputs": [],
|
398 | 404 | "source": [
|
399 |
| - "# Create the model, optimizer and metrics inside strategy scope, so that the\n", |
| 405 | + "# Create the model, optimizer and metrics inside the strategy scope, so that the\n", |
400 | 406 | "# variables can be mirrored on each device.\n",
|
401 | 407 | "with strategy.scope():\n",
|
402 | 408 | " model = create_model()\n",
|
|
414 | 420 | "\n",
|
415 | 421 | "@tf.function\n",
|
416 | 422 | "def train_step(iterator):\n",
|
417 |
| - " \"\"\"The step function for one training step\"\"\"\n", |
| 423 | + " \"\"\"The step function for one training step.\"\"\"\n", |
418 | 424 | "\n",
|
419 | 425 | " def step_fn(inputs):\n",
|
420 | 426 | " \"\"\"The computation to run on each TPU device.\"\"\"\n",
|
|
438 | 444 | "id": "Ibi7Z97V6xsQ"
|
439 | 445 | },
|
440 | 446 | "source": [
|
441 |
| - "Then run the training loop." |
| 447 | + "Then, run the training loop:" |
442 | 448 | ]
|
443 | 449 | },
|
444 | 450 | {
|
|
471 | 477 | "id": "TnZJUM3qIjKu"
|
472 | 478 | },
|
473 | 479 | "source": [
|
474 |
| - "### Improving performance by multiple steps within `tf.function`\n", |
475 |
| - "The performance can be improved by running multiple steps within a `tf.function`. This is achieved by wrapping the `strategy.run` call with a `tf.range` inside `tf.function`, AutoGraph will convert it to a `tf.while_loop` on the TPU worker.\n", |
| 480 | + "### Improving performance with multiple steps inside `tf.function`\n", |
| 481 | + "\n", |
| 482 | + "You can improve the performance by running multiple steps within a `tf.function`. This is achieved by wrapping the `strategy.run` call with a `tf.range` inside `tf.function`, and AutoGraph will convert it to a `tf.while_loop` on the TPU worker.\n", |
476 | 483 | "\n",
|
477 |
| - "Although with better performance, there are tradeoffs comparing with a single step inside `tf.function`. Running multiple steps in a `tf.function` is less flexible, you cannot run things eagerly or arbitrary python code within the steps.\n" |
| 484 | + "Despite the improved performance, there are tradeoffs with this method compared to running a single step inside `tf.function`. Running multiple steps in a `tf.function` is less flexible—you cannot run things eagerly or arbitrary Python code within the steps.\n" |
478 | 485 | ]
|
479 | 486 | },
|
480 | 487 | {
|
|
487 | 494 | "source": [
|
488 | 495 | "@tf.function\n",
|
489 | 496 | "def train_multiple_steps(iterator, steps):\n",
|
490 |
| - " \"\"\"The step function for one training step\"\"\"\n", |
| 497 | + " \"\"\"The step function for one training step.\"\"\"\n", |
491 | 498 | "\n",
|
492 | 499 | " def step_fn(inputs):\n",
|
493 | 500 | " \"\"\"The computation to run on each TPU device.\"\"\"\n",
|
|
523 | 530 | "source": [
|
524 | 531 | "## Next steps\n",
|
525 | 532 | "\n",
|
526 |
| - "* [Google Cloud TPU Documentation](https://cloud.google.com/tpu/docs/) - Set up and run a Google Cloud TPU.\n", |
527 |
| - "* [Distributed training with TensorFlow](./distributed_training.ipynb) - How to use distribution strategy and links to many example showing best practices.\n", |
528 |
| - "* [Saving/Loading models with TensorFlow](../tutorials/distribute/save_and_load.ipynb) - How to save and load models with distribution strategies.\n", |
529 |
| - "* [TensorFlow Official Models](https://github.com/tensorflow/models/tree/master/official) - Examples of state of the art TensorFlow 2.x models that are Cloud TPU compatible.\n", |
530 |
| - "* [The Google Cloud TPU Performance Guide](https://cloud.google.com/tpu/docs/performance-guide) - Enhance Cloud TPU performance further by adjusting Cloud TPU configuration parameters for your application." |
| 533 | + "- [Google Cloud TPU Documentation](https://cloud.google.com/tpu/docs/): How to set up and run a Google Cloud TPU.\n", |
| 534 | + "- [Distributed training with TensorFlow](./distributed_training.ipynb): How to use distribution strategies with many example showing best practices.\n", |
| 535 | + "- [Saving/Loading models with TensorFlow](../tutorials/distribute/save_and_load.ipynb): How to save and load models with distribution strategies.\n", |
| 536 | + "- [TensorFlow Official Models](https://github.com/tensorflow/models/tree/master/official): Examples of state-of-the-art TensorFlow 2.x models that are Cloud TPU-compatible.\n", |
| 537 | + "- [Google Cloud TPU Performance Guide](https://cloud.google.com/tpu/docs/performance-guide): Enhance Cloud TPU performance further by adjusting Cloud TPU configuration parameters for your application." |
531 | 538 | ]
|
532 | 539 | }
|
533 | 540 | ],
|
|
0 commit comments